Thursday, January 13, 2005

google and common words

This is kinda interesting: I was curious to know how many PDFs google indexes. One simple way to find out is to look for a common word, so I tried "the filetype:pdf". The result - nada! Of course this has an obvious explaination, according to Shannon: the information content of a symbol is inversely proportional to the probability of occurence. And since "the" is the most common word in the English language, it is THE most meaningless. After a little googling I saw them say so themselves in the Automatic Exclusion of Common Words section. Being a curious monkey I decided not to take their word for it, and got some interesting results. Yes, they do not let you search on "the" in PDF, but allow it in HTML search despite their own disclaimer. Possible explaination is that they are changing their policy (so that people can find this) and have not updated their PDF index yet. But I dug deeper and realized that they have allowed "the" for quite a while, plenty of time to update their PDF index. Then it came to me: google index treats each HTML page as a single document AND every PDF FILE as a single document. Since PDF files are on average significantly longer than an HTML page, the probability of "the" in the document is greatly increased, making the "the" in PDF that much more meaningless than "the" in HTML. So how do I know how many PDFs google indexed? I just keep going down the list:

the nada
of nada
to nada
and nada
a nope
in ;(
s -
it -
you bingo!
about 22M files.
I think this is very close to the total number of PDFs in their index, certainly within an order of magnitude.


