Last year, Google released a list of how frequently single words and combinations appeared, based on analyzing over a trillion words on public web pages. It has over 13 million individual words, and the frequencies of combinations of up to 5 words. It’s available on 6 DVDs for just $180 from the Linguistic Data Consortium at the University of Pennsylvania.
If, like me, you use statistical analysis to pick out unusual words or phrases from documents, this is a god-send. It should be a great base-line to compare the document’s text against, and eliminate the common phrases, leaving just the distinctive parts. I’m hoping to at least use it as an uber-stop-word file. The main down-side is the restrictive license, that forbids "commercially exploiting" the data. It shouldn’t be rocket-science to reproduce similar data by crawling the web when that becomes an issue, so I’ll work within those limits for now.
The LDC has a great collection of other raw data sets too. It’s worth checking out their English Gigaword archive of millions of news stories if you want some more baseline data. Thanks to Ionut at Google Operating System for leading me to the article in the official Google Research blog covering this release.