The Google n-gram data-set is probably as big a word frequency list as you’ll ever need, but it has very restrictive license terms that don’t allow you to publish it in any form. Since I’m interested in doing some web-based services to let you query the frequency of particular words and phrases, I could fall foul of that restriction. Luckily there are some alternatives, since using the web as a source of word-frequency data has been a big topic in the linguistics community over the last few years.
The Web as Corpus site has a good collection of resources, and in particular it led me to Bill Fletcher’s work. He has both written kfNgram, a free tool for generating word and phrase frequency (n-gram) lists from text and html files, he’s also made some decent-sized data sets available himself, such as this list with other 100,000 entries.
Also very interesting is the WebCorp project. It has an online word frequency list generator which you can point at any site you’re interested in and retrieve the statistics of the text on that page. It also features a search engine which adds a layer of linguistic analysis on top of standard Google search results. It has some neat features such as displaying all occurrences of the search terms within each result, rather than just the standard abbreviated summary that Google produces.