Want the average frequencies of 13 million words?

Graph
Last year, Google released a list of how frequently single words and combinations appeared, based on analyzing over a trillion words on public web pages. It has over 13 million individual words, and the frequencies of combinations of up to 5 words. It’s available on 6 DVDs for just $180 from the Linguistic Data Consortium at the University of Pennsylvania.

If, like me, you use statistical analysis to pick out unusual words or phrases from documents, this is a god-send. It should be a great base-line to compare the document’s text against, and eliminate the common phrases, leaving just the distinctive parts. I’m hoping to at least use it as an uber-stop-word file. The main down-side is the restrictive license, that forbids "commercially exploiting" the data. It shouldn’t be rocket-science to reproduce similar data by crawling the web when that becomes an issue, so I’ll work within those limits for now.

The LDC has a great collection of other raw data sets too. It’s worth checking out their English Gigaword archive of millions of news stories if you want some more baseline data. Thanks to Ionut at Google Operating System for leading me to the article in the official Google Research blog covering this release.

	bouquetsweetly69036a… on Meet Fiona and Abby
	softlysuitcb91a8b8b1 on Meet Fiona and Abby
	Zero-Copy GPU Infere… on Why GEMM is at the heart of de…
	Moonshine Voice完全解説｜… on Announcing Moonshine Voice
	Moonshine KI-Sprache… on Introducing Moonshine, the new…

Pete Warden's blog

Ever tried. Ever failed. No matter. Try Again. Fail again. Fail better.

Want the average frequencies of 13 million words?

Leave a comment Cancel reply

Share this:

Related

Leave a comment Cancel reply