Here’s my new tag cloud generator that uses a list of all the Wikipedia article titles to produce a visualization of the concepts on a web page. You can download the source PHP code here, or enter a URL in the box below to get a cloud:
It’s an extension of the standard tag cloud technique of counting word frequencies. I’ve included a white list of all the Wikipedia article names, as an approximation of ‘interesting concepts’. Only phrases that appear amongst the million titles are included in the cloud. I’ve weeded out the top 10,000 most commonly used words to reduce the noise. An extension would be using the expected average frequency of a word versus its actual frequency to produce statistically improbable phrases like Amazon.
This is a by-product of some of my email analysis work. Tag clouds just based on the number of times a word appears in a piece of text often generate surprisingly good summaries. People tolerate the noise of incorrect words in a way they wouldn’t with a bullet-point list.
The underlying technology of semantic analysis is making very slow progress, so I’m picking applications and interfaces that are extremely tolerant of bad input, where the broad coverage you get from automating the analysis wins out over its poor quality.
One example of this is creating a profile for someone based on the contents of the emails they send. In a large company you’d have a white-list of skill and project keywords, similar to the Wikipedia titles. The people who mention those words most often in their emails would have them added to their expertise list in a searchable employee directory. The consequences of some incorrect entries aren’t too painful. As long as there’s a white list, no private or embarrassing terms will appear there, and the profile can be hand-edited by the user to fix anything glaringly wrong.