Photo by Sarah Parrot
I was searching for other applications that were using Wikipedia entry titles for semantic analysis of texts, when I came across Chris Sizemore's conText experiment. Testing it out, I was blown away by how well it worked as an automatic tagger, better than commercial semantic analysis solutions like OpenCalais or SemanticHacker.
I used the same two texts I tried those two services with, an asteroid news article and one of my own blog posts. Here's the top ten results for the news article:
Asteroid_deflection_strategies
Asteroids_in_fiction
99942_Apophis
Impact_event
Near-Earth_asteroid
Asteroid
Planetary_defense
Human_extinction
Space_colonization
Risks_to_civilization,_humans_and_planet_Earth
And for my 4th of July blog post:
Thunder_on_the_mountain
The_Dilbert_Future
Art_of_Motion_(album)
Joseph_T._Bockrath
List_of_Elvis_Presley_songs
List_of_disco_artist
Songs_of_the_Century
Farris_Hassan
To_Tell_the_Tooth
List_of_Beatles_songs
9 of the top 10 results for the asteroid article are one's I'd pick as good categories for it. That's a much better hit ratio than OpenCalais or SemanticHacker in my tests. The results for the blog post are all completely unrelated, but the commercial tools do only slightly better, picking out one or two related concepts. Having text that's full of abstract musings rather than concrete nouns seems to be bad news for any semantic analysis.
In fairness I should mention that both OpenCalais and SemanticHacker are not primarily aimed at my goal, which is to automatically extract a small set of categories from short-form pieces of text (eg emails), so the comparison isn't apples to apples. It is still good news for me that Chris' approach is so useful for my purpose though.
What's really fun about his project is that it's a true garden-shed effort, produced as part of the BBC radio labs from open-source parts without requiring a massive development budget. Here's how he did it:
– Download the whole of Wikipedia, and save out each article as a file on disk.
– Index all those files using the open-source search framework Lucene.
– For every candidate text, use 'More like this' (Lucene's equivalent of Google's related sites) to generate a list of the most similar Wikipedia articles.
I really like this approach. It's all statistically-based so you get the advantage of very broad and robust coverage and don't have to sweat over hand-tuned vocabularies. I'm also a firm believer in using Wikipedia as a list of concepts for semantic analysis. The one downside is that the current implementation of the 'More like this' functionality is slow, it can take 20-30 seconds to process an article. Happily that seems open to improvement, rather than anything fundamental.