The best automatic tagger you’ve never heard of


Photo by Sarah Parrot
I was searching for other applications that were using Wikipedia entry titles for semantic analysis of texts, when I came across Chris Sizemore's conText experiment. Testing it out, I was blown away by how well it worked as an automatic tagger, better than commercial semantic analysis solutions like OpenCalais or SemanticHacker.

I used the same two texts I tried those two services with, an asteroid news article and one of my own blog posts. Here's the top ten results for the news article:






And for my 4th of July blog post:






9 of the top 10 results for the asteroid article are one's I'd pick as good categories for it. That's a much better hit ratio than OpenCalais or SemanticHacker in my tests. The results for the blog post are all completely unrelated, but the commercial tools do only slightly better, picking out one or two related concepts. Having text that's full of abstract musings rather than concrete nouns seems to be bad news for any semantic analysis.

In fairness I should mention that both OpenCalais and SemanticHacker are not primarily aimed at my goal, which is to automatically extract a small set of categories from short-form pieces of text (eg emails), so the comparison isn't apples to apples. It is still good news for me that Chris' approach is so useful for my purpose though.

What's really fun about his project is that it's a true garden-shed effort, produced as part of the BBC radio labs from open-source parts without requiring a massive development budget. Here's how he did it:

– Download the whole of Wikipedia, and save out each article as a file on disk.
– Index all those files using the open-source search framework Lucene.
– For every candidate text, use 'More like this' (Lucene's equivalent of Google's related sites) to generate a list of the most similar Wikipedia articles.

I really like this approach. It's all statistically-based so you get the advantage of very broad and robust coverage and don't have to sweat over hand-tuned vocabularies. I'm also a firm believer in using Wikipedia as a list of concepts for semantic analysis. The one downside is that the current implementation of the 'More like this' functionality is slow, it can take 20-30 seconds to process an article. Happily that seems open to improvement, rather than anything fundamental.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: