I’m interested in ways of automatically categorizing emails, so I’ve been experimenting with some of the recently launched semantic analysis services. Earlier I set up an OpenCalais demo, and next I tried out Semantic Hacker. Luckily they already have an online demonstration page, which made it a lot easier. To get a rough idea of how it worked, I tried two different pieces of text. The first was a news story about deflecting asteroids that came as part of the OpenCalais test suite, and for the second I took the text from a recent post I did on Independence Day. I chose these because the news piece covered a lot of concrete names, places and organizations, and my post was about a more abstract topic, and I wanted to understand how similar emails would be handled.
For the news story, Semantic Hacker does a great job of picking out the main topic, with 5 extremely relevant suggested categories.
OpenCalais by contrast picked out a lot of organizations and places, but didn’t really try to summarize the overall meaning of the document.
Both systems did a lousy job with the 4th of July post. Semantic Hacker suggested the Boy Scouts as the top category, followed by the Knights of Columbus, which I can only guess came up because I mention patriotism a lot. The first couple of related wikipedia articles were reasonably relevant at least.
OpenCalais picked up some of the places I mentioned, like Juneau, though it assumed LA was Louisiana! It didn’t get any of the more abstract concepts, apart from the holiday name itself.
So far, what I’m seeing is confirming my instinct that general semantic analysis and categorization is AI-complete, but that some of these tools might be useful for limited applications, like pulling out locations, organizations and technical terms from emails. My next experiments are going to be focused on statistical methods of pulling out interesting words and phrases.