One interesting feature of Disruptor Monkey’s Unifyr is their automatic generation of tags from web pages. Good tags are the basis of a folksonomy, and with them I could do some very useful classification and organization of data. With an organization’s email, I’d be able to show people’s areas of expertise if I knew which subjects they sent messages about. This could be the answer to the painful problem of ‘Who can I ask about X?’.
Creating true human-quality tags would require an AI that could understand the content to the same level a human could, so any automatic process will fall short of that. Is there anything out there that will produce good enough results to use?
There are two main approaches to this problem, which is sometimes known as keyword extraction, since it’s very similar to that search engine task. The first is to use statistical analysis to work out which words are significant, with no knowledge of what the words actually mean. This is fundamentally how Google’s search works. The second is to use rules about language and information about the meanings of words to pick out the right words. As an example, knowing that notebook means the same as laptop, and so having both words count for the same concept. Powerset is going to be using this approach to search. Danny Sullivan has a thought-provoking piece on why he doesn’t think that the method will ever live up to its promise.
KEA is an open-source package for keyword extraction, and is towards the rules-based end of the spectrum, though it sets up those rules using training and some standard thesauruses, rather than manually. I was initially very interested, because it’s designed to do exactly what I need, pulling descriptive keywords from a text. Unfortunately, I’d still have to set up a thesaurus and some manually tagged documents for the system to learn from before running it on any information. I would like to start off with something completely unsupervised, so it can be deployed without a skilled operator or any involved setup.
The other alternative is using statistical analysis to identify words that are uncommonly used in most texts, but which are common in the particular one you’re looking at. The simplest example I’ve seen is the PHP automatic keyword generation class. You’ll need to register to see the code, but all it does is exclude all stop words, and then returns the remaining words, and two and three-word phrases, in descending order of frequency. The results are a long way from human tagging, but just good enough to make me think it’s worth expanding.
An obvious next step is to expand the stop word concept, and keep track of the general frequency of a lot more words, so you can exclude other common terms, and focus on the unusual ones. The standard way to do this is to take the frequencies from a large corpus of text, often a general one like the Brown corpus that includes hundreds of articles from a variety of sources. For my purposes, it would also be interesting to use the organization’s overall email store as the corpus, and identify the words a particular employee uses that most others in the company don’t. This would prevent things like the company name from appearing too often.
You’ll never get human-grade tags from this sort of system, but you can get keywords that are good enough for some tasks. I hope it will be good enough to identify subject-matter experts within a company, but only battle-testing will answer that.
Thanks for such an interesting post Pete.
I’ve played with automatic tag generation for EventVue. We have the advantage of being able to use the community at a conference to determine the contextual significance of each word. The key for us is to use the dataset of existing tags as the dictionary by which we detect tags in the future.
The dictionaries we use for a tech conference would look very different for a lawyers convention. Do you think it is possible to find a master dictionary that would work in any context? The best attempt that I’ve seen is tagthe.net — and they have a long way to go.
There’s some interesting subject-specific word lists here:
These are for specialized areas like medicine and physics, and include synonyms which could be useful.
I think you’re right that the usefulness of an automatic tagging system goes down as the domain gets broader. You’ll get a lot less noise if you have a small, hand-tuned list of keywords to look for.
http://tagthe.net is interesting, thanks for the link. You’re right it’s a long way from human tagging, but running this article through the web interface gave me about 8 keywords, half of them useful. I’m hopeful something of that quality would be useful for what I’m thinking of.
http://nosyjoe.com is using an intelligent tagging engine…
Thanks for sharing those links to the domain-specific word lists. Lists like that could be quite useful to get everything kick-started.