Maybe it was my weekly D&D game last night, but probability is on my mind. One thing I’ve learnt from working in games is that accuracy is overrated in AI. Most problems in that domain have no perfect solution. The trick is to find a technique that’s right often enough to be useful, and then make it part of a workflow that makes coping with the incorrect guesses painless for the user.
A lot of Amazon’s algorithms work like this. They recommend other books based on rough statistical measures which bring up mostly uninteresting items, but it’s right often enough to justify me spending a few seconds looking at what they found. The same goes for their statistically improbable phrases. They’re odd and random most of the time but usually one or two of them do give me an insight into the book’s contents.
This is interesting for email because when I’m searching through a lot of messages I need a quick way to understand something about what they contain without reading the whole text. One of the key features of Google’s search results is the summary they extract surrounding the keywords for each hit. This gives you a pretty good idea of what the page is actually discussing. In a similar way I want to present some key phrases from an email that very quickly give you a sense of what it’s about.
The main approach I’m using is vanilla SIPs, but there’s a couple of other interesting heuristics (sounds so much more technical than ‘ways of guessing’). The first is looking for capitalized phrases within sentences. These are usually proper nouns, so you’ll get a rough idea of what people or places are discussed in a document. The second is to find sentences that end with a question mark, so you can see what questions are asked in an email.
These are fun because they’re both reliant on easily-parsed quirks of the language, rather than deep semantic processing. This means they’re quick and easy to implement. It also means that they’re not very portable to other languages, German capitalizes all nouns for example, but one problem at a time!