A description of a document is really useful if you’re dealing with large amounts of information. If I’m searching through emails, or even when they’re coming in, I can usually decide whether a particular message is worth reading based on a short description. Unfortunately, creating a full human-quality description is an AI-complete problem, since it requires an understanding of an email’s meaning.
Automatic tag generation is a promising and practical way of creating a short-hand overview of a text, with a few unusual words pulled out. It’s somewhat natural, because people do seem to classify objects mentally using a handful of subject headings, even if they wouldn’t express a description that way in conversation.
If you asked someone what a particular email was about, she’d probably reply with a few complete sentences; "John was asking about the positronic generator specs. He was concerned they wouldn’t be ready in time, and asked you to give him an estimate." This sort of summary also requires a full AI, but it is possible to at least mimic the general form of this type of description, even if the content won’t be as high-quality.
The most common place you encounter this is on Google’s search results page:
The summary is generated by finding one or two sentences in the page that contain the terms you’re looking for. If there’s multiple occurences, usually the sentences earliest in the text are favored, along with ones that contain the most terms closest together. It’s not a very natural-looking summary, but it does do a good job at picking out the relevant quotations to what you’re looking for, and giving a good idea whether it’s actually talking about what you want to know.
Amazon’s statistically improbable phrases for books are an interesting approach, they try to identify combinations of words that are distinctive to a particular book. These are almost more like tags, and are found by a similar method to statistics-based automatic tagging, by spotting combinations that are frequent in a particular book, and not as common in a background of related items. They don’t act as a very good description in practice, they’re more useful as a tool for discovering distinctive content. I also discovered they’ve introduced capitalized phrases, which serve a similar purpose. That’s an intriguing hack on the English language to discover proper nouns, I may need to copy that approach.
The final, and most natural, type of summary is created by picking out key sentences from the text, and possibly shortening them. Microsoft Word’s implementation is the most widely used, and it isn’t very good. There’s also an online summarizer you can experiment with that suffers a lot of the same problems.
There are two big barriers to getting good summaries with this method. First, it’s hard to identify which bits of the document are actually important. Most methods use location, if it’s a heading or the start of a paragraph, and statistical frequency of unusual words, but these aren’t very good predictors. Even once you’ve picked the ones you want to use, here’s also very little guarantee that the sentences will make any sense when strung together outside the context of the full document. You often end up with a very confusing narrative. Even MS in their description of their auto summary tool acknowledge that at best it produces a starting point that you’ll need to edit.
Overall, for my purposes displaying something like Google or Amazon’s summaries for an email might be useful, though I’ll have to see if it’s any better than just showing the first sentence or two of a message. It doesn’t look like the other approaches to producing a more natural summary are good enough to be worth using.