I’m very excited about the potential for email as a data source. I’m so passionate about it, it’s hard to remember that for most people it’s a new idea, and its not obvious how it could be useful. To explain, I usually point out existing companies like Spoke or Contact Networks that pull out basic contact information, or Microsoft’s Knowledge Network experiment that automatically locates experts. But what truly gets my heart racing are all the applications that haven’t been possible without easy access to email data.
One very promising area is extracting events from messages. Dates are one of the easier entities to spot in unstructured language. PHP has a built-in function, strtotime(), that can convert most English time strings into an absolute value, even fuzzy ones like "next Thursday" or "now". Getting the rest of the information like the name of an event is tougher, but imagine a calendar view that just shows the subject line of each email at any time that’s mentioned in the body of the message. You could restrict the view to only genuine contacts (people you’d replied to at a minimum) and then with a single click transfer any true events to your appointments calendar.
So why isn’t this already implemented? Gmail has got something similar for its Gcal integration, but it’s very limited in the formats it will recognize. There’s articles out there like Learned Automatic Recognition of Appointments from Email by Lauren Paone at CMU, but as a quote from the paper puts it "Although email is ubiquitous, large and realistic email corpora are rarely available for research purposes." Lauren faced serious obstacles even running realistic tests because he didn’t have enough email to work with.
What’s stopping progress is the mind-numbing pain of first getting data to prototype with (though the Enron corpus makes that somewhat easier), but even worse, trying to integrate with any email service like Outlook/Exchange or through IMAP. Innovation is rapid in the web world because anybody can spider the public internet and offer the results through a website. A few large companies like Google and Yahoo have access to their user’s emails and so can create email-based tools, and if you can persuade users to install a desktop plugin you can do the same, but the only way to move things forward is to open up access to a lot more developers. I’ll be posting on some of my efforts in that direction soon.