To build a system that pulls information from large email stores, I need three processing stages. Capture to pull the information from the source, whether it’s using Exchange APIs to pull from a server, libgmailer or plain screen scraping. Analysis takes that data, and pulls out things like the social network and tags the content. Presentation takes the information that the analysis produces, and displays it to the end users in a compelling form.
Most of the innovation is going to be in the analysis and presentation, but getting the capture right, whilst not ground-breaking, will be a lot of code. I need to decouple the analysis implementation from the capture technology, so the same code could be used for both web mail and Exchange for example. That requires a common interchange format for the capture stage to output and the analysis to read. I want a human-readable, text-based format for easy debugging and implementation in a variety of languages, something that will be flexible enough to cope with a lot of changes in structure and that has a lot of existing tool support. Those all argue for something XML based. Luckily there’s already a draft email XML standard I can build on.
Unfortunately it’s looking like it never made it past the draft stage and now seems abandoned, but it’s a good starting point for me to use. RFC822 is the source of most of the tag names, so it’s an easy conversion from either raw message text or the MAPI functions. It only deals with individual messages, rather than large sets as I need, but it’s possible to logically extend it to have a hierarchical folder structure.