Photo by Photobunny
Breaking down information silos is the key to making better tools. Email stores are the biggest and most interesting silos out there, and one reason for the lack of progress is the lack of interchange standards between mail systems. Sure there's IMAP/POP, and RFCs galore, but they're all either connection oriented transport protocols, or are hard to decode with modern tools like MIME. For my own work I'm taking mails from diverse sources like Gmail through IMAP, Outlook through OOM and Exchange through MAPI and converting them into XML so that I can write the rest of my pipeline once and ignore where the mails came from.
Seeing Tim O'Reilly asking Postbox about their XML use reminded me that an agreed standard for email in XML would help everyone. XMTP is an effort based on RFCs, but a simple duplication of headers into XML tags is not much different than parsing the original raw text. What I needed was something that had a layered approach, hiding details like the exact type of a recipient to allow easy dumping of everybody who received it, rather than having to separately collate the to, cc and bcc headers. And nobody should ever have to deal with MIME's multi-part implementation ever again.
Here's some information on my format, with a DTD and an example encoded message. It's aimed at my need to pass around messages within a data analysis pipeline, so it skips a lot of less-used headers, but it captures what I need. I'll put together a minimal expat-based PHP parser in the future. Contact me if you're using any other email XML formats, I want to understand what else is out there.
In style I've completely avoided attributes, putting everything within the data section of a tag. This makes parsing simpler, and also brings it closer to JSON style notation for easy data interchange using map arrays in languages like PHP.
The example message demonstrates the tag, containing a plain text and HTML body, along with a single image attachment. Here's an explanation of the tag types:
<messagelist> This surrounds an unordered list of <message> objects
<message> Contains all the data for a message
<messageuid> A globally unique ID for the message (eg a UUID)
<sourceuid> Some ID that uniquely identifies the message at the location where it originated (eg an EntryID in Outlook). This is different from the <messageuid> because different copies of the same message may be present in the pipeline.
<subject> The subject line of the email
<fromaddress> The email address of the sender
<fromdisplay> The display name of the sender
<deliverytime> The time of arrival for the message in the recipients inbox. Stored in Y-m-d H:i:s format (will need time-zone added, but currently assuming GMT).
<recipients> Surrounds an unordered list of <recipient> objects
<recipient> Contains information about an individual recipient
<address> The email address for a recipient
<display> The display name for a recipient
<role> The type of recipient, either 'to', 'cc' or 'bcc'
<contenttext> The plain text version of the message body or an attachment. My tools take .doc, .pdf, and .xls attachments and convert them into both text and HTML versions for easy searching, analysis and viewing.
<contenthtml> The HTML version of the message body or attachment.
<sourcefolder> Somewhat misnamed, this actually indicates whether the mail was 'sent' or 'received
<attachments> Surrounds an unordered list of <attachment> objects
<attachment> Begins an individual attachment
<attachmentuid> A globally unique identifier to refer to the attachment
<filename> The full filename of the attachment
<filetype> The MIME type of the attached file
<filedata64> The actual data for the attachment, base64 encoded into a text form