What’s the best way to search large amounts of email?

Markmailscreenshot

MarkMail is a really interesting demonstration site for MarkLogic’s technology. They host archives of a number of development mailing lists for projects like Apache and Perl. You can search within each list, and the results are presented in a three panel format.

Markscreenshot2

The left panel shows you the frequency of the search terms over time, and suggests some different ways to narrow your search by focusing on subsets of the list or particular contributors who mention the term frequently. The middle panel is more like a conventional results page, listing links to all the matching messages. It also offers the ability to reorder the results by date instead of relevance. The right section shows you the content of the message, and other matching messages from around the same time.

I like this interface a lot, it’s the best presentation of time in search results that I’ve seen, combining the information offered by Google Trends with all the facilities of a normal search. I’m a big believer in using a horizontal split for previews too.

Beyond presentation, they also offer a lot of advantages over a web search engine in their understanding of mail messages. They allow you to search on subjects, authors, for unquoted text and can ignore boiler-plate material like disclaimer sigs and checkin notices. Much like Krugle focusing on function names, they can also use their knowledge of the structure to offer more relevant general results by giving more weight to the subject line than text in the body when working out the relevance of a result. This gives them an advantage over Google searching the same content as a web archive, since it has no idea what the significance or importance of any of the parts of each page are. Anyone who’s ever tried to do a mailing list search for "thread" through Google will know that it can be hard if the archive interface includes any interface elements that use thread to refer to topic-browsing, such as "Next in thread". As an example, here’s a Google search on the postgresql archives for thread where 2 of the top 3 results are for thread interface references. By contrast, all of the MarkMail results for the same search cover discussion of threads in the body of the message.

Under the hood they’re using an interesting mix of technology. On their blog, Jason Hunter posted a presentation covering the nuts and bolts of how they’ve built their search engine. Like me, they’ve gone the route of defining an XML format to store the messages in.
Markslide1

Markslide3

I’m currently using XML for an interchange format, but was going down the standard relational/mysql route for my database. MarkMail is completely powered using the XQuery database language, backed up by data stored in XML rather than converted to some processed database format. I couldn’t find any information on the technology they use to implement this (Saxon?), but it would be a lot simpler to do a single conversion to XML, and then operate on it, rather than trying to do input and output conversions from mysql. Fascinating stuff, I’ll have to see if I can get any more information from the team.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: