Search massive XML datasets with MarkLogic

Bookpile
Photo by GeorgMayer

After my last post on the MarkMail project, I heard back from MarkLogic’s Jason Hunter with some more information about the underlying implementation. Almost all the capabilities are provided by the MarkLogic database server, which seems to offer an impressive set of features. I initially had some trouble finding the technical information on their main site, since it’s mostly geared towards of high-end content publishers who are the main users of the system, but then I came across their developer center.

What is MarkLogic Server? is probably the best place to start for an overview of what they offer. Essentially they differ from a standard relational database by accepting comparatively unstructured data, without a rigid schema, and focusing on the great search and retrieval performance you need for any publishing system. As they’ve demonstrated with MarkMail, this makes a good interface for large collections of email too. The technology has been battle-tested, deployed in situations dealing with terabytes of data and with the ability to run in a distributed cluster so you can scale performance to cope with heavy loads.

As well as the MarkMail demonstration, they also have the Executive Pay Check site that lets you do a live search on 14A filings to see the salaries of leaders at public companies. This is interesting mostly because it’s doing a good job coping with some theoretically structured, but in practice quite messy, source data, with inconsistent naming and formatting for the tables holding the filings. It would require some heavy massaging to get this into a traditional relational database, but MarkLogic seems to be a lot better at handling that sort of problem.

There’s a free community version of the engine available for download, so I’ll be experimenting with it when I have the chance. An active developer community has grown up around the product over the last few years, with lots of documentation I’m absorbing to help my understanding. I’m surprised that it isn’t better known in the search community, it seems like it offers some unique features that would let you build an interesting search engine for all kinds of rich content.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: