Need a custom Internet Explorer or Outlook plugin?

Wisconsin
Wisconsin photo by James Jordan


I recently came across Gigasoft Development, a small firm that specializes in writing IE and Outlook plugins. This is the first group I’ve come across that is solely focused on these, and whilst I’ve never used them myself, their work seems impressive.

If there’s any part of your software development you’d want to contract out, it’s writing extensions for Microsoft products. I know from my own explorations that it’s an incredibly deep field, with undocumented gotchas everywhere you turn. It’s a waste to devote months of your own engineering schedule relearning all those lessons if it’s not part of your core technology. It’s pretty rare to have good web developers who can also handle the hard-core Win32 hacking too. Contracting out to a good team of people who already know where the booby-traps are means much quicker and cheaper development.

You can often follow a pattern where the plugin itself is just a thin shim that fetches and renders HTML from a URL you control. That gives you the flexibility and ease of web development for the UI aspects, and means you can update the application logic without touching all those installed plugins.

I also have a soft spot for Gigasoft after looking through their site and spotting that Tom’s a Packers fan from Wisconsin, and they’re based in Illinois. I always love visiting Chicago and Wisconsin when we fly back to see Liz’s family.

Using Outlook to import emails to Exchange is painfully slow

Tortoise

Outlookscreenshot

Once I’d converted the Enron emails to a PST and loaded them into Outlook, I thought I was almost done with my quest to get them on my Exchange server. The last remaining step was to copy them to an Outlook folder that’s in an account that’s hosted on that server. With the PST conversion taking about a day, I assumed it would take a while, but after running for 6 days, it’s still only up to the B’s in alphabetical order!

ExMerge is an alternative way to import a PST onto an Exchange server. It only supports non-unicode files though, and has a 2GB limit, so that doesn’t work for the 5GB Enron data set. Another suggestion (Experts’ Exchange, so scroll down past the ads to see the comments) is to turn off cached mode and do File->Import from within Outlook. I’ve cancelling my current copy and so far this approach seems a lot faster.

How to use IMAP as a Gmail API in PHP

Palomarstamp
Photo by Voxphoto

I’ve tended to avoid client/server APIs like IMAP or POP for my mail analysis work, because they’re inherently limited to a single account and a lot of the information I’m interested in comes from looking at an entire organization’s data. Mihai Parparita’s work with MailTrends impressed me though, so I’m going to show you how to access Gmail messages using IMAP as an API. I’ll be using a PHP script, since I have an irrational bias against Python. Something about semantically significant whitespace really gets my goat.

I’ve got a demonstration page up at http://funhousepicture.com/phpgmail/. You’ll need to enter your full gmail address and password if you want to try it out there, or you can download the sourcecode and run it on your own server. I’ve also included it inline below. After connecting, it will fetch all of the headers from your account, along with the full content of the first ten messages. This may take a few seconds

You’ll need PHP with support for the IMAP library enabled to use it yourself. I was surprised to find this wasn’t included by default in the OS X distribution, and after some considerable yak shaving trying to get my own copy of PHP compiled, along with all its dependencies, I gave up doing local development and relied on my hosted Linux server instead. Thankfully that worked right out of the box.

<?php

function gmail_login_page()
{
?>
<html>
<head><title>Gmail summary login</title>
<style type="text/css">body { font-family: arial, sans-serif; margin: 40px;}</style>
</head>
<body>
<div>This page demonstrates how to access your Gmail account using IMAP in PHP. </div><br/>
<div>Enter your full email address and password, and the next page will show a selection of information about your account.</div><br/>
<div>See <a href="http://petewarden.typepad.com/">http://petewarden.typepad.com/</a&gt; for more information.</div><br/>
<hr/><br/>
<div>
<form action="index.php" method="POST">
<input type="text" name="user"> Gmail address<br/>
<input type="password" name="password"> Password<br/>
<br/>
<input type="submit" value="Get summary">
</form>
</div>
<hr/>
</body>
</html>
<?php
}

function gmail_summary_page($user, $password)
{
?>
<html>
<head><title>Gmail summary for <?=$user?></title>
<style type="text/css">body { font-family: arial, sans-serif; margin: 40px;}</style>
</head>
<body>
<?php
   
    $imapaddress = "{imap.gmail.com:993/imap/ssl}";
    $imapmainbox = "INBOX";
    $maxmessagecount = 10;

    display_mail_summary($imapaddress, $imapmainbox, $user, $password, $maxmessagecount);
?>
</body>
</html>
<?php
}

function display_mail_summary($imapaddress, $imapmainbox, $imapuser, $imappassword, $maxmessagecount)
{
    $imapaddressandbox = $imapaddress . $imapmainbox;

    $connection = imap_open ($imapaddressandbox, $imapuser, $imappassword)
        or die("Can’t connect to ‘" . $imapaddress .
        "’ as user ‘" . $imapuser .
        "’ with password ‘" . $imappassword .
        "’: " . imap_last_error());

    echo "<u><h1>Gmail information for " . $imapuser ."</h1></u>";

    echo "<h2>Mailboxes</h2>\n";
    $folders = imap_listmailbox($connection, $imapaddress, "*")
        or die("Can’t list mailboxes: " . imap_last_error());

    foreach ($folders as $val)
        echo $val . "<br />\n";

    echo "<h2>Inbox headers</h2>\n";
    $headers = imap_headers($connection)
        or die("can’t get headers: " . imap_last_error());

    $totalmessagecount = sizeof($headers);

    echo $totalmessagecount . " messages<br/><br/>";

    if ($totalmessagecount<$maxmessagecount)
        $displaycount = $totalmessagecount;
    else
        $displaycount = $maxmessagecount;

    for ($count=1; $count<=$displaycount; $count+=1)
    {
        $headerinfo = imap_headerinfo($connection, $count)
            or die("Couldn’t get header for message " . $count . " : " . imap_last_error());
        $from = $headerinfo->fromaddress;
        $subject = $headerinfo->subject;
        $date = $headerinfo->date;
        echo "<em><u>".$from."</em></u>: ".$subject." – <i>".$date."</i><br />\n";
    }

    echo "<h2>Message bodies</h2>\n";

    for ($count=1; $count<=$displaycount; $count+=1)
    {
        $body = imap_body($connection, $count)
            or die("Can’t fetch body for message " . $count . " : " . imap_last_error());
        echo "<pre>". htmlspecialchars($body) . "</pre><hr/>";
    }

    imap_close($connection);
}

$user = $_POST["user"];
$password = $_POST["password"];

if (!$user or !$password)
    gmail_login_page();
else
    gmail_summary_page($user, $password);

?>

Analyzing your Gmail

Mailtrends


Mihai Parparita
, a Google developer, has created a system to display information about your email over time. Mail Trends is a python script that connects to your Gmail account through IMAP, and generates a series of tables and graphs showing information about your mail account over time. The time aspect is key, it’s one of the most interesting parts of email, and something that distinguishes it from other implicit data we have access to. He has a demonstration using part of the Enron data set, and you can see the most prolific emailers, subjects and who sends you the most email. I was hoping it would also demonstrate searching by keyword, since being able to look for specific terms is very useful for research in Google Trends and similar buzz tracking sites for the web. One of my goals is to both show graphs of search keywords over time in your mail, in the same way that MarkMail does for its public mailing list search, and also have a animated tag clouds that show the most popular terms as they change over time. I’ll be watching closely for future developments, at least one of the blog commenters understands how this could build into something larger.

On the technical side, using IMAP is a great way to work around the lack of a proper Gmail API. He’s using the Python IMAPLib, I’ll have to look at the equivalents for other languages, since I have an irrational prejudice against any language in which whitespace is significant. Tabs in make files also bother me, but I’ve learnt to live with them. A hat tip to Brad and Googlified for pointing me towards Mail Trends.

How to convert mbox files to an Outlook pst

Kenlay
Photo by MotherPie

[Update- There's now a good alternative that includes separate PSTs for each user]

After getting the Enron emails into the mbox format, the next step was to convert them into something that the Outlook/Exchange world can understand. Thankfully I already had a great conversion program in mind, Aid4Mail. At its core its a translator between a large number of mail formats, including Outlook, Outlook Express, Windows Mail, Eudora, Thunderbird, Netscape Messenger, Pegasus Mail and a whole bunch of generic formats including several mbox variants. It can read and write to all of these formats, and has a large number of options to transform the mail as you do so. For example you can choose to only convert mails sent between certain dates, or to ignore attachments. If you're working with mail, I highly recommend giving this program a try, it's the swiss army knife of email tools.

To do the Enron conversion, I selected generic unix mbox as the input format. On the next screen I navigated to the root folder that contained all my files, and then chose Outlook pst as the destination type. I left all the other options at their defaults, so no filtering was done and the folder hierarchy was preserved. It took around 16 hours to process all 500,000 messages, and the pst file came out at around 5 GB.

I'm able to open it in Outlook and browse through the messages, and can also add them to my Exchange server. There are some issues, it doesn't preserve the original user structure, since they're all in one pst, attachments aren't included, and some of the addresses are obsfucated. It's good enough to give me the testbed I need to put some of my tools through some real-world stress tests.

Once the upload has finished, you should be able to access the pst yourself at
http://funhousepicture.com/enron.pst
It's 5 GB, so it won't be all there for a good few hours, and be prepared for a long download time.

What’s the best way to search large amounts of email?

Markmailscreenshot

MarkMail is a really interesting demonstration site for MarkLogic’s technology. They host archives of a number of development mailing lists for projects like Apache and Perl. You can search within each list, and the results are presented in a three panel format.

Markscreenshot2

The left panel shows you the frequency of the search terms over time, and suggests some different ways to narrow your search by focusing on subsets of the list or particular contributors who mention the term frequently. The middle panel is more like a conventional results page, listing links to all the matching messages. It also offers the ability to reorder the results by date instead of relevance. The right section shows you the content of the message, and other matching messages from around the same time.

I like this interface a lot, it’s the best presentation of time in search results that I’ve seen, combining the information offered by Google Trends with all the facilities of a normal search. I’m a big believer in using a horizontal split for previews too.

Beyond presentation, they also offer a lot of advantages over a web search engine in their understanding of mail messages. They allow you to search on subjects, authors, for unquoted text and can ignore boiler-plate material like disclaimer sigs and checkin notices. Much like Krugle focusing on function names, they can also use their knowledge of the structure to offer more relevant general results by giving more weight to the subject line than text in the body when working out the relevance of a result. This gives them an advantage over Google searching the same content as a web archive, since it has no idea what the significance or importance of any of the parts of each page are. Anyone who’s ever tried to do a mailing list search for "thread" through Google will know that it can be hard if the archive interface includes any interface elements that use thread to refer to topic-browsing, such as "Next in thread". As an example, here’s a Google search on the postgresql archives for thread where 2 of the top 3 results are for thread interface references. By contrast, all of the MarkMail results for the same search cover discussion of threads in the body of the message.

Under the hood they’re using an interesting mix of technology. On their blog, Jason Hunter posted a presentation covering the nuts and bolts of how they’ve built their search engine. Like me, they’ve gone the route of defining an XML format to store the messages in.
Markslide1

Markslide3

I’m currently using XML for an interchange format, but was going down the standard relational/mysql route for my database. MarkMail is completely powered using the XQuery database language, backed up by data stored in XML rather than converted to some processed database format. I couldn’t find any information on the technology they use to implement this (Saxon?), but it would be a lot simpler to do a single conversion to XML, and then operate on it, rather than trying to do input and output conversions from mysql. Fascinating stuff, I’ll have to see if I can get any more information from the team.

How to build MAPI Editor with VS 2005

Digger
Photo by Lawrence Whittemore

MAPI Editor is a great open-source example of how to access data on an Exchange server. It took a little bit of tweaking to get it compiling on my Windows Server 2003 box with Visual Studio 2005, so here’s my directions:

Version Conversion. The project is still set up for Visual Studio 6, so you’ll need to convert it to work with VS 2005. You can normally just load the old .vcproj project file and have VS create a new project from it. Unfortunately this fails in this case, but an easy workaround is to load the .dsp workspace instead. The conversion works in that case.

Deprecation Frustration. Editor.cpp uses _tfopen(), but this is marked as deprecated in the latest Visual Studio C runtime library, because it has security issues. Again there was a fairly simple solution, replacing it with the new _tfopen_s function. Here’s the code change in MFCOutput.cpp: Line 84

Old:    fOut = _tfopen(szFileName,szParams);
New:    _tfopen_s(&fOut, szFileName,szParams);

Inclusion Confusion. Editor.cpp includes vssym32.h on line 13. This was introduced after Server 2003, but there’s an equivalent older header you can use. Here’s the changes:

Old:    #include <vssym32.h>
New:    #include <tmschema.h>

Installation Aberration. The project links to msi.lib. Unfortunately there’s a known bug with Visual Studio 2005 that means this isn’t installed in the default platform SDK that ships with the product, so the linking stage fails. The obvious solution is to install one of the separate platform SDKs that Microsoft offers. Whilst I was waiting on one of these to download, I experimented with removing it entirely from the project, and it looks like it’s actually no longer needed, since it still builds and runs fine. So, just go into the linker portion of the project settings and remove msi.lib from the input portion.

I’ve filed bugs to document these issues here, here, here and here. Fixing some of them could break backwards compatibility with older versions of the compiler, so making code changes might not make sense, but I wanted to get them documented.

Is Exchange a drag racer?

Dragracer
Photo by Bcmacsac1

Why hasn’t Microsoft Exchange changed very much in a decade? As the holder of a lot of really interesting information, you’d think that there would be lots of cool new features they could introduce.

Looking at it from the outside, I think the problem is that they’ve ended up building a drag racer. They’re using the same JET database engine as Access, but heavily customized and tuned to offer great mail handling performance. Just like a drag racer, it’s going to be hard to beat its performance going in a straight line, doing what Exchange has always had to do, passing mail between users. The problem is, it’s really hard to turn a drag racer. There’s a lot of other useful things you could do with all that data, but since the Exchange engine is so specialized for its current requirements, doing anything else is usually both hard to write and slow to run. For example, accessing a few hundred mail messages can take several seconds through MAPI. Doing the same operation on similar mysql data would take a few milliseconds, and requires a lot less code.

They recognize this themselves, the Kodiak project was an aborted attempt to put a modern, flexible database underneath Exchange. I know the bind they’re in; there’s so many years of optimization in the code associated with the old JET implementation that any switch is bound to mean slower performance initially. I’ve seen companies wrestle with this dilemma for years; they can’t produce a new version that runs more slowly at the tasks customers are currently using it for, but they can’t ship the new features  they’d like while they’re tied to the gnarly legacy engine.

How Gmail collapses quoted text

Contents

A friend recently asked if there was a good way to detect just the added text in an email reply. This would allow users to reply directly to emails showing things like Facebook messages, and have the reply show up in a decent form on that other service. Spotting just the new content is fairly tricky, because you’ve not only got the quoted text of the original message, different email programs also add their own decorations to give attribution to the quotations, eg:

------ Original Message -----

On Tue, Mar 4, 2008 at 8:15 PM, Pete Warden <pete@petewarden.com> wrote:
From: Pete Warden 
Sent: Wednesday, March 04, 2008 8:17 PM
To: Pete Warden
Subject: Testing 2

The solution he is looking at for removing this boilerplate is collecting a library of examples, and figuring out some regular expressions that will match them. They’re fairly distinctive, so it should be possible to do a pretty accurate job spotting them. The main problem is that there’s so many different mail programs out there, and they all seem to add slightly different decorations.

Detecting the quoted text is more of an algorithmic problem, and comes down to doing a fuzzy string search to work out if some text roughly matches the contents of the original mail. Another approach would be to look for >’s at the start of a line, and would work reasonably well if it wasn’t for Outlook. For once, there’s actually a helpful patent that describes how Google does this in Gmail. I really hate software patents, but at least this one contains some non-obvious parts, is not insanely broad and explains reasonably well the implementation behind it. They don’t talk about handling the boilerplate decoration very much, apart from mentioning they look for common headers like "From:". For the quotations, it looks like they do some magic with hash calculations to spot small sections of matching text between the two documents, and then try to merge them into larger blocks.

Where can you get the inside track on Active Directory?

Queue

Microsoft recently released the first round of their open protocol documentation. These sort of documents are crucial information for anyone trying to do something challenging in the Exchange world. I was hoping to get a look at the undocumented parts of MAPI, and see a discussion of the variant that Outlook uses to communicate with Exchange, but it looks like that won’t be available for a few months.

Almost as valuable is the Active Directory Technical Specification, along with the related documents on the Security Account Manager and Directory Services Replication. For example, it gives detailed information on how to create a new user account through SAM, and a full IDL for DRS. This level of information makes it possible to design software that works seamlessly within a world of Microsoft services, so it’s not only great PR, it’s a cunning move to encourage more third-party development locked to Windows. Hopefully they’ll be rolling out the Exchange specs soon!