How to build MAPI Editor with VS 2005

Digger
Photo by Lawrence Whittemore

MAPI Editor is a great open-source example of how to access data on an Exchange server. It took a little bit of tweaking to get it compiling on my Windows Server 2003 box with Visual Studio 2005, so here’s my directions:

Version Conversion. The project is still set up for Visual Studio 6, so you’ll need to convert it to work with VS 2005. You can normally just load the old .vcproj project file and have VS create a new project from it. Unfortunately this fails in this case, but an easy workaround is to load the .dsp workspace instead. The conversion works in that case.

Deprecation Frustration. Editor.cpp uses _tfopen(), but this is marked as deprecated in the latest Visual Studio C runtime library, because it has security issues. Again there was a fairly simple solution, replacing it with the new _tfopen_s function. Here’s the code change in MFCOutput.cpp: Line 84

Old:    fOut = _tfopen(szFileName,szParams);
New:    _tfopen_s(&fOut, szFileName,szParams);

Inclusion Confusion. Editor.cpp includes vssym32.h on line 13. This was introduced after Server 2003, but there’s an equivalent older header you can use. Here’s the changes:

Old:    #include <vssym32.h>
New:    #include <tmschema.h>

Installation Aberration. The project links to msi.lib. Unfortunately there’s a known bug with Visual Studio 2005 that means this isn’t installed in the default platform SDK that ships with the product, so the linking stage fails. The obvious solution is to install one of the separate platform SDKs that Microsoft offers. Whilst I was waiting on one of these to download, I experimented with removing it entirely from the project, and it looks like it’s actually no longer needed, since it still builds and runs fine. So, just go into the linker portion of the project settings and remove msi.lib from the input portion.

I’ve filed bugs to document these issues here, here, here and here. Fixing some of them could break backwards compatibility with older versions of the compiler, so making code changes might not make sense, but I wanted to get them documented.

What does Ask’s failure mean for search?

Fail
Photo by Fridgeuk

I was really sad to hear about the layoffs and change of strategy at Ask. They were working really hard to do something different with search, and this can only help fuel the belief that search doesn’t need to change. I didn’t always like the end result, and I never switched to using them fulltime, but they were the only mainstream engine that was moving the technology forward.

I won’t go into the history behind their retreat, other people have already done better jobs than I ever could. Danny Sullivan has the best overall roundup, and Jonathan Salem Baskin did a prescient piece on their marketing problems back in January. What I’m interested in is where it leaves the alternative search engine industry?

Ask had some great advantages over a startup. They had a large group of existing users to test new ideas on. Their history and contacts helped them publicize their new technology, and if they achieved a breakthrough their user base would mean a lot of word-of-mouth buzz. They also had a massive marketing budget, though it ended up being spent in some strange ways. My hope was that they would succeed by either come up with a killer feature in-house, or combining with one of the promising startups. Showing that users really do want more out of search would then trigger a technology arms race with the big guys, and we’d all benefit from some progress forward.

Instead, any challenger to Google’s crown will now have to organically build a user base, gather contacts and marketing resources. I’m still confident that there is a better way to search than a flat list of links, but this removes one of the paths to proving that. I hope it doesn’t make the investment climate harder for those small horizontal search engines too.

Slow food and growing up in the Cambridge countryside

Blackberry
Photo by Fannyluvstrans

For my next trip to Boulder, I’m hoping to pay a visit to The Kitchen, since I’ve had it recommended by several friends. Whilst browsing the site, I discovered that one of the chefs, Hugo Matheson, was raised in a small village in the Cambridgeshire countryside, just like me. He’s been a big part of the Slow Food movement in Boulder, and that got me thinking about how my childhood affected my relationship with food.

The village I grew up in is confusingly called Over, from the Saxon word for riverbank. With 3000 inhabitants, it was always based on agriculture, and I grew up opposite a farm yard. Even the residents who weren’t directly involved often grew fruit and vegetables in their gardens or allotments. A lot of the food I ate growing up was grown in the village. There is a wonderful tradition of leaving bags of produce on an unattended wooden stall at the front of your house, with a sign announcing the price and a jar to leave your money in. Another great local source is "The Cooks", an unsigned market stall in a courtyard off the High Street. They’re a local farming family who sell a wonderful range of fresh food, my mother still buys most of her ingredients from them.

Much more exciting as a child were the opportunities for scrumping, a term that sadly has no equivalent over here. Surrounded by farmland and orchards, I spent many hours gathering blackberries from patches of wasteland, plums from hedgerows, fallen apples and even digging up potatoes they’d missed after the harvest! These would all either be eaten on the spot, or brought home for cooking. It always makes me a bit sad when I drive past the orange orchards here in California and see all the fallen fruit rotting on the ground.

I never set foot in a fast food restaurant until I was 14, since they were all 10 miles away in Cambridge. Once I tried my first McDonalds, it was like alcohol and the early Indians, the overwhelming amount of sugar, salt and fat got me totally hooked. Once I left Over for university, I couldn’t often afford to eat out but when I did, the trashier the better. Most of the time I was doing home-cooking though, for economy’s sake rather than as a preference, but I could never get produce that was quite as good as the fresh local ingredients I was used to. Now I’m here in California I just cook at home as a treat, one of the joys is the wonderful array of good restaurants to chose from that do use decent ingredients.

I still have trouble getting good produce though. I’m never free when farmer’s markets are running, and getting home late and working most weekends means I’m at the mercy of the supermarkets. Ironically I’ve found that the most upscale stores like Albertsons and Vons have the worst produce. I’ve found the best at Jons, a discount store. My theory is that customers like to see the fruit and vegetables on display at the upscale places, but they don’t buy them very often, so they sit around for a long time. I’ve certainly brought home a lot of vegetables from there that are already starting to go soft or mouldy. At Jons, most of the shoppers are poorer and I see a lot more of them buying raw ingredients to cook with, so they both care more about the quality and there’s a lot more turnover.

I’m interested by the Slow Food movement, and by what I’ll find at The Kitchen. I’m naturally sympathetic to their ideals, and love the fact they have recipes and suppliers up on the site. I’d love to hear ideas on how we could make good fresh produce available in a convenient way that’s affordable to people on a budget.

Is Exchange a drag racer?

Dragracer
Photo by Bcmacsac1

Why hasn’t Microsoft Exchange changed very much in a decade? As the holder of a lot of really interesting information, you’d think that there would be lots of cool new features they could introduce.

Looking at it from the outside, I think the problem is that they’ve ended up building a drag racer. They’re using the same JET database engine as Access, but heavily customized and tuned to offer great mail handling performance. Just like a drag racer, it’s going to be hard to beat its performance going in a straight line, doing what Exchange has always had to do, passing mail between users. The problem is, it’s really hard to turn a drag racer. There’s a lot of other useful things you could do with all that data, but since the Exchange engine is so specialized for its current requirements, doing anything else is usually both hard to write and slow to run. For example, accessing a few hundred mail messages can take several seconds through MAPI. Doing the same operation on similar mysql data would take a few milliseconds, and requires a lot less code.

They recognize this themselves, the Kodiak project was an aborted attempt to put a modern, flexible database underneath Exchange. I know the bind they’re in; there’s so many years of optimization in the code associated with the old JET implementation that any switch is bound to mean slower performance initially. I’ve seen companies wrestle with this dilemma for years; they can’t produce a new version that runs more slowly at the tasks customers are currently using it for, but they can’t ship the new features  they’d like while they’re tied to the gnarly legacy engine.

How Gmail collapses quoted text

Contents

A friend recently asked if there was a good way to detect just the added text in an email reply. This would allow users to reply directly to emails showing things like Facebook messages, and have the reply show up in a decent form on that other service. Spotting just the new content is fairly tricky, because you’ve not only got the quoted text of the original message, different email programs also add their own decorations to give attribution to the quotations, eg:

------ Original Message -----

On Tue, Mar 4, 2008 at 8:15 PM, Pete Warden <pete@petewarden.com> wrote:
From: Pete Warden 
Sent: Wednesday, March 04, 2008 8:17 PM
To: Pete Warden
Subject: Testing 2

The solution he is looking at for removing this boilerplate is collecting a library of examples, and figuring out some regular expressions that will match them. They’re fairly distinctive, so it should be possible to do a pretty accurate job spotting them. The main problem is that there’s so many different mail programs out there, and they all seem to add slightly different decorations.

Detecting the quoted text is more of an algorithmic problem, and comes down to doing a fuzzy string search to work out if some text roughly matches the contents of the original mail. Another approach would be to look for >’s at the start of a line, and would work reasonably well if it wasn’t for Outlook. For once, there’s actually a helpful patent that describes how Google does this in Gmail. I really hate software patents, but at least this one contains some non-obvious parts, is not insanely broad and explains reasonably well the implementation behind it. They don’t talk about handling the boilerplate decoration very much, apart from mentioning they look for common headers like "From:". For the quotations, it looks like they do some magic with hash calculations to spot small sections of matching text between the two documents, and then try to merge them into larger blocks.

Where can you get the inside track on Active Directory?

Queue

Microsoft recently released the first round of their open protocol documentation. These sort of documents are crucial information for anyone trying to do something challenging in the Exchange world. I was hoping to get a look at the undocumented parts of MAPI, and see a discussion of the variant that Outlook uses to communicate with Exchange, but it looks like that won’t be available for a few months.

Almost as valuable is the Active Directory Technical Specification, along with the related documents on the Security Account Manager and Directory Services Replication. For example, it gives detailed information on how to create a new user account through SAM, and a full IDL for DRS. This level of information makes it possible to design software that works seamlessly within a world of Microsoft services, so it’s not only great PR, it’s a cunning move to encourage more third-party development locked to Windows. Hopefully they’ll be rolling out the Exchange specs soon!

Enterprise email is boring

Boredbusiness

After chatting some more with Nat Torkington of O’Reilly, the source for my previous article, he pointed out I misquoted him. He actually said "enterprise email is boring", and then outlined a few examples of the huge number of exciting things that are waiting to happen with mail:

  • Forms in email for seamless interactivity with web applications … Ajax even?
  • People you send email to form your contacts, why doesn’t your mail client automatically update your address book and buddy list when you’ve exchanged more than a few emails?
  • NLP-based clustering of mails into topical and thematic groups (pre-/auto- filtering)
  • Better indexing of old mail and visualizations of those indexes
  • Integrated GTD/productivity systems

Xobni does a good job with automatic contact extraction and ranking, but I’ve seen very little work done on the remaining areas. I Want Sandy is a great email-based scheduling tool that could grow into a full productivity system, and there’s been research on automatic mail categorization, but that’s about it.

He also questioned why I am so focused on the server side, since it looks to him like most of the interesting stuff should happen on the client. I’m building a server-based system because that’s where the data is. There’s patterns that emerge from a whole organization’s communications that you can’t see when you’re just looking at a single person’s inbox. There’s companies like Trampoline Systems that offer business intelligence based on this, and lots of forensic analysis work’s been done to discover patterns after the fact, but nobody’s trying to build tools to give this information to users.

Another reason driving me is ease of use. It’s much simpler to build indices and do other pre-processing work ahead of time in a server and offer the user an immediate experience through a web app, than requiring a client-side install and then spending time and storage space creating that data locally.

Probably the biggest stumbling block with this plan is a final point he brings up, that the pace of change in corporate IT departments is painfully slow. The most successful products in this area have been driven by an urgent and painful problem like spam, where someone will be fired if a solution isn’t found. I’ll need a very compelling application to get traction.

Email is boring

Bored

I had a great conversation on Friday night with a very savvy technology journalist. I gave him the pitch on my email work, he threw in a lot of smart and incisive questions, and he discussed some of the similar projects he’d covered. At the end though he threw out the line "but anyway, email is boring".

That’s what I find so interesting about it! Here’s a content-creation technology that’s used by billions of people every day, far more than will ever write anything that ends up on the web, and almost no ones doing anything innovative with it. Here’s why the web is hopping and email is languishing:

Closed technology. Email is scattered across different web services, in-house Exchange servers, social sites like Facebook and using a plethora of both web-based and PC clients. Most of these have no API you can use to programatically access the messages, and the few that do have a very steep learning curve. That all makes it orders of magnitude easier to get to the "hello world" stage of a web app than it is to get started doing something interesting with mail.

Closed data. When you’re working with the web, there’s an enormous public corpus of data available just by spidering. Email is private, and it’s very hard to find large collections of email to work with. The Enron set is the only one I know of. That means even if you do have a brilliant idea for working with email, it’s very hard to prototype and test it.

Solve these two problems, even partially, and there’s a world of possibilities. That’s why I’m building a platform and API to let you work with email in a simple way. Write native importers that feed Exchange, Gmail, etc data into a standard XML pipeline, and then you can cheaply and quickly create interesting tools to work with that information. Social networks, content analysis, collaboration tools, personal assistants, trend spotting, that’s when it all gets really exciting.

World War II and the Implicit Web

Spitfire

Traffic analysis is a field of espionage, focused on learning about the enemy by looking at their communication patterns without having to understand the content. Here’s some examples from the Wikipedia entry:

  1. Frequent communications — can denote planning.
  2. Rapid, short, communications — can denote negotiations.
  3. A lack of communication — can indicate a lack of activity, or completion of a finalized plan
  4. Frequent communication to specific stations from a central station — can highlight the chain of command.
  5. Who talks to whom — can indicate which stations are ‘in charge’ or the ‘control station’ of a particular network. This further implies something about the personnel associated with each station.
  6. Who talks when — can indicate which stations are active in connection with events, which implies something about the information being passed and perhaps something about the personnel/access of those associated with some stations.
  7. Who changes from station to station, or medium to medium — can indicate movement, fear of interception.

Some of these might sound familiar to anyone interested in analysing implict data. Number 4 sure sounds a lot like PageRank. The others can all be applied to any communications where you know the time, sender and recipients. Email content isn’t encrypted, but computers can’t full understand natural language so it might as well be, so anything we can gather from the external characteristics is invaluable. There’s obviously a lot we could learn from the work that’s been done over the years.

Unfortunately it’s been exclusively the territory of government intelligence services, and they don’t publish too many papers. Some of the most useful work I’ve found has been declassified World War II reports, but even there cryptanalysis tends to get the most coverage. Probably the most fascinating I found was the post-mortem report produced on the British TA work with German signals. It’s not very enlightening about the techniques they used, but the management recommendations they make are startling relevant for a modern tech company, once you get past the formal language:

"The policy of recruiting personnel for T.A. should emphasize the quality of personnel, not the quantity. Conforming to the usual pattern of history, in the beginning of such an undertaking as T.A., there is available only a very limited number of experienced people. Therefore, from the whole field of possible T .A. functions only the most useful and urgent should be undertaken. As the exploitation of these functions progresses, other possible functions will be recognised by the small but able original staff. Their suggestions for organisational changes and expansion should be encouraged and taken seriously. Only from operational experience can possible new functions be properly evaluated in the first instance. Once operational opinion is agreed that certain specific undertakings should be added, the additional personnel should be, as far as possible, chosen and trained by those who have the operational responsibility. … A wholesale creation of a T.A. staff with the a priori idea of providing a sufficient quantity of people to exhaust the field of T.A. … is wasteful and operationally inefficient."

History shows that small motivated teams usually beat the lumbering man-month monstrosities that large companies keep trying to assemble. I guess that’s a lesson they learnt back in 1942!