Slow food and growing up in the Cambridge countryside

Blackberry
Photo by Fannyluvstrans

For my next trip to Boulder, I’m hoping to pay a visit to The Kitchen, since I’ve had it recommended by several friends. Whilst browsing the site, I discovered that one of the chefs, Hugo Matheson, was raised in a small village in the Cambridgeshire countryside, just like me. He’s been a big part of the Slow Food movement in Boulder, and that got me thinking about how my childhood affected my relationship with food.

The village I grew up in is confusingly called Over, from the Saxon word for riverbank. With 3000 inhabitants, it was always based on agriculture, and I grew up opposite a farm yard. Even the residents who weren’t directly involved often grew fruit and vegetables in their gardens or allotments. A lot of the food I ate growing up was grown in the village. There is a wonderful tradition of leaving bags of produce on an unattended wooden stall at the front of your house, with a sign announcing the price and a jar to leave your money in. Another great local source is "The Cooks", an unsigned market stall in a courtyard off the High Street. They’re a local farming family who sell a wonderful range of fresh food, my mother still buys most of her ingredients from them.

Much more exciting as a child were the opportunities for scrumping, a term that sadly has no equivalent over here. Surrounded by farmland and orchards, I spent many hours gathering blackberries from patches of wasteland, plums from hedgerows, fallen apples and even digging up potatoes they’d missed after the harvest! These would all either be eaten on the spot, or brought home for cooking. It always makes me a bit sad when I drive past the orange orchards here in California and see all the fallen fruit rotting on the ground.

I never set foot in a fast food restaurant until I was 14, since they were all 10 miles away in Cambridge. Once I tried my first McDonalds, it was like alcohol and the early Indians, the overwhelming amount of sugar, salt and fat got me totally hooked. Once I left Over for university, I couldn’t often afford to eat out but when I did, the trashier the better. Most of the time I was doing home-cooking though, for economy’s sake rather than as a preference, but I could never get produce that was quite as good as the fresh local ingredients I was used to. Now I’m here in California I just cook at home as a treat, one of the joys is the wonderful array of good restaurants to chose from that do use decent ingredients.

I still have trouble getting good produce though. I’m never free when farmer’s markets are running, and getting home late and working most weekends means I’m at the mercy of the supermarkets. Ironically I’ve found that the most upscale stores like Albertsons and Vons have the worst produce. I’ve found the best at Jons, a discount store. My theory is that customers like to see the fruit and vegetables on display at the upscale places, but they don’t buy them very often, so they sit around for a long time. I’ve certainly brought home a lot of vegetables from there that are already starting to go soft or mouldy. At Jons, most of the shoppers are poorer and I see a lot more of them buying raw ingredients to cook with, so they both care more about the quality and there’s a lot more turnover.

I’m interested by the Slow Food movement, and by what I’ll find at The Kitchen. I’m naturally sympathetic to their ideals, and love the fact they have recipes and suppliers up on the site. I’d love to hear ideas on how we could make good fresh produce available in a convenient way that’s affordable to people on a budget.

Is Exchange a drag racer?

Dragracer
Photo by Bcmacsac1

Why hasn’t Microsoft Exchange changed very much in a decade? As the holder of a lot of really interesting information, you’d think that there would be lots of cool new features they could introduce.

Looking at it from the outside, I think the problem is that they’ve ended up building a drag racer. They’re using the same JET database engine as Access, but heavily customized and tuned to offer great mail handling performance. Just like a drag racer, it’s going to be hard to beat its performance going in a straight line, doing what Exchange has always had to do, passing mail between users. The problem is, it’s really hard to turn a drag racer. There’s a lot of other useful things you could do with all that data, but since the Exchange engine is so specialized for its current requirements, doing anything else is usually both hard to write and slow to run. For example, accessing a few hundred mail messages can take several seconds through MAPI. Doing the same operation on similar mysql data would take a few milliseconds, and requires a lot less code.

They recognize this themselves, the Kodiak project was an aborted attempt to put a modern, flexible database underneath Exchange. I know the bind they’re in; there’s so many years of optimization in the code associated with the old JET implementation that any switch is bound to mean slower performance initially. I’ve seen companies wrestle with this dilemma for years; they can’t produce a new version that runs more slowly at the tasks customers are currently using it for, but they can’t ship the new features  they’d like while they’re tied to the gnarly legacy engine.

How Gmail collapses quoted text

Contents

A friend recently asked if there was a good way to detect just the added text in an email reply. This would allow users to reply directly to emails showing things like Facebook messages, and have the reply show up in a decent form on that other service. Spotting just the new content is fairly tricky, because you’ve not only got the quoted text of the original message, different email programs also add their own decorations to give attribution to the quotations, eg:

------ Original Message -----

On Tue, Mar 4, 2008 at 8:15 PM, Pete Warden <pete@petewarden.com> wrote:
From: Pete Warden 
Sent: Wednesday, March 04, 2008 8:17 PM
To: Pete Warden
Subject: Testing 2

The solution he is looking at for removing this boilerplate is collecting a library of examples, and figuring out some regular expressions that will match them. They’re fairly distinctive, so it should be possible to do a pretty accurate job spotting them. The main problem is that there’s so many different mail programs out there, and they all seem to add slightly different decorations.

Detecting the quoted text is more of an algorithmic problem, and comes down to doing a fuzzy string search to work out if some text roughly matches the contents of the original mail. Another approach would be to look for >’s at the start of a line, and would work reasonably well if it wasn’t for Outlook. For once, there’s actually a helpful patent that describes how Google does this in Gmail. I really hate software patents, but at least this one contains some non-obvious parts, is not insanely broad and explains reasonably well the implementation behind it. They don’t talk about handling the boilerplate decoration very much, apart from mentioning they look for common headers like "From:". For the quotations, it looks like they do some magic with hash calculations to spot small sections of matching text between the two documents, and then try to merge them into larger blocks.

Where can you get the inside track on Active Directory?

Queue

Microsoft recently released the first round of their open protocol documentation. These sort of documents are crucial information for anyone trying to do something challenging in the Exchange world. I was hoping to get a look at the undocumented parts of MAPI, and see a discussion of the variant that Outlook uses to communicate with Exchange, but it looks like that won’t be available for a few months.

Almost as valuable is the Active Directory Technical Specification, along with the related documents on the Security Account Manager and Directory Services Replication. For example, it gives detailed information on how to create a new user account through SAM, and a full IDL for DRS. This level of information makes it possible to design software that works seamlessly within a world of Microsoft services, so it’s not only great PR, it’s a cunning move to encourage more third-party development locked to Windows. Hopefully they’ll be rolling out the Exchange specs soon!

Enterprise email is boring

Boredbusiness

After chatting some more with Nat Torkington of O’Reilly, the source for my previous article, he pointed out I misquoted him. He actually said "enterprise email is boring", and then outlined a few examples of the huge number of exciting things that are waiting to happen with mail:

  • Forms in email for seamless interactivity with web applications … Ajax even?
  • People you send email to form your contacts, why doesn’t your mail client automatically update your address book and buddy list when you’ve exchanged more than a few emails?
  • NLP-based clustering of mails into topical and thematic groups (pre-/auto- filtering)
  • Better indexing of old mail and visualizations of those indexes
  • Integrated GTD/productivity systems

Xobni does a good job with automatic contact extraction and ranking, but I’ve seen very little work done on the remaining areas. I Want Sandy is a great email-based scheduling tool that could grow into a full productivity system, and there’s been research on automatic mail categorization, but that’s about it.

He also questioned why I am so focused on the server side, since it looks to him like most of the interesting stuff should happen on the client. I’m building a server-based system because that’s where the data is. There’s patterns that emerge from a whole organization’s communications that you can’t see when you’re just looking at a single person’s inbox. There’s companies like Trampoline Systems that offer business intelligence based on this, and lots of forensic analysis work’s been done to discover patterns after the fact, but nobody’s trying to build tools to give this information to users.

Another reason driving me is ease of use. It’s much simpler to build indices and do other pre-processing work ahead of time in a server and offer the user an immediate experience through a web app, than requiring a client-side install and then spending time and storage space creating that data locally.

Probably the biggest stumbling block with this plan is a final point he brings up, that the pace of change in corporate IT departments is painfully slow. The most successful products in this area have been driven by an urgent and painful problem like spam, where someone will be fired if a solution isn’t found. I’ll need a very compelling application to get traction.

Email is boring

Bored

I had a great conversation on Friday night with a very savvy technology journalist. I gave him the pitch on my email work, he threw in a lot of smart and incisive questions, and he discussed some of the similar projects he’d covered. At the end though he threw out the line "but anyway, email is boring".

That’s what I find so interesting about it! Here’s a content-creation technology that’s used by billions of people every day, far more than will ever write anything that ends up on the web, and almost no ones doing anything innovative with it. Here’s why the web is hopping and email is languishing:

Closed technology. Email is scattered across different web services, in-house Exchange servers, social sites like Facebook and using a plethora of both web-based and PC clients. Most of these have no API you can use to programatically access the messages, and the few that do have a very steep learning curve. That all makes it orders of magnitude easier to get to the "hello world" stage of a web app than it is to get started doing something interesting with mail.

Closed data. When you’re working with the web, there’s an enormous public corpus of data available just by spidering. Email is private, and it’s very hard to find large collections of email to work with. The Enron set is the only one I know of. That means even if you do have a brilliant idea for working with email, it’s very hard to prototype and test it.

Solve these two problems, even partially, and there’s a world of possibilities. That’s why I’m building a platform and API to let you work with email in a simple way. Write native importers that feed Exchange, Gmail, etc data into a standard XML pipeline, and then you can cheaply and quickly create interesting tools to work with that information. Social networks, content analysis, collaboration tools, personal assistants, trend spotting, that’s when it all gets really exciting.

World War II and the Implicit Web

Spitfire

Traffic analysis is a field of espionage, focused on learning about the enemy by looking at their communication patterns without having to understand the content. Here’s some examples from the Wikipedia entry:

  1. Frequent communications — can denote planning.
  2. Rapid, short, communications — can denote negotiations.
  3. A lack of communication — can indicate a lack of activity, or completion of a finalized plan
  4. Frequent communication to specific stations from a central station — can highlight the chain of command.
  5. Who talks to whom — can indicate which stations are ‘in charge’ or the ‘control station’ of a particular network. This further implies something about the personnel associated with each station.
  6. Who talks when — can indicate which stations are active in connection with events, which implies something about the information being passed and perhaps something about the personnel/access of those associated with some stations.
  7. Who changes from station to station, or medium to medium — can indicate movement, fear of interception.

Some of these might sound familiar to anyone interested in analysing implict data. Number 4 sure sounds a lot like PageRank. The others can all be applied to any communications where you know the time, sender and recipients. Email content isn’t encrypted, but computers can’t full understand natural language so it might as well be, so anything we can gather from the external characteristics is invaluable. There’s obviously a lot we could learn from the work that’s been done over the years.

Unfortunately it’s been exclusively the territory of government intelligence services, and they don’t publish too many papers. Some of the most useful work I’ve found has been declassified World War II reports, but even there cryptanalysis tends to get the most coverage. Probably the most fascinating I found was the post-mortem report produced on the British TA work with German signals. It’s not very enlightening about the techniques they used, but the management recommendations they make are startling relevant for a modern tech company, once you get past the formal language:

"The policy of recruiting personnel for T.A. should emphasize the quality of personnel, not the quantity. Conforming to the usual pattern of history, in the beginning of such an undertaking as T.A., there is available only a very limited number of experienced people. Therefore, from the whole field of possible T .A. functions only the most useful and urgent should be undertaken. As the exploitation of these functions progresses, other possible functions will be recognised by the small but able original staff. Their suggestions for organisational changes and expansion should be encouraged and taken seriously. Only from operational experience can possible new functions be properly evaluated in the first instance. Once operational opinion is agreed that certain specific undertakings should be added, the additional personnel should be, as far as possible, chosen and trained by those who have the operational responsibility. … A wholesale creation of a T.A. staff with the a priori idea of providing a sufficient quantity of people to exhaust the field of T.A. … is wasteful and operationally inefficient."

History shows that small motivated teams usually beat the lumbering man-month monstrosities that large companies keep trying to assemble. I guess that’s a lesson they learnt back in 1942!

Enjoy a taste of Britain with Tikka Masala

Tikkamasala

There’s not much food I miss from the UK, but life isn’t worth living without an occasional Chicken Tikka Masala. Invented in the 70’s, it’s a combination of traditional Indian spices and the British love of sauces. Tender chunks of chicken are slowly cooked in a creamy tomato and yogurt sauce and served on rice with a side of naan bread. I’ll show you how to make your own at home.

Ingredients

Serves 2-3

1/4 stick of butter
1/2 teaspoon cumin seeds
cinnamon stick
8 cardamom pods
12 cloves
2 bay leaves
1 medium onion
6 cloves of garlic
1 piece of ginger root
1 tablespoon of powdered cumin
1 tablespoon of powdered coriander
1 14 oz can of diced tomatoes
1/2 teaspoon cayenne pepper
3/4 teaspoon salt
1 pair of chicken breasts (about 1lb)
2 tablespoons plain full-fat yogurt or sour cream

Preparation

The cooking itself takes about an hour, but you’ll need to prepare the ingredients before that starts. That usually takes me around 45 minutes, and I recommend a small glass of Kingfisher beer to help you along and add an authentic curry-house feel.

Garam Masala

The spice base used for many Indian dishes is known as garam masala. I’ve tried a lot of prepared mixes, but I’ve never found one that hits the spot. To make your own, take the cumin seeds, cinnamon, cardamoms, cloves and bay leaves, and place them on a small plate. Later you’ll be throwing them in hot oil to bring out the flavor.

Tikkaspices

Vegetable stock

Onion, garlic and ginger make up the stock base for the dish. The onion should be chopped finely and placed on one side for later. The garlic should be crushed into a small bowl or glass, and then grate the ginger and add to the mixture along with a little water. I love the smell of the garlic and ginger together, I’ve found it works great for any dish that needs a rich vegetable stock.

Tikkaginger

Chicken

This is my least favorite part of the recipe since I hate handling raw chicken. One tip I’ve found is that you can leave the frozen chicken a little under-defrosted and the slicing will be a lot simpler.

Take the breasts and slice them into roughly 1 inch cubes. I’m pretty fussy and remove any fatty or stringy sections so there’s just the pure white meat. Sprinkle the salt and cayenne pepper over the cubes, adding more than suggested if you like it hot, and leave to marinate for a few minutes.

Tikkachicken

Cooking

Now all the components are ready, you’ll need to put the butter in a large, deep frying pan and put on a high heat. You need to get it hot, but not so hot that the butter separates or smokes. My usual test is to drop a single cumin seed into the oil. If it sizzles and pops within a few seconds, it’s hot enough.

Once up to the right temperature, add the plate of garam masala spices and stir for around a minute. You should start to smell the aroma of the spices as they mix with the oil.

Leaving the heat on high, add the onion and leave cooking for around 3 minutes, stirring frequently. The onion should be translucent and a little browned by the end.

Now add the cumin and coriander, along with the garlic/ginger mix. Mix well in the pan, and if it looks too dry add a little more water. Cook for another minute or so.

Add the canned tomatoes, mix again and leave for another minute.

Throw in the chicken and give another good stir. Now turn the heat down to low and place a cover on the pan. After a few minutes have a peek and it should be bubbling very gently. Keep stirring every few minutes, and it should be ready in around 45 minutes. If the sauce looks too watery, leave the lid off the pan for the 15 minutes to let it reduce. A few minutes before the end mix in the yogurt or sour cream.

Tikkapan

Rice

Curry-house rice is usually basmati, and cooked quite dry compared to the American standard. I recommend a cup of rice to two cups of water and a medium-sized saucepan with a good lid.

Take the rice and soak in a large bowl of water for half an hour before you cook it. Then drain the water, and add the rice to the two cups of boiling water in the pan. Stirring is the enemy, since you’ll break up the grains and add sticky starch to the mixture, so just stir once when you add the rice, and then once again when it’s up to boiling. Once it’s reached that point put on the lid, turn down the heat and don’t peek for ten minutes since the steam’s doing a lot of the cooking and you don’t want to let it escape.

After ten minutes turn off the heat and fluff the rice with a fork, and then cover again until everything else is ready. Put the rice on plates as a base and add the curry on top.

Naan

You really need proper stretchy sweet naan bread for the proper Indian experience. Making that yourself is a whole different article, but as a poor substitute you can try pitta bread in a pinch, since that’s a lot easier to find.

Is PageRank the ultimate implicit algorithm?

Bookpile

PageRank has to be one of the most successful algorithms ever. I’m wary of stretching the implicit web definition until it breaks, but it shares a lot of similarities with the algorithms we need to use.

Unintended information. It processes data for a radically different purpose than the content’s creators had in mind. Links were meant to simply be a way of referencing related material, nobody thought of them as indicators of authority. This is the definition of implicit data for me, it’s the information that you get from reading between the lines of the explicit content.
Completely automatic. No manual intervention means it can scale up to massive sets of data without having a corresponding increase in the numbers of users or employees you need. This means that its easy to be comprehensive, covering everything.
Hard to fake. When someone links to another page, they’re putting a small part of their reputation on the line. If the reader is disappointed in the destination, their opinion of the referrer drops, and this natural cost keeps the measure correlated with authority. This makes the measure very robust against manipulation.
Unreliable. PageRank is only a very crude measure of authority, and I’d imagine that a human-based system would come up with different rankings for a lot of sites.

As a contrast, consider the recipe behind a social site like Digg that aim to rank content in order of interest.

Explicit information. Every Digg vote is done in the knowledge that it will be used to rank stories on the site.
Human-driven. It relies completely on users rating the content.
Easy to fake. The voting itself is simple to game, so account creation and other measures are required to weed out bad players.
Reliable. The stories at the top of its rankings are generally ones a lot of people have found interesting, it seems good at avoiding boring content, though of course there’s plenty that doesn’t match my tastes.

A lot of work seems to be fixated on reliability, but this is short-sighted. Most implicit data algorithms can only ever produce a partial match between the output and the quality you’re trying to measure. Where they shine is their comprehensiveness and robustness. PageRank shows you can design your system around fuzzy reliability and reap the benefits of fully automatic and unfakeable measures.

A secret open-source MAPI example, from Microsoft!

Mapiscreen

Microsoft’s Messaging API has been the core interface to data held on their mail systems since the early 90’s. For Exchange 2007, it’s deprecated in favor of their new web service protocol but it’s still the language that Outlook speaks, and is the most comprehensive interface even for Exchange.

The underlying technology holding the mail data has changed massively over the years, and so the API has grown to be massive, inconsistent and obscure. It can’t be used with .Net, it requires C++ or a similar old-school language, and its behavior varies significantly between different versions of Outlook and Exchange. There’s some documentation and examples available, but what you really need is the source to a complete, battle-tested application. Surprisingly, that’s where a grassroots effort from Microsoft’s Stephen Griffin comes in!

He’s the author of MAPI Editor, an administrator tool for Exchange that lets you view the complete contents and properties of your mail store. It also offers a wealth of other features, like the ability to export individual messages or entire folders as XML. Even better, he’s made it a personal mission to keep the source available. I know how tricky getting that sort of approval can be in a large company and I’m very glad he succeeded, it’s been an invaluable reference for my work. I just wish it was given more prominence in the official Microsoft documentation, I had been working with the API for some time before I heard about it. That might be a reflection of it’s history, since it started off as a learning project, and evolved from being used as ad-hoc example code, to being documented in an official technical note, to shipping as part of the Exchange tools.

Another resource Stephen led me to is the MAPI mailing list. The archives are very useful, packed full of answers to both frequently and infrequently asked questions. It’s not often that you see an active technical mailing list that’s been going since 1994 either.