Try out OpenCalais’s semantic analysis for yourself

Calaisferry
Photo by graphistolage.com

I’ve been intrigued by the promise of automatically extracting information from raw text using semantic analysis, but I’ve never found a publicly-available component I could integrate into my own work that was good enough to get excited about. When OpenCalais was released I wanted to give it a spin, but there wasn’t a demo page available to run tests with. I’ve taken some of the PHP demo code they’ve released, added some robot-deterrent and put it online at http://funhousepicture.com/calaisdemo/

To use it, copy-and-paste some text, answer the CAPTCHA test, and click on Show Results. You should see some of the places, people and technical terms highlighted. If you mouse over, it shows what kind of object it is. You can download the source to my version of the demo here, though you’ll need to grab your own reCAPTCHA keys before it will run.

Give it a try for yourself and let me know what you think. I’m primarily interested in automatically tagging business emails, and from my tests it’s got some promise. It didn’t seem to mistakenly identify many items in my material, but there were a lot of nouns its not designed to handle. I’d love to see something that understood dates, addresses and locations, but it doesn’t do a great job with these yet.

I’ll be running some more bake-offs figuring out what off-the-shelf semantic technology can do these days, so stay tuned.

Are you human?

Robotsindisguise
Photo by That_James

I hate having to ask my users to prove they’re human, so I wanted to make the process as painless as possible. I was only looking for a free, easy and usable way of adding spam-blocking images to my site, but ended up with one that does good too.

reCAPTCHA is a project by the inventors of the original CAPTCHA. As a web service it’s simple to add to your site, and it uses real words rather than random strings of characters, which makes it a lot easier for users. There’s no charge, and my favorite part is that every use helps decipher scanned books from the Internet Archive. One of the two words displayed is from an old book, and couldn’t be understood by the character recognition software that’s used to turn them into computer documents. Every correct answer identifies a hard-to-read word and helps to release that book to the public.

I’ve got an example online, with the source code here. There’s prepackaged plugins for dozens of website systems like WordPress and MovableType. If you have a custom site, it only takes a few lines of code and 5 minutes to get going. Here’s what you do:

Go to http://recaptcha.net
Sign up for an account
Enter the name of your website and get an API key
Download the library for PHP and put it on your website
Open up the example-captcha.php file and type in your key

Now if you navigate to that PHP file in your browser, you should see a working CAPTCHA test. All you need to do then is put it where you need it in your own pages.

 

Update – I did get some feedback from someone else who’s fanatical about CAPTCHA usability. They were fond of the idea of reCAPTCHA too, but ended up switching to a custom solution because they were still seeing problems where users couldn’t understand the words. I was disappointed to hear that, but so far it’s the most usable component I’ve found.

Independence Day

Utahsunset
Utah Sunset by Tim Hamilton

I love America. That’s a phrase that’s used so much, it’s hard to hear it fresh, or say with feeling. I first came here when I was 18, for a three-month vacation in Juneau, Alaska. The landscape seeped into my soul, and once I returned home it became a constant day-dream. I knew I had to go back. 7 years later, I moved over here, but was resigned to living in a concrete jungle since my job was in LA. Then I discovered the vast wildernesses that surround the city, and realized how much beauty there was almost anywhere in America.

I miss my family and friends back in the UK, but on a deep level I know this is where I belong. I feel at peace when I’m out in the mountains. I’m at home surrounded by Americans all determined to Do Something About It, whatever It is for them. It’s hard to talk about loving something abstract, but when I look out of a plane at the land below, I feel connected to it and everyone down there. I’ve so many reasons to be grateful for the chances America has given me, but it’s that raw feeling that keeps me here.

One of my favorite recent articles is The War over Patriotism, on the political struggles to claim patriotism for one side or another. The key insight was that the root of patriotism has to be an emotional bond to the place and people, not just an intellectual belief in our ideals. We’re always going to fall short of perfection, but that attachment keeps us trying again until we get it right.

Happy 4th of July to everyone! Now it’s time to get changed into my red coat and start trotting up and down clutching my musket while the neighbors take pot-shots at me…

Goodbye Apple

Applebadge

Yesterday was my last day as an Apple engineer. Deciding to leave was one of the toughest choices I’ve ever had to make, I’ve never seen such a dedicated and motivated team, and I’m certain that the hits will keep on coming. More than that, I’ll miss sharing my daily life with all the colleagues who became friends over the last 5 years. I have to single out my boss, Guido Hucking. He contacted me initially based purely on my open-source image processing work, hired me despite the hassles of sorting out a work visa, gave me the independence and support I needed to get things done, and then arranged for the office to close at 3:00pm on my last day so we could all head down the local British pub!

Kingshead

I’m looking forward to the future, I’m going to be building my email ideas into a real product, but I’m always going to look back very fondly on my time at Apple. You can be sure I’ll be cheering on the Apple crew as they keep up their tradition of excellence. Here’s what I’m leaving behind:

I’ll miss bumping into Nathalie holding her tea, killing kobolds with JF P, debugging memory leaks with JF D, hearing tales of the OS world from JP, sitting in Martin’s comfy chair while we tried to find a nice fix to a nasty bug, thinking ‘What would Greg N do?’ on design questions, biking up hell hill with Angus, Richard S’s non-stop chatter, Fernando’s heroic wrestling with the build system, Brian’s willingness to dive into scary parts of the build system, Darrin for being so relentless with the filter bugs, Sid’s tube map and other great test footage, Bob’s ability to explain any C++ question, Charles’ tales of his bike commute across LA, Stephen’s calmness in the face of any bug, Greg A’s ability to skewer my thinking when I got sloppy, chatting with Eric B about hiking , having a fellow game refugee in Jed, Gavin’s desk-pounding and loud swearing late at night, Doug’s wizardry with model helicopters, Jake’s powered bike, Omid’s ability to rip out the nastiest code and turn it into a unit test, talking with Richard P about the Illuminati, the way Gilbert’s projects could always break my code, Nigel’s emergency tea stash, (even though I never used it, it comforted me that it was there), Eric’s hair, Jayson and Pam’s pony-sized dog, the feeling of dread whenever I got a bug from Shin, because I knew she would have narrowed it down so precisely there was no escape, Steve’s patience as he tried to track me down to ask questions, Pete W’s ability to thrive in 3-hour-long spec meetings, Amanda’s knowledge of Better Off Dead, Enrique’s own CSI movie, Johanne’s literary adventures, Chris N’s encyclopedic knowledge of OpenGL that helped him to tell us exactly how we were messing it up as clients, Mike L’s Red Setters and WoW, Ken’s incisive questions, Gio’s drive to Get Things Done, Garret’s explanations of color science that even I could understand, running into Andrew in the garage, Steph M’s cat-herding with the schedule, Steph’s keeping the whole office purring along smoothly, Jeffy’s incredible PvP tales, Chris Bentley persuading ATI to take us out to a Red Sox game, Paul S being able to make progress even with my cruftiest filter code, Robin’s patience when doing the same with the templates, Sheila’s cheer-leading for the team, hearing Greg W’s stories of biking across the US, and David A for taking the time to give me the low-down on funding as an ex-VC.

There’s hundreds more people there who’ve touched my life, I’m sorry if I missed you out. I just want to say a big thank you to everyone, especially for being amazingly supportive of my decision to strike out on my own and leave them to deal with all my bugs!

How to create automatic blog categories with Lijit

Categories
Photo by Hawkexpress

I really like Lijit’s blog search widget, but I don’t want a cloud generated from the most popular searches. I’ve seen other blogs end up with some very inappropriate word combinations, apparently from people gaming the system. I also find the standard notion of tags very limiting; it’s only when I step back and see what I’ve been posting about that natural categories emerge. When I’m writing a post I often have no idea if it’s the first in a long series, or a one-off. I’d much rather have an automatic way of tagging all my posts, based on a few categories I describe after the fact.

If you look on my right bar, you’ll see a new ‘Categories’ list. These are actually canned Lijit searches, so clicking on them will bring up an in-context list of all the posts that match. For each category I’ve defined a Google search, often using the upper-case OR operator to pick a variety of different terms that are present in those types of posts. For example, the ‘Outdoors’ category searches for ‘hiking OR camping OR trails OR biking’.

I’ve mentioned this to the very nice people at Lijit as a feature request for a more general widget, but for now I’ve included a simple tool below to generate your own category lists. It generates the raw HTML, and you’ll need to work out how to get it into your own blog. It also calls back into Lijit’s scripts to bring up the in-context results, so you’ll need to have the main widget already installed.

Here’s what it takes to get this into Typepad:

  • Generate the HTML for your list using the form below. Copy the HTML that appears in the textbox when you hit the button onto the pasteboard.
  • Go to the TypeLists tab on your Typepad blog administration page.

Lijittutorial1

  • Click on the Create New List link.

Lijittutorial2

  • Set the type of the new list to Notes and the name to ‘Categories’

Lijittutorial3

  • Click on Add Item, and paste the HTML from the generator into the label textbox.
  • Go to the Publish tab and select the blog you want to add it to, and click Save Changes.
  • Go to Weblogs, then Design, and choose Select Content.
  • Disable the built-in categories module if you have it already selected, and click Save Changes.
  • Go to Content Ordering and drag the new ‘Categories’ list to where you want it, and save.

Now if you refresh your site, you should see the new categories appear.

function get_object(id)
{
return document.getElementById(id);
}
function get_value(id)
{
var currentobject = get_object(id);
if (currentobject!=null)
return currentobject.value;
else
return “”;
}
function generate_widget()
{
var widgethtml = “

“;
var username = get_value(“username”);
var count;
for (count=0; count<12; count+=1)
{
var nameid = "name"+count;
var keywordsid = "keywords"+count;
var namevalue = get_value(nameid);
var keywordsvalue = get_value(keywordsid);
if ((namevalue!="") && (keywordsvalue!=""))
{
var keywordsescaped = keywordsvalue.replace(/ /g,"+");
var currenttag = "“;
currenttag += namevalue;
currenttag += “

“;
widgethtml += currenttag;
}
}
widgethtml += “

“;
var textpreview = get_object(“textpreview”);
var htmlpreview = get_object(“htmlpreview”);
textpreview.value = widgethtml;
htmlpreview.innerHTML = widgethtml;
}

Lijit user name:

Name:

Search:

Name:

Search:

Name:

Search:

Name:

Search:

Name:

Search:

Name:

Search:

Name:

Search:

Name:

Search:

Name:

Search:

Name:

Search:

Name:

Search:

Name:

Search:


Generated HTML:


Preview:

 

You can also open this in a separate page in case Typepad’s cleanup breaks the tool, and here’s a screenshot from my category creation:
Lijittutorial5

How do you use the new Exchange documentation?

Filingcabinets
Photo by Curious Yellow

Yesterday, Microsoft released a new series of their Open Specification documents, many of them related to Exchange and email. There are in-depth descriptions of all the APIs and protocols that connect the various parts of the mail ecosystem together, so it’s essential reading for anyone working in the Exchange ecosystem. When you first start looking through them, it’s rather opaque, using a lot of internal code names and DOS-style filenames like [MS-OXCFOLD].pdf, so here’s some tips I found handy.

First, download the whole Exchange archive of documents as a zip file. You’ll be doing a lot of referring back and forth between specifications, and that’s a lot easier when they’re local files.

Second, bookmark the official Exchange specification forum, or subscribe to its RSS feed. Microsoft have traditionally been very good at supporting developers, and they’re active answering questions here. It’s useful both if you end up with your own queries, and to learn from what other people are hitting.

Now the tricky part is understanding what documents to look at. There’s a few conventions in the file names that help. They all start with MS-OX, but the next letter sometimes gives an idea of what category the file covers. C stands for communication protocols, O for object definitions, so MS-OXCMSG covers how to transmit a message between machines, and MS-OXOMSG defines the properties of an email object. Interestingly, MS-OXMSG has all the details of the .msg file format, though there’s already some information available on it from reverse engineering.

I didn’t find the main MS-OXDOCO file that was supposed to explain what was included very useful. The MS-OXPROTO overview was a useful description of the overall architecture, but didn’t give me much of a clue where to start either. Since I was especially interested in the Outlook Exchange Transport Protocol (formerly known as MAPI/RPC), I started by examining the MS-OXCMSG message communication document.

Like most API documentation, it’s pretty dry and detailed, but I find the best place to start is actually at the end, by looking through the examples. These don’t have any code unfortunately, but they are pretty good at taking you through the steps of doing something useful with the protocols. It’s also a good idea to look closely at the references to other protocols in each document, since most of them work by building on top of other APIs. For the message transport, the key underlying protocol is Microsoft’s remote operation API, or ROP, defined in MS-OXCROP.

All-in-all, this new release of information looks like good news for anyone who has to make their product work with Exchange. You’ll still need a lot of patience and some packet sniffing tools, but this makes implementing your own services that replace parts of the mail ecosystem a lot less daunting. I’m also hoping this helps the development of interoperability libraries like Moonrug’s, that would open the door to a lot of innovative new products.