How to use Yahoo’s Placemaker API to extract places from documents

Oldmap

Today I was lucky enough to hear Greg Cohn walk us through all the goodies Yahoo offers developers. I'm a big fan and heavy user of their Geoplanet geocoding API, so I was stoked to hear they'd just launched a service to recognize placenames in arbitrary HTML and XML documents. Why is this so interesting? Look at what Just Landed have done by searching for the words "Just landed in" in Twitter messages and then geocoding and visualizing the placenames. Placemaker makes it a lot simpler to build tools like this with anything that can be expressed as XML or HTML. That covers web pages, REST APIs like Twitters and even RSS feeds, so you can see why I'm excited!

I've put together a simple example that shows off how to use it as a bash script, tested on OS X. You can download it as geturlplaces.zip here, or I've included the source below. To use it, pass a web page address as the first argument, eg ./geturlplaces http://news.bbc.co.uk/

For production code you'll want a real XML parser rather than the regexs used below.

#!/bin/bash

# enter your Yahoo geo app id here – to obtain one go to http://developer.yahoo.com/wsregapp/index.php and register
# (interestingly as of May 20th 2009 it works with a bogus id!)
APPID=XXXXX

if [ $# -ne 1 ]
then
  echo "Extract a list of all the recognized place names from a web page using Yahoo's Placemaker API"
  echo "Usage: `basename $0` <web page url>"
  exit 65
fi

curl –silent -d "documentURL=$1&documentType=text/html&outputType=xml&appid=$APPID" "http://wherein.yahooapis.com/v1/document&quot; | grep '<text><\!\[CDATA\[' | sed 's/<text><\!\[CDATA\[//; \
s/\]\]><\/text>//' | sort | uniq

Privacy’s vanishing; how screwed are we?

Veil
Photo by Matanya

The whole theory behind Mailana is that people's attitudes to privacy are changing; there's a younger generation willing to open up private information as long as they get something useful in return and retain control. I've written about this before, but a recent post by Marc Hedlund brought some of my thoughts into focus.

He's a self-confessed "privacy freak" but concedes that he's on the losing side of the battle. Selfishly speaking that's a great validation of the bet I'm making on my business, but what's interesting is his motivation. He says that people are blase about privacy online because they've never been stalked or the victim of identity theft. Once you go through that hell, like he has, you realize how useful all those old-fashioned notions really are.

That makes a lot of sense to me, those are black swan events; statistically speaking pretty unlikely to happen to you but devastatingly bad when they do. What's worse is that easy-going attitudes towards privacy create an environment where criminals will thrive, actually making it more likely you'll be attacked in the future. By handing over personal information and even passwords we're all picking up pennies in front of a steam-roller right now.

I'm still a fan of people's new freedom to trade some privacy for something they want more, but I'm acutely aware that people are care-free about that bargain because they've never been stung. A lot of people are going to get hurt before we reach a new equilibrium, with widely understood ground rules for what's acceptable and safe.

Wild and wacky ways to access email

Catmail
Photo by Stephanie Booth

When I designed the Mailana architecture I built my data pipeline around an XML format capturing the message information I need. That meant I could support a wide variety of sources by just writing a single import component for each that translated the native format into my XML. That's worked out really well, letting me pull in data directly from Exchange servers, Outlook PST files, Gmail and other IMAP services, and of course from Twitter.

I've been having some fascinating chats with Pete Sheinbaum, and one thing he's been enthusiastic about is tapping into the mass market by grabbing communications data that isn't easily accessible. In practice that means screen-scraping and other unconventional techniques, all of which are immensely appealing to my subversive geeky streak (see my old GoogleHotKeys project) and would be easy to integrate into Mailana as an import component. Here's some of my favorite approaches to grabbing email:

Yahoo IMAP Spoofing

Normally you can only get IMAP or POP access to your Yahoo inbox if you upgrade to a premium account. Last year they introduced their Zimbra desktop client which works even with free accounts, and it wasn't long before some enterprising coders discovered it was using a slightly modified version of IMAP. To programmatically access all Yahoo email accounts all you need to do is send one non-standard command!

Outlook Web Access Screen Scraping

The Substandard Evil Genius has a nice little snippet for logging into OWA and grabbing the HTML for an email inbox. Parsing that would give you the headers for a page of emails, and then you could grab the links to download each message's content. It's definitely tougher than using a real API, but with some of time and care it's very feasible to pull down everything from an Outlook account.

TrueSwitch's Uber Screen Scraper

I hadn't run across TrueSwitch until recently, but they're a fascinating company. Their purpose in life is to let people transfer everything from old to new email accounts, including all messages and contacts. What rocks is that they use screen scraping to support all the main email services, even those like Hotmail that only have a webmail interface. Amusingly they're also used by all the main email providers to make it easy for users to switch to them, even those like Yahoo and MSN that deliberately don't offer an API, presumably to make it harder for users switch away!

They don't offer an API, but what it does demonstrate is that it's possible to access almost everyone's email by screen scraping if you're willing to invest the time and effort.

Should you ignore the data?

Wrongway
Photo by EranB

I was lucky enough to hear Scott Petry, Brad Feld and Ryan McIntyre telling the story of Postini, the company Scott founded, and Brad and Ryan funded. One phrase from Scott really resonated:

Data is necessary, but not sufficient, to make good decisions

I love and live by metrics, and I'm a big fan of the Customer Development philosophy, but I think there's dangers in taking it too far, and Scott crystallized the problem. Customer Development rocks because it forces startups to gather data, something most would otherwise neglect. It sucks if it prevents you from following paths that cause your metrics to dip in the short-term in return for a long-term win.

One fascinating example of this was profitability. Scott explained how reaching profitability early became a curse, because it became immensely important every quarter after that to avoid dipping into a loss, even when spending more money would be a long-term win. Irrational primates that we are, humans get a lot more worried by a small drop that causes the company to go from profit to loss than they do by the same drop when you're already making a loss, so the statistic becames their master not their servant.

I asked the guys to talk more about the metrics they used, and they talked about their in-depth tools for measuring customer satisfaction which helped them to an astonishing 96% customer renewal rate! But interestingly Brad and Scott both stressed the need to sometimes ignore or go against the evidence. It all reminded me of Tom Evslin's motto "nothing great has ever been accomplished without irrational exuberance" – the trick is knowing when to take that leap of faith. To do that, you have to know that you're taking a leap in the dark, which requires knowing the evidence in the first place and making a conscious decision to override it.

You should never ignore the data, but sometimes you have to listen and make a deliberate choice to go against it.

Why I love my Kindle

Ringoflove
Photo by Roger McLassus

I've been very wary of ebooks. I've always found reading long-form text on a computer monitor very tiring, and I couldn't imagine anything being as elegant, easy and convenient as a paperback. I was wrong!

Last month I tried out Amazon's iPhone application as an experiment, and immediately got hooked. For me the joy was the immediacy. I'd be wondering about a period I wished I knew better, thinking about a historical question or just remembering a favorite author and I could instantly find and download the book to match. Even on the iPhone's tiny screen I had no trouble reading, despite reducing the font-size to the minimum to pack more onto the page. Since we had a couple of long LA to Boulder drives coming up, Liz noticed my new addiction and gave me a Kindle 2 as an early birthday present.

I'm very, very impressed by the hardware. The screen is sharp and readable, with the wireless disabled the battery lasts forever and the form factor is just like a conventional book. The software is well thought out, extremely minimal just the way I like it, and purchasing is made very simple. The only downside is that it's not possible to read at night without an external light, unlike the iPhone version.

I really wish there was an API I could use to expose my reading list as a blog widget, it would be fun for me and free advertising for Amazon, but here's a rundown I manually compiled from my accounts page:

Battle Cry of Freedom : The Civil War Era

I've read accounts of different battles and studied the abolition movement years ago, but I've never had a clear picture of the entire war. McPherson starts off by doing a great job describing the background of the conflict. There's never any doubt that he see the South as the villains, but he makes their worldview at least understandable, if not forgivable. Particularly fascinating was how the rise of the factory system turned northern freelancers into much more regimented employees, the original 'wage slaves', a system the Southern whites feared and hated.

I was more amazed than ever by Lincoln; he was headed for almost certain defeat in his re-election, but was determined to do what he thought right for the country. Seeing how often he and his team failed was inspiring too; they just kept trying despite sometimes overwhelming odds.

Lancaster Against York

The US civil war got me thinking about the War of the Roses in England. At the end of the medieval age there was a succession of kings contesting the throne backed by powerful factions, with frequent battles between rival forces tearing apart the kingdom. What became clear from reading this book, and especially the eventual triumph of the Tudors, was that public opinion mattered. Henry VII came to power with a flimsy hereditary claim but the support of the population. Unlike the absolutist monarchies like the French, it was never possible for English royalty to forget that their power was ultimately given by the people's consent. That political lesson matured into the democracies we have today, with especial thanks to some colonists who were able to remind King George!

Wyrd Sisters

After some heavy reading, I needed something fun to clear my palate. I was an avid Pratchett reader when he first started (I still have a fondness for his much-maligned early SF novels) but gave up when they started to feel too predictable. Jumping back in, the plot definitely wasn't the best part, but there were a few surprises and it was a delight to wallow in his gorgeous writing again. With so many true-to-life observations and characters, along with the Macbeth references, I enjoyed it immensely and will be digging out some more of his immense catalog.

The People of the Abyss – The Underworld of Victorian London

I have a fascination with the Victorian era, maybe because it's the last time British people actually seemed to have a purpose, even if the results of their efforts weren't always good. I was very curious to know why Jack London had gone from the wilderness to inner-city Victorian London, and had high hopes. I have to admit I'm only about a third of the way through, he's obviously writing with a strong purpose but that makes it hard to see the real people below the apocalypse he's predicting. I also have a hard time suspending my disbelief that this American managed to blend in as a Cockney nearly as well as he claims. I'll be coming back to this, but it's heavy going.

The Diamond Age

After the last unsuccessful foray into the 19th century, I decided to return to one of my favorite books. Stephenson is best known for Snowcrash and Cryptonomicon, but I like this much better. There's no super-powerful protagonist or awkward romance with nerd-fantasy women (I swear the archeologist in Cryptonomicon is Lara Croft), just real people and intriguing debate on the value judgments we make on Victorians.

Desperate Passage: The Donner Party's Perilous Journey West

We were driving through Utah with 3 cats and a dog in my hatchback, and the snow started to come down very heavy. That naturally got me wondering about the Donner Party, all I really knew was the comic-book version, so I downloaded and read this book on the road. What really struck me was how much like a startup the pioneer parties were, how many timing and resource decisions they had to make and how little information they sometimes had. The Donner Party were brought down by a series of mistakes that left them stranded on the wrong side of the mountains when winter came. What's clear from the history is how each step seemed fairly rational at the time, but they all added up to a nightmare.

The Fall of the Roman Empire: A New History of Rome and the Barbarians

This is my current read, and I'm working my way through it. I've long been captured by the mystery of the Dark Ages, this is an interesting approach to their beginning that takes a modern look at why Rome fell.

The best interactive social network demos

As you might be able to tell from http://twitter.mailana.com/ I'm a big fan of displaying relationships between people as a graph. Here's a few of my favorite examples:

Twitterfriendsscreenshot
http://www.neuroproductions.be/twitter_friends_network_browser/ has a very compact tool for browsing through your friends and followers. It shows the latest message from each person, and it's very quick and simple to browse through the graph. Unlike Mailana it doesn't use your conversations to figure out the graph, just the list of people you follow.

Mucketyshot

Muckety is a fascinating experiment that uses its database of connections between people and institutions to illustrate news stories. You can manipulate the graphs and use them to explore deeper links. I haven't found a news story that's a killer app for this (Oil Change USA have something similar with very targeted graphs showing how oil money interacts with politics), but I would love to see more journalists using tools like this to show rather than telling.

Touchgraphshot

The TouchGraph folks have been doing amazing work for years with their Java visualization applet. My favorite demo is probably their Facebook graph, but check out their Google tool if you want to see the connections between websites.

Nexusshot
For a slightly mysterious Facebook network visualizer give Nexus a test-run. It shows a lot of detail on your friends and their connections, and I'm very fond of the straightforward interface.