Discover public data with the Data Source Handbook

I’m pleased to announce that the Data Source Handbook is now available from O’Reilly. It’s a compact ebook guide to the most useful APIs and bulk data sets I’ve found, packed with examples and advice. These are hand-picked services that I’ve actually spent time using during my own work, and I chose them because they add insights and information to data you’re already likely to be dealing with. You can check out the table of contents below, and I’ve also included a couple of excerpts.

It’s organized by the kind of data that you want to look up information on, from websites to locations, email addresses to ISBNs. There’s a whole new world of free or cheap public data out there, I’ve been having a blast exploring it myself, so I hope you’ll enjoy it as much as I have. A big thanks to everyone who helped me compile this too, from my editors Mike Loukides and Teresa Elsey to all the helpful people on Quora, along with the many friends who emailed me ideas. Keep the suggestions coming, I’ll be working on an updated edition soon.

Websites:

  •   WHOIS
  •   Blekko
  •   bit.ly
  •   Compete
  •   Delicious
  •   BackType
  •   PagePeeker

People by email:

  •   WebFinger
  •   Flickr
  •   Gravatar
  •   Amazon
  •   AIM
  •   Friendfeed
  •   Google Social Graph
  •   MySpace
  •   Github
  •   Rapleaf
  •   Jigsaw

 People by name:

  •   WhitePages
  •   LinkedIn
  •   GenderFromName

 People by account:

  •   Klout
  •   Qwerly
  •   Search terms
  •   BOSS
  •   Blekko
  •   Bing
  •   Google Custom Search
  •   Wikipedia
  •   Google Suggest
  •   Wolfram Alpha

Locations:

  •   SimpleGeo
  •   Yahoo
  •   Google Geocoding API
  •   CityGrid
  •   Geo-Coder-US
  •   Geodict
  •   GeoNames
  •   US Census
  •   Zillow Neighborhoods
  •   Natural Earth
  •   US National Weather Service
  •   OpenStreetMap
  •   MaxMind

 Companies:

  •   CrunchBase
  •   ZoomInfo
  •   Hoovers
  •   Yahoo Finance
  •   IP Addresses
  •   MaxMind
  •   Infochimps

Books, films, music and products:

  •   Amazon
  •   Google Shopping
  •   Google Book Search
  •   Netflix
  •   Yahoo music
  •   Musicbrainz
  •   The Movie DB
  •   Freebase

WHOIS

The whois unix command is still a workhorse, and I’ve found this web service a decent alternative too. You can get the basic registration information for any website. In recent years, some owners have chosen ‘private’ registration which hides their details from view, but in many cases you’ll see a name, address, email and phone number for the person who registered the site. You can also enter numerical IP addresses here and get data on the organization or individual that owns that server.

Unfortunately the terms-of-service of most providers forbid automated gathering and processing of this information, but you can craft links to the Domain Tools site to make it easy for your users to access the information.

<a href="http://whois.domaintools.com/www.google.com">Info for www.google.com</a>

There is a commercial API available through whoisxmlapi.com that offers a JSON interface and bulk downloads, which seems to contradict the terms mentioned in most WHOIS results. It costs $15 per thousand queries. Be careful though, it requires you to send your password as an non-secure URL parameter, so don’t use a valuable one.

curl "http://www.whoisxmlapi.com/whoisserver/WhoisService?\
domainName=oreilly.com&outputFormat=json&userName=<username>&password=<password>"
{"WhoisRecord": {
"createdDate": "26-May-97",
"updatedDate": "26-May-10",
"expiresDate": "25-May-11",
"registrant": {
"city": "Sebastopol",
"state": "California",
"postalCode": "95472",
"country": "United States",
"rawText": "O'Reilly Media, Inc.\u000a1005 Gravenstein Highway North
\u000aSebastopol, California 95472\u000aUnited States\u000a",
"unparsable": "O'Reilly Media, Inc.\u000a1005 Gravenstein Highway North"
},
"administrativeContact": {
"city": "Sebastopol",
...

Blekko

The newest search engine in town, one of Blekko’s selling points is the richness of the data it offers. If you type in a domain name followed by /seo you’ll receive a page of statistics on that URL

blekko0.png

They are also very keen on developers accessing their data, so they offer an easy-to-use API through the /json slash tag, which returns a JSON object instead of HTML.

http://blekko.com/?q=cure+for+headaches+/json+/ps=100&auth=<APIKEY>&ft=&p=1

To obtain an API key, email apiauth@blekko.com. Their terms of service are available at https://blekko.com/ws/+/terms, and while they’re somewhat restrictive, they are flexible in practice:

You should note that it prohibits practically all interesting uses of the blekko API. We are not currently issuing formal written authorization to do things prohibited in the agreement, but, if you are well behaved (e.g. not flooding us with queries), and we know your email address (from when you applied for an API auth key, see above), we will have the ability to attempt to contact you and discuss your usage patterns if needed.

Currently, the /seo results aren’t available through the JSON interface, so you have to scrape the HTML to obtain it. There’s a demonstration of that athttps://github.com/petewarden/pagerankgraph.

 

A fundamental bug in HTML5’s Canvas?

Stitchingscreenshot

I still get requests for an HTML5 implementation of OpenHeatMap, so I guess I've done a terrible job of telling people about the Canvas-based renderer I've had in there since it launched. The confusion comes about because I default to Flash if your browser has it installed, since it's usually faster and there's still one problem with the Canvas implementation that I haven't been able to fix.

If you look at the screenshot above, you'll see pale white lines within the states. Those are boundaries between the internal polygons that they're made of, and in the Flash version they don't show up. The fundamental problem is that if you render two polygons that share an edge, Canvas will show a visible join along that edge, whereas Flash will seamlessly meld the two together, with no difference visible if they're the same color. I've put together a minimal page here to show the issue:

http://web.mailana.com/labs/stitchingbug/

The source, along with a Flash project doing the same thing and producing the expected results, is here:

http://github.com/petewarden/stitchingbug

The fundamental issue is that it's impossible to do any complex polygonal rendering if you can't stitch polygons together without seams. I don't know exactly what Flash's fill rules are, but they produce the correct results, as do the 3D renderers I've used in the past. It's cross-browser, which makes it seem deliberate, so any references to the rules used would also be appreciated. Here's the Canvas code:

    var ctx = canvas.getContext('2d');
    ctx.fillStyle = 'rgb(0,0,0)';

    ctx.beginPath()
    ctx.moveTo(0, 0);
    ctx.lineTo(50.5, 0);
    ctx.lineTo(50.5, 100);
    ctx.lineTo(0, 100);
    ctx.closePath();
    ctx.fill();

    ctx.beginPath()
    ctx.moveTo(50.5, 0);
    ctx.lineTo(100, 0);
    ctx.lineTo(100, 100);
    ctx.lineTo(50.5, 100);
    ctx.closePath();
    ctx.fill();

Anybody have any insights on this? I'd love to deprecate the Flash version, but I need to understand what's going on here, and I'm at a dead end. I'd love to hear there's something obvious I'm doing wrong.

Update – Thanks for the suggestions. I purposely simplified the example to avoid alpha, but that's the issue that most of the 'stroke()' approaches I'd already tried hit. I've included code to demo that below. I'm really not trying to be a jerk about this, honestly I'd love to know that I'm an idiot, as long as I find a solution. The public humiliation will be worth it, I swear.

        ctx.fillStyle = 'rgba(0,0,0,0.5)';

        ctx.strokeStyle = 'rgba(0,0,0,0.5)'

        ctx.beginPath()

        ctx.moveTo(0, 0);

        ctx.lineTo(50.5, 0);

        ctx.lineTo(50.5, 100);

        ctx.lineTo(0, 100);

        ctx.closePath();

        ctx.fill();

        ctx.stroke();

        ctx.beginPath()

        ctx.moveTo(50.5, 0);

        ctx.lineTo(100, 0);

        ctx.lineTo(100, 100);

        ctx.lineTo(50.5, 100);

        ctx.closePath();

        ctx.fill();

        ctx.stroke();

 

Trunk.ly and Egypt

Egyptprotests
Photo by Al Jazeera

We have to be careful not to project our own obsessions onto the rest of the world. As one Egyptian commentator said; "it's not about you". Technology has clearly played a role in the protests, but I'm betting that cell phones and television are a lot more influential than social networks.

Still, it's been amazing to watch how Twitter has been used to spread the word, especially to the outside world. The only trouble is that there's so many links and comments, it's almost impossible to follow any one topic amongst the volume of messages. That's where the Trunk.ly curation service comes in.

As I was chatting to the founders Tim and Alex yesterday, they pointed me at one of their users, ExiledSurfer. Since the protests began he's been collecting a massive number of links and reports from inside the country and relaying them through his Twitter stream. When people get in touch asking for information, he points them at his trunk.ly stream that give his links "with no noise". With ability to search and categorize by tags, and a simple view showing just the links and snippets of information about them, it's a much better way of documenting the material than just Twitter.

With over 5,000 links in his archive, he's turned his messages from an ephemeral stream into something more like a library of reference material, all without having to do anything more than use Twitter the way he always has. This is a case where technology has enabled a single person to become vastly more effective as a news broker than we could have imagined just a few years ago.

I've been watching the progress of Trunk.ly since before it was a glimmer in Tim's eye, thanks to his wonderful weekly blog chronicling their startup progress as it happens. With a site that's headed into the top 20,000 worldwide according to Alexa, and over seven million links collected, it feels like there's a lot of people who like their curation model. I'm looking forward to seeing how it helps us all organize our knowledge, and maybe play a small part in spreading the word in situations like Egypt.

OpenHeatMap now supports Hong Kong and Canadian constituencies

Can_constituency

Thanks to some very helpful folks at the University of Hong Kong, I've been able to add support for parliamentary constituencies in both Hong Kong and Canada. Like all OpenHeatMap areas, all you need to do is list the names of the constituencies and the values you want to display in a CSV file, Excel spreadsheet, or a Google document. Upload it, choose your display options, and you can create your own interactive political maps.

For more information and example files to download, go to the documentation sections on Hong Kong or Canada. I look forward to seeing what you build with these, so let me know how you get on.

Hk_constituency

Five short links

Vbirds
Photo by Sam Teigen

Bixo Labs – A company building bespoke web data mining solutions, founded by Ken Krugle of the late, lamented Krugle code search engine. Interesting because it shows how web crawling is now moving out of the labs and big corporations, into the reach of ordinary companies. There’s a nice explanation of their crawling architecture too.

WireIt – Build your own Yahoo Pipes/Quartz Composer interface with this free Javascript library. I’m having a blast with it myself, thanks Russ for the tip.

St Louis Fed – A feast of free economics data, I can’t believe I didn’t know about this before. via Paul Kedrosky.

Deri Pipes – Talking of Yahoo Pipes, here’s an open-source clone. I see errors pop up when I try to use it unfortunately, but with Yahoo’s recent troubles I’m looking out for alternatives to their services.

Datamarket releases 13,000 data sets – Another lovely set of data appears on the web. I haven’t seen anyone make a pure data marketplace work as a business model, but I wish them luck and appreciate more sources of data. Via Paul Kedrosky.

Five short links

Letterv
Photo by Chris in Plymouth

Extraordinary Claims – An in-depth look at the methodology behind both Daryl Bem’s research claiming evidence of precognition, and the critical responses to it. I’m deeply sceptical that his claims are correct myself, but Peter clearly lays out how the critics are trying to change the rules to dismiss them, rather than having a fair fight. As the Climategate coverage shows, science isn’t just about getting the right answer. Like justice it has to be seen to be done, having a transparent and even-handed process for dealing with heretics is important.

25 Commandments for Journalists – I’ve been thinking a lot about ‘sensationalism’ in writing, engaging the reader and how to square that with truth, justice and the American Way (of journalism). It’s been one of the most controversial topics I’ve tackled, provoking some insightful push back from regulars like Emily Cunningham. This manifesto from Tim Radford articulates the British position far better than I’ve managed to so far, with key phrases like “Nobody has to read this crap” and “Words like ‘sensational’ and ‘trivial’ are not insults to a journalist”. The final commandment is the most important though, about the balancing act we all need to do.

25. Writers have a responsibility, not just in law. So aim for the truth. If that’s elusive, and it often is, at least aim for fairness, the awareness that there is always another side to the story. Beware of all claims to objectivity. This one is the dodgiest of all. You may report that the Royal Society says that genetic modification is a good thing, and that depleted uranium is mostly harmless. But you should remember that genetic modification was invented by people who were immediately elected to the Royal Society for their cleverness, by people already in there because they knew how to enrich uranium fuel rods and deplete the rest. So to paraphrase Miss Mandy Rice-Davies (1963) “They would say that, wouldn’t they?”

Cuatro años de ejecuciones en México (Four years of executions in Mexico) – This is exactly the sort of important story I hoped OpenHeatMap could help tell. A sobering read, even through a stuttering automatic translation.

Gremlin – A screencast introducing Gremlin, a Groovy-based graph processing language. Uses analysis of Grateful Dead playlists as the example, and makes dealing with graph traversal look easy. Thanks to Chris Diehl for the heads-up.

Wolfram Alpha’s API is free, but is it open? – Wolfram has assembled an awesome collection of knowledge and aims to make it ‘computable’, but their API only returns images and textual descriptions of their data. If we’re going to do more than just display supplemental search results to users, we’ll need a machine-readable version. Anybody know folks there that I can quiz about that? Email me if you do, thanks (or if you have any other thoughts too of course!).

Five short links

Eveportraits

Eve Online User-Generated Portraits – Just look at the quality of those pictures above – they’re all created by players using Eve Online’s character generation system. Back in 1997 I worked on a pool game with animated characters, and a generator we nicknamed the Barbie Fashion Show, but I never realized how far the technology had come since then. Just check out this video of the interface to see how easy it is to create amazing results. In a funny twist, one of my old colleagues from that pool game is now working for Eve out in Iceland as a senior designer.

Carpets for airports – A connoisseurs guide to airport flooring, revealing their secret meanings. The Da Vinci Code for carpets, with a funky flash interface.

Without adding context, a journalist with data can be dangerous – A fantastic example of something I’ve been struggling to get across to people. At the moment we’re incredibly susceptible to believing number’s people throw at us, in a way we wouldn’t with stories told in prose. As a society, we need to wise up and develop enough savvy to build an immune system to this sort of manipulation, and part of that has to be calling out distortions like the one David pounces on.

Lighting the dark continent – Africa’s lack of development can seem staggering, but as Jon points out, it’s also a massive opportunity.

The Quantified Self Conference – I didn’t know there was a grass-roots movement around this idea, I thought it would be something driven by the product folks, but it makes sense there’s people interested in instrumenting their lives. This looks like a great opportunity to get feedback if you are in the business of offering solutions around this area.

Spotlight your startup at Strata

Spotlight
Photo by Bryan Stevenson

Are you a data startup who'd love to be at Strata but can't afford the admission? You now have a chance to attend the conference and show off what you've been building, thanks to the Strata Startup Showcase. There's space for fifteen startups, and successful companies will be given two free passes and five minutes to show off their work in front of investors. It's a great opportunity, but the deadline for admissions is Friday, so you'll need to be quick.

Don't forget the free Big Data Camp Unconference on the Monday before the main event too, the price is specially tailored for starving entrepreneurs' wallets.

What makes a good data API?

Centaurskeleton
Picture by Victoria (Mouse World)

I’ve been working on a guide to data APIs, and making decisions about what to include has forced me to think about exactly what I look for. If you’re going to build an API that’s useful to a wide range of people, and will add value to the whole data ecosystem, here’s what you need.

  • Free, or self-service signup. Traditional commercial data agreements are designed for enterprise companies, so they’re very costly and time-consuming to experiment with. APIs that are either free or have a simple sign-up process make it a lot easier to get started.

  • Broad coverage. There’s been quite a few startups that build infrastructure, and hope that users will then populate it with data. Most of the time, this doesn’t happen, so you end up with APIs that look promising on the surface but actually contain very little useful data.

  • Online API or downloadable bulk data. Most of us now develop in the web world, so anything else requires a complex installation process that makes it much harder to try out.

  • Linked to outside entities. There has to be some way to look up information that ties the service’s data to the outside world. For example, the Twitter and Facebook APIs don’t qualify because you can only find users by internal identifiers, whereas LinkedIn does because you can look up accounts by their real-world names and locations.

The first three principles are just about ease of use, but having linkable data is essential if you’re going to allow developers to innovate by combining data sources. Once you’ve got an external reference point, we can join information to come up with insights you’d never expect.

Five short links

Paintedfive
Photo by Chris in Plymouth

The Linked Open Data cloud diagram – I disagree with the Linked Data philosophy, I think top-down, formal semantic approaches are a dead end, and believe RDF is the Devil’s Own Format. I can’t deny that the array of sources they’ve linked together is impressive though, and it’s beautifully presented here.

Taco Bell Programming – The hacker mentality can be an incredibly powerful tool for compressing days-long tasks into minutes, if you can just look at them from the right angle. Mmmm, Taco Bell….

The Perils of Kinder Surprise – I’m so glad we’re being protected from the dangers of small chocolate eggs with plastic toys inside. I never really liked them growing up in the UK, but it depresses me that here we’re paying border guards to seize an average of 25,000 of them every year. When Kinder Surprises are outlawed, only outlaws will have Kinder Surprises.

The Myth and Truth of the NYC Engineer Shortage – Hiring ‘A players’ doesn’t mean hiring people with the exact skills you need, or even experienced engineers. Hire for smarts and enthusiasm, give your experienced folks time to help them, and within a few months you’ll have productive employees. Even better, they’ll be cheaper, and more loyal than that hot-shot you keep dreaming of. Hire for the right mentality, and everything else will follow.

Elusive Forger, Giving but Never Stealing – My favorite character reading the Norse myths as a kid was always Loki the Trickster, so I find this story of a non-profit forger slipping his works into museum’s collections delightful.