Discover public data with the Data Source Handbook

I’m pleased to announce that the Data Source Handbook is now available from O’Reilly. It’s a compact ebook guide to the most useful APIs and bulk data sets I’ve found, packed with examples and advice. These are hand-picked services that I’ve actually spent time using during my own work, and I chose them because they add insights and information to data you’re already likely to be dealing with. You can check out the table of contents below, and I’ve also included a couple of excerpts.

It’s organized by the kind of data that you want to look up information on, from websites to locations, email addresses to ISBNs. There’s a whole new world of free or cheap public data out there, I’ve been having a blast exploring it myself, so I hope you’ll enjoy it as much as I have. A big thanks to everyone who helped me compile this too, from my editors Mike Loukides and Teresa Elsey to all the helpful people on Quora, along with the many friends who emailed me ideas. Keep the suggestions coming, I’ll be working on an updated edition soon.

Websites:

  •   WHOIS
  •   Blekko
  •   bit.ly
  •   Compete
  •   Delicious
  •   BackType
  •   PagePeeker

People by email:

  •   WebFinger
  •   Flickr
  •   Gravatar
  •   Amazon
  •   AIM
  •   Friendfeed
  •   Google Social Graph
  •   MySpace
  •   Github
  •   Rapleaf
  •   Jigsaw

 People by name:

  •   WhitePages
  •   LinkedIn
  •   GenderFromName

 People by account:

  •   Klout
  •   Qwerly
  •   Search terms
  •   BOSS
  •   Blekko
  •   Bing
  •   Google Custom Search
  •   Wikipedia
  •   Google Suggest
  •   Wolfram Alpha

Locations:

  •   SimpleGeo
  •   Yahoo
  •   Google Geocoding API
  •   CityGrid
  •   Geo-Coder-US
  •   Geodict
  •   GeoNames
  •   US Census
  •   Zillow Neighborhoods
  •   Natural Earth
  •   US National Weather Service
  •   OpenStreetMap
  •   MaxMind

 Companies:

  •   CrunchBase
  •   ZoomInfo
  •   Hoovers
  •   Yahoo Finance
  •   IP Addresses
  •   MaxMind
  •   Infochimps

Books, films, music and products:

  •   Amazon
  •   Google Shopping
  •   Google Book Search
  •   Netflix
  •   Yahoo music
  •   Musicbrainz
  •   The Movie DB
  •   Freebase

WHOIS

The whois unix command is still a workhorse, and I’ve found this web service a decent alternative too. You can get the basic registration information for any website. In recent years, some owners have chosen ‘private’ registration which hides their details from view, but in many cases you’ll see a name, address, email and phone number for the person who registered the site. You can also enter numerical IP addresses here and get data on the organization or individual that owns that server.

Unfortunately the terms-of-service of most providers forbid automated gathering and processing of this information, but you can craft links to the Domain Tools site to make it easy for your users to access the information.

<a href="http://whois.domaintools.com/www.google.com">Info for www.google.com</a>

There is a commercial API available through whoisxmlapi.com that offers a JSON interface and bulk downloads, which seems to contradict the terms mentioned in most WHOIS results. It costs $15 per thousand queries. Be careful though, it requires you to send your password as an non-secure URL parameter, so don’t use a valuable one.

curl "http://www.whoisxmlapi.com/whoisserver/WhoisService?\
domainName=oreilly.com&outputFormat=json&userName=<username>&password=<password>"
{"WhoisRecord": {
"createdDate": "26-May-97",
"updatedDate": "26-May-10",
"expiresDate": "25-May-11",
"registrant": {
"city": "Sebastopol",
"state": "California",
"postalCode": "95472",
"country": "United States",
"rawText": "O'Reilly Media, Inc.\u000a1005 Gravenstein Highway North
\u000aSebastopol, California 95472\u000aUnited States\u000a",
"unparsable": "O'Reilly Media, Inc.\u000a1005 Gravenstein Highway North"
},
"administrativeContact": {
"city": "Sebastopol",
...

Blekko

The newest search engine in town, one of Blekko’s selling points is the richness of the data it offers. If you type in a domain name followed by /seo you’ll receive a page of statistics on that URL

blekko0.png

They are also very keen on developers accessing their data, so they offer an easy-to-use API through the /json slash tag, which returns a JSON object instead of HTML.

http://blekko.com/?q=cure+for+headaches+/json+/ps=100&auth=<APIKEY>&ft=&p=1

To obtain an API key, email apiauth@blekko.com. Their terms of service are available at https://blekko.com/ws/+/terms, and while they’re somewhat restrictive, they are flexible in practice:

You should note that it prohibits practically all interesting uses of the blekko API. We are not currently issuing formal written authorization to do things prohibited in the agreement, but, if you are well behaved (e.g. not flooding us with queries), and we know your email address (from when you applied for an API auth key, see above), we will have the ability to attempt to contact you and discuss your usage patterns if needed.

Currently, the /seo results aren’t available through the JSON interface, so you have to scrape the HTML to obtain it. There’s a demonstration of that athttps://github.com/petewarden/pagerankgraph.

 

Leave a comment