Do you want a map of your blog?


One of my goals with OpenHeatMap is building a better interface to access and explore the firehose of data we’re bombarded with. I’ve always wanted a way to navigate content by location and so I’m experimenting with mapping blogs and other media sites. Above is a visualization I built of a few hundred posts from the Sunsurfer Tumblr blog. You can see at a glance the locations that the site covers, and then by mousing over you’ll see a selection of photos from that place.

I’m looking at turning this into a self-serve interface where anyone can enter a URL and receive a visualization of that whole site, but I need testers. If you have a blog or site that features a lot of locations, whether it’s pictures from around the world, news stories or anything you’re keen to see visualized like this, drop an email to and I’ll get on the case!

Geodict – an open-source tool for extracting locations from text

Photo by Mukumbura

One of my big takeaways from the Strata pre-conference meetup was the lack of standard tools (beyond grep and awk) for data scientists. With OpenHeatMap I often need to pull location information from natural-language text, so I decided to pull together a releasable version of the code I use for this. Behold, geodict!

It's a GPL-ed Python library and app that takes in a stream of text and outputs information about any locations it finds. Here's the command-line tool in action:
./ < testinput.txt

That should produce something like this:
New Zealand
Barcelona, Spain
Wellington New Zealand

For more detailed information, including the lat/lon positions of each place it finds, you can specify JSON or CSV output instead of just the names, eg

./ -f csv < testinput.txt
New Zealand,country,-41.0,174.0
"Barcelona, Spain",city,41.3833,2.18333
Wellington New Zealand,city,-41.3,174.783

For more of a real-world test, try feeding in the front page of the New York Times:
curl -L "; | ./
United States
Erlanger, Ky

The tool just treats its input as plain text, so in production you'd want to use something like beautiful soup to strip the tags out of the HTML, but even with messy input like that it works reasonably well. You will need to do a bit of setup before you run it, primarily running to load information on over 2 million locations into your MySQL server.

There are some alternative technologies out there like Yahoo's Placemaker API or general semantic APIs like OpenCalais, Zemanta or Alchemy, but I've found nothing open-source. This is important to me on a very practical level because I can often get far better results if I tweak the algorithms to known characteristics of my input files. For example if I'm analyzing a blog which often mentions newspapers then I want to ignore anything that looks like "New York Times" or "Washington Post", they're not meaningful locations. Placemaker will return a generous helping of locations based on those sort of mentions, adding a lot of noise to my results, but with geodict I can filter them out with some simple code changes.

Happily MaxMind made a rich collection of location information freely available, so I was able to combine that with some data I'd gathered myself on countries and US states to make a simple-minded but effective geo-parser for English-language text. I'm looking forward to improving it with more data and recognized types of locations, but also to seeing what you can do with it, so let me know if you do get some use out of it too!

Hack the planet?

Photo by Alby Headrick

I've just finished reading Hack the Planet, and I highly recommend it to anyone with an opinion on climate change (which really should be everyone). Eli Kintisch has written a comprehensive guide to the debates around geo-engineering, but it's also a strong argument for reducing our CO2 emissions. Like me, he's an unabashed believer in the scientific method as the best process for answering questions of fact and he's not afraid to challenge his subject's assertions. It's fundamentally a documentary approach, not a polemic, but all the more powerful for how he tells the stories of those involved in the debate.

The core of the book is the idea that we may need to perform engineering on a massive scale to mitigate the climate changes caused by increases in greenhouse gases. There's a wide variety of schemes, from seeding the upper atmosphere with sulphur to reduce incoming radiation, spraying salt water into clouds to increase the amount of light reflected, scrubbing CO2 directly from the atmosphere or sequestering it underground as we're generating it. His conclusion is that the potentially cheap options, like injecting new material into the atmosphere, are risky and unproven, while the safer ones, like capturing CO2 as it's generated, are way too expensive to be practical.

The risks mostly come from our extremely poor understanding of how the planet actually works. We've had a few natural experiments with volcanic gases emitted by eruptions which demonstrate that cooling does occur, but also that rainfall and other patterns may be radically altered. Research may help quantify the risks, but the large-scale field trials needed face widespread public opposition. That also highlights a long-term political risk; if war or other disruption stops the engineering effort then the climate could change suddenly and catastrophically. On the other hand, it's also vital that we understand as much as possible about the techniques. If we end up with a 'long tail' event causing far more severe climate change than we expect, we may need to rapidly implement something as an emergency measure.

That all makes me hope that we can bring the costs down of sequestration technologies. The outlook there is uncertain, despite a lot of smart folks attacking the problem the costs of capturing one ton of carbon are still far too high to be deployed commercially. I'm optimistic this will change, but it's sobering to see the history of high hopes and broken promises from the proponents.

After his detailed view of the realities of the different geo-engineering approaches, reducing emissions emerges as a far more attractive option. As an engineer I'm inherently drawn to technological solutions, and if I was an investor I'd be betting on some form of carbon capture working out in the end, but in software terms this book paints the planet as the most convuluted legacy system you could imagine. We'll never be quite sure what will happen when we monkey with its spaghetti code, so lets hope we don't have to.

How I ended up using S3 as my database

Photo by Longhorn Dave

I've spent a lot of the last two years wrestling with different database technologies from vanilla relational systems to exotic key/value stores, but for OpenHeatMap I'm storing all data and settings in S3. To most people that sounds insane, but I've actually been very happy with that decision. How did I get to this point?

Like most people, I started by using MySQL. This worked pretty well for small data sets, though I did have to waste more time than I'd like on housekeeping tasks. The server or process would crash, or I'd change machines, or I'd run out of space on the drive, or a file would be corrupted, and I'd have to mess around getting it running again.

As I started to accumulate larger data sets (eg millions of Twitter updates) MySQL started to require more and more work to keep running well. Indexing is great for medium-scale data sets, but once the index itself grows too large, lots of hard-to-debug performance problems popped up. By the time that I was recompiling the source code and instrumenting it, I'd realized that its abstraction model was now more of a hindrance than a help. If you need to craft your SQL around the details of your database's storage and query optimization algorithms, then you might as well use a more direct low-level interface.

That led me to my dalliance with key-value stores, and my first love was Tokyo Cabinet/Tyrant. Its brutally minimal interface was delightfully easy to get predictable performance from. Unfortunately it was very high maintenance, and English-language support was very hard to find, so after a couple of projects using it I moved on. I still found the key/value interface the right level of abstraction for my work; its essential property was the guarantee that any operation would take a known amount of time, regardless of how large my data grows.

So I put Redis and MongoDB through their paces. My biggest issue was their poor handling of large data loads, and I submitted patches to implement Unix file sockets as a faster alternative to TCP/IP through localhost for that sort of upload. Mongo's support team are superb, and their reponsiveness made Mongo the winner in my mind. Still, I realized I was finding myself wasting too much time on the same mundane maintenance chores that frustrated me back in the MySQL days, which led me to look into databases-as-a-service.

The most well-known of these is Google's AppEngine datastore, but they don't have any way of loading large data sets, and I wasn't going to be able to run all my code on the platform. Amazon's SimpleDB was extremely alluring on the surface, so I spent a lot of time digging into it. They didn't have a good way of loading large data sets either, so I set myself the goal of building my own tool on top of their API. I failed. Their manual sharding requirements, extremely complex programming interface and mysterious threading problems made an apparently straightforward job into a death-march.

While I was doing all this, I had a revelation. Amazon already offered a very simple and widely used key/value database; S3. I'm used to thinking of it as a file system and anyone who's been around databases for a while knows that file systems make attractive small-scale stores that become problematic with large data sets. What I realized was that S3 was actually a massively key/value store dressed up to look like a file system, and so it didn't suffer from the 'too many files in a directory' sort of scaling problems. Here's the advantages it offers:

Widely used. I can't emphasize how important this is for me, especially after spending so much time on more obscure systems. There's all sorts of beneficial effects that flow from using a tool that lots of others also use, from copious online discussions to the reassurance that it won't be discontinued.

Reliable. We have very high expectations of up-time for file systems, and S3 has had to meet these. It's not perfect, but backups are easy as pie, and with so many people relying on it there's a lot of pressure to keep it online.

Simple interface. Everything works through basic HTTP calls, and even external client code (eg Javascript via AJAX) can access public parts of the database without even touching your server.

Zero maintenance. I've never had to reboot my S3 server or repair a corrupted table. Enough said.

Distributed and scalable. I can throw whatever I want at S3, and access it from anywhere else. The system hides all the details from me, so it's easy to have a whole army of servers and clients all hammering the store without it affecting performance.

Of course there's a whole shed-load of features missing, most obviously the fact that you can't run any kind of query. The thing is, I couldn't run arbitrary queries on massive datasets anyway, no matter what system I used. At least with S3 I can fire up Elastic MapReduce and feed my data through a Hadoop pipeline to pull out analytics.

So that's where I've ended up, storing all of the data generated by OpenHeatMap as JSON files within both private and public S3 buckets. I'll eventually need to pull in a more complex system like MongoDB as my concurrency and flexibility requirements grow, but it's amazing how far a pseudo-file system can get you.