A Mad Engineer at SciFoo

From Cowbirds in Love

I’m not quite sure how an engineer got invited to SciFoo, it was all very mysterious, but I’m planning on having fun! I am passionate about the possibilities for answering some of the important questions using the new sources of data we’re all creating on the web, so I’ll be evangelizing cheap web crawling, analysis using Hadoop and of course visualization. After some last minute idiocy on my part over the hotel, I’m now all set for the flight to San Jose tomorrow, and my first visit to the Google campus.

Talking of visualization, I was really curious to see where the other attendees would be coming from, I built a special-edition OpenHeatMap to display the locations of the attendees mentioned on Kaitlin Thaney’s Twitter list:


Here’s the full interactive version. You can also now do the same for any Twitter list in the main app by typing in list:user/list-name in the main Twitter Followers search box.

Discrimination and sticking your hand up

Photo by Cpt. Obvious

This article by Stubbornella on women in technology was a great summary of what I've seen in my programming career. I've seldom witnessed overt discrimination, there's no girlie calendars in the break-room, but we're terrible at assessing and promoting coders according to how effective and efficient they are. Instead there's a tendency to reward the qualities that Nicole lists under 'Cowboy-coders', even when these people have a negative impact on our ability to ship product.

Why does this matter? It's insanely hard to build great software, it's really tough to find strong engineers, and a system that discourages a large proportion of entrants with the potential to grow into those engineers is broken.

The point that Nicole mentions but doesn't go into depth on, is that this isn't just about women. Some of the most technically brilliant male coders I've worked with are still quietly working away with zero recognition from either their own companies or the outside world. I've always loved talking about my work, and that's driven me to overcome my innate shyness and learn to perform in public, and generally be pretty vocal with my opinions in discussions. This is vital in a profession infested with arrogant jerks, and sometimes I wonder if I've had to become one too. I try hard to "act like I'm right, but listen like I'm wrong", but that's a very fine line.

We tend to take this as a law of nature, that's how people get ahead in the programming world, but it's a sign of both poor management and a very ineffective culture. By contrast my partner Liz is a trained actuary, her job involves a decade of taking exams and some very deep math, and almost all of her old department was women. You don't have to be a pushy jerk to get noticed as an actuary, there's actually a system for actively finding and rewarding talent, and as if by magic, there's a lot more women in the profession.

Our problem is that we're fixated on a romantic image of programming, one full of individual rock-stars producing astonishing products thanks to all-night coding sessions. The problem is this doesn't work, at least not consistently or reliably. The reality of building great software is that it's a long process that takes a lot of teamwork. Our immature reverence for aggressive self-promoters leads to poor outcomes, and even if you don't give a fig about discrimination, that's still a massive problem.

So how do we fix it? I'm actually still a bit ambivalent about Google's scholarship specifically for women, though I can see why it makes sense in a lot of ways. I'd much rather see companies and individuals in leadership positions actively seeking out strong engineers to speak and attend conferences, rather than waiting for people to come to their attention, since that rewards loud-mouth folks like me. We need to find more people who are quietly doing great work, without waiting for them to stick up their hand.

Visualizing the war deaths in Afghanistan

Open map

Niraj Chokshi took the WikiLeaks data from Afghanistan and filtered it to produce some maps. It's always tough to build visualizations on a deadline, and so there were some issues with the initial graphs, but they presented the data in a useful and interesting way. Niraj made the underlying data he'd used available as a spreadsheet, so with almost no changes I was able to upload it into OpenHeatMap to produce some different views.

It was pretty sobering to be handling data covering hundreds of people's deaths, and I'm honestly not sure what story the data is telling us. Just looking at the location and magnitude of the enemy dead in 2004 compared to 2009 shows how much the battlefield has changed though, from a handful of hotspots along the Pakistan border to a dense ring around the whole country.

Want a map of your Twitter followers?


I've always wanted to know where the people who follow me on Twitter are from, both out of curiosity and so I can connect with them as I travel around the country. To find out, I built a tool using OpenHeatMap to visualize your followers by location. It only shows followers who have been active recently, but I've had great fun discovering connections to Ghangzhou, Prague and even the exotic, enigmatic country of Canada, little known to westerners.

There's actually three different views you can use to explore Twitter as a map. You can put in your own or someobody else's handle to see their active followers, you can visualize the updates from the people you follow, or do a search on a keyword and see where in the world people are talking about that topic.

It's still just a prototype, but it feels like a step towards the interface we need to make sense of the flood of location data that's flowing all around us. I look forward to hearing your ideas on improving it, and since the component is completely open-source, feel free to build your own to show me how it should really be done!

OpenHeatMap for journalists

I’ve long admired The Guardian’s innovative approach to opening up their data, so I was very excited to see their technology editor, Charles Arthur, using it on a recent story. I was happily surprised that he was able to set up his map without ever contacting me. I’ve always intended it to be self-serve but that can be very hard to achieve with a completely new product.

Since I’m really keen to see the stories other reporters can tell with OpenHeatMap, I’ve created a four minute video guide aimed at journalists that walks you through exactly what you need to do to build your own maps. If you have some information about places, I’ve made it drop-dead simple to create a map that tells your story, so please check the guide out out and pass it along to any other folks who might be interested.

Free bulk geocoding for US addresses

Photo by Chris Blakeley

My goal with OpenHeatMap is to have the computer handle all of the messing around that’s usually required to load data into a GIS system. I want to accept anything that describes a location, rather than forcing users to spend endless time massaging their input data.

This is fairly straightforward with country names, zip codes, and even US county names, but I’ve struggled to find a good solution for turning addresses into latitude, longitude positions. All of the free APIs out there are either very accurate but have crippling limits on how often you can call them, or are unlimited but with very low precision. The going rate for commercial geocoding is $10 per thousand addresses, which ruled that right out!

Happily I’ve found a solution. Schuyler Erle and Jo Walsh created an open-source Perl module a few years ago called Geo-Coder-US. It uses the public-domain Tiger/Line data from the US census to look up American addresses. In my tests of the online version it was remarkably accurate (much better than OpenStreetMap’s Nominatim for example) though the authors warn that rural coverage is not as good. The only downside was that the actual database file to accompany the code was too large for the authors to host, so I had to spend some time digging around the census FTP site to find the right source files, download all 9 GB of them and then run the database creation which took several hours.

To save anyone else from having to go through the same struggle, I’ve uploaded a version of the project to github that contains the compiled database file. Be warned, the database is almost a gigabyte in size, so it’s not a quick download! You may also need to install Geo-Coder-US-1.00.tar.gz via cpan to grab all the dependencies. Once you have it, cd into the directory and try running

eg/lookup.pl “2543 Graystone Pl, Simi Valley, CA 93065”

You should see the following output:

“2543 Graystone Pl, Simi Valley, CA 93065”, 34.280874, -118.766207

You can either pass multiple addresses as command line arguments, or pipe a file to the script and it will treat each line as an address and output as CSV. The original authors also include a SOAP server script for Perl, so you could also run this as a web service. I’m going to be moving OpenHeatMap to using this, so look out for more accurate address locations, at least for American data.

A big thanks to Schuyler and Jo for making this code available in the first place, do keep them in mind for any location consulting work you might have.

Clapham is a hole, and other curiosities of the London Underground

I never lived in London, but my Gran was born within sound of the Bow Bells and I’ve spent enough time there to know how important the London Underground network is. The distance as the crow flies is far less important than how accessible the start and destination of any journey are by Tube. I’ve always wondered how the city would look if you could see how far everywhere was from a station, so I grabbed a list of the locations (compiled by OpenStreetMap volunteers) and uploaded them to OpenHeatMap. A few surprises leapt out at me:

Clapham is a hole


Clapham Junction is one of the busiest above-ground stations in the world, but it’s nearly a mile and a half to the nearest Underground line. The whole area is a big, gaping hole in the coverage of the network, and you have to wonder if some nameless Tube planner had it in for the place?

It’s Grim Down South


There’s a lot more lines north of the river than in South London. I have no idea why, but I know I want a direct line to Chessington for when I’m visiting my Grandad (and no, he’s not in the World of Adventure!)



Though the map of the Underground has all the elegance of a fly on a windshield, I was intrigued by a few of the feelers shooting out across the landscape. Chesham is the furthest ‘underground’ station from central London, though it doesn’t appear very subterranean to me. On the far north-west of the map, in the wild, howling wasteland of Buckinghamshire, it has the fewest visitors of any station in the network, but does have the distinction of being the most popular starting point for the Tube Challenge. Thanks to Wikipedia, I’ve learned that this involves trying to beat the Guinness World Record for visiting all 270 Tube stations in the shortest possible time. Apparently this has been going on for decades (1979 to 2000 was the ‘Bob Robinson Era’) but recently advanced computing techniques have been used to find more and more optimal routes.

I really do miss Britain sometimes, no other nation comes close to our skill at finding wonderfully creative ways to waste time. Now I need to get back to my game of Mornington Crescent…