Five short links

Fivestar
Photo by Andreas

Mapumental – A really useful visualization that promises to help you find the ideal place to live, letting you interact and question the data in an intuitive way. This is exactly why I built OpenHeatMap, an interactive data map opens up so many possibilities. Now I’m eagerly awaiting the live product.

Guide to finding a technical co-founder – Vinicius’s guide rings very true to me, and most of his points also apply to finding a business co-founder if you’re technical. The short story is, it’s hard. Really hard. I spent a lot of time and effort on the hunt, and still failed. It will help a lot if you’re flexible about what you want to work on, then you might get lucky at co-founder dating events or via other random connections. My biggest regret is that I didn’t set myself a deadline, I kept at it long after I should have given up on the search and either dropped my idea or focused on getting traction instead (which is what I eventually ended up doing).

7,000 Neighborhood shapefiles – I only just came across this resource from Zillow, it’s incredibly handy who wants to talk a geographic language that users understand. We all talk about neighborhoods in our daily life, but this is the only open resource I’ve found that gives you definitions you can use.

Who is the greatest diva of the last 25 years? – I love this mostly because of the brutally honest description of their methodology (“We used computers!”). Truth be told, most of the visualizations you’ll see on the web don’t have much more behind them either, including some of mine on a bad day. Always try to look beyond the veneer of authority that charts and infographics give to any argument.

Twitter contour maps – Talking of substance before style, here’s a spartan visualization with some great data and ideas behind it. I’m looking forward to the next version of Twitterplaces from Ben, he’s digging deep in some interesting territory.

OpenHeatMap now visualizes Canada and Mexico

Canadascreenshot
http://www.openheatmap.com/view.html?map=PentaptychWardererIdioticalness

After the UK, the most-requested countries for OpenHeatMap have been Canada and Mexico, so I'm pleased to launch state/province level coverage for both of them. All you need to do is create either a Canadian province or Mexican state column in your spreadsheet and I'll take care of the rest!

Unfortunately I haven't been able to find more detailed information for either country, I'd love to find electoral boundaries for both, but I'll keep looking. I also have leads on Turkey, Australia, New Zealand, Germany, Austria, Switzerland and Ireland, so I hope I can cover even more of the world.

If you don't want a thematic map, and are happy with blobby heatmaps instead of filled polygons, you can already use OpenHeatMap with international street addresses, just put them in an address column.

Mexicoscreenshot
http://www.openheatmap.com/view.html?map=TheodramaProctologicApogeny

OpenHeatMap now visualizes the UK

I've had a lot of requests from friends in the old country wanting to use OpenHeatMap to display data about areas in the United Kingdom. After quite some time wrestling with the Ordnance Survey's recent release of open data I'm pleased to announce that you can now visualize everything from counties to constituencies to council districts! To celebrate, I've created a visualization of the recent election, showing the percentage of the vote that each party received in every constituency:

Ukelectionshot

To go along with it, I've put together a quick 'making of' video tutorial, showing how it only takes a couple of minutes to transform the Guardian's spreadsheet of election data into an interactive map:

I just want to say a big thanks to everyone who helped me gather this data, especially Matthew Somerville of MySociety who went above and beyond to help me understand counties, Simon Rogers for making the election data freely available, along with Cameron Neylon, Colin McCrory, Chris McCray, Leigh Dodds, Bill Roberts, John Goodwin, Richard Stirling and Laura Oliver.

Do you wish you could visualize data about regions of your country? Drop me an email and I'll see what I can do (and maybe rope you in to investigate if your government makes the data openly available!)

Build a map, win an iPad!

Kittenipad
Photo by Foxy Coxy

Did I mention how much I like the folks over at This Week in Relevance? They've just launched a competition where the best OpenHeatMap wins an iPad! I love that idea so much, I'm going to give a free t-shirt to everyone who enters too.

If you're looking for inspiration, just today I've helped people build maps of UEFA cup rankings, amphibian habitats, oil well starts, college alumni, and even someone's friends on Facebook, and that was without the lure of a lovely Apple gadget at the end.

As always, I'll be online as much as I can and available to chat via the awesome SnapABug (I had four simultaneous conversations going at one point this afternoon), or you can email me via pete@mailana.com or IM me at petewarden on Skype.

Where are LA’s most effective schools?

After seeing the LA Times analysis of how effective local teachers are at improving their students performance, I was left wondering how the different neighborhoods performed. School rankings based on pure academic achievement aren't that informative since they correlate very closely with the wealth or poverty of their students family, but the 'value-added' approach seemed like it might produce more informative patterns. Here's the map I built, using the Times' ranking of schools from one to five, where dark blue marks the most effective elementary schools:

http://www.openheatmap.com/view.html?map=EricolinCounterweightsPuffinus

As always, here's the code I used to gather and format the data to load into OpenHeatMap:

http://github.com/petewarden/schoolcrawl

There's an obvious cluster of under-achieving schools between Culver City and downtown LA, but I was pleased to see some pretty poor areas like Compton showing some effective schools. Northridge seems to have a lot of great schools too, along with Culver City and Santa Monica, though there are a few surprises with 'one' rated schools tucked in amongst the rest.

Remember you’re a Womble

Wombles

I'm excited to be doing a presentation at Defrag again this year, Eric Norlin gathers together an amazing bunch of people. As I was exchanging emails with him about the conference I found the theme to The Wombles kids show from the 70's going through my head. I've always struggled to find the right label for what I do, Implicit Data was the original inspiration for Defrag, these days Big Data is en vogue, but none of them are very descriptive. I realized that Recycled Data might be a better theme, which makes me a Womble:


Underground, overground, Wombling free

The Wombles of Wimbledon Common are we


Making good use of the things that we find


Things that the everyday folk leave behind

What's really changed in the last few years is that the technology for grabbing large amounts of data and analyzing it is now incredibly cheap. Just as mining companies are using new technology to extract metal from decades-old piles of waste material, so researchers are starting to pull useful information from data that the big players see as valueless.

I think the root cause of my troubles with Facebook was that they didn't realize what a rich source of information the public profiles they exposed to search engines were. Individually they only displayed a name, a handful of friends and some pages each user liked, which seemed worthless. What they didn't understand was that if you have enough of them, important and interesting patterns start to emerge. Even junk data becomes valuable at scale. Who'd have thought that analyzing which pages link to each other could become a gushing fountain of money for Google, once they had enough pages crawled?

I feel like a kid in a candy store, there's so many great sources of public data to choose from I hardly know what to visualize first, and I'm surprised there aren't more people taking advantage of this bounty. From Crunchbase, to Google Profiles, Twitter, US Census data, make good use of the things you can find, things that the everyday folks leave behind, and remember you're a Womble:

The March of Twitter – Technical notes

This is a quick run-down of the technical side of my guest post chronicling the March of Twitter on Hubspot's blog. Do go check out that article, I was able to have a lot of fun using Dharmesh's data.

Putting together that analysis of the early days of Twitter involved a lot of detective work and 'filling in gaps', since I don't have access their internal traffic data, so I want to cover exactly what I did to produce it.

The map of the spread of Twitter over several years was based on a dump of 4.5 million accounts from the Twitter Grader project. Dharmesh had already done some normalization on the location fields, so I first filtered to remove everybody with a non-US address. That left me with 1.5 million profiles to work with. I believe that Grader's collection methods make those a fairly random sampling of those from the full universe, so I could use the frequency of users in different locations over time to build a visualization that accurately showed the relative geographic presence, even if I can't give accurate absolute numbers. This incomplete sampling does mean that I may be missing the actual earliest user for some locations though.

I accomplished this using ad-hoc Python code to process large CSV files. I've published these as random snippets at http://github.com/petewarden/openheatmap/blob/master/mapfileprocess/scratchpad.py

The second analysis looked at the adoption levels over the first few months. This was a lot trickier, since those sort of absolute figures weren't obviously available. Happily I discovered that Twitter gave out id numbers in a sequential way in the early days, so that @biz is id number 13, @noah is 14, etc. I needed to ensure this was actually true for the whole time period I was studying, since I was planning on searching through all the possible first few thousand ids, and if some users had arbitrary large numbers instead I would miss them. To verify this relationship held, I looked at a selection of the earliest users in the Grader data set and verified that all of them had low id numbers, and that the id numbers were assigned in the order they joined. This confirmed that I could rely on this approach, at least until December 2006. There were frequent gaps where ids were either non-assigned or pointed to closed accounts, but this didn't invalidate my sampling strategy. Another potential issue, that also affects the Twitter Grader data set, is that I'm sampling user's current locations, not the locations they had when they joined, but my hope is that most people won't have changed cities in the last four years, so the overall patterns won't be too distorted. There's also a decent number of people with no location set, but I'm hoping that also doesn't impose a systematic bias.

For the first few thousand users I went through every possible id number and pulled the user information for that account into a local file, which I then parsed into a CSV file for further processing. Once the number of new users grew larger in August I switched to sampling only every tenth id and making each found account represent ten users joining in the data. Once hiccup was a change in late November where Twitter appear to switch to incrementing ids by ten instead of one, so only ids ending in the last digit 3 are valid, which I compensated for with a new script. Shortly after that in December I again detected a change in the assignment algorithm that was causing a slew of 'no such account' messages during my lookup, so I decided to stop my gathering at that point.

The code for all this processing is also included in http://github.com/petewarden/openheatmap/blob/master/mapfileprocess/scratchpad.py, though it's all ad-hoc. The data for the first few thousand users is available as a Google Spreadsheet:

https://spreadsheets.google.com/ccc?key=0ArtFa8eBGIw-dHZjOUl3eXRzX19PLUFVQUNTU3FndFE&hl=en

You can also download the derived daily and monthly figures here:

https://spreadsheets.google.com/ccc?key=0ArtFa8eBGIw-dG5FU0hJZHI3RkVVMUgtaDhyczZxM1E&hl=en

https://spreadsheets.google.com/ccc?key=0ArtFa8eBGIw-dFlpS0QxSUw5blEtVjdyd2FaT2FySmc&hl=en

I attacked this problem because I really wanted to learn from Twitter's experiences, and it didn't seem likely that the company themselves would collect and release this sort of information. Of course I'd be overjoyed to see corrections to this provisional history of the service based on internal data, if any friendly Twitter folks would care to contribute? Any other corrections or improvements to my methodology are also welcome from my readers.

Five short links

Fiveisalive
Photo by EmilyDickinsonRidesABMX

If it quacks like an RDBMS – This article made me feel old, but in a good way. It lists all the design constraints that Mongo has been able to avoid by focusing on modern machines. Does this mean I can finally stop targeting 32 bit address space systems?

Why should engineers and scientists be worried about color? – I’m always slightly bemused that I’ve spent my entire career in the graphics world despite being color blind, but maybe it’s made me more attentive to the sort of issues raised in this article. It’s a good illustration that infographics can be just as misleading as a written description, despite their air of objectivity, and so you need to be as careful in your visual choices as you are in the words you pick. Via Daniel Stadulis

Data Mining the Heart – Good coverage of the recent wave of academic studies that use social sites as natural experiments. This is only the beginning for this sort of research, we’re all instrumenting our lives in a thousand ways each day, every time we interact with an online service.

Challenges in front of us – I feel a strong affinity for Alex and Tim as they flesh out their service. They’re unfunded, despite having paying customers, but they’re fighting like demons to build the business. I know from my own experiences that the hardest battle there is psychological, keeping yourself motivated when you seem to be shouting into an empty void.

Needle in the Haystack – The story of a bio-entrepreneurs epic battle to save his daughter’s life by analyzing a mountain of genetic data. His persistence is inspiring, and I can’t think of a more important application of the newly-cheap tools for processing big data.

The top 10 zip codes for startups

Brad asked an interesting question on his blog today – Boulder seems packed with entrepreneurs, but what's the real density of those sort of folk to the general population? His guess is almost everyone in Boulder is either working at an entrepreneurial company or going to college.

The data to answer that is floating around on the web, so I thought it would be a great demo of the value of grabbing data in bulk (as opposed to siphoning data through a preset API), and of visualizing the results. Crunchbase has a liberal robots.txt and data license, so I wrote up a crawler that pulled down the information on all 45,000 companies in their database. The US census releases population data for zip codes, so then it was just a simple matter of programming to derive some per-person stats for different areas. I didn't trust the employee counts in Crunchbase (they're not the first thing someone would update) so instead I chose a couple of related indicators – the total number of companies in a location, and how much venture money they'd raised between them. Here's the top 10 zip codes for each category:

Amount raised per-person

CA 94104 – $629m total – $1,681,925 per person
CA 94304 – $2,822m total – $1,656,031 per person
CA 94105 – $972m total – $472,540 per person
MA 02142 – $1,013m total – $448,833 per person
IL 60606 – $739m total – $439,744 per person
CA 92121 – $1,826m total – $429,847 per person
CA 95113 – $202m total – $373,077 per person
MA 02210 – $135m total  – $229,442 per person
WA 98033 – $5,662m total – $186,292 per person
NY 10004 – $168m total – $137,404 per person

Companies per-person

CA 94104 – 87 companies – 0.233 per person
CA 94105 – 173 companies – 0.084 per person
CA 95113 – 24 companies – 0.044 per person
MA 02142 – 73 companies – 0.032 per person
MA 02210 – 19 companies – 0.032 per person
CA 94111 – 103 companies – 0.031 per person
CA 92121 – 116 companies – 0.027 per person
NY 10004 – 29 companies – 0.024 per person
IL 60606 – 39 companies – 0.023 per person
NY 10005 – 20 companies – 0.023 per person

This is a crude approach to take, since the Crunchbase data may not be a representative sample, etc, but it gives a good first approximation. I've open-sourced all the code and data, so if you have ideas on improving this, jump in.

Next of course I wanted to visualize this data. Thanks to the sheer mindblowing awesomeness(*) of my OpenHeatMap project, all I had to do was upload my spreadsheets to get these maps of the data:

Companies per-person 

Funds raised per-person

And here's a couple of detailed views of the funds raised in Colorado and the Bay Area:

* Mileage may vary. Standard terms and conditions apply

Using KissMetrics to improve your website

Kissmetricsshot1

I'm a big believer in the power of objective measurements as the best way to drive product improvements, and in the past I've built my own ramshackle logging systems to gather the data I needed. Unfortunately it always took a frustratingly long time to create the systems, and I never had enough resources to build a visualization and analysis interface that easily told me what I wanted to know. For OpenHeatMap I decided to be as aggressive as I could in finding off-the-shelf solutions for everything outside of the core of the service, so I gave KISSMetrics a try.

Much as I would enjoy a Gene Simmons-themed stats service, it's actually named after "Keep It Simple, Stupid", and they deliver on that (in a good way). Installing the code is straightforward, just a nugget of Javascript for every page on your site. With that set up, you can define a series of pages as a 'funnel', a path you expect your users to take through the site towards your eventual goal. This was also very painless to set up, though in OpenHeatMap's case it's more of a tree with lots of alternate routes. The reporting handles this fairly well, letting you see visitors who entered after the nominal start of your funnel. You can see the sort of graph you get at the top of this post.

That's really the heart of the service for me. My goal is to get as many visitors as possible to create maps, so I religiously follow the ratio of people viewing the front page to those making to the end of the map building process. I started off with only around 2% making it all the way through, but now on a good day I'll see 9% building their own visualizations. Having that number to check my changes against has been essential. I've been able to tell very quickly if my changes are actually making a difference, and psychologically it's been a great motivator to work on improvements and make that dial move!

There's a lot of depth to KISSMetrics, including support for A/B testing and an API for custom events, but I have so much on my plate improving obvious problems with my service that I haven't dived in. There is a cost to all this goodness of course: $125 a month. That's a very steep price for a small-scale site like mine, but it's vital enough to my development that it's worth it. It's a good motivation to get my service to the point where I can roll out premium features too!