Remember you’re a Womble

Wombles

I'm excited to be doing a presentation at Defrag again this year, Eric Norlin gathers together an amazing bunch of people. As I was exchanging emails with him about the conference I found the theme to The Wombles kids show from the 70's going through my head. I've always struggled to find the right label for what I do, Implicit Data was the original inspiration for Defrag, these days Big Data is en vogue, but none of them are very descriptive. I realized that Recycled Data might be a better theme, which makes me a Womble:


Underground, overground, Wombling free

The Wombles of Wimbledon Common are we


Making good use of the things that we find


Things that the everyday folk leave behind

What's really changed in the last few years is that the technology for grabbing large amounts of data and analyzing it is now incredibly cheap. Just as mining companies are using new technology to extract metal from decades-old piles of waste material, so researchers are starting to pull useful information from data that the big players see as valueless.

I think the root cause of my troubles with Facebook was that they didn't realize what a rich source of information the public profiles they exposed to search engines were. Individually they only displayed a name, a handful of friends and some pages each user liked, which seemed worthless. What they didn't understand was that if you have enough of them, important and interesting patterns start to emerge. Even junk data becomes valuable at scale. Who'd have thought that analyzing which pages link to each other could become a gushing fountain of money for Google, once they had enough pages crawled?

I feel like a kid in a candy store, there's so many great sources of public data to choose from I hardly know what to visualize first, and I'm surprised there aren't more people taking advantage of this bounty. From Crunchbase, to Google Profiles, Twitter, US Census data, make good use of the things you can find, things that the everyday folks leave behind, and remember you're a Womble:

The March of Twitter – Technical notes

This is a quick run-down of the technical side of my guest post chronicling the March of Twitter on Hubspot's blog. Do go check out that article, I was able to have a lot of fun using Dharmesh's data.

Putting together that analysis of the early days of Twitter involved a lot of detective work and 'filling in gaps', since I don't have access their internal traffic data, so I want to cover exactly what I did to produce it.

The map of the spread of Twitter over several years was based on a dump of 4.5 million accounts from the Twitter Grader project. Dharmesh had already done some normalization on the location fields, so I first filtered to remove everybody with a non-US address. That left me with 1.5 million profiles to work with. I believe that Grader's collection methods make those a fairly random sampling of those from the full universe, so I could use the frequency of users in different locations over time to build a visualization that accurately showed the relative geographic presence, even if I can't give accurate absolute numbers. This incomplete sampling does mean that I may be missing the actual earliest user for some locations though.

I accomplished this using ad-hoc Python code to process large CSV files. I've published these as random snippets at http://github.com/petewarden/openheatmap/blob/master/mapfileprocess/scratchpad.py

The second analysis looked at the adoption levels over the first few months. This was a lot trickier, since those sort of absolute figures weren't obviously available. Happily I discovered that Twitter gave out id numbers in a sequential way in the early days, so that @biz is id number 13, @noah is 14, etc. I needed to ensure this was actually true for the whole time period I was studying, since I was planning on searching through all the possible first few thousand ids, and if some users had arbitrary large numbers instead I would miss them. To verify this relationship held, I looked at a selection of the earliest users in the Grader data set and verified that all of them had low id numbers, and that the id numbers were assigned in the order they joined. This confirmed that I could rely on this approach, at least until December 2006. There were frequent gaps where ids were either non-assigned or pointed to closed accounts, but this didn't invalidate my sampling strategy. Another potential issue, that also affects the Twitter Grader data set, is that I'm sampling user's current locations, not the locations they had when they joined, but my hope is that most people won't have changed cities in the last four years, so the overall patterns won't be too distorted. There's also a decent number of people with no location set, but I'm hoping that also doesn't impose a systematic bias.

For the first few thousand users I went through every possible id number and pulled the user information for that account into a local file, which I then parsed into a CSV file for further processing. Once the number of new users grew larger in August I switched to sampling only every tenth id and making each found account represent ten users joining in the data. Once hiccup was a change in late November where Twitter appear to switch to incrementing ids by ten instead of one, so only ids ending in the last digit 3 are valid, which I compensated for with a new script. Shortly after that in December I again detected a change in the assignment algorithm that was causing a slew of 'no such account' messages during my lookup, so I decided to stop my gathering at that point.

The code for all this processing is also included in http://github.com/petewarden/openheatmap/blob/master/mapfileprocess/scratchpad.py, though it's all ad-hoc. The data for the first few thousand users is available as a Google Spreadsheet:

https://spreadsheets.google.com/ccc?key=0ArtFa8eBGIw-dHZjOUl3eXRzX19PLUFVQUNTU3FndFE&hl=en

You can also download the derived daily and monthly figures here:

https://spreadsheets.google.com/ccc?key=0ArtFa8eBGIw-dG5FU0hJZHI3RkVVMUgtaDhyczZxM1E&hl=en

https://spreadsheets.google.com/ccc?key=0ArtFa8eBGIw-dFlpS0QxSUw5blEtVjdyd2FaT2FySmc&hl=en

I attacked this problem because I really wanted to learn from Twitter's experiences, and it didn't seem likely that the company themselves would collect and release this sort of information. Of course I'd be overjoyed to see corrections to this provisional history of the service based on internal data, if any friendly Twitter folks would care to contribute? Any other corrections or improvements to my methodology are also welcome from my readers.

Five short links

Fiveisalive
Photo by EmilyDickinsonRidesABMX

If it quacks like an RDBMS – This article made me feel old, but in a good way. It lists all the design constraints that Mongo has been able to avoid by focusing on modern machines. Does this mean I can finally stop targeting 32 bit address space systems?

Why should engineers and scientists be worried about color? – I’m always slightly bemused that I’ve spent my entire career in the graphics world despite being color blind, but maybe it’s made me more attentive to the sort of issues raised in this article. It’s a good illustration that infographics can be just as misleading as a written description, despite their air of objectivity, and so you need to be as careful in your visual choices as you are in the words you pick. Via Daniel Stadulis

Data Mining the Heart – Good coverage of the recent wave of academic studies that use social sites as natural experiments. This is only the beginning for this sort of research, we’re all instrumenting our lives in a thousand ways each day, every time we interact with an online service.

Challenges in front of us – I feel a strong affinity for Alex and Tim as they flesh out their service. They’re unfunded, despite having paying customers, but they’re fighting like demons to build the business. I know from my own experiences that the hardest battle there is psychological, keeping yourself motivated when you seem to be shouting into an empty void.

Needle in the Haystack – The story of a bio-entrepreneurs epic battle to save his daughter’s life by analyzing a mountain of genetic data. His persistence is inspiring, and I can’t think of a more important application of the newly-cheap tools for processing big data.

The top 10 zip codes for startups

Brad asked an interesting question on his blog today – Boulder seems packed with entrepreneurs, but what's the real density of those sort of folk to the general population? His guess is almost everyone in Boulder is either working at an entrepreneurial company or going to college.

The data to answer that is floating around on the web, so I thought it would be a great demo of the value of grabbing data in bulk (as opposed to siphoning data through a preset API), and of visualizing the results. Crunchbase has a liberal robots.txt and data license, so I wrote up a crawler that pulled down the information on all 45,000 companies in their database. The US census releases population data for zip codes, so then it was just a simple matter of programming to derive some per-person stats for different areas. I didn't trust the employee counts in Crunchbase (they're not the first thing someone would update) so instead I chose a couple of related indicators – the total number of companies in a location, and how much venture money they'd raised between them. Here's the top 10 zip codes for each category:

Amount raised per-person

CA 94104 – $629m total – $1,681,925 per person
CA 94304 – $2,822m total – $1,656,031 per person
CA 94105 – $972m total – $472,540 per person
MA 02142 – $1,013m total – $448,833 per person
IL 60606 – $739m total – $439,744 per person
CA 92121 – $1,826m total – $429,847 per person
CA 95113 – $202m total – $373,077 per person
MA 02210 – $135m total  – $229,442 per person
WA 98033 – $5,662m total – $186,292 per person
NY 10004 – $168m total – $137,404 per person

Companies per-person

CA 94104 – 87 companies – 0.233 per person
CA 94105 – 173 companies – 0.084 per person
CA 95113 – 24 companies – 0.044 per person
MA 02142 – 73 companies – 0.032 per person
MA 02210 – 19 companies – 0.032 per person
CA 94111 – 103 companies – 0.031 per person
CA 92121 – 116 companies – 0.027 per person
NY 10004 – 29 companies – 0.024 per person
IL 60606 – 39 companies – 0.023 per person
NY 10005 – 20 companies – 0.023 per person

This is a crude approach to take, since the Crunchbase data may not be a representative sample, etc, but it gives a good first approximation. I've open-sourced all the code and data, so if you have ideas on improving this, jump in.

Next of course I wanted to visualize this data. Thanks to the sheer mindblowing awesomeness(*) of my OpenHeatMap project, all I had to do was upload my spreadsheets to get these maps of the data:

Companies per-person 

Funds raised per-person

And here's a couple of detailed views of the funds raised in Colorado and the Bay Area:

* Mileage may vary. Standard terms and conditions apply

Using KissMetrics to improve your website

Kissmetricsshot1

I'm a big believer in the power of objective measurements as the best way to drive product improvements, and in the past I've built my own ramshackle logging systems to gather the data I needed. Unfortunately it always took a frustratingly long time to create the systems, and I never had enough resources to build a visualization and analysis interface that easily told me what I wanted to know. For OpenHeatMap I decided to be as aggressive as I could in finding off-the-shelf solutions for everything outside of the core of the service, so I gave KISSMetrics a try.

Much as I would enjoy a Gene Simmons-themed stats service, it's actually named after "Keep It Simple, Stupid", and they deliver on that (in a good way). Installing the code is straightforward, just a nugget of Javascript for every page on your site. With that set up, you can define a series of pages as a 'funnel', a path you expect your users to take through the site towards your eventual goal. This was also very painless to set up, though in OpenHeatMap's case it's more of a tree with lots of alternate routes. The reporting handles this fairly well, letting you see visitors who entered after the nominal start of your funnel. You can see the sort of graph you get at the top of this post.

That's really the heart of the service for me. My goal is to get as many visitors as possible to create maps, so I religiously follow the ratio of people viewing the front page to those making to the end of the map building process. I started off with only around 2% making it all the way through, but now on a good day I'll see 9% building their own visualizations. Having that number to check my changes against has been essential. I've been able to tell very quickly if my changes are actually making a difference, and psychologically it's been a great motivator to work on improvements and make that dial move!

There's a lot of depth to KISSMetrics, including support for A/B testing and an API for custom events, but I have so much on my plate improving obvious problems with my service that I haven't dived in. There is a cost to all this goodness of course: $125 a month. That's a very steep price for a small-scale site like mine, but it's vital enough to my development that it's worth it. It's a good motivation to get my service to the point where I can roll out premium features too!

Under the Devil’s Thumb

Devilsthumbpass

We're just back from a three-day backpacking trip through the Rockies, and I'm still nursing my aching legs. I hadn't done any backpacking since our trip to Scotland, so hiking with full packs was a shock to my system (especially without a support network of pubs to retire to every evening). Our plan was to start at Hessie trailhead near Eldora, camp by Devil's Thumb Lake the first night, then head along the divide to Middle Boulder for the second, hiking back down King Lake trail on the last day back to the truck. Do you see the ice wall in the photo? That blew our plan apart.

The first thing to watch out for if you're thinking of exploring the wilderness from Hessie trailhead is the parking. There's a first parking area that only has space for a dozen cars, and while there's room along the road for more, there's signs warning they'll be towed. We were lucky and found a space someone had just left, but as we hiked in we discovered ample parking if you have a four-wheel vehicle that can make it through a long stream ford. We made it to our first camp without a hitch, and the setting was beautiful, next to a lake, amongst scattered pine trees sitting under the Devil's Thumb mountain. That night the wind was continuous and strong, even sheltered behind a stand of trees.

In the morning we set out to complete the final thousand feet of climbing to Devil's Thumb Pass on the Divide. The trail was steep and a bit treacherous towards the end, but we were a stones-throw from the top when we hit the ice. At first we explored clambering over, but the surface was too slick. Next we scrambled over the scree to work our way it. After slipping out of my pack I was able to make it around, and halfway along the top, but had to admit defeat. With our 30 pound rucksacks, a 500 foot drop and the strong winds it was just beyond our abilities to manage safely. We chatted with another pair of hikers as we dropped back to eat our lunch, and were within earshot as they planned their attack. It took them about 30 minutes, but they managed to work their way around, and later on we met another group who'd tackled it a couple of years ago who took the same route. We're still pretty frustrated we couldn't manage it ourselves, we were so close to making the divide, so we're already planning another attempt.

Feeling pretty dejected, we headed back down the trail almost back to the trailhead, and then cut up Kings Lake trail to camp that night.

Kingslake
It was hard to stay downhearted in somewhere so beautiful, and on our final day we had an amazing series of views as we climbed up to the divide along Kings Lake trail, the route we would have been taking back had everything gone to plan. Near the top we took a spur trail to Betty and Bob lakes, just below the Continental Divide at about 12,000 feet:

Bettylake 

We actually ended up working quite a bit harder than we would have doing the loop, hiking up and down to the Divide twice rather than cutting along it on the High Lonesome trail, but it was worth it to experience the world up there. Even in mid-August the winter seems close, with the flowers blooming like it's just turned spring and the most amazing mushrooms sprouting from the damp ground. We were lucky enough to have clear weather so we could see way out onto the plains, and deep into the Rockies. Despite our aching legs after the hike down, it left us wanting to return.

Paul Graham’s wrong on the value of a hacker culture (sadly)

Paulgraham

I love Paul's essays, but he's way off-base in his latest analysis of Yahoo's rise and fall. I've no bone to pick with the historical section, but his conclusion that "in the software business, you can't afford not to have a hacker-centric culture" can be contradicted with a single word. Apple.

Apple is an amazing company, but Steve Jobs is the anti-hacker. Most of the values we love like openness and configurability fly in the face of his obsession with a transcendental  experience for ordinary users. Third-party developers are a necessary evil, to be kept tightly constrained lest they screw up that experience. Internally they have some of the most amazing developers I've ever met, the standard is astonishing, but the designers are in complete control. Hackers love to experiment in public. Apple is obsessed with putting on a seamlessly well-rehearsed performance, so failures have to happen in private or there's hell to pay.

I could go on and on listing Apple's sometimes rocky relationship with its open-source projects or its willingness to compete with its own third-party developers, but the bottom line is that Steve has built a culture where hacker values are marginalized but he still manages to both produce amazing products and make massive amounts of money. He does so by obsessing about the user experience above all else, and he'd have no qualms about out-sourcing the code development to China just like the manufacturing, if he could get the same quality of output. He doesn't have any more affection for programmers than for anyone else involved in the product and sees designers as a lot more central to the process.

The sad thing is that I'd love a 'hacker-centric culture' to be an essential requirement for large software companies. I lost far too much time at SciFoo when I discovered the Inventables stand in the Googleplex, and I'd be in heaven if all corporations catered to my obsessions like that. I just can't justify that argument with the evidence, and on the basis of this essay, neither can Paul.