Why your startup should use data competitions

Photo by Brett Jordan

When I first came across Kaggle last year I loved the idea. They run Netflix-style data competitions as a service, and by linking real-world problems with the researchers who can solve them, everybody wins! The only thing that surprised me was that I didn't see startups creating contests, it seemed like a perfect way to bring in some amazingly talented helpers on a tight budget. It stayed on my mind, and when I hit a tough prediction problem with my current company, I knew I wanted to give Kaggle a try.

Happily I've got to know Anthony, Jeremy and the team quite well since I moved to San Francisco, so they were extremely helpful when I turned to them (and they cut me a great deal at starving-startup rates!). I first reached out on Wednesday, and we had the competition live on Saturday morning:

Photo Quality Prediction

With a prize pool of $5,000, we've attracted over thirty teams in less than two days! The results are already very promising, and there's still three weeks left.

If you're a startup, have a look at the amount of intelligent and enthusiastic help we're getting, and think about the problems you face in your business. You want to be focused on your product, so why not get help from experts on the machine-learning side? I bet they'll do a much better job than you have time for, and it won't break the bank. Kaggle's help has meant it's taken very little of my time as well, freeing me up to work on our core technology.

In my next post I'll walk you through exactly what it took to set up the competition through Kaggle's site, but go check it out now, and picture what you could do with dozens of machine-learning ninjas on your team.

Why we need an open-source geocoding alternative to Google

Photo by Marc Levin

You can't use Google's geocoding for anything but map display! I've always been surprised by how many services rely on the Google Maps API for general address to coordinate translation, despite it being prohibited unless you're displaying the results on one of their maps. Google have provide some fantastic resources for geo developers, they've moved the whole field forward, but we can't rely on them for everything. The recent changes to their terms of service have alerted a few people to this long-time issue, so here's the alternatives I've discovered over the years, and why I think you should look into open-source solutions.


The easiest change for an application developer is to use one of Yahoo's excellent geocoding APIs, either Placefinder for street addresses, or Placemaker for more unstructured names of places like towns, provinces or countries. There's no restrictions on how you use the data, and you get 50,000 requests a day. It has good coverage worldwide (though I recently noticed an issue with Finland).

The biggest downside is that they clearly have an uncertain future. Yahoo hasn't managed to monetize their awesome developer APIs, and most of the engineers involved in setting them up have left. It's nerve-wracking to build your application on an API that could disappear at any point!

Schuyler Erle

Schuyler is a one-man open-source-geocoding machine! He wrote the original Perl module for taking US Census data and looking up addresses, and also created an updated Ruby version for the Geocommons folks. I've found it works impressively well on US addresses. The biggest drawbacks are the requirement that you download and import many gigabytes of US census data before you can set it up on your own machine, and a lack of international coverage.


OpenStreetMap has created the Nominatim project for converting addresses into coordinates using its open-source collection of mapping information. Unfortunately it's way too logical for its own good, expecting to receive addresses that are strictly hierarchical. For example, it can't understand "40 Meadow Lane, Over, Cambridge CB24 5NF, United Kingdom", you have to mangle it to something unnatural like "40 Meadow Lane, Over, Cambridgeshire, England" before it starts to parse it, and even then it picks the wrong one as the first result. It also generally doesn't know where numbers fall on particular streets, since it relies on landmark points like pubs with numbers attached, and these are generally very sparse.

Data Science Toolkit

Since I couldn't find anything that met my needs, I decided to take a shot at pulling together a lot of the existing resources into a more convenient package. I took Schuyler's work on the TIGER/Line data for US addresses, and used some of the Nominatim backend code with a more flexible front-end to handle more postal addresses. I then rolled up a couple of virtual machine packages so you don't have to do the messy data importing yourself, so you can grab it as an Amazon AMI or a VMware image. You can get started using the main datasciencetoolkit.org site through the API too, but I wouldn't recommend it for heavy use since it's just a single machine.

Its main limitation is that it only handles US and UK addresses. The UK lookups are all done through OpenStreetMap data, so it should be possible to extend it worldwide given enough work, I just haven't been able to devote enough time to do that. I'd love to see someone extend the current code though, or improve a different project like Nominatim, or even start a whole new one. There's already enough data out there to build a truly open API for geocoding, so let's make it happen!

Analytics adventures with Pig and Cassandra

Photo by Jes

After a lot of head-scratching, I ended up choosing Cassandra as the main data store for my latest project. I needed a system that could handle a loading process throwing hundreds of thousands of items a minute at it, denormalized across multiple indexes, whilst simultaneously serving up results to a web application, and so far it has performed magnificently. 

Unfortunately it has ended up being a victim of its own success. Now that we have tens of millions of pieces of user-generated content, we want to ask the data questions. That's not so easy with NoSQL, so here's some notes on the solution I ended up building.


The only way to run code across data held on a large cluster of machines is to execute the processing on a similarly large cluster, preferably on machines that already have local copies of the data. That means using Hadoop, and since I'm using DataStax EC2 machine images for my Cassandra servers anyway, I started by trying to add some Brisk AMIs to my existing cluster. This is a Hadoop distribution, pre-configured to integrate with Cassandra, designed for exactly what I was hoping to do. Unfortunately I struggled to figure out the right startup parameters to get the newly-created machines talking to my existing cluster, despite some excellent help from Joaquin. I took a break from that, and discovered that hand-building the basic Cassandra components I needed wasn't too hard on my existing Hadoop machines, so I continued with more of a home-brew setup.

Pig or Hive?

In order to run Hadoop jobs on Cassandra data, you need a fast way to pull it out from your tables into a supported coding environment. The only two supported languages for this in the current Cassandra releases are Pig and Hive. Pig is a procedural data-transformation language, whereas Hive looks a lot like SQL. I think a lot more procedurally and I needed something that could handle some tough unpacking and formatting tasks, so I went with Pig.


I don't know if my experiences would have been any better with Hive, but I found I was walking a fairly lonely path using Pig in conjunction with Cassandra. Jeremy Hanna was a life-saver, and Brandon Williams put in a lot of hard work to get me up and running, but my initial encounter involved a couple of days of me tearing my hair out. I was trying to make some sense out of the results I was seeing on the latest stable release of Cassandra, but they left me baffled. It turned out that the recent introduction of types into the Cassandra adapter had broken all the existing example code, and left the schema reported for the data very different from the actual structure that was returned. Happily I was able to monkey up a messy patch, which Brandon then fixed properly, but it definitely made me realize how far out on the bleeding edge I was. That's a place I usually try to avoid for mission-critical projects!

The right tool for the job

With that overcome, I was able to move forward, but not as speedily as I had hoped. Pig's great strength is that it's a domain-specific language for data processing. It's a big bag of useful operations, with no particular grand design for the language. It reminds me of PHP or R, and I don't mean those comparisons as an insult, I have a fondness for these sort of languages. When you're working inside their domain they're extremely productive, you almost never need to install extra dependencies and everything's at your fingertips. Sadly, I found I was operating a bit outside of the mainstream for Pig.

Pig wrestling

As an example, I store a large array of records per-user. In theory, I could store each item as a new row in Cassandra, with a secondary index for the user key, but in practice the performance and storage size overhead of that approach rules it out. A CSV string is a simple but effective way of holding the data, so when I'm running a Pig script, I needed to decode that string back into records. There is a smart CSV loader in the latest release, but it only works when you're reading in files, not on strings you've already loaded. To make the job easier, I reloaded all my data using a format that was going to be a lot easier to parse (stripping out any line terminator or separation characters inside quoted strings, for example) and then set out to do the job using Pig's built-in primitives.

I thought I could just use STRSPLIT to break up my strings into individual rows, but it turns out that it only returns a tuple. This is bad because turning a tuple into separate records for further processing is pretty involved at best. What you really need is what Pig calls a bag, an unordered set of records that can be easily turned into a proper stream of records. TOKENIZE is almost identical to STRSPLIT and returns a bag, so I thought I was in luck. Unfortunately it doesn't take a parameter allowing you to specify what characters to split the string on. Undaunted, I thought I should contribute something back to the code base and create a patch. I dusted off my Java neurons, and created the changes I needed. That let me finally parse the CSV files I was dealing with, but I was stuck as I tried to figure out how to clean up the code to offer it as a patch.

The problem is that my new TOKENIZE takes an optional second parameter to specify the custom characters to split on. I needed to keep the existing behavior for a single parameter call to avoid breaking old scripts. That's easy enough to handle in the execution code, I just check the number of arguments passed in, but there's also a function that exposes the signature of the function. That doesn't support variable numbers of arguments, as a known limitation. The suggested workaround is to remove the signature definition entirely, but since that's presumably there for a reason, it didn't seem a sensible approach. In the end I was stymied, but since I could move forward with a slightly custom branch of Pig, I reluctantly abandoned the patch.

Happily ever after

I don't want to sound too negative about my experiences, now that I've got the basics set up I'm able to write scripts very quickly and answer all sorts of questions. It's also amazing to think about the power of all this free software unleashes, and the generosity of the community who helped me out as a newbie. If you're considering a similar project though, I would either budget for more research time than you might expect, or track down a native guide, somebody who has already done something similar in a production environment.

[UpdateJon Coveney added a great comment explaining more about how to solve the TOKENIZE issue I hit, so I'm including it below, and I'll do another post when I give it a try]

Hey Pete, thanks for writing about your experience. It's a goal of mine to find a project to use Cassandra for, and I'm sure I'll be walking through many similar problems. 

Just wanted to note something about the TOKENIZE piece. Pig can be a bit weird about variable arguments, but in this case, it shouldn't be too bad. You have two options. 

1 is to have an optional constructor which takes one parameter. You could then do 
DEFINE mytokenize TOKENIZE('*'); 

now you could just use mytokenize normally. Implementing this is as difficult as implementing the constructor. 

Another option is that you can have pig take 1 or 2 arguments. The limitation you pointed to doesn't actually apply here…that limitation is that if you have a function that takes a variable number of arguments _that also takes arguments of different types_ (this part is key) THEN you can't do it. [1] 

In your case, tokenize always takes a string, and an optional string delimiter. I personally would go the constructor route, but either is fine 🙂 I usually write UDF's with an initialize method instead of doing it all constructors anyway, so you would only have to check the number of arguments once. 

Happy hacking 

[1] To explain where this applies, think of SQL's coalesce function. This is a function that can take both functions of different types, and varying numbers of them. So you could do coalesce(1,2,3,4,5) or coalesce('hey','you','get','out') or whatever. Pig does not allow you to do this. With the getArgToFuncMapping, you can map to a function that takes a fixed number of arguments. Or you can have a varying number of arguments, but it can't be sensitive to the type of the input…while you can still then implement something like coalesce, it's going to be slow. 

Duboce Triangle excitement and business card inspiration


If you were wondering why your Muni ride home took a lot longer tonight, this is the explanation. The rear wheels of the N car I was on decided to turn off down Church Street, while the rest carried on straight along Duboce. There were a lot of sparks overhead as the power connector sheared off, followed by a nasty crunch, but we were going slowly and nobody was hurt. Happily it happened just outside my apartment, so I was able to hop off without too much disruption, and it sounds like service should be back to normal tomorrow.


I also had to share this business card I came across. All I can say is that I'll be passing along some suggestions to my co-founder and CEO Julian in the morning.

Five short links


Photo by Sami Sieranoja

JSON Pointer – For one crazy moment I thought this was an attempt to squeeze C's memory management into Javascript. It's actually a very useful effort to standardize how we describe parts of JSON structures, a bit like XPATH is for XML.

Microsoft and Hadoop – Even the Beast of Redmond digs Hadoop these days. Do I need to be a code hipster and find something more obscure to evangelize now?

Pygmalion – I've been knee-deep in Pig and Cassandra internals for the last week, trying to build an approachable analytics solution for a massive, dynamic data set. It has been something of a struggle, thanks to the combination of my unfamiliarity with both Pig and Cassandra, and the scarcity of other users. I've had some fantastic help from the community though, especially from Jeremy Hanna and Brandon Williams, and I recommend checking out Jeremy's library and talks if you're also wandering into this area.

SMS Corpus – The National University of Singapore has made around 60,000 voluntarily collected text messages in English and Chinese available as a research data set. There's precious little like this available for academic researchers, so asking for contributions is an interesting solution to the privacy problem.

Bill Nguyen – I met Bill briefly at the Color offices, and he is startlingly charismatic. This profile includes some thoughtful quotes from Paul Kedrosky and Eric Ries, but the one that rang most true was the old Hollywood saying that "nobody knows anything". I'm lousy at predicting which companies will go on to success, I have my own mental anti-portfolio of fantastic startups I could have got more deeply involved in. The only way to keep my sanity is to work on products I'm proud of, and hope everything else works out.

Blue Angels 2011


When my friend Bruno Bowden told me that Fleet Week was one of his favorite events in San Francisco, I raised an eyebrow. It sounded like something straight out of a 40's musical. Then he invited me and Joanne along to a party on his rooftop to see the airshow, and I was intrigued, but imagined we'd be staring at little dots flying over Sausalito through binoculars. Boy, was I wrong!

There's no way I could capture the full experience of the jets flying a few hundred feet directly over our heads, or watching them seemingly weave between the buildings, flying lower than we were. It was so hard to film them, they flew over so fast ! We did manage to capture a couple of the larger planes in the show, but please excuse the swearing in the second one, I just couldn't believe they could do that with a full-sized jet-liner over a major city!

Big thanks to Bruno to inviting us along, it truly was one of the most unique events I've ever been to. I'm so happy they allow such an amazing show in the middle of San Francisco. Looking around the rooftops were packed, especially with kids, and I know the pilots left behind memories that will last a lifetime.


How British cheese and a Turing test converted me to Google Plus


Photo by Ulterior Epicure

I'll admit it, I had written off Google's latest effort to be social. I have a tidy little theory that their focus on metrics leads them to local maximums but prevents them leaping across gaps to islands of fun. I expected Google Plus to be a flash in the pan as people signed up, poked around and never returned. That left me surprised to hear a continuing murmur of interest from people I trust like Marshall Kirkpatrick. I almost hate to publicize his secret, but he has managed to notch up a series of great stories by using the service for research.

On Thursday night I'd just finished off a links post and wanted to do my normal tweet about it, but Twitter was down. When it was still unavailable after an hour, I decided this was a sign I should check out Google Plus myself. I went to the site and put together a short post with an explanation I was experimenting with the service.


My first impression was of how fast the interface was. It didn't have that lag that I thought was inevitable with web applications. It was so quick I was convinced they must have pre-populated the page with hidden content so it was instantly available. When I looked at the network activity it turns out I was wrong, most of their content is dynamically loaded by Ajax calls. What makes the difference is how fast their response times are, often under 100ms for me. I'm definitely a snob about this sort of thing after my time at Apple, but the snappiness and some thoughtful work on the interaction model made a profound difference to my experience. The fit and finish make me want to spend time exploring the service, in contrast to Twitter's web interface which just feels a lot klunkier and messier.


Being able to write at a natural length was a surprising relief. I still stayed fairly short, within the same three-sentence rule that I try to use for emails, but I was over 140 characters. I don't think the service would work unless we'd already been taught brevity by Twitter, but being able to expand beyond those limits when I need to feels very liberating.


There were a surprising number of responses to my initial post. One of them was from DeWitt Clinton, in what I wrongly suspected was an auto-generated welcome message. I knew him through the Buzz and WebFinger developer mailing lists, so I thought he might be the Tom Anderson of Google Plus, the first friend for every new user. He convinced me he was human, so either the employees are very engaged with the service, or Google's code can now pass the Turing Test. It seems like there's quite a few regular people watching the streams too, which is confidence-inspiring this long after the launch.

Conversations about cheese

The comment thread grew into a couple of conversations, giving me a chance to reconnect with several friends. In particular Audrey Watters, Edd Dumbill and I started waxing nostalgic about British food, and the struggle to find decent cheese. Edd was inspired to create a short public post, and the whole experience convinced me that the service makes some great conversations possible. It left me wanting to return regularly, which gives me hope the service is sustainable.

I'm happy that I was proven wrong about Google Plus. I won't be abandoning Twitter, but I will be spending more time on Google's service because it offers a different experience, one that I was surprised to find myself enjoying. If you're curious, join me there too.

Five short links


Photo by Holeymoon

Sourcetree – I don't often recommend commercial software, mostly because my personal stack's mostly open source these days. I've fallen in love with this tool for exploring my git repositories though. Git's new Mac app is fantastic too, but focused on 'doing things'. I've found sourcetree a wonderful way to explore and understand your code. I just discovered they've been acquired by Atlassian, so I guess I'm not the only fan!

The Sketchbook Project – I never lose my sense of wonder at how many ways the web can be used to drive creativity. By offering to scan people's sketchbooks they've motivated a community of artists from all over the world, and given me a vast set of material to browse through when my imagination needs a jump-start.

Smart Meter surveillance – How your electricity meter can reveal what TV channel you're watching. My German's not good enough to follow the main paper, but the abstract sounds very plausible. From my perspective not something to freak out over, but a good example of all the unexpected ways we leak information about our lives. We measure more and more things to improve efficiency, but the by-product is that the same data can be used for many unintended purposes too.

The Brown Revolution – An unfortunate name, but a compelling idea for sustainable grazing. I'm normally skeptical of agricultural 'silver bullets' like this, but I know from my experience maintaining trails how effective thoughtful drainage can be. When water's compressed into a narrow stream by a gully it will cut through even packed soil like a plasma torch, but keep spread out it in a wide sheet using a shallow 'rolling dip' and you'll have a surface that can survive years of storms.

Apple insiders remember Steve Jobs – I'm very sad we've lost Steve, he always seemed more like a super hero than a mortal to me. At the Guardian's request I contributed a few thoughts about my time working at Apple, and how he was a constant presence even though I barely met him. I'll be thinking of his family.

Five short links

Photo by Nick P

Is teaching MapReduce healthy? – The working conditions inside Hadoop are terrible, but it's rapidly becoming the default for large-scale data processing. Does that mean students should learn the MapReduce approach? It feels a lot like the debates over teaching ugly, confusing, widely-used C/C++ versus beautiful, elegant but niche functional languages, and which should come first in the curriculum. This article gave me some fantastic glimpses into the wider world of distributed frameworks and techniques, and left me itching to try Bloom.

Amazon comments on Spot price spikes – A refreshingly detailed and open response from a large company. I'm disappointed that the prices have suddenly become so volatile though, since that severely limits the places I can use them.

PhantomJS – A WebKit based headless browser, lighter-weight than Selenium and driven by Javascript. I hope I get a chance to use this, and that they can sever the vestigial dependency on X windows soon. My main use case would be generating screenshots.

Teaching data to speak humanely – Looking at the Facebook timeline, and how old metaphors of interaction disappear as the interface gets closer to the content.

Pictures of the Big Bang – Gorgeous snapshots of our universe's very first moments, courtesy of a computer simulation:


How I saved $1,000 on my monthly EC2 costs


Photo by Paul Hohmann

If you’ve been a user of my Data Science Toolkit or OpenHeatMap sites, you may have noticed they’ve been a bit flakey recently. The back story is that I’ve started a new company (of which more soon) and I had to cut back on how much I was spending on my own servers. It was costing me over $1,200 a month on all the different systems I’d set up over the last three years! This was mostly because it was quicker and easier to set up a new instance than worry about jamming it onto an existing one, and I never got around to cleaning things up. I doubt there are many people who’ve been as lazy about this as me, but if you’re looking at cutting costs, I would start by figuring out if there’s servers you can merge.

By cutting out several like twitter.mailana.com and fanpageanalytics.com that were no longer being updated or heavily used, I was able to cut that in half, but I then reached two sites that have decent traffic. I started by merging DSTK and OHM onto one large server, which mostly worked but caused some hiccups. That got me down to around $400 a month, including another small instance for some legacy Mailana labs sites.

I then decided to switch to ‘spot instances’, Amazon’s auction model for buying spare server capacity cheaply. Most of the time it’s only about 12 cents an hour, a third of the normal price. I switched, but then started to experience some serious price spikes that kept taking the server down, and required manual intervention to set everything up again. At some points, the price went to $15 an hour, so there’s obviously capacity limits being hit. That’s very different from my experiences just a few months ago when prices seemed a lot more stable. At this point I’d never recommend spot instances for user-facing servers, the downtime seems too high. They’re still a great deal for things like MapReduce backend processing.

I need 64 bit support for a lot of the frameworks DSTK relies on, so I couldn’t go down to a small instance, but I did realize that micro instances were x86_64, so my next step was to try running both sites off one of those. Rather predictably, the processing requirements and lack of memory crippled the tiny instance, so the site was extremely flakey. It would have been only $14 a month though, so I spent some time trying to fix the issues, by configuring swap space for instance. My conclusion was that micros are too limited for anything but light web serving.

Today I finally bit the bullet and bought a one year reserved m1.large instance, which cost me about $900 up front and another $1,100 over the next year in costs. I also rolled in my web.mailana.com small instance into the same server, so I’m down to about $170 a month! This is around a $100 less than the unreserved cost, so I’d seriously look at reserving servers as a way of managing your costs, if you can stomach the up-front deposit.

I’ve also added a link to my O’Reilly books to the sites, in the hope I’ll cover some of my costs:

Improve your data skills (and keep this server running!) by buying my guides:

 I’m happy I’ve found a solution that should allow me to keep offering the DSTK and OpenHeatMap services without breaking my bank account. Apologies to all the users who have suffered through the transition, but things should be a lot more stable from now on.