How do analytics really work at a small startup?

I was lucky enough to spend a few hours today with my friend Kevin Gates, one of the creators of Google's internal business intelligence systems, and it turned out to be a very thought-provoking chat. His mind was somewhat boggled that we were so data-obsessed at such an early stage in our life. Most people running analytics work at a large company and have a big stream of users to run experiments on. Our sample sizes are much smaller, which makes even conceptually simple approaches like A/B tests problematic. Just waiting long enough to get a statistically-significant results becomes a big bottleneck.

We've found ways around a lot of the technical issues, for example focusing on pre/post testing rather than A/B to speed up the process, but there's a bigger philosophical question. Is it even worth focusing on data when you only have tens of thousands of users?

The key for us is that we're using the information we get primarily for decision-making (should we build out feature X?) rather than optimization (how can we improve feature X?). Our quest is to understand what users are doing and what they want. Everything we're looking at should be actionable, should answer a product question we're wrestling with. To help answer that, I sketched out a diagram of how the information flows through our tools to the team:

Analytics

The silhouettes show where people are looking at the results of our data crunching. The primary things that everyone on our team religiously watches are the daily report emails, and the UserTesting.com videos that show ordinary people using new features of our app. The daily reports are built on top of our analytics database, which is a Postgres machine with a homebrewed web UI to create, store, and regularly run reports on the event logs it holds. We built this when our requirements expanded beyond KissMetrics more funnel-focused UI, but we still use their web interface for some of our needs. Qualaroo is an awesome offshoot of KissMetrics that we use for in-app surveys, and we also refer to MailChimp's Mandrill dashboard and Urban Airship's statistics to understand how well our emails and push notifications are working. We have to use AppAnnie to keep track of our iOS download numbers and reviews over time.

We also have about twenty key statistics that we automatically add to a 'State of the App' Google Docs spreadsheet every day. This isn't something we constantly refer to, but it is crucial when we want to understand trends over weeks or months.

Over the last 18 months we've experimented with a lot of different approaches and sources of data, but these are the ones that have proved their worth in practice. It doesn't look the same as a large company approach to analytics, but this flow has been incredibly useful in our startup environment. It has helped us to make better and faster decisions, and most importantly spot opportunities we'd never have seen otherwise. If you're a small company and are feel like you're too early to start on analytics, you may be surprised by how easy it is to get started and how much you get out of it. Give simple services like KissMetrics a try, and I bet you'll end up hooked!

 

How good are our geocoders?

Confusingsign
Photo by Oatsy 40

My last post was a quick rant about the need for a decent open geocoder, but what's wrong with the ones we have? I've created a command-line tool to explore their quality: https://github.com/petewarden/geocodetest.

As a first pass, I pulled together a list of six addresses, some from my past and a few from spreadsheets users have uploaded to OpenHeatMap. The tool runs through the list (or any file of addresses you give it) and geocodes them through DSTK and Nominatim, returning a CSV of whether the locations are within 100m of Google's result. Run the script with -h to see all the options.  Here are the results, produced by running ./geocodetest.rb -i testinput.txt

dstk,google,nominatim,address
Y,Y,Y,2543 Graystone Place, Simi Valley, CA 93065
Y,Y,N,400 Duboce Ave, #208, San Francisco CA 94117
Y,Y,N,11 Meadow Lane, Over, Cambridge, CB24 5NF, UK
N,Y,N,VIC 3184, Australia
N,Y,Y,Lindsay Crescent, Cape Town, South Africa
N,Y,N,3875 wilshire blvd los angeles CA

The first three are standard test cases for me, so it's not be a massive surprise that my DSTK (based on Schuyler Erle and GeoIQ's original work) works better than Nominatim for two of them. It does highlight one of the reasons I've struggled to use Nominatim though – it's not good at coping with alternative address forms. This makes it quite brittle, especially around addresses like the UK where there are multiple common permutations of village, city, and county names. Nominatim doesn't return any results for #2 or #3 at all, when I'd hope for at least a town-level approximation.

The Australian postal code is about 30 km from Google's result, whereas the open GeoNames data in the DSTK gets me to within 400m of Google. Nominatim does much better on the SA address, since I haven't imported OSM data into the DSTK for anywhere but the UK. I did have to correct the original user-entered spelling of 'Cresent' first though, and I'd love to see an open geocoder that was robust to this sort of common mistake. The last address is another sloppy one, but we should be able to cope with that one too!

Part of the reason there hasn't been more progress on open geocoders is that the problems are not very visible. I hope having an easy test harness changes that, and while this first pass is far from scientific, it's already inspired me to put in several fixes to my own code. I'm a big fan of the effort that's been put into the Nominatim project (I'm using their OSM loading code myself) I'm just disappointed that the results haven't been good enough to build services like OpenHeatMap on top of. I'll be expanding this tool to cover more addresses and so build a better 'map' of how we're doing, and what remains to be done. I'm excited by the opportunities to make progress here, I'll be busy working more on my own efforts and I can't wait to hear other folks thoughts too.

Why is open geocoding important?

Globe
Photo by Werner Kunz

A few years ago I had what I thought was a simple problem. I had a bunch of place names, and I needed to turn them into latitude and longitude coordinates. To my surprise, it turned out to be extremely hard. Google has an excellent geocoder, but you're only allowed to use it for data you're displaying on Google maps, and there are rate limits and charges if you use it in bulk. Yahoo has an excellent array of geo APIs with much better conditions, but there are still rate limits and their future was in doubt even then!

So, I ended up hacking up my own very basic solution based on open data. It turned out to be a fascinating problem, one you could spend a lifetime on, trying to draw a usable, detailed picture of the world from freely available data. I bulked up the underlying data and algorithms, and it became the core of the Data Science Toolkit. Turning addresses into coordinates may sound like a strange obsession, but it has become my white whale.

There are some folks who agree that this is an important problem, but I've been surprised there aren't more. Placenames describe our world, and we need an open and democratic way for machines to interpret them. Almost any application that uses locations needs to do this operation, and right now we have no alternative to commercial systems.

What are the practical impacts of this? We've got no control over what our neighborhoods are called, or how they're defined. We can't fix problems in the data that impact us, like correcting the location of our address so that delivery drivers can find us. We can't build applications that take in large amounts of address data unless we can afford high fees, which cuts out a whole bunch of interesting projects.

This is on my mind because I'm making another attack on improving the DSTK solution. I've already added a lot of international postal codes thanks to GeoNames, but next I want to combine the public domain SimpleGeo point-of-interest dump with OpenStreetMap data to see if I can synthesize more addressable ranges for at least some more countries. That will be an interesting challenge, but if I get some usable it opens the door to adding more coverage through any open data set that combines street addresses and coordinates. I can't wait to see where this takes me!