Five short links

Fiveluck
Photo by Clara Alim

Lawnchair – “Sorta like a couch except smaller and outside”. Great little project for storing client-side data in a simple, modern way.

Synopsis Data Structures for Massive Data Sets – The more I deal with large data sets, the more I appreciate how useful approaches like Bloom Filters can be. This somewhat dense paper introduced me to a whole menagerie of other ‘synopsis data structures’ that store an incomplete but useful set of properties about a massive data set, within a surprisingly small memory footprint.

Let your gray hair light your way through unfamiliar data – Short but interesting perspective from an investment banker on the attitude you need to effectively analyze data.

Burglaries in Auckland’s Eastern SuburbsOpenHeatMap hits New Zealand, thanks to Reuben Schwarz

GetAround – I spent yesterday trying to make it to a series of meetings scattered around Silicon Valley using only my new folding bike and trains, but the sparse non-peak Caltrain schedule made it impossible. At the end of the day I bumped into Jessica Scorpio handing out donut holes in Palo Alto, and after my day her startup sounded very appealing. It lets you rent out your car when you’re not using it, like a peer-to-peer equivalent of ZipCar. I would feel a lot better about keeping my car for visits like yesterday’s if it wasn’t sitting idle for the rest of the time.

Data is snake oil

Snakeoil
Photo by Library Company of Philadelphia

There's a whole new world of data emerging, along with cheap and easy tools for processing it. Unfortunately a lot of snake-oil salesmen have spotted this too, and are now eagerly mis-using 'big data' in their pitches. I was reminded of this when I read the recent Wall Street Journal article on health insurance companies looking at social network data. There's been detailed demographic and purchase data available for every household in the US for decades, so why haven't they used that existing data if the approach is as effective as the many hopeful consultants claim?

It's because data is powerful but fickle. A lot of theoretically promising approaches don't work because there's so many barriers between spotting a possible relationship and turning it into something useful and actionable. Russell Jurney's post on Agile Data should give you a flavor of how long and hard path from raw data to product usually is. Here's some of the hurdles you'll have to jump:

Acquisition. Few data sets are freely available, and even if you can afford the price, the licensing terms are likely to be restrictive. Even if you have that sorted, you're at the mercy of the providers unless you're gathering it yourself. If they see you making money, in the best case they'll ramp up their price, and in the worst case they'll cut you off either for reputational reasons or so they can offer a similar service themselves. Can you imagine the outcry if insurance companies penalize donors to cancer charities, as the article postulates? Nobody will want to provide data with that sort of reputational risk looming.

Coverage. No matter how good your analysis results are, if you only have source data on 10% of the targets the product will be useless.

Over-determination. Age, income and industry probably do a pretty effective job of predicting your chances of becoming overweight. Is going to the trouble of spotting that somebody's buying exercise equipment really going to improve your prediction enough to justify the expense of testing, implementing and tuning the process?

Poor correlations. The data may just not carry the answers you need. This is more common than you'd think, many relationships that seem like they should be rock-solid don't pan out when you test them against reality.

Noise. A lot of information gets lost in the noise of real-world data sets. I think of this as the Megan Fox problem; so many Facebook users were fans of her that she appeared in almost every region's top 10 list and I had to run normalizing steps to remove her malign influence on my results. That of course degraded the overall fidelity of the conclusions.

So what's the solution? As Russell says, you need a whole new approach to prototyping, focused on building something that works with actual data and lets you interactively explore what works in reality, versus the relationships you hope are there from thought experiments. At least the Aviva study in the article did try out their techniques on 60,000 records, though the report left me with lots of unanswered questions.

Next time somebody's trying to sell you on the awesomeness of their new data technique, ask to see a prototype. If they haven't got that far, it's snake oil.