How to turn data into money

Shortsnorter
Photo by Jerry Swantek (fascinating tradition behind it)

The most important unsolved question for Big Data startups is how to make money. I consider myself somewhat of an expert on this, having discovered a thousand ways not to do it over the last two years. Here's my hierarchy showing the stages from raw data to cold, hard cash:

Data

You have a bunch of files containing information you've gathered, way too much for any human to ever read. You know there's a lot of useful stuff in there though, but you can talk until you're blue in the face and the people with the checkbooks will keep them closed. The data itself, no matter how unique, is low value, since it will take somebody else a lot of effort to turn it into something they can use to make money. It's like trying to sell raw mining ore on a street corner; the buyer will have to invest so much time and effort processing it, they'd much prefer to buy a more finished version even if it's a lot more expensive.

Down the road there will definitely be a need for data marketplaces, common platforms where producers and consumers of large information sets can connect, just as there are for other commodities. The big question is how long it will take for the market to mature; to standardize on formats and develop the processing capabilities on the data consumer side. Companies like InfoChimps are smart to keep their flag planted in that space, it will be a big segment someday, but they're also moving up the value chain for near-term revenue opportunities.

Charts

You take that massive deluge of data and turn it into some summary tables and simple graphs. You want to give an unbiased overview of the information, so the tables and graphs are quite detailed. This now makes a bit more sense to the potential end-users, they can at least understand what it is you have, and start to imagine ways they could use it. The inclusion of all the relevant information still leaves them staring at a space shuttle control panel though, and only the most dogged people will invest enough time to understand how to use it.

Reports

You're finally getting a feel for what your customers actually want, and you now process your data into a pretty minimal report. You focus on a few key metrics (eg unique site visitors per-day, time on site, conversion rate) and present them clearly in tables and graphs. You're now providing answers to informational questions the customers are asking; "Is my website doing what I want it to?", "What areas are most popular?", "What are people saying about my brand on Twitter?". There's good money to be had here, and this is the point many successful data-driven startups are at.

The biggest trouble is that it can be very hard to defend this position. Unless you have exclusive access to a data source, the barriers to entry are low and you'll be competing against a lot of other teams. If all you're doing is presenting information, that's pretty easy to copy, and caused a race to the bottom in prices in spaces like 'social listening platforms'/'brand monitoring' and website analytics.

Recommendations

Now you know your customers really well, and you truly understand what they need. You're able to take the raw data and magically turn it into recommendations for actions they should take. You tell them which keywords they should spend more AdWords money on. You point out the bloggers and Twitter users they should be wooing to gain the PR they're after. You're offering them direct ways to meet their business goals, which is incredibly valuable. This is the Nirvana of data startups, you've turned into an essential business tool that your customers know is helping them make money, so they're willing to pay a lot. To get here you also have to have absorbed a tremendous amount of non-obvious detail about the customer's requirements, which is a big barrier to anyone copying you. Without the same level of background knowledge they'll deliver something that fails to meet the customer's need, even if it looks the same on the surface.

This is why Radian6 has flourished and been able to afford to buy out struggling 'social listening platforms' for a song. They know their customers and give them recommendations, not mere information. If this sounds like a consultancy approach, it's definitely approaching that, though hopefully with enough automation that finding skilled employees isn't your bottleneck.

Of course the line between the last two stages is not clear-cut (Radian6 is still very dashboard-centric for example), and it does all sound a bit like the horrible use of 'solution' as a buzz-word for tools back in the 90's, but I still find it very helpful when I'm thinking about how to move forward. More actionable means more valuable!

Is ingestion the Achilles Heel of Big Data?

Wellies
Photo by Jon Appleyard

Drew Bruenig asked me a very worthwhile question via email:

"Outside of a handful of few predictable cases (website analytics, social exchange, finance) big data piles are each incredibly unique. In the smaller data sets of consumer feedback (that are still much larger than our typical sets) it’s more efficient for me to craft an ever expanding library of scripts to deal with each set. I have yet to have a set that doesn’t require writing a new routine (save for exact reruns of surveys).

So the question is: can big data ever become big business, or are the variables too varied to allow a scalable industry"

This gets to the heart of the biggest practical problem with Big Data right now. Processing the data keeps getting easier and cheaper, but the job of transforming your source material into a usable form remains as hard as it's ever been. As Hilary Mason put it, are we stuck using grep and awk?

A lot of the hype around Big Data assumes that it will be a growth industry as ordinary folks learn to analyze these massive data sets, but if the barrier is the need to craft custom input transformations for each new situation, it will always be a bespoke process, a cottage industry populated solely by geeks hand-rolling scripts.

Part of the hope is that new tools, techniques and standards will emerge that remove some of the need for that sort of boiler-plate code. activitystrea.ms/ is a good example of that in the social network space, maybe if there were more consistent ways of specifying the data in other domains we wouldn't need as many custom scripts? That's an open question, even the Activity Streams standard hasn't removed the need to ingest all the custom data formats from Twitter, etc.

Another big hope is that we'll do a better job of generalizing about the sort of data transformations we commonly need to do, and so build tools and libraries that let us specify the operations in a much more high-level way. I know there's a lot of repetition in my input handling scripts, and I'm searching for the right abstraction to use to simplify the process of creating them.

I also think we should be learning from the folks who have been dealing with Big Data for decades; enterprise database engineers. There's a cornucopia of tools for the Extract, Transform, Load stage of database processing, including some nifty open-source visual toolkits like Talend. Maybe these don't do exactly what we need, but there has to be a lot of accumulated wisdom we can build on. The commercial world does tend to be a blindspot for folks like me from a more academic/research background, so I'll be making an effort to learn more from their existing practices. On the other hand the fact that ETL is still a specialized discipline in its own right is a sign that ingestion is still an unsolved problem even after decades of investment, so maybe our hopes shouldn't get too high!

Five short links

Strawberries
Photos by the_moment

Optimizing conversion rates with qualitative tests – First in a series, this post does a great job of walking through the steps that you can take to figure out simple ways to improve your site. It alerted me to some services I wasn’t aware of like feedbackarmy.com and fivesecondtest.comvia Healy Jones

Orange – An interesting node-based graphical environment for building data-mining pipelines – via Dániel Molnár

Lies, Damned Lies and Medical Science – A compelling portrait of a ‘meta-researcher’ who has made a career out of proving how bogus most medical research is. Everyone involved in data analysis should read this; as a culture we have an irrational respect for charts and tables, when in fact they’re just useful ways of telling stories. Just like normal prose those stories are only as good as the evidence behind them, and should be treated just as sceptically. via Alexis Madrigal

Scrapy – Solid, simple and mature, so far this framework for building web crawlers in Python looks very useful and I’ll be using it on some upcoming projects. I’m still not convinced that XPath is flexible enough for the sort of content extraction I need to do, but I’ll see how far I can get with it and if alternative methods are easy to bolt on. via Alex Dong

The ignorance of what is possible – Growing up, my highest ambition was to work in an office, since that let you sit down in a private space and everybody I knew had jobs that involved standing up and dealing with customers. Reading this article reminded me of how limited my horizons were when I was young, it was only when I moved to the US that I realized how much more was possible. There must be so much potential wasted because kids don’t see how wide the world can be and limit their ambitions without even knowing what they’re losing.

What rules should govern robots?

Asimov

Image by Andre D'Macedo

Shion Deysarker of 80legs recently laid out his thoughts on the rules that should govern web crawling. The status quo is a free-for-all with robots.txt providing only bare-bones guidelines for crawlers to follow. Traditionally this hasn't mattered, because only a few corporations could afford the infrastructure required to crawl on a large scale. These big players have reputations and business relationships to lose, so any conflicts not covered by the minimalist system of rules can be amicably resolved through gentleman's agreements.

These days any punk with a thousand bucks can build a crawler capable of scanning hundreds of millions of pages. Startups like mine have no cosy business relationships to restrain them, so when we're doing something entirely new we're left scratching our heads about how it fits into the old rules. There's several popular approaches:

Scofflaws

None of the major sites I've looked at and talked to have any defense or even monitoring of crawlers, so as long as you stay below denial-of-service levels they'll probably never even notice your crawling. They rely on the legal force of robots.txt to squash you like a bug if you publicize your work, but there's a clearly a black market developing where shady marketers will happily buy data, no questions asked, much like the trade in email lists for spammers.

An extension of this approach is crawling while logged in to a site, getting access to non-public information. This WSJ article is a great illustration of how damaging that can be, and Shion is right to single it out as unacceptable. I'd actually go further and say that any new rules should build on and emphasize the authority of robots.txt. It has accumulated a strong set of legal precedents to give it force, and it's an interface webmasters understand.

Everything not forbidden is permitted

If your gathering obeys robots.txt, then the resulting data is yours to do with as you see fit. You can analyze it to reveal information that the sources thought they'd concealed, publish derivative works, or even the underlying data if it isn't copyrightable. This was my naive understanding of the landscape when I first began crawling, since it makes perfect logical sense. What's missing is the fact that all of those actions I list above, while morally defensible, really piss website owners off. That matters because the guys with the interesting data also have lots of money and lawyers, and whatever the legal merits of the situation they can tie you up in knots longer than you can keep paying your lawyer.

Hands off my data!

To the folks running large sites, robots.txt is there to control what shows up in Google. The idea that it's opening up their data to all comers would strike them as bizarre. They let Google crawl them so they'll get search traffic, why would they want random companies copying the information they've worked so hard to accumulate?

It's this sense of ownership that's the biggest obstacle to the growth of independent crawler startups. Shion mentions the server and bandwidth costs, but since most crawlers only pull the HTML without any images or other large files, these are negligible. What really freaks site owners out is the loss of control.

Over the next few years, 'wildcatter' crawlers like mine will become far more common. As site owners become more aware of us, they'll be looking for ways to control how their data is used. Unless we think of a better alternative, they'll do what Facebook did and switch to a whitelist containing a handful of the big search engines, since they're the only significant drivers of traffic. This would be a tragedy for innovation, since it would block startups off from massive areas of the Internet and give the existing players in search a huge structural advantage.

To prevent this, we need to figure out a simple way of giving more control to site that won't block innovative startups. Robots.txt needs to communicate the owner's intent more clearly, with new directives similar to 'no-archive' that lay out acceptable usage in much more detail. I'm not the only one to realize this, and I'm hopeful we'll have a more detailed proposal ironed out soon.

At the same time, sites need to take stock of what information they are exposing to the outside world, since the 'scofflaw' crawlers will continue happily ignoring robots.txt. Any security audit should include a breakdown of exactly what they're handing over to scofflaw crawlers – I bet they'd be unpleasantly surprised!

A Wisconsin century

We’re visiting Liz’s mom in her hometown of Hayward, WI this week and thought we’d take a day off from work and try to bike over a hundred miles! We’ve both been doing a lot more road-riding this summer in Boulder and with the breathing boost from going to near sea-level we thought this would be a great chance to manage our first ‘century’. It turned out to still be quite a workout even though the terrain looks flat when you’re driving, with the turn-around of the out-and-back almost 1,000 feet lower than the start and lots of hills in between. We made it though, riding 110 miles all the way to Lake Superior and back.

It wasn’t exactly a route custom-built for biking, it was all along the edge of highways but they weren’t too busy and in most places there was a wide, clear shoulder. The drivers that passed us were courteous and we felt pretty safe, especially compared to biking in Los Angeles. Highlights included the tiny store advertising “Minnows, Movie Rentals, Tanning and Turkey Registration” in the misleadingly named Grand View and a John Deere dealership. I have to confess I was more excited by the rival Kubota dealership down the road, thanks to fond memories of the Kubotas we drive on Santa Cruz Island when we’re helping out the rangers.

Anyway, I have no idea if anyone else will ever want to ride this route, but the map’s up above and we had a wonderful, exhausting time. Who knows, maybe we can give Hayward a biking rival to the Birkie?

How to fetch URLs in parallel using Python

Parallellines
Photo by FunkyMonkey

Here's my class to run web requests in parallel in Python

One of the most common patterns in my work is fetching large numbers of web resources, whether it's crawling sites directly or through REST APIs. Often the majority of the execution time is spent waiting for the HTTP request to make the round trip across the internet and return with the results. The obvious way to speed things up is to change from a synchronous 'call-wait-process' model where each call has to wait for the previous one to finish, to an asynchronous one where multiple calls can be in flight at once.

Unfortunately that's hard to do in most scripting languages, despite being a common idiom in Javascript thanks to Ajax. Threads are too heavy-weight in both resources and in programming complexity since we don't actually need any user-side code to run in parallel, just the wait on the network. In most languages the raw functions you need to build this are available through libcurl, but its multi_curl interface is nightmarishly obscure.

In PHP I ended up writing my own ParallelCurl class that provides a simple interface on top of that multi_curl complexity, letting you specify how many fetches to run in parallel and then just feed it URLs and callback functions. Recently though I've been moving away to using Python for longer-lived offline processing jobs, and since I couldn't find an equivalent to ParallelCurl I ported my PHP code over.

This is the result. You'll need to easy_install pycurl to use it and I'm a Python newbie so I'm sure there's ugliness in the code, but I'm really excited that one of the big barriers to migrating more of my code is now gone. Let me know how you get on with it.

Five short links

Fivedollar
Photo by Cayusa

Large Scale Social Media Analysis with Hadoop – A great introduction to the power of MapReduce on massive social network data, courtesy of Jake Hofman

Junk Conferences – A guide to spotting bogus conferences and journals, aimed at the scientific world but equally applicable to startup scammers – via Felix Salmon

Honoring Cadavers – Growing up in a family of nurses I heard plenty of tales of doctors' lack of human empathy, so I love the idea of connecting them with the people behind the corpses they start with. I'd like to see every technical profession forced to deal meaningfully with the people at the receiving end of their work. At Apple, most Pro Apps engineers would volunteer to spend a shift demo-ing our software to users at the NAB show every year, and I got an amazing amount of insight and motivation from that experience.

Gary, Indiana's unbroken spirit – It seems like Gary is in even more dire straits than Detroit. I spent five years in Dundee just after the jute mills had closed, and it felt a lot like this. The European approach is to try to retain the residents and bring jobs to them, but during my time in Scotland that seemed to result in a middle class almost entirely employed by the government, and very few private companies.

The world is full of interesting things – A Google labs roundup of over a hundred cool sites on the internet, with Fan Page Analytics tucked away on page 86.