Is ingestion the Achilles Heel of Big Data?

Wellies
Photo by Jon Appleyard

Drew Bruenig asked me a very worthwhile question via email:

"Outside of a handful of few predictable cases (website analytics, social exchange, finance) big data piles are each incredibly unique. In the smaller data sets of consumer feedback (that are still much larger than our typical sets) it’s more efficient for me to craft an ever expanding library of scripts to deal with each set. I have yet to have a set that doesn’t require writing a new routine (save for exact reruns of surveys).

So the question is: can big data ever become big business, or are the variables too varied to allow a scalable industry"

This gets to the heart of the biggest practical problem with Big Data right now. Processing the data keeps getting easier and cheaper, but the job of transforming your source material into a usable form remains as hard as it's ever been. As Hilary Mason put it, are we stuck using grep and awk?

A lot of the hype around Big Data assumes that it will be a growth industry as ordinary folks learn to analyze these massive data sets, but if the barrier is the need to craft custom input transformations for each new situation, it will always be a bespoke process, a cottage industry populated solely by geeks hand-rolling scripts.

Part of the hope is that new tools, techniques and standards will emerge that remove some of the need for that sort of boiler-plate code. activitystrea.ms/ is a good example of that in the social network space, maybe if there were more consistent ways of specifying the data in other domains we wouldn't need as many custom scripts? That's an open question, even the Activity Streams standard hasn't removed the need to ingest all the custom data formats from Twitter, etc.

Another big hope is that we'll do a better job of generalizing about the sort of data transformations we commonly need to do, and so build tools and libraries that let us specify the operations in a much more high-level way. I know there's a lot of repetition in my input handling scripts, and I'm searching for the right abstraction to use to simplify the process of creating them.

I also think we should be learning from the folks who have been dealing with Big Data for decades; enterprise database engineers. There's a cornucopia of tools for the Extract, Transform, Load stage of database processing, including some nifty open-source visual toolkits like Talend. Maybe these don't do exactly what we need, but there has to be a lot of accumulated wisdom we can build on. The commercial world does tend to be a blindspot for folks like me from a more academic/research background, so I'll be making an effort to learn more from their existing practices. On the other hand the fact that ETL is still a specialized discipline in its own right is a sign that ingestion is still an unsolved problem even after decades of investment, so maybe our hopes shouldn't get too high!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: