Five short links

Pentagami
Picture by Phillip Chapman-Bell

Truthy – A research project that tracks how memes spread on the internet. Could be a big help in understanding how we can combat bogus ideas on the web.

A rough but intriguing method for assessing a school principal – I love rules of thumb, especially subtle ones that are hard to game and tied to something you care about, they're so useful in any data work. PageRank is at its heart a statistical hack like most rules of thumb, and is the most lucrative algorithm in history. Bob Sutton unearths an offline example, and I can believe that how many pupil's names a principal knows has a strong relationship to their diligence and involvement in the job.

Liquimap Demo – Steve Souza has been doing some thought-provoking work on new ways of presenting data as raster images, essentially hacking our vision systems to help us spot patterns in chaotic information.

Data Without Borders – I'm a believer in the power of data analysis to provide new insights on old problems, especially in the non-profit world, and Jake Porway's new effort seems like a great way for us to use our powers for good.

HBase vs Cassandra – I'm deciding which of the two to use for a new project, and this article has been a big help. By looking around at other startups, Cassandra seems to have won, but since I'll be doing a lot of Hadoop processing, I'm trying to figure out whether Brisk on Cassandra would work, or if HBase still has advantages there.

Green Tea Kit Kats

Greenteakitkat

A few weeks ago I had a lovely gift from Ken Cukier, who writes for the Economist out of Tokyo, and was responsible for the Deluge of Data special report last year. Who knew Nestlé had the mad food science skills to create a green tea kit kat? I actually hesitated about eating it, I wanted to keep it as a conversation piece, but in the end I couldn't resist. It was unusual but delicious, a bit like white chocolate but with a distinct tea flavor. I'm posting this both as a thank-you to Ken, and in the hope it will encourage other friends to ply me with exotic candy from around the world, it's a trend I like!

Am I wrong about queues being Satan’s little helpers?

Fluffysatan
Photo by Lori La Tortuga

I either did a bad job explaining myself, or my last post was wrong, judging by the reaction from a Twitter engineer, and other comments by email. The point I was trying to get across was that queue-based systems look temptingly simple on the surface but require a lot of work to get right. Is it possible to build a robust pipeline based on queues? Yes. Can your team do it? Even if they can, is it worth the time compared to an off-the-shelf batch solution?

I've seen enough cases to believe that there's a queue anti-pattern for building data processing pipelines. Building streaming pipelines is still a research problem, with promising projects like S4Storm, and Google's Caffeine, but they're all very young or proprietary. It's a tempting approach because it's such an obvious next step in data processing, and it's so easy to get started stringing queues together. That's the wrong choice for most of us mere mortals though, as we'll get sucked into dealing with all the unexpected problems I described, instead of adding features to our product.

I'm wary of using queues for data processing for the same reason I'm wary of using threads for parallelizing code. Experts can create wonderful concurrent systems using threads, but I keep shooting myself in the foot when I use them. They just aren't the right abstraction for most problems. In the same way, when you're designing a pipeline and thinking in terms of queues, take a step back and ask yourself why you can't achieve your goals with a mature batch-based system like Hadoop instead?

Queue-based data pipelines are hard, but they look easy at first. That's why I believe they're so dangerous.

Queues are the Devil’s own data structures

Plushsatan
Photo by Paula Izzo

Queues in data-processing pipelines considered harmful

Every time I start talking to a startup that's trying to deal with processing data at scale, and struggling, they all seem to have built a pipeline around queues. First, an item will be put into the 'load' queue, maybe something as simple as the ID of a Facebook user. A process will be sitting somewhere else that is watching the queue, pulls the item, and passes it to a function to perform the required task, maybe fetching a list of those users friends. That result will then be inserted as a payload into another queue, and the whole process is repeated, possibly several levels deep.

I understand why people are drawn to this pattern, a few years ago it would have been my approach too. It uses familiar tools to most web developers, and conceptually makes sense. Don't give in though, it's a trap! Here's why:

Fragile

What happens when one of those tasks pulls an item from the queue and then fails with an error? You can do some trickery to mark the item as in-progress, and then 'garbage collect' to retry tasks that are taking too long, but that's a nasty hunk of housekeeping code to get right. You'll also need to figure out a logging strategy so you can debug what's going wrong in the case of more subtle problems. It's easy to create a simple system dealing with a few items using queues, but once things get complex fixing issues becomes a nightmare.

Unbalanced

Different stages of your system will run at different speeds, and you'll end up with upstream tasks creating ever-growing backlogs when they feed into slower consumers. You can try to fix this by allocating more workers for the slower tasks, but then another stage will become the bottleneck, and the only real solution is carefully tuning the speed of the original inputs to match the available resources. This is not fun. 

Bottlenecked

You need some way to pass data between the different stages, since they're each a little like remote-procedure calls with arguments that need to be marshalled to pass across the network. Queue items can have a data attached, but that mechanism often becomes inefficient as your payload grows beyond a few kilobytes. Instead, information will often be written to a database like PostGres or MySQL, with only the primary key passed in the queue message. Now you're paying the overhead of a database transaction for each item, or alternatively you're using a temporary disk file on a networked file system, and paying for the access cost there. Whether you're passing data in the queue, on a database, or through a file system, it's a costly, heavyweight operation, using a resource that doesn't scale very nicely as your demands grow.

Unreproduceable

Supposing you realize that you need to alter your algorithm, and run it across all the data you've already gathered. There's no easy way to do that if you aren't capturing all of the inputs to the system in a historical record. What usually happens is that some custom code is hand-rolled to read out of a database and reproduce the inputs (Facebook IDs or whatever) and then feed them into the pipeline again, but the feeding code is hard to get right and almost certainly won't produce the same results.

Obscure

Every custom pipeline built from low-level building-blocks like queues is a unique little snowflake. Any outside developer who interacts with it has to learn it from scratch, there's no standardization. This makes hiring and training engineers costly and time-consuming. It's great job security for the creators though. 

The answer?

The only route to sanity is to go as stateless as possible. Queues on their own are fine, upstanding data structures, but they introduce tightly-coupled stateful dependencies into a pipeline. They encourage you to think in terms of streaming processing, which is a much harder problem to write code for than batch jobs. Hadoop is almost always the right answer, even when you're doing something that feels too simple for a MapReduce algorithm. The paradigm of 'take this folder of input log files, process them, and write out the results to this folder' might not seem that different from a queue, but because the contents of the inputs and outputs are immutable it's a much less coupled and stateful system than a cascade of queues. There's also loads of support tools like Flume for getting data into the system, tons of documentation and plenty of people who know how to use it already. You might end up using queues somewhere in there, they're still a useful tool to keep in your bag, but don't build a data pipeline around them and expect to deal with large-scale data.

My ‘Introduction to MapReduce’ video is now available

For a while now I've been visiting companies and doing a 'brown bag' lunch, where I gather a bunch of engineers and database people, and walk them through writing their own simple MapReduce jobs in Python. Ever since I discovered how straightforward the MapReduce approach actually was behind the intimidating jargon, I've been on a mission to spread the word. You can write a useful MapReduce job using just a couple of simple Python scripts, run it from the Unix command line, and then take the same scripts and run them as Hadoop streaming jobs. A few months ago I got together with the O'Reilly team and filmed an extended version of one of those training sessions, which I'm hoping will help my message reach a wider audience.

I used to think that MapReduce was an esoteric, academic approach to data processing that was too much trouble to learn. Once I wrapped my head around it, I realized how simple and useful it actually is, so my goal is to help other people over that same hump, and start using it in their daily work. The main link to the course is at:

http://oreilly.com/catalog/0636920020233/

It's $20 for the full two hour video, but check out the free preview to get a flavor before you buy. A big thanks to the students who volunteered their day. It turned out to be a long recording session, thanks to some technical issues in the second half, but they were all wonderfully patient and fantastic collaborators.

Backpacking along Salmon Creek in Big Sur

Bigsur0Three years ago, I went on a day hike through Salmon Creek in the south of Big Sur, and was amazed by some of the hike-in campgrounds along the trail. They were in gorgeous locations, didn't require reservations and weren't heavily used. Last weekend I finally had a chance to spend a few days backpacking through a couple of them with my friend Richard. He's an ace photographer, and so here's a few of the images he captured (here's the full set). I felt so rejuvenated after my time out there, despite the punishing climb through the hills, I have to make it out there again soon.

Bigsur1

Spruce Camp, a smaller plot about 2.5 miles from Salmon Creek trailhead. It was very damp on our first night there, so the wood fire was quite feeble, but as you can see from the top photo the creek running through the site more than made up for it.

Bigsur2
The second night we did a shorter hike to Estrella, a much larger campground by another creek. My dog Thor loves the hiking, but once I set up the tent he's on the sleeping bag like a shot. The only way to lure him out is to slip him inside my jacket, he definitely misses central heating.

Bigsur3
The trail was in good shape until Estrella. After that, some sections were very grown-in, and there were landslides that made my footing very treacherous, especially with a backpack on.

Shutting down Wordlin.gs

Wordlingsshot

My Wordlin.gs site was an experiment to see if a merchandise-supported model would work for creative visualizations. It has had some usage, but I've yet to see a single purchase of a t-shirt, mug or poster, so I'll be shutting it down next week to save the $250 per-month Heroku costs (largely due to the database and image-processing requirements). I'll be contacting users who have put up public images so they can save them. I'm sad it didn't work out, I'd love to focus on pure visualizations thanks to a solid revenue model, but I learned a lot.