Photo by Paula Izzo
Queues in data-processing pipelines considered harmful
Every time I start talking to a startup that's trying to deal with processing data at scale, and struggling, they all seem to have built a pipeline around queues. First, an item will be put into the 'load' queue, maybe something as simple as the ID of a Facebook user. A process will be sitting somewhere else that is watching the queue, pulls the item, and passes it to a function to perform the required task, maybe fetching a list of those users friends. That result will then be inserted as a payload into another queue, and the whole process is repeated, possibly several levels deep.
I understand why people are drawn to this pattern, a few years ago it would have been my approach too. It uses familiar tools to most web developers, and conceptually makes sense. Don't give in though, it's a trap! Here's why:
What happens when one of those tasks pulls an item from the queue and then fails with an error? You can do some trickery to mark the item as in-progress, and then 'garbage collect' to retry tasks that are taking too long, but that's a nasty hunk of housekeeping code to get right. You'll also need to figure out a logging strategy so you can debug what's going wrong in the case of more subtle problems. It's easy to create a simple system dealing with a few items using queues, but once things get complex fixing issues becomes a nightmare.
Different stages of your system will run at different speeds, and you'll end up with upstream tasks creating ever-growing backlogs when they feed into slower consumers. You can try to fix this by allocating more workers for the slower tasks, but then another stage will become the bottleneck, and the only real solution is carefully tuning the speed of the original inputs to match the available resources. This is not fun.
You need some way to pass data between the different stages, since they're each a little like remote-procedure calls with arguments that need to be marshalled to pass across the network. Queue items can have a data attached, but that mechanism often becomes inefficient as your payload grows beyond a few kilobytes. Instead, information will often be written to a database like PostGres or MySQL, with only the primary key passed in the queue message. Now you're paying the overhead of a database transaction for each item, or alternatively you're using a temporary disk file on a networked file system, and paying for the access cost there. Whether you're passing data in the queue, on a database, or through a file system, it's a costly, heavyweight operation, using a resource that doesn't scale very nicely as your demands grow.
Supposing you realize that you need to alter your algorithm, and run it across all the data you've already gathered. There's no easy way to do that if you aren't capturing all of the inputs to the system in a historical record. What usually happens is that some custom code is hand-rolled to read out of a database and reproduce the inputs (Facebook IDs or whatever) and then feed them into the pipeline again, but the feeding code is hard to get right and almost certainly won't produce the same results.
Every custom pipeline built from low-level building-blocks like queues is a unique little snowflake. Any outside developer who interacts with it has to learn it from scratch, there's no standardization. This makes hiring and training engineers costly and time-consuming. It's great job security for the creators though.
The only route to sanity is to go as stateless as possible. Queues on their own are fine, upstanding data structures, but they introduce tightly-coupled stateful dependencies into a pipeline. They encourage you to think in terms of streaming processing, which is a much harder problem to write code for than batch jobs. Hadoop is almost always the right answer, even when you're doing something that feels too simple for a MapReduce algorithm. The paradigm of 'take this folder of input log files, process them, and write out the results to this folder' might not seem that different from a queue, but because the contents of the inputs and outputs are immutable it's a much less coupled and stateful system than a cascade of queues. There's also loads of support tools like Flume for getting data into the system, tons of documentation and plenty of people who know how to use it already. You might end up using queues somewhere in there, they're still a useful tool to keep in your bag, but don't build a data pipeline around them and expect to deal with large-scale data.