Queues are the Devil’s own data structures

Plushsatan
Photo by Paula Izzo

Queues in data-processing pipelines considered harmful

Every time I start talking to a startup that's trying to deal with processing data at scale, and struggling, they all seem to have built a pipeline around queues. First, an item will be put into the 'load' queue, maybe something as simple as the ID of a Facebook user. A process will be sitting somewhere else that is watching the queue, pulls the item, and passes it to a function to perform the required task, maybe fetching a list of those users friends. That result will then be inserted as a payload into another queue, and the whole process is repeated, possibly several levels deep.

I understand why people are drawn to this pattern, a few years ago it would have been my approach too. It uses familiar tools to most web developers, and conceptually makes sense. Don't give in though, it's a trap! Here's why:

Fragile

What happens when one of those tasks pulls an item from the queue and then fails with an error? You can do some trickery to mark the item as in-progress, and then 'garbage collect' to retry tasks that are taking too long, but that's a nasty hunk of housekeeping code to get right. You'll also need to figure out a logging strategy so you can debug what's going wrong in the case of more subtle problems. It's easy to create a simple system dealing with a few items using queues, but once things get complex fixing issues becomes a nightmare.

Unbalanced

Different stages of your system will run at different speeds, and you'll end up with upstream tasks creating ever-growing backlogs when they feed into slower consumers. You can try to fix this by allocating more workers for the slower tasks, but then another stage will become the bottleneck, and the only real solution is carefully tuning the speed of the original inputs to match the available resources. This is not fun. 

Bottlenecked

You need some way to pass data between the different stages, since they're each a little like remote-procedure calls with arguments that need to be marshalled to pass across the network. Queue items can have a data attached, but that mechanism often becomes inefficient as your payload grows beyond a few kilobytes. Instead, information will often be written to a database like PostGres or MySQL, with only the primary key passed in the queue message. Now you're paying the overhead of a database transaction for each item, or alternatively you're using a temporary disk file on a networked file system, and paying for the access cost there. Whether you're passing data in the queue, on a database, or through a file system, it's a costly, heavyweight operation, using a resource that doesn't scale very nicely as your demands grow.

Unreproduceable

Supposing you realize that you need to alter your algorithm, and run it across all the data you've already gathered. There's no easy way to do that if you aren't capturing all of the inputs to the system in a historical record. What usually happens is that some custom code is hand-rolled to read out of a database and reproduce the inputs (Facebook IDs or whatever) and then feed them into the pipeline again, but the feeding code is hard to get right and almost certainly won't produce the same results.

Obscure

Every custom pipeline built from low-level building-blocks like queues is a unique little snowflake. Any outside developer who interacts with it has to learn it from scratch, there's no standardization. This makes hiring and training engineers costly and time-consuming. It's great job security for the creators though. 

The answer?

The only route to sanity is to go as stateless as possible. Queues on their own are fine, upstanding data structures, but they introduce tightly-coupled stateful dependencies into a pipeline. They encourage you to think in terms of streaming processing, which is a much harder problem to write code for than batch jobs. Hadoop is almost always the right answer, even when you're doing something that feels too simple for a MapReduce algorithm. The paradigm of 'take this folder of input log files, process them, and write out the results to this folder' might not seem that different from a queue, but because the contents of the inputs and outputs are immutable it's a much less coupled and stateful system than a cascade of queues. There's also loads of support tools like Flume for getting data into the system, tons of documentation and plenty of people who know how to use it already. You might end up using queues somewhere in there, they're still a useful tool to keep in your bag, but don't build a data pipeline around them and expect to deal with large-scale data.

My ‘Introduction to MapReduce’ video is now available

For a while now I've been visiting companies and doing a 'brown bag' lunch, where I gather a bunch of engineers and database people, and walk them through writing their own simple MapReduce jobs in Python. Ever since I discovered how straightforward the MapReduce approach actually was behind the intimidating jargon, I've been on a mission to spread the word. You can write a useful MapReduce job using just a couple of simple Python scripts, run it from the Unix command line, and then take the same scripts and run them as Hadoop streaming jobs. A few months ago I got together with the O'Reilly team and filmed an extended version of one of those training sessions, which I'm hoping will help my message reach a wider audience.

I used to think that MapReduce was an esoteric, academic approach to data processing that was too much trouble to learn. Once I wrapped my head around it, I realized how simple and useful it actually is, so my goal is to help other people over that same hump, and start using it in their daily work. The main link to the course is at:

http://oreilly.com/catalog/0636920020233/

It's $20 for the full two hour video, but check out the free preview to get a flavor before you buy. A big thanks to the students who volunteered their day. It turned out to be a long recording session, thanks to some technical issues in the second half, but they were all wonderfully patient and fantastic collaborators.

Backpacking along Salmon Creek in Big Sur

Bigsur0Three years ago, I went on a day hike through Salmon Creek in the south of Big Sur, and was amazed by some of the hike-in campgrounds along the trail. They were in gorgeous locations, didn't require reservations and weren't heavily used. Last weekend I finally had a chance to spend a few days backpacking through a couple of them with my friend Richard. He's an ace photographer, and so here's a few of the images he captured (here's the full set). I felt so rejuvenated after my time out there, despite the punishing climb through the hills, I have to make it out there again soon.

Bigsur1

Spruce Camp, a smaller plot about 2.5 miles from Salmon Creek trailhead. It was very damp on our first night there, so the wood fire was quite feeble, but as you can see from the top photo the creek running through the site more than made up for it.

Bigsur2
The second night we did a shorter hike to Estrella, a much larger campground by another creek. My dog Thor loves the hiking, but once I set up the tent he's on the sleeping bag like a shot. The only way to lure him out is to slip him inside my jacket, he definitely misses central heating.

Bigsur3
The trail was in good shape until Estrella. After that, some sections were very grown-in, and there were landslides that made my footing very treacherous, especially with a backpack on.

Shutting down Wordlin.gs

Wordlingsshot

My Wordlin.gs site was an experiment to see if a merchandise-supported model would work for creative visualizations. It has had some usage, but I've yet to see a single purchase of a t-shirt, mug or poster, so I'll be shutting it down next week to save the $250 per-month Heroku costs (largely due to the database and image-processing requirements). I'll be contacting users who have put up public images so they can save them. I'm sad it didn't work out, I'd love to focus on pure visualizations thanks to a solid revenue model, but I learned a lot.

Five short links

Highfivedog
Photo by Dave M Barb

quantiFind – A commercial company with a similar philosophy to the DSTK. Take in unstructured data, and use statistical approaches to extract something structured.

WorryDream – I love the infographics approach of this site, especially the ‘latitudes I have lived at’.

Goliath – Intriguing lightweight event-based web server, written in Ruby. Sinatra has been a pleasure to use, but despite its maturity, requiring a layer like Passenger to serve parallel web requests still doesn’t sit right with me. I don’t have a well-reasoned argument behind this, so please go easy, but it has felt like too many moving parts.

Graphite – Log arbitrary events to a central server, get instant graphs of them over time. It’s a simple concept, but one I could see being very powerful. One of my favorite profiling tools when I was a game programmer was altering the screen border color as each code module executed. For 90’s era single-threaded games synced to the refresh rate, you’d end up with a stacked colored column along the side, showing the proportion of the frame devoted to AI, rendering, etc. The simpler the profiling interface, the more likely people are to actually use it and learn the real characteristics of their systems.

Diffbot – A different take on the unstructured-to-structured approach to data. One API lets you watch a web page and get a stream of changes over time, either as a simple RSS feed or a more detailed XML format. Another does a boilerpipe-like extraction of an article’s text or a home page’s link structure.

Facts are untraceable

Serialnumber
Photo by Brian Hefele

As we share more and more data about our lives, there's a lot of discussion about what organizations should be allowed to do with this information. The longer I've spent in this world, the more I think that this might be a pointless debate. Controlling what happens with our data requires punishing people who are caught misusing it. The trouble is, how do you tell where they got it from?

If an organization has your name, friends and interests, here's just a few of the places that information could have come from:

– Your Facebook account, via the API or hacking.

– Your email inbox, through a browser extension or hacking, gathering a list of the people you mail and purchase confirmations.

– Your phone company, analyzing the calls you get and the URLs you navigate to on your smart phone.

– Your credit card company. They'd have trouble with the friends, though theoretically spotting split checks and simultaneous purchases should be a strong clue.

– Retailers sharing data with each other about their customers.

There are now so many ways of gathering facts about your life, that it's usually impossible to tell where a particular set of data came from. You can inject fake Mountweazel values into databases to catch unskilled abusers, but as soon as there's multiple independent sources for any given fact, you can avoid them by only including values that are present in more than one of them. If I know your postal address, can you prove that I hacked into the DMV, rather than just getting it from a phone book or one of your friends?

In practice this means that creepy marketing data gathered by underhand means can be easily laundered into openly-sold data sets, since nobody can prove it has murky origins. This has always been theoretically possible, but what has changed is that there's now so many copies of our personal data floating around, it's far easier to gather and harder to trace. From a technical point of view I don't see how we can stop it, as long as we continue to instrument our lives more and more.

I'm actually very excited by the new world of data we're moving into, but I'm worried that we're giving people false assurances about how much control they can keep over their information. On the other hand the offline marketing world has gathered detailed data on all of us for decades without raising much public outrage, so maybe we don't really care?

Should You Talk to Journalists?

Pressplate
Photo by Danger Ranger

I've been helping to arrange some interviews for a reporter, and one of the friends I approached asked "Is there any benefit to the interviewee?". This is actually a very perceptive question, most people jump at any chance to talk to a journalist, but there's real costs to that decision. Speaking as someone who has both written and been written about for money, I know a journalist's job is to persuade you to talk to them, whether or not that's actually in your interest. After I thought about it, I told him it really came down to what your goals are.

Good things that may happen

– Your work might be covered and publicized.

– He may approach you for quotes about related stories in the future.

– He might introduce you to other people in the area.

Bad potential side effects

– You lose valuable time you could spend actually building things.

– He could garble or misquote your points, leading to negative publicity.

– Other publications may decide not to publish stories if you're seen as giving an exclusive to a rival.

What may happen if you don't talk

– A competitor does provide the needed quotes, and gets the publicity.

– The journalist covers you in a negative way. This is very rare, but it's always there as a threat.

Most people radically over-estimate the dangers of being mis-quoted, but also have unrealistic expectations of the power of good publicity. A lot of it boils down to networking and exposure, and how much that benefits you depends on what you're trying to do. If you're focused on research or making technical progress, it's probably a distraction you should ignore. If the startup/fundraising side is higher on your priority list, being able to point to articles can really help in establishing the ever-desired perception of traction.

It's worth thinking about how you'll deal with interview requests before they come up. I've always loved talking to people about what I do, and my frustration at not being able to discuss my work was a big part of why I left Apple, so I've ended up working on projects where my tendency pays off. Your situation may well be different though. Unless you're clear-headed about your goals, you'll end up wasting your time. It's also worth pondering which publications reach an audience you actually care about. It might be that comparatively-obscure industry journals will let you talk to the decision makers in your market a lot more effectively than a mainstream outlet, which should affect which journalists you spend time on.

Five short links

Dicefive
Photo by Doug88888

Stanford’s Wrangler – A promising data-wrangling tool, with a lot of the interactive workflow that I think is crucial.

Open Knowledge Conference – They’ve gathered an astonishing selection of speakers. I’m really hoping I can make it out to Berlin to join them.

The Privacy Challenge in Online Prize Contests – It’s good to see my friend Arvind getting his voice heard in the debate around privacy.

The Profile Engine – A site that indexes Facebook profiles and pages, with their permission.

Acunu – I met up with this team in London, and they’re doing some amazing work at the kernel level to speed up distributed key/value stores, thanks to some innovative data structures.

Kindles Profiles are so close to being wonderful

"Propose to an Englishman any … instrument, however admirable, and you will observe that the whole effort of the English mind is directed to find a difficulty, defect, or an impossibility in it. If you speak to him of a machine for peeling a potato, he will pronounce it impossible; if you peel a potato with it before his eyes, he will declare it useless, because it will not slice a pineapple"

I'd completely forgotten about this deliciously bitter quote from Charles Babbage in The Philosophical Breakfast Club, but thanks to Amazon's Kindle profiles site, I re-discovered it listed in my highlights. I was very excited when I stumbled across this social feature, I've been looking for an automatic way to share my reading list with friends. I've even experimented with scripts to scrape the reading history from my account, but never got anything complete enough to use. My dream is a simple blog widget showing what I'm reading, but without the maintenance involved in updating GoodReads with my status. I'm often reading on a plane or in bed at night, so the only way I'll have something up to date is if it uses information directly from my Kindle. I looked at the highlights page, and it looked like exactly what I was after, a chronological list of notes and the books I'd been reading recently:

Kindleshot

Now all I needed to do was figure out how to make that page public. First, I had to go through all 160 books, and manually mark two check boxes next to each of them, making the book public, and then making my notes on it available. That was a bit of a grind (and something I guess I'll need to do for every book as I read it), but worth it if I could easily publish my highlights. After that though, I realized there was nothing like a 'blog' page for my notes that was available to anyone else. The closest is this one for my public notes:

https://kindle.amazon.com/profile/Peter-C–Warden/11996/public_notes

It just has covers for the five books I most recently altered the state of, whether or not they have any notes or highlights, and you have to click through to find any actual notes. The "Your Highlights" section that only I can access is perfect, exactly what I would like to share with people, its simplicity is beautiful. Short of posting my account name and password here, does anyone have any thoughts on how I could get it out there? Anybody at Amazon I can beg?

Facebook and Twitter logins aren’t enough

Bulgariastamp
Photo by Karen Horton

A couple of months ago I claimed "These days it doesn't make much sense to build a consumer site with its own private account system" and released a Ruby template that showed how to rely on just Facebook and Twitter for logins. It turns out I was wrong! I always knew there would be some markets that didn't have enough adoption of those two services, but thought that the tide of history would make them less and less relevant. What I hadn't counted on was kids.

My Wordlings custom word cloud service has seen a lot of interest from teachers who want to use it with their students, but especially amongst pre-teens, there's little chance they're on either Facebook or Twitter. They may not even have an email address to use! Since that's not likely to change, I added a new "Sign in for Kids" option that just requires a name, skipping a password even. It has the disadvantage that once you log out, you can't edit any of your creations, but that seems a small price to pay to make the service more accessible.