How to speed up massive data set analysis by eliminating disk seeks

Photo by Pchweat

Building means analyzing with billions of pieces of information about hundreds of millions of users. At this sort of scale not only do traditional relational databases become impractical for my needs (even loading a few tens-of-millions of rows into a mysql table and then creating an index can take days), key-value stores also fail.

Why do they fail? Let's walk through a typical data-flow example for my application. I have an input text file containing new information about a user, so I want to update that user's record in the database. Even with a key-value store that means moving the disk head to the right location to write that new information, since user records are scattered arbitrarily across the drive. That typically takes around 10ms, giving an effective limit of around 100 users per second. Even a million users will take over two hours to process at that rate, with almost all the time spent tapping our toes waiting for the hard drive.

Stores like Mongo and Redis try to work around this by caching as much as they can in RAM, and using delayed writes of large sectors to disk so that updates don't block on disk seeks. This works well until the data set is too large to fit in RAM. Since my access locations are essentially random, the system ends up thrashing as it constantly swaps large chunks in and out of main memory, and we're back to being limited by disk seek speed.

So what's the solution? SSD drives don't have the massive seek bottleneck of traditional disks, but I'm still waiting for them to show up as an option on EC2. Instead, I've re-engineered my analysis pipeline to avoid seeks at all costs.

The solution I've built is surprisingly low-tech, based entirely on text files and the unix sort command-line tool. For the user record example I run through my source data files and output a text file with  line for each update, beginning each line with the user id, eg:

193839: { fanof:['cheese', 'beer'] }

I then run sort on these individual files, which since the command is very efficient and the individual files are only a couple of gigabytes in size, only takes a few seconds each. I can then take several hundred of these sorted sub-files and use the -m option on sort to very quickly merge them into an uber-file that's sorted, which avoids the thrashing you get when it tries to sort files larger than RAM.

What does this buy me? Within this uber-file, all the information related to a given user id is now in adjacent lines, eg:

193839: { fanof:['cheese', 'beer'] }
193839: { fanof:['hockey', 'ice fishing'] }
193839: { location:'Wisconsin' }
193839: { name:'Sven Hurgessoon' }

It's now pretty simple to write a script that runs through the uber-file and can output complete records containing all of a user's information from multiple source files without having to do any seeking, since you're just outputting each user to a new row or file, and all the source data is also local.

This same technique can be applied to any attribute you want to index in your source data. You can use the fan page name as the key in the first part of each line instead, which is how I'm assembling the data on each topic.

So in summary, I'm using sort to pre-order my data before processing to avoid seeks. I'm sure I'm not the only person to discover this, but it's not something that I've run across before, and it's enabled me to cope with orders-of-magnitude larger data sets than my pipeline could handle before.

One response

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: