How to upload your CSV data into SimpleDB at 1000 items a second

Weir
Photo by Old Onliner

With help from Sid Anand, Kevin Marshall (buy his book) and David Kavanagh, along with Brett Taylor, Siva Raghupathy and the rest of the SimpleDB team, I've managed to improve my loading performance by an order of magnitude. I've also added in support for loading from arbitrary CSV or JSON files, so you can use the simpledb_loader tool to do fast uploads of your own data too.

If you just want to dive in, grab the source, make sure you've got java, cd into the directory and run

./sdbloader help

to bring up the options and a mini-tutorial. You'll be able to setup a cluster of domains, and then either run a synthetic benchmark, or load data from a file.

The biggest performance improvement came from fixing a problem in my original code that caused my requests to get serialized rather than running in parallel. With that out of the way, I started hitting the throttling that Amazon starts applying if you send too many requests too soon. They're trying to penalize 'bursty' writers, so you need to start off with a comparatively low number of requests per-domain, per-second and ramp to your full rate over a few minutes. After some advice from the SimpleDB team followed by experimentation, I started off at 1 request per-second, and over the course of two minutes I ramp that up to 3 requests per-second, per-domain. Since each request can have 24 items inside it, that works out to a theoretical maximum of 72 items per-second for each domain. You can tune these values yourself by setting -minrps, -maxrps and -ramptime on the command line.

That led to the next change, tweaking the number of domains being used. The SimpleDB team recommended around 20 or 30 as a maximum, I'm guessing because that roughly corresponds to the actual number of machines they're hosted in. I actually see a performance increase with higher numbers than that, my 1000 item/second maximum was achieved with 100 domains. However I think this is likely to be a loophole in their throttling code, so I wouldn't recommend going that far. You can alter the number of domains used with the -domaincount argument, make sure you specify the same number for both setup and your loading.

The final important performance tip is to ensure that you're running from within Amazon's network, by running your data upload from an EC2 server. This makes a massive difference, I get half the speed when I'm running over my broadband connection at home.

To reproduce the speeds I'm seeing, run these commands

 ./sdbloader setup -a <access key> -s <secret key> -d 100

 ./sdbloader loadcsv -a <access key> -s <secret key> -d 100 -f testdata.csv

Those will set up the domains you need, and then try to upload 20,000 items from the test CSV file, each with multiple attributes, and a pretty typical representation of my workload. I see this taking around 19 seconds to complete, or just over 1000 items a second.

I know from Sid's work at NetFlix that this isn't the end of the road, he's getting over 10,000 items/second, but it's starting to become usable for the 210m item data set I need to upload. The main hurdles I'm hitting with the full data set are failed loads, either because of repeated 503 errors that exhaust the retries, or socket timeouts. If you want to dig deeper, the code is all fully available on github with no strings attached, just fork and go, and let me know if you make any improvements!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: