How to speed up data loads to SimpleDB

Speedlimit
Photo by Random Factor

I'm really keen to use Amazon's SimpleDB service to store my data, but the upload process is just too damn slow. A naive implementation of a loader lets me upload about 20 rows a second, and since I've got over 200 million rows, that would take around 6 months! Sid kindly shared his experiences with Netflix's massive data transfer to SimpleDB over at practicalcloudcomputing.com, and he achieved rates of over 10,000 items a second. He's been very generous with advice, but obviously can't share any proprietary code, so I've set out to implement an open-source data loader in Java to implement his suggestions.

The code is up at:
http://github.com/petewarden/simpledb_loader

It uploads 10,000 generated rows using these optimizations:
– Calling BatchPutAttributes() to upload 20 rows at a time
– Multiple threads to run requests in parallel
– Leaving Replace as false for the overwrite behavior

Despite that, I'm still only seeing around 140 items a second, which is a long way off Sid's results. I'm going to be doing some more work on this, but I'd love it if anyone from Amazon could jump in and help put together an example that implements all their best practices. Judging from the forums there's a lot of people stuck on exactly this problem and it would making porting over existing services a lot easier.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: