I'm really keen to use Amazon's SimpleDB service to store my data, but the upload process is just too damn slow. A naive implementation of a loader lets me upload about 20 rows a second, and since I've got over 200 million rows, that would take around 6 months! Sid kindly shared his experiences with Netflix's massive data transfer to SimpleDB over at practicalcloudcomputing.com, and he achieved rates of over 10,000 items a second. He's been very generous with advice, but obviously can't share any proprietary code, so I've set out to implement an open-source data loader in Java to implement his suggestions.
The code is up at:
http://github.com/petewarden/simpledb_loader
It uploads 10,000 generated rows using these optimizations:
– Calling BatchPutAttributes() to upload 20 rows at a time
– Multiple threads to run requests in parallel
– Leaving Replace as false for the overwrite behavior
Despite that, I'm still only seeing around 140 items a second, which is a long way off Sid's results. I'm going to be doing some more work on this, but I'd love it if anyone from Amazon could jump in and help put together an example that implements all their best practices. Judging from the forums there's a lot of people stuck on exactly this problem and it would making porting over existing services a lot easier.