Over the past year I've been doing a lot of work analyzing very large data sets, eg hundreds of millions of Twitter messages. I started with mysql, but wasn't able to get the performance I needed, so like a lot of other engineers I moved towards key/value databases offering far fewer features but much more control.
I found it very hard to pick a database, there's so many projects out there it's like a Cambrian explosion. To help me understand how they could meet my needs, I decided to take one of my typical data analysis jobs and turn it into a benchmark I could run against any key/value store. The benchmark takes 10,000 Facebook profiles, looks at the fan mentions and compiles a record of how fan pages are correlated, eg there were 406 Michael Jackson fans, and 22 of them were also fans of the Green Bay Packers.
I'm not claiming this is a definitive benchmark for everyone, but it's an accurate reflection of the work I need to do, with repeated updates of individual rows and large values. I've uploaded all the PHP files so you can also try this for yourself. Here are the results on my MacBook Pro 2.8GHz:
Null Store: 5.0s
RAM Store: 21.3s
Redis Domain Socket: 37.3s
Tokyo Domain Socket: 40.8s
Redis TCP Socket: 42.6s
Mongo TCP Socket: 43.9s
Tokyo TCP Socket: 45.3s
The 'Null Store' is a do-nothing interface, just to test the overhead of non-database work in my script. The 'RAM store' keeps the values in a PHP associative array, so it's a good control giving an upper limit on performance, since it never has to touch the disk. Memcache is another step towards the real world, it's in another process, but its lack of disk access also gives it an unrealistic advantage.
The 'Redis domain socket' gives the best results of the real database engines. The domain socket part refers to a patch I added to support local file sockets. It's impressively close to the Memcache performance, less than 2 seconds behind. The 'Tokyo domain socket' comes in next, also using file sockets, then Redis using TCP sockets on the local machine, and then Tokyo on TCP.
A long, long way behind is MySQL, at over 10 times the duration of any of the other solutions. This demonstrates pretty clearly why I had to abandon it as an engine for my purposes, despite its rich features and stability. I also tested MongoDB, but was getting buggy results out of the analysis so I was unable to get meaningful timings. I've included the file if anyone's able to tell me where I'm going wrong?
[Updated: thanks to Kristina who fixed the bug in my PHP code in the comments I now have timings of 43.9s for Mongo via TCP, and I've updated the code download. Bearing in mind I'm doing a lot of unnecessary work for Mongo, like serializing, this is an impressive result and I should be able to do better if I adapt my code to its features. I'm also trying to find info on their support for domain sockets.]
To try this on your own machine, download the PHP files, go to the directory and run it from the command line like this:
time php fananalyze.php -f data.txt -s redis -h localhost -p 6379
You can replace redis and the host arguments with the appropriate values for the database you want to test. To make sure the analysis has worked, try running a query, eg
php fananalyze.php -q "http://www.facebook.com/michaeljackson" -s redis -h localhost -p 6379
You should see a large JSON dump showing which pages are correlated with Jackson.
I would love to hear feedback on ways I can improve this benchmark, especially since they will also improve my analysis! I'll also be rolling in some RAM caching in a later article, since I've found the locality of my data access makes delaying writes and keeping a fast in-memory cache layer around helps a lot.
Here's the versions of everything I tested:
Redis 1.02 (+ my domain socket patch)
Tokyo Cabinet 1.4.9
Tokyo Tyrant 1.1.7 (+an snprintf bug fix for OS X)
MySQL 5.1.41, InnoDB
Mongo 64 bit v1.1.4 and PHP driver v1.0.1