2009 December « Pete Warden's blog

Photo by Stitch

As I've been searching for a solution to my big-data analysis problems, I've been very impressed by MongoDB's features, but even more by their astonishing level of support. After mentioning I was having trouble running Mongo in my benchmark, Kristina from 10Gen not only fixed the bug in my code, she then emailed me an optimization (using the built-in _id in my array), and after that 10gen's Mathias Stearn let me know the latest build contained some more optimizations for that path. After burning days dealing with obscure Tokyo problems I know how much time responsive support can save, so it makes me really want to use Mongo for my work.

The only fly in the ointment was the lack of Unix domain socket support. I'm running my analysis jobs on the same machine as the database, and as you can see from my benchmark results, using a file socket rather than TCP on local host speeds up my runs significantly on the other stores that support it. I already added support to Redis, so I decided to dive into the Mongo codebase and see what I could manage.

Here's a patch implementing domain sockets, including a diff and complete copies of the files I changed. Running the same benchmarks gives me a time of 35.8s, vs 43.9s over TCP, and 28.9s with the RAM cache vs 31.1s on TCP. These figures are only representative for my case, large values on a single machine, but generally they demonstrate the overhead of TCP sockets even if you're using localhost. To use it yourself, specify a port number of zero, and put the socket location (eg /tmp/mongo.sock) instead of the host name. I've patched the server, the command-line shell, and the PHP driver to all support file sockets this way.

I don't know what Mongo's policy is on community contributions, I primarily wrote this patch to scratch my own itch, but I hope something like this will make it into the main branch. Writing the code is the easy bit of course, the real challenge is testing it across all the platforms and configurations!

Photo by Olduser

In my previous post I gave raw timings for a typical analysis job on top of various key/value stores. In practice I use another trick to speed up these sort of processes; caching values in RAM for the duration of a run and then writing them all out to the store in one go at the end. This helps performance because there's some locality in the rows I'm accessing, so it's worth keeping previously-fetched or written data in memory and reducing the amount of disk IO needed. If this helps you will depend on your database usage patterns, but I've found it invaluable for my analysis of very large data sets.

The way I do this is by creating a PHP associative array mapping keys to values, populating it as I fetch from the store, and delaying writes until a final flushToDisk() call at the end of the script. This is very inelegant, it means you have to watch PHP's memory usage since the default 32MB max is easy to hit, and ensuring that final flush call is made is error-prone. The performance boost is worth it though, here's the figures using the same test as before, but with the cache enabled:

Null: 22.1s

Ram: 23.9s

Redis domain: 27.5s

Memcache: 27.9s

Redis TCP: 29.6s

Tokyo domain: 29.9s

Mongo 31.1s

Tokyo TCP: 33.6s

MySQL: 182.9s

To run these yourself, download the PHP files and add a -r switch to enable the RAM cache, eg

time php fananalyze.php -f data.txt -s mongo -h localhost -p 27017 -r

They're all significantly faster than the original run with no caching, and Redis using domain sockets is approaching the speed with no store at all, suggesting that the store is not the bottle-neck for this test. In practice, most of my runs are with hundreds of thousands of profiles, not 10,000, and the RAM cache becomes even more of a win, though the space used expands too! I've included the code for the cache class below:

<?php

// A key value store interface that caches the read and written values in RAM,
// as a PHP associative array, and flushes them to the supplied disk-based
// store when storeToDisk() is called

require_once('keyvaluestore.php');

class RamCacheStore implements KeyValueStore
{
    public $values;
    public $store;

    public function __construct($store)
    {
        $this->store = $store;
        $this->values = array();
    }

    public function connect($hostname, $port)
    {
        $this->store->connect($hostname, $port);
    }

    public function get($key)
    {
        if (isset($this->values[$key]))
            return $this->values[$key]['value'];

        $result = $this->store->get($key);
        $this->values[$key] = array(
            'value' => $result,
            'dirty' => false,
        );

        return $result;
    }

    public function set($key, &$value)
    {
        $this->values[$key] = array(
            'value' => $value,
            'dirty' => true,
        );
    }

    public function storeToDisk()
    {
        foreach ($this->values as $key => &$info)
        {
            if ($info['dirty'])
            {
                $this->store->set($key, $info['value']);
                $info['dirty'] = false;
            }
        }
    }

}

	bouquetsweetly69036a… on Meet Fiona and Abby
	softlysuitcb91a8b8b1 on Meet Fiona and Abby
	Zero-Copy GPU Infere… on Why GEMM is at the heart of de…
	Moonshine Voice完全解説｜… on Announcing Moonshine Voice
	Moonshine KI-Sprache… on Introducing Moonshine, the new…

Pete Warden's blog

Ever tried. Ever failed. No matter. Try Again. Fail again. Fail better.

Monthly Archives: December 2009

Boosting MongoDB performance with Unix sockets

How to speed up key/value database operations using a RAM cache