The worst interview question ever

Interrogate
Photo by B Rosen

This one article sums up everything that's wrong with engineering interviews. The author likes to ask potential hires to explain whether you can call delete this within a C++ member function. What's so wrong with that you ask, it seems like fairly standard practice?

I've conducted a lot of interviews, and been on the other side of a few, and from my own experience and the research I know poorly structured interviews like this are a terrible mechanism for predicting how good people will be at performing a job. Just think about this interview question for a second; how much time in your coding job do you typically spend worrying about this sort of C++ trivia versus debugging, trying to understand legacy code, talking to other engineers, figuring out requirements, explaining your project to managers, etc, etc? The right answer for me is "I've no clue, looks like a terrible idea generally, but I'd google it if needed."

The reason these sort of questions keep coming up is the same reason the drunk kept looking under the lamp post for his keys, they're within the comfort zone of technical specialists, even though the answers aren't useful. For a long time, I did the same, even though I was frustrated with the results. Finally I received some official training at Apple, and what they taught me opened my eyes!

You can find a more detailed description here, but the most important part is "Ask about past behavior". It's the best predictor of future performance, and if you ask in the right way it's also very hard to for the candidate to exaggerate or lie. You can do something general like "Tell me about your worst project", but something more specific is even better, I'd often use "Tell me about a time you hit a graphics driver bug". The candidates will start off with a superficial overview, but if you follow up with more detailed questions (eg "So, did you handle talking to Nvidia?") you'll start to build a real picture of their role and behavior, and it's almost impossible to fake that level of detail.

If C++ experience is crucial, then a much better question would be "Tell me about a time you had to debug a template issue" or "Tell me about a project you implemented using reference-counted objects". Anybody who's read enough C++ books can answer the original question, but these versions will tell you who's actually spent time in the trenches.

Easier command-line arguments in PHP

Arguments
Photo by Between a Rock

One of my pet peeves is that no language I've used handles command-line arguments well. Everyone falls back to C's original argv indexed array of space-separated strings, even though there's decades-old conventions about the syntax of named arguments. There's some strong third-party libraries that make it easier, and the arcane getopt(), but nothing that's emerged as a standard. Since I'm doing a lot more PHP shell scripts these days, I decided to write a PHP CLI parser that met my requirements:

Specify the arguments once. Duplication of information is ugly and error-prone, so I wanted to describe the arguments in just one place.

Automated help. The usage description should be generated from the same specification that the parser uses so it stays up-to-date.

Syntax checking. I want to be able to say which arguments are required, optional or switches, and have the parser enforce that, and catch any unexpected arguments too.

Unnamed arguments. Commands like cat take a list of files with no argument name, I wanted those to be easily accessible.

Optional defaults. It makes life a lot easier if you don't have to check to see if an argument was specified in the main script, so I wanted to ensure you could set defaults for missing optional arguments.

Human-readable specification. getopt() is close to what I need, but as well as not generating a usage description, the syntax for describing the long and short arguments is a horrible mess of a string. I want the argument specification to make sense to anyone reading the code.

Here's the result, cliargs.php. To use it specify your arguments in the form:

array(
 '<
long name of argument>' => array(
     'short' => '<
single letter version of argument>',
     'type' => <
'switch' | 'optional' | 'required'>,
     'description' => '<
help text for the argument>',
     'default' => '<
value if this is an optional argument and it isn't specified>',
 ),
 …
 );

There's an example script in the package, and documentation in the readme.txt. The code is freely reusable with no restrictions; I'm just dreaming of a world where no one ever writes another CLI argument parser ever again.

Three lessons I learnt from porting Diablo

Diablo
Photo by Vizzzual

It was 1997, I'd just finished college, was really excited about getting my first job in the game industry, and I was a complete idiot. Luckily life was there to hand me a few lessons.

I'd always worked at name-badge jobs paying hourly rates, so when I was offered a whole 10,000 pounds a year, I thought it sounded amazing. It came out to around 550 pounds take-home pay a month, my rent was 400 pounds, which left me and my unemployed wife 150 pounds a month for food, transport and bills. The first lesson I learnt was to crunch the numbers on any deal, and not be distracted by a big headline figure.

The project, for Climax Inc ("Hi, I'm at Climax", not the best name), was to port Blizzard's hit game Diablo from the PC to the Playstation 1. I'd spent years obsessively coding in my bedroom, but this was the first time I'd done any professional work, so I was very definitely a junior Junior Programmer. I kept hitting frustrating problems just using the basic tools I needed for development (I'd never even touched a debugger before) and my code was so buggy I could barely get it to run. I was painfully shy, didn't know anyone else in the company, and they all seemed too busy to help. The only person who made time to help me dig myself out of my incompetence was the bloke sitting behind me, Gary Liddon. Over the course of a couple of weeks he was incredibly patient about hand-holding me through the basics of building and debugging. It was only after the team started getting organized that someone introduced Gary as the project lead, in charge of 20 programmers and with a decades-long career in games behind him.

The second lesson I learnt was that I wanted to work with people like Gary, willing to help the whole team, rather than hunting for individual glory. I've since worked with a lot of 'rock star' programmers, and while they always look good to management, they hate sharing information or credit and end up hampering projects no matter how smart they are as individuals. Gary used his massive brain to help make us all more effective instead, and I've always tried to live up to his example.

The code itself was a mess. There were hundreds of pieces of x86 assembler scattered throughout the code base, which was a problem since we were porting to the Playstation's MIPS processor. Usually just a couple of instructions long, and in the middle of functions, these snippets were pretty puzzling. Finally one of the team figured it out; somebody had struggled with C's signed/unsigned casting rules, and so they'd fallen back on the assembler instructions they understood! The whole team had a good laugh at that, and were feeling pretty superior about it all, until Gary quietly pointed out that the programmers responsible were busy swimming in royalties like Scrooge McDuck while we were porting their game for peanuts.

The third lesson I learnt was that you don't need great code to make a great product. I take pride in my work, but there's no shame in doing what it takes to get something shipped. I've seen plenty of projects die a lingering death thanks to creeping elegance!

After 6 months of spiralling into debt I finally managed to get another job, only 2,000 pounds more in salary but in a much cheaper part of the country. Not much of my code made it into the final game, and it was a pretty miserable time of my life to be honest, but sometimes the worst projects are the best teachers.

Boosting MongoDB performance with Unix sockets

Outdoorsocket
Photo by Stitch

As I've been searching for a solution to my big-data analysis problems, I've been very impressed by MongoDB's features, but even more by their astonishing level of support. After mentioning I was having trouble running Mongo in my benchmark, Kristina from 10Gen not only fixed the bug in my code, she then emailed me an optimization (using the built-in _id in my array), and after that 10gen's Mathias Stearn let me know the latest build contained some more optimizations for that path. After burning days dealing with obscure Tokyo problems I know how much time responsive support can save, so it makes me really want to use Mongo for my work.

The only fly in the ointment was the lack of Unix domain socket support. I'm running my analysis jobs on the same machine as the database, and as you can see from my benchmark results, using a file socket rather than TCP on local host speeds up my runs significantly on the other stores that support it. I already added support to Redis, so I decided to dive into the Mongo codebase and see what I could manage.

Here's a patch implementing domain sockets, including a diff and complete copies of the files I changed. Running the same benchmarks gives me a time of 35.8s, vs 43.9s over TCP, and 28.9s with the RAM cache vs 31.1s on TCP. These figures are only representative for my case, large values on a single machine, but generally they demonstrate the overhead of TCP sockets even if you're using localhost. To use it yourself, specify a port number of zero, and put the socket location (eg /tmp/mongo.sock) instead of the host name. I've patched the server, the command-line shell, and the PHP driver to all support file sockets this way.

I don't know what Mongo's policy is on community contributions, I primarily wrote this patch to scratch my own itch, but I hope something like this will make it into the main branch. Writing the code is the easy bit of course, the real challenge is testing it across all the platforms and configurations!

How to speed up key/value database operations using a RAM cache

Ram
Photo by Olduser

In my previous post I gave raw timings for a typical analysis job on top of various key/value stores. In practice I use another trick to speed up these sort of processes; caching values in RAM for the duration of a run and then writing them all out to the store in one go at the end. This helps performance because there's some locality in the rows I'm accessing, so it's worth keeping previously-fetched or written data in memory and reducing the amount of disk IO needed. If this helps you will depend on your database usage patterns, but I've found it invaluable for my analysis of very large data sets.

The way I do this is by creating a PHP associative array mapping keys to values, populating it as I fetch from the store, and delaying writes until a final flushToDisk() call at the end of the script. This is very inelegant, it means you have to watch PHP's memory usage since the default 32MB max is easy to hit, and ensuring that final flush call is made is error-prone. The performance boost is worth it though, here's the figures using the same test as before, but with the cache enabled:

Null: 22.1s

Ram: 23.9s

Redis domain: 27.5s

Memcache: 27.9s

Redis TCP: 29.6s

Tokyo domain: 29.9s

Mongo 31.1s

Tokyo TCP: 33.6s

MySQL: 182.9s

To run these yourself, download the PHP files and add a -r switch to enable the RAM cache, eg

time php fananalyze.php -f data.txt -s mongo -h localhost -p 27017 -r

They're all significantly faster than the original run with no caching, and Redis using domain sockets is approaching the speed with no store at all, suggesting that the store is not the bottle-neck for this test. In practice, most of my runs are with hundreds of thousands of profiles, not 10,000, and the RAM cache becomes even more of a win, though the space used expands too! I've included the code for the cache class below:

<?php

// A key value store interface that caches the read and written values in RAM,
// as a PHP associative array, and flushes them to the supplied disk-based
// store when storeToDisk() is called

require_once('keyvaluestore.php');

class RamCacheStore implements KeyValueStore
{
    public $values;
    public $store;
   
    public function __construct($store)
    {
        $this->store = $store;
        $this->values = array();
    }

    public function connect($hostname, $port)
    {
        $this->store->connect($hostname, $port);
    }
   
    public function get($key)
    {
        if (isset($this->values[$key]))
            return $this->values[$key]['value'];

        $result = $this->store->get($key);
        $this->values[$key] = array(
            'value' => $result,
            'dirty' => false,
        );
       
        return $result;
    }

    public function set($key, &$value)
    {
        $this->values[$key] = array(
            'value' => $value,
            'dirty' => true,
        );
    }
   
    public function storeToDisk()
    {
        foreach ($this->values as $key => &$info)
        {
            if ($info['dirty'])
            {
                $this->store->set($key, $info['value']);
                $info['dirty'] = false;
            }
        }
    }
   
}

?>

Real-world benchmarking of key/value stores

Keys
Photo by Stopnlook

Over the past year I've been doing a lot of work analyzing very large data sets, eg hundreds of millions of Twitter messages. I started with mysql, but wasn't able to get the performance I needed, so like a lot of other engineers I moved towards key/value databases offering far fewer features but much more control.

I found it very hard to pick a database, there's so many projects out there it's like a Cambrian explosion. To help me understand how they could meet my needs, I decided to take one of my typical data analysis jobs and turn it into a benchmark I could run against any key/value store. The benchmark takes 10,000 Facebook profiles, looks at the fan mentions and compiles a record of how fan pages are correlated, eg there were 406 Michael Jackson fans, and 22 of them were also fans of the Green Bay Packers.

I'm not claiming this is a definitive benchmark for everyone, but it's an accurate reflection of the work I need to do, with repeated updates of individual rows and large values. I've uploaded all the PHP files so you can also try this for yourself. Here are the results on my MacBook Pro 2.8GHz:

Null Store: 5.0s

RAM Store: 21.3s

Memcache: 35.7s

Redis Domain Socket: 37.3s

Tokyo Domain Socket: 40.8s

Redis TCP Socket: 42.6s

Mongo TCP Socket: 43.9s

Tokyo TCP Socket: 45.3s

MySQL: 543.5s

The 'Null Store' is a do-nothing interface, just to test the overhead of non-database work in my script. The 'RAM store' keeps the values in a PHP associative array, so it's a good control giving an upper limit on performance, since it never has to touch the disk. Memcache is another step towards the real world, it's in another process, but its lack of disk access also gives it an unrealistic advantage.

The 'Redis domain socket' gives the best results of the real database engines. The domain socket part refers to a patch I added to support local file sockets. It's impressively close to the Memcache performance, less than 2 seconds behind. The 'Tokyo domain socket' comes in next, also using file sockets, then Redis using TCP sockets on the local machine, and then Tokyo on TCP.

A long, long way behind is MySQL, at over 10 times the duration of any of the other solutions. This demonstrates pretty clearly why I had to abandon it as an engine for my purposes, despite its rich features and stability. I also tested MongoDB, but was getting buggy results out of the analysis so I was unable to get meaningful timings. I've included the file if anyone's able to tell me where I'm going wrong?

[Updated: thanks to Kristina who fixed the bug in my PHP code in the comments I now have timings of 43.9s for Mongo via TCP, and I've updated the code download. Bearing in mind I'm doing a lot of unnecessary work for Mongo, like serializing, this is an impressive result and I should be able to do better if I adapt my code to its features. I'm also trying to find info on their support for domain sockets.]

To try this on your own machine, download the PHP files, go to the directory and run it from the command line like this:

time php fananalyze.php -f data.txt -s redis -h localhost -p 6379

You can replace redis and the host arguments with the appropriate values for the database you want to test. To make sure the analysis has worked, try running a query, eg

php fananalyze.php -q "http://www.facebook.com/michaeljackson&quot; -s redis -h localhost -p 6379

You should see a large JSON dump showing which pages are correlated with Jackson.

I would love to hear feedback on ways I can improve this benchmark, especially since they will also improve my analysis! I'll also be rolling in some RAM caching in a later article, since I've found the locality of my data access makes delaying writes and keeping a fast in-memory cache layer around helps a lot.

Here's the versions of everything I tested:

Redis 1.02 (+ my domain socket patch)
libevent 1.4.13
memcache 1.4.4
PECL/memcache 2.2.5
Tokyo Cabinet 1.4.9
Tokyo Tyrant 1.1.7 (+an snprintf bug fix for OS X)
MySQL 5.1.41, InnoDB
Mongo 64 bit v1.1.4 and PHP driver v1.0.1

MacBook Pro specs

I’m thankful for Ed Pulaski

Pulaski
Photo from the Santa Monica Mountains Trails Council

For years one of my favorite trail work tools has been the Pulaski ax, similar to a small pick-ax with a blade on one side and a grubbing hoe on the other. It works great for digging out stubborn stumps and then chopping the roots so they can be pulled out, as you can see in the shot above.

I vaguely knew it was named after a fire-fighter who popularized it, but I never knew his amazing life-story until I read The Big Burn. It's one of the best books I've read in a long time, weaving the political story of the early days of the Forest Service with the personal nightmares and heroics of the biggest wildfire in American history.

Ed Pulaski was a grizzled woodsman in his 40s when he joined the new-born Forest Service as a Ranger in Idaho. In those early days Rangers were mostly young college boys from the East, so as a respected local who'd spent years outdoors as a prospector, rancher and railroad worker, Ed brought more credibility to the job than most.

In the summer of 1910, Teddy Roosevelt was out of office and without his support the Forest Service was being starved of funds. Since the year was unusually dry, there were spot fires throughout the Rockies, and not enough man-power to control them. It was so desperate that Pulaski, like many Rangers, ended up paying fire-fighter's wages out of his own savings.

On August 20th, hurricane-force winds whipped up the mass of spot fires into an immense burn, covering over 3 million acres, about the size of Connecticut. Ed Pulaski was leading a crew of 50 men trying to save the town of Wallace, and when it became clear the winds made the blaze unstoppable he tried to lead them out. The fire moved too fast, and they became cut off. Remembering a small mine nearby, he found the tunnel and led his crew inside as the forest was burning around them.

Standing at the entrance, he desperately tried to keep breathable air from being sucked out of the mine by hanging wet clothes as a barrier, and trying to extinguish the supports as they kept igniting. Finally out of cloth, and badly burned and blinded by the flames, he ordered his crew to get low to the ground to make the most of the breathable air. He had to threaten the panicked men with his pistol to get them to obey, and shortly afterward he fell unconscious to the floor.

After the fire had passed, the men stumbled to their feet. Five were dead of asphyxiation, and Ed was thought to be gone too until he was woken by the fresher air. Forty-five men survived thanks to his leadership, and they stumbled the five miles back to down on feet burned raw, in shoes with the soles melted off.

They found a third of Wallace burned, but most lives saved by evacuation. Other fire-fighting crews throughout the Rockies were not as lucky, with 28 killed in just one spot, and over 125 dead in total.

Ed never fully recovered from his injuries, but went back to work as a Ranger, with medical treatment paid for by donations from his colleagues since the Service refused to help. He fought for years to get a memorial built for the fire-fighters who lost their lives, and created the Pulaski ax as a better tool for future crews.

Ironically, the great fire was a public relations coup for the Forest Service. They used the heroic story of Ed Pulaski to push through increased funding, with the promise of a zero-tolerance approach to wild fires. This use of fire as a justification for largely unrelated policies set a terrible precedent that's come back to haunt us. Now most debates around how we should use our National Forests are fought by invoking the specter of wild fires on both sides. The lack of both regular smaller fires and logging has left us with a tinder box, and from my time with forest and fire professionals, there are no simple solutions. The only approach that makes commercial sense for loggers is clear-cutting easily accessible areas, and simply letting fires burn when there's so much fuel results in far more devastation than when they were smaller but more frequent. I'm in favor of letting the professionals figure out good management plans without too much political pressure to lean towards a pre-judged outcome. I'd imagine that would involve more selective logging, which wouldn't go down well with many environmentalists, but it's also obvious that's only going to address a small part of the problem.

Despite this political knot, I'm grateful that people back in 1905 put so much of America into National Forests. After growing up in a country where every square inch has been used and reused for thousands of years, I fell in love with the immense wildernesses over here. Even just a few miles from LA you can wander for hours in beautiful mountains without seeing another soul. I'm thankful we had dedicated Rangers like Ed Pulaski to preserve that for us.

Boosting Redis performance with Unix sockets

Electricsocket
Photo by SomeDriftwood

I've been searching for a way to speed up my analysis runs, and Redis's philosophy of keeping everything in RAM looks very promising for my uses. As I mentioned previously, I hit a lot of speed-bumps trying to get it working with PHP, but I finally got up and running and ran some tests. The results were good, faster than using the purely disk-based Tokyo Tyrant setup I have been relying on.

The only niggling issue was that I knew from Tokyo that Unix file (aka domain) sockets have a lot less overhead and help performance compared to TCP sockets, even using localhost. Since the interface is almost identical, I decided to dive in and spend a couple of hours patching my copy of Redis 1.02 to support file sockets. The files I've changed are available at http://web.mailana.com/labs/redis_diff.zip.

My initial results using the redis-benchmark app that ships with the code show a noticeable performance boost across the board, sometimes up to 2x. Since this is an artificial benchmark it's unlikely to be quite this dramatic in real-world situations, but with Tokyo 30%-50% increases in speed were common.

I hope these changes will get merged into newer versions of Redis, it's a comparatively small change to get a big performance boost in situations when the client and server are on the same machine.

Using Redis on PHP

Kitcarson
Photo by Bowbrick

There's a saying that pioneers tend to come back stuck full of arrows. After a couple of days trying to get Redis working with PHP, I know that feeling!

I'm successfully using Tokyo Tyrant/Cabinet for my key/value store, but I do find for a lot of my uses disk access is a major bottleneck on performance. I do lots of application-level RAM caching to work around this, but Redis's philosophy of keeping everything in main memory looked like a much lower-maintenance solution.

Getting started is simple, on OS X I was able to simply download the stable 1.02 source, make and then run the default redis-server executable. Its interface is through a TCP socket, so I then grabbed the native PHP module from the project's front page and started running some tests.

The first problem I hit was that the PHP interface silently failed whenever a value longer than 1024 characters was set. Looking into the source of the module, it was using fixed-length C arrays (they were even local stack variables with pointers returned from leaf functions!) and failing to check if the passed-in arguments were longer. This took me an hour or two to figure out in unfamiliar code, so I was a bit annoyed that there weren't more 'danger! untested!' signs around the module, though the README did state it was experimental.

Happily a couple of other developers had already run into this problem, and once I brought up the issue on the mailing list, Nicolas Favre-Félix and Nasreddine Bouafif made their fork of PHPRedis available with bug fixes for this and a lot of other issues.

The next day I downloaded and ran the updated version. This time I was able to get a lot further, but on my real data runs I was seeing intermittent empty string being returned for keys which should have had values. This was tough to track down, and even when I uncovered the underlying cause it didn't make any sense. It happened seemingly at random, and I wasn't able to reproduce it in a simple test case. An email to the list didn't get any response, so the following day I heavily instrumented the PHP interface module to understand what was going wrong.

Finally I spotted a pattern. The command before the one that returned an empty string was always

SET 100
<1000 character-long value>

It turned out that some digit-counting code thought the number 1000 only had three digits, and truncated it to 100. The other 900 characters in the value remained in the buffer and were misinterpreted as a second command. That meant the real next command received a -ERR result. I coded up a fix and submitted a patch, and now it seems to be working at last.

Hitting this many problems so quickly has certainly made me hesitate to move forward using Redis in PHP. It's definitely not a well-trodden path, and while the list was able bring me a solution to my first problem, I was left to debug the second one on my own, and a question about Unix domain sockets versus TCP was left unanswered as well. If you are looking at Redis yourself in PHP, make sure you're mentally prepared for something pretty experimental, and don't count on much hand-holding from the developer community.

Of course, the same goes for almost any key/value store right now, it's the wild west out there compared to the stability of the SQL world. My next stop will be MongoDB to see if having a well-supported company behind the product improves the experience.

Nonsensical Infographics

Nonsensicalinfographic1
Nonsensical Infographic 1 by Chad Hagan

20×200 is an awesome concept, an online gallery selling limited editions of works by new artists starting at $20 each. While I strive to follow Tufte and make all my visualizations tell a clear story, I'm aware they sometimes turn out more pretty than functional, so I'm in love with Chad Hagan's 'Nonsensical Infographic' series on there. Now I just need to convert them into flash animations to make them even more beautiful and confusing.

Nonsensicalinfographic2 
Nonsensical Infographic 2 by Chad Hagan