Getting Tokyo Tyrant to work with files larger than 2GB

Godzillavskitten
Photo by Gen Kanai

I use Tokyo Tyrant/Cabinet as the key-value database for Mailana, and after some initial hiccups I've been very happy with its performance. Last night though it stopped working in the middle of preparing several hundred nightly emails, and I wanted to document the problem and the fix to help anyone else who hits this.

After a bit of investigation, I noticed that the Tyrant server kept dieing with "File size limit exceeded". My casket.tch hash database file had grown to 2GB, and running on a 32 bit EC2 server Tokyo couldn't cope with anything larger. There's a standard called Large File Support on Linux that allows you to access >2GB files, but it requires a few things to work:

– A modern version of Linux. I'm on 2.6, so it has support for LFS built in.

– A modern file system that supports large files. I'm on XFS, so that was also ok.

– You need to recompile your program to use the 64 bit versions of file operations. Happily Tokyo was using the correct off_t type for file offsets, rather than int, so I was able to add the -D_FILE_OFFSET_BITS=64 compile flag to the configure script in both Cabinet and Tyrant, rebuilt them both and they then ran with 64 bit file offsets on a 32 bit system.

There was one other quirk I discovered. By default Tokyo only uses a 32 bit index for the hash database, so you also need to pass in the l option at runtime to cope with the larger files, eg:

/usr/local/bin/ttserver -host /sqlvol/tokyo.sock -port 0 -le /sqlvol/casket.tch#opts=l

After doing those changes, I was able to restart my server and run the daily email updates again. The meta-data for my database seemed to have been corrupted by the issue, but all my data integrity checks passed, so I patched around the problem. Specifically in tchdb.c:tchdbopenimpl() the file size returned from fstat() didn't match the one stored in the meta-data header, so I skipped the check:

sbuf.st_size < hdb->fsiz

Plug and Play Tech Center spam

I don't usually post spam, but for anyone out there who gets an email like this and googles it, no, I don't think it's that dream investor you've been waiting for. The fact they can't even figure out my first name is a strong sign, and I'm not the only one getting these.

From: Nickolas Turner <nturner@plugandplaytechcenter.com>

Subject: Funding Opportunity through Plug and Play Tech Center

Dear Mailana,

Are you looking for funding? Please contact Alireza@plugandplaytechcenter.com
to get in touch with our seed and early stage venture arm, as well as our
partners.

Best of luck in your ventures.

Regards,

Nick Turner

Business Relationship Associate

Plug and Play Tech Center

(650) 207-7001

Hate your bank? Use a credit union

Bankteller
Photo by Ronn Ashore

I spent several years suffering with grotty customer service at Citibank, and then I was hit by a check fraud that spiraled into a kafka-esque nightmare. A house-mate snuck into my room, stole a check, forged my signature (poorly) and then cashed it for $1000. Einstein that he was, he'd had to write his driver's license and social security number on the back, which showed up when I got the photocopy back. Not wanting to tip him off, me and the other housemates contacted the police, who were very helpful and interested. Now all we needed was the location where the check was cashed, which didn't show up on the statement.

After 3 months of both me and the police constantly calling and visiting Citibank, they refused to provide us with any details. I was constantly fobbed off with bogus excuses, since the case was allegedly in the hands of their fraud department who must live on an island somewhere in the south Atlantic with no means of communicating with the outside world, since I was never able to get a phone number or address to contact them. I finally received a refund after blowing my top at the local branch, and then promptly closed my account, threw the house-mate's possessions out on the front lawn and sent a copy of the forged check to his parents.

I was reminded of that when I saw this article on someone being hit with a $888,888.88 bank charge, with no explanation or help from the bank staff. It sounds like exactly the same sort of organizational failure that stymied my efforts to get help. From what I can see, the big banks have spent the last decade trying to build automated systems and procedures so they can get rid of expensive staff. That mostly works for routine operations, but as soon as something unusual happens you need somebody with judgement and authority to make decisions.

So what's the answer? I moved my account to a credit union eight years ago and I've been incredibly happy with them ever since:

– The customer service has been fantastic. They have trained, motivated bank staff able and willing to sort out problems for me, both in the branch and on the phone.

– I pay zero ATM fees, even when I'm traveling, since I can use any other credit union's machine for free.

– They don't gouge me with any other fees either. The big banks make nearly 40% of their revenue from 'non-interest income', and the bigger they are, the more they rely on them. Even worse, the 20% of households who pay the majority of overdraft fees (ie the poorest) pay 80% of those, averaging around $1300 each annually.

– I also get a warm glow inside because my deposits are funding straight-forward loans to local people and businesses, not financial speculation or empire building by bank CEOs. I'd rather be helping George Bailey than Gordon Gekko.

My personal account is with Keypoint Credit Union, and my business one with Lockheed, and they've both been stellar. If you're sold on the idea, there's almost certainly one that you can join, either because of where you live or the industry you work in. If you're current with a large bank, you won't regret switching.

Super-simple A/B testing in PHP

Alphabeta
Photo by Roadside Pictures

To really learn about what your users want you need to see how they respond to the different alternatives. Running A/B tests is a great way to do this, but even though the concept is simple, I always felt like it would require some complex coding and database setup to implement. I was wrong: inspired by Eric Ries's tips from a recent workshop I've been getting a lot of valuable feedback using just a 32-line PHP module and plain old logging to a file.

To use it yourself, all you need to do is think up a name for your test, and surround your alternatives with an if (should_ab('yourtestname', $userid)). That's it. I've deliberately made it so there's zero configuration, you can just pick an arbitrary test name, to encourage myself to test early and often. It's best if you have a proper user id to supply to the test function, but if you omit it, the client IP address will be used instead.

Now when your users load up a page they should see one version or another based on who they are, but how do you gather the information about which one worked? I'm logging all my user events to a file on the server using my custom_log() function, so whenever a user views a page I want to store what options they viewed it with. To do that, the only other function in the module returns an array containing what A/B choices were made for the current page. With that appending as a JSON string to each log entry, I can run analytics on the user's subsequent behavior, to tell which version of a front page led to the most conversions for example. The only tricky part of this approach is that you need to make sure you're logging the event at the end of the page, after all the choices have been made.

If you want to dive deeper, there's lots of strong frameworks out there for split-testing (I particularly like kissmetrics' approach), but even using something as brain-dead as my 32 line module will be a massive leap forward if you're a non-split-tester like I was.

[Update – Doh! I got the random generator wrong, it only returned true about 30% of the time using the md5 test. I've switched it over to crc32 below and in the file]

Download abtesting.php

<?php
// A module to let you do simple A/B split testing.
// By Pete Warden ( http://petewarden.typepad.com ) – freely reusable with no restrictions

// An array to keep track of the choices that have been made, so we can log them
$g_ab_choices = array();

function should_ab($testname, $userid=null) {
    // If no user identifier is supplied, fall back to the client IP address
    if (empty($userid))
        $userid = $_SERVER['REMOTE_ADDR'];
   
    global $g_ab_choices;
    if (isset($g_ab_choices[$testname]))
        return $g_ab_choices[$testname];
       
    $key = $testname.$userid;
    $keycrc = crc32($key);
   
    $result = (($keycrc&1)==1);
   
    $g_ab_choices[$testname] = $result;
   
    return $result;
}

function get_ab_choices()
{
    global $g_ab_choices;
    return $g_ab_choices;
}
?>

How to log to custom files from PHP

Logcabin
Photo by Old Shoe Woman

I needed a function in PHP that worked like error_log(), but appended to a set of custom files rather than to the standard error_log. I wanted to have an easier way to organize the different types of information, so that important messages weren't buried in an avalanche of less-crucial warnings, but this sort of thing is also great fodder for analytics if you write user events to their own file.

The result is custom_log(). It takes two arguments, a category name that determines which file to write to, and the message you want to log. The message gets written to that file, prefixed with the time and client IP. You can download the code as customlog.zip or it's included below:

<?php

// A module to write out events to a set of log files. Similar to error_log(),
// but with multiple output files.
//
// You'll need to set up a directory that the process running PHP (eg Apache) has
// permission to write to. You'll also need to keep an eye on the size of the log
// files, rotate out old ones once they get too large, etc.
//
// By Pete Warden ( http://petewarden.typepad.com ) – freely reusable with no restrictions

// Edit this to set it to the folder on your server where you want the logs to live
//define('CUSTOM_LOG_ROOT_DIRECTORY', '/private/var/log/apache2/'); // OS X default Apache log directory

define('CUSTOM_LOG_ROOT_DIRECTORY', '/var/log/httpd/'); // Red Hat Linux default Apache log directory

$g_custom_log_categories = array();
$g_custom_log_shutdown_registered = false;

// This function works like error_log(), but takes an extra category argument that
// determines which file the message is appended to.
function custom_log($category, $message)
{
    global $g_custom_log_categories;
   
    // If the file hasn't been opened for appending yet, create a new file handle
    if (!isset($g_custom_log_categories[$category]))
    {
        // Make sure there's no shenanigans with special characters like ../ that
        // could be abused to write outside of the specified directory
        $sanitizedcategory = preg_replace('/[^a-zA-Z0-9]/', '_', $category);
        $filename = CUSTOM_LOG_ROOT_DIRECTORY.$sanitizedcategory;
        $filehandle = fopen($filename, 'a');
        if (empty($filehandle))
        {
            error_log("Failed to open file '$filename' for appending");
            return;
        }

        // To close any open files once the script is done, and so ensure that
        // all the messages are written to disk, register a global shutdown
        // function that fclose()'s any open handles
        global $g_custom_log_shutdown_registered;
        if (!$g_custom_log_shutdown_registered)
        {
            register_shutdown_function('custom_log_on_shutdown');
            $g_custom_log_shutdown_registered = true;
        }
       
        // Urghh, this is required to prevent a spew of warnings when more recent
        // PHP versions are set to strict errors
        if (!ini_get('date.timezone'))
            date_default_timezone_set('UTC');
       
        $g_custom_log_categories[$category] = array('filehandle' => $filehandle);
    }

    // Create the full message and append it to the file
    $categoryinfo = $g_custom_log_categories[$category];   
    $filehandle = $categoryinfo['filehandle'];
   
    $timestring = date('D M j H:i:s Y');
    $ipaddress = $_SERVER['REMOTE_ADDR'];
    $fullmessage = "[$timestring] [$category] [client $ipaddress] $message\n";
   
    fwrite($filehandle, $fullmessage);
}

// A clean-up function called to make sure all open file handles are closed
function custom_log_on_shutdown()
{
    global $g_custom_log_categories;
    foreach ($g_custom_log_categories as $category => $categoryinfo)
        fclose($categoryinfo['filehandle']);
}

?>

Balsamiq: So simple, even a programmer can use it

Balsamiqshot

Mock me mercilessly, I deserve it, but I've really been struggling to prototype on paper before I code. Back at Apple there was always a white-board handy and a bunch of colleagues and customer-surrogates I had to collaborate with on any feature, so I did plenty of documentation before doing any serious engineering. As a lone founder, it's seriously tempting to think I have a good enough picture in my head to just go ahead and try it out.

Wrong, wrong, wrong! For one thing I end up involving users way too late in the process, since it takes a whole bunch of coding effort before I can show them something. Even ignoring that, I've never thought things through as completely as I think I have. Just a few minutes trying to sketch out the result I'm trying to achieve will always show me something I'd missed, and that's a lot cheaper than spending hours of programming to get to the same conclusion.

One of my mental blocks to prototyping is that I couldn't find a method I felt comfortable with. I'd tried the Pencil Sketch Firefox plugin, but it just didn't work the way I wanted. OmniGraffle is fantastic for creating beautiful diagrams, but it's painful to build something that looks like a UI sketch out of it's primitives. I've fallen back to using pen and paper, but it's really hard to alter and evolve hard copy, and you have to scan it in to share it remotely. Finally I tried out Balsamiq last week, and I'm in love.

I could rhapsodize about its ease of use, but the single best feature is that it looks like a sketch. This visual metaphor is really important, it clearly marks the results out as conceptual designs, not detailed blue-prints. This stops both other people and myself from focusing on nit-picking the look-and-feel, and forces a focus on the big questions about content and placement. I don't spend hours obsessing about aligning elements, because they naturally look a bit wonky, so I'm freed to think about what the overall content should be.

You can give it a try for yourself with the online version, and the full desktop product is $79, though I got it for $40 with a Techstars discount. If you're at all involved in product development, I think you'll end up buying it too.

Blogs I’m reading now

Booklist
Photo by MargoLove

Paul Jozefak just posted a list of the startup-related blogs he's reading, and that reminded me that I'd been intending to highlight some of my favorites too. I'm skipping the obvious ones (Brad, Fred, Eric Ries) to focus on lesser-known gems I'd love to see more widely read.

Bill Flagg

Bill's a Boulder entrepreneur with several great companies under his belt, but what really makes him stand out is that he's a boot-strapper. During TechStars he was a great counter-point to the focus on raising money, and he posts some awesome advice on building a company that actually generates cash. How about a billing department that encourages customers to mark down their invoices if they didn't feel like they got their money's worth? It's working for RegOnline.

Rick Segal

I love Rick's blog because of his willingness to risk offending people. I actually got fairly irate at a post he did last year, but I wouldn't have him any other way. What's even more interesting is that he's recently started on a journey from VC to startup founder, so there's been lots of great "Eat your own dogfood" posts, including a mea culpa on ever uttering the words 'lifestyle business' as a VC.

Highway 12 Ventures

Mark and George were very active in TechStars, but I never realized they blogged until Mark's stellar "Don't let the bastards grind you down". Since then I've been working through their archive, and they're chock full of other great posts, even tips from a hostage negotiator!

Jay Parkhill

Talking of negotiations, Jay's latest post on telling who wants to actually do a deal and who's just there to argue is a must-read. He's a lawyer specializing in startups, so there's loads of other great advice like how to cope with the loss of co-founders without sinking the business.