Denormalization: The forbidden optimization

Lambada

One of the key principles you learn about relational database design is always normalize your data. This gets quietly thrown out of the window once you have to scale up web applications, either to cope with large numbers of users, or large amounts of data. I’m hitting this point with the 500,000 emails in the Enron collection, and chatting with Kwin from Oblong made me realize what a dirty little industry secret this optimization is.

Normalization means storing any fact in just one location in your database, so that you avoid update anomalies where you have conflicting information in multiple places. For instance, if you have someone’s address stored in several tables, you might forget and only update one of them when it changed, but still be using their old address for queries relying on the other tables. The downside to normalization is that many queries require joins to fetch all the information you need, and these can be slow on large data sets.

Denormalization means duplicating facts across different tables so that reads can avoid joins and so run much faster. It’s dangerous because you lose the safety net of automatically robust updates, and you have to make your data-writing code more complex. For my work on mail, and for most web services, the real time is spent on reading data, which is why it’s such an appealing optimization.

It’s actually just another form of caching, on the memory-intensive end of the classic performance/memory tradeoff spectrum. Memcached is another layer of caching that works well if you’ve got a lot of repeated queries, though again it complicates the update logic. Indexing within a database is another form of caching frequently needed data, though that’s handled behind the scenes for you.

There’s some fascinating case studies out there on how sites like Ebay and Flickr have broken all the old rules to get the performance they need. Google’s BigTable doesn’t specify anything about normalization, but the fact that it’s a simple map between keys and values, with no complex queries possible, makes it very tempting to duplicate your data with keys for the common read operations.

How to write an Ajax update function with PHP

Fetch
Photo by Bored-Now

I’ve been writing a lot of Ajax code to request some information from a server, and then update an element on the page with the returned HTML. The basic XMLHttpRequest code to do this is pretty simple, but I’ve specialized the code to do a couple of common things. First, it always replaces the HTML of the element with the ID given in $replacename, and it takes in a Javascript variable name so you can dynamically alter the URL parameters that are passed in. The second part is really useful when you want a client-side event to trigger the fetch, you can write <select onchange="yourajaxfunction(this.value);"> in a menu, and then define the values in each menu item. Here’s the PHP code for the function body:

function add_ajax_fetch_script($fetchurl, $parametersjsvar, $replacename)
{
?>
    var xhr;
    try
    {
        xhr = new ActiveXObject(‘Msxml2.XMLHTTP’);
    }
    catch (e)
    {
        try
        {
            xhr = new ActiveXObject(‘Microsoft.XMLHTTP’);
        }
        catch (e2)
        {
            try
            {
                xhr = new XMLHttpRequest();
            }
            catch (e3)
            {
                xhr = false;
            }
        }
    }

    xhr.onreadystatechange  = function()
    {
        if(xhr.readyState  == 4)
        {
            if(xhr.status  == 200)
                document.getElementById("<?=$replacename?>").innerHTML = xhr.responseText;
            else
                document.getElementById("<?=$replacename?>").innerHTML = "Error code " + xhr.status;
        }
    };

    xhr.open("GET", "<?=$fetchurl?>"+<?=$parametersjsvar?>,  true);
    xhr.send(null);

<?php
}

To use this code, you’d write out the signature and name of your Javascript function, call add_ajax_fetch_script() and then terminate the JS function with a closing curly brace. Eg:

function yourajaxfunction(urlsuffix)
{
<?php
add_ajax_fetch_script("http://someurl.com&quot;, "urlsuffix", "someelementid");
?>
}

How to set up Linux for web development through Parallels on OS X

Redhat
Photo by CarlosLuis

I wanted a setup for web development that matched my production server, but let me do local development on my MacBook Pro. I’m a big fan of Parallels running Windows, so I set out to get Red Hat Fedora server running too. My requirements were that I should easily be able to install extensions that aren’t standard on OS X PHP like IMAP and GD, and that I could save files to my local drive and immediately run them through the server without having to copy anything.

Getting started with Fedora was painless thanks to this ready-made Parallels disk image. I downloaded that and it loaded immediately with no setup required. I then ran yum to update the system to the latest patches, and I was in business. I had to remove yum-updatesd before I could open any of the desktop software installer applications, but once that was working I could run the Add Software application. There I chose the parts I needed, like mysql PHP and other random web development additions.

After that was going, I created a test.php file containing <?php phpinfo(); ?> and placed it in /var/www/html/. Loading up Firefox inside Fedora and pointing it to http://localhost/test.php gave me the expected information dump. Everything was going smoothly, so I should have known there was trouble in store.

The only remaining part was adding a link back to my OS X filesystem within Linux, so that Apache could access my files without having to do any copying. Parallels offers a great bridge between the Windows and mac file stores, but I couldn’t find anything that easy for Fedora. What I did run across was sshfs, which uses fuse and ssh to create a virtual folder inside Linux that points back to a directory on a system accessed through the network.

I went through all the steps to get that set up, but spent a very long time getting 403 Permission Denied errors every time I tried to access OS X files through Apache. After a lot of hair-pulling, I figured out how to make everything play nicely together. It involves loosening the permissions model, so I don’t recommend doing this on a production server for security reasons, but it should be fine for local development. Here’s the steps:

  • On Linux, make sure you have sshfs, and you’ve added your current Linux user to the fuse group with su -c ‘gpasswd -a <username> fuse’
  • Again on Fedora, get fuse running with service fuse start and add it to the startup sequence with echo ‘service fuse start’ >> /etc/rc.local
  • On OS X, go to Preferences->Sharing and turn on remote login. Make a note of the IP number it displays on that window.
  • To test the remote login, in a Linux terminal window do ssh <mac user name>@<mac IP number>, eg ssh petewarden@10.0.1.196 . If this doesn’t work, you’ll need to stop and check the IP number and your Parallels network setup.
  • Now you can try to set up the filesystem connection through SSH. On your Fedora terminal, type sshfs -o allow_other,default_permissions <mac user name>@<mac IP number>:<Path to your mac folder> /var/www/html/testbed , eg sshfs -o allow_other,default_permissions petewarden@10.0.1.196:/Users/petewarden/Sites/testbed /var/www/html/testbed . The magic bit here are the extra options for allowing other users to access the folder. Without these, the apache user won’t be able to read the files and you’ll get the 403 errors. Be warned though, this is necessary to fix them, but you’ll need to follow the steps below to remove all the permission problems.
  • Because the files show up with a user ID and group ID that’s unknown to the Linux filesystem, Apache’s strict security settings will refuse to display them. To get around this, you need to make the security settings less restrictive. First you’ll need to disable Security Enhanced Linux, aka SELinux. To do this through the GUI, go to the Security preferences and click the SELinux tab. Then choose "Disabled" from the dropdown menu. You could also do this on a per-program basis, but I wanted to keep it simple.
  • Next you need to remove suEXEC, another security feature of Apache. To do this just move the file itself, on my system at /usr/sbin/suexec to another location eg mv /usr/sbin/suexec /usr/bin/suexec_disabled
  • Finally restart apache with service restart httpd and try navigating to one of your pages. With any luck you’ll now be able to save from OS X and immediately see them in Firefox within Fedora.

One of my main reasons for this setup was to easily install extensions. After going through these steps I was able to just run yum install php-gd to get the gd graphics library, a project that had previously taken me hours of fiddling on OS X even with fink.

Update: I’m now using the Parallels Linux instance directly from OS X. To do that I had to enable HTTP in the firewall settings on Fedora, and then ran ifconfig to work out the IP address that it had acquired. After that I can navigate to http://x.x.x.x/ in my OS X copy of Firefox and access my files, without having to ever switch to the Linux desktop.

Why aren’t we using humans as robots?

Robot
Photo by Regolare

Yesterday I had lunch with Stan James of Lijit fame, and it was a blast. One of the topics that’s fascinated both of us is breaking down the walls that companies put up around your data. In the 90’s it was undocumented file formats and this decade it’s EULAs on web services like Facebook. The intent is to keep your data locked in to a service, so that you’ll remain a customer, but what’s interesting is that they don’t have any legal way of enforcing exactly that. Instead they forbid processing the data with automated scripts and giving out your account information to third-party services. It’s pretty simple to detect when somebody’s using a robot to walk your site, and so this is easy to enforce.

The approach I took with Google Hot Keys was to rely on users themselves to visit sites and view pages. I was then able to analyze and extract semantic information on the client side, as a post processing step using a browser extension. It would be pretty straightforward to do the same thing on Facebook, sucking down your friends information every time you visited their profile. I Am Not A Lawyer, but this sort of approach is both impossible to detect from the server side and seems hard to EULA out of existence. You’re inherently running an automated script on the pages you receive just to display them, unless you only read the raw HTTP/HTML responses.

So why isn’t this approach more popular? One thing both me and Stan agreed on is that getting browser plugins distributed is really, really hard. Some days the majority of Google’s site ads seem to be for their very useful toolbar, but based on my experience only a tiny fraction of users have it installed. If Google’s marketing machine can’t persuade people to install client software, it’s obvious you need a very compelling proposition before you can get a lot of uptake.

Illegal characters in PHP XML parsing

Kanji
Photo by Cattoo

If you hit the error "Invalid character" while using PHP’s built-in XML parser, and you don’t see the usual "<" or "&" characters in the input, you might be running into the same control code problems I’ve been hitting. I’d always assumed, and most sites state, that you can put anything within a CDATA block apart from < and &. I’m wrapping the bodies of email messages in XML, within CDATA’s, but I was still seeing parser failures like these. I also tried using various escaping methods instead, like htmlspecialchars(), but still hit the failure.

Digging into it was tricky, since it doesn’t give you the actual character value it’s choking on. In one case I tracked it down to "\x99", which looks like a Microsoft variant for the trademark character. That got me wondering exactly what character set was being used, so I tried specifying ISO 8859 1 explicitly when I created the parser, but still hit the same error.

Then I realized I was cutting some corners by skipping the starting <?xml> tag for all of the strings I was creating. That’s where you can specify the character set for the file, and sure enough prefixing it with
<?xml version="1.0" encoding="ISO-8859-1"?>
got me past that first error. I thought I was home free, but looking at my test logs, it looks like it failed again overnight after going through 1300 more emails. I shall have to dig into that further and see what the issue was there.

It does seem like a design flaw that the parser chokes dies on unrecognized characters, rather than shrugging its shoulders and carrying on. It may well be outside of the spec to have control characters that aren’t legal in the current instruction set, but it seems both possible and helpful to have a mode that either ignores or demotes those characters when they’re found, rather than throwing up its hands and refusing to parse any further. It has the same smell of enforcing elegance at the expense of utility that infuriated me with bondage and discipline languages like Pascal.

Don’t repeat yourself with XML and SQL

Rascallyrepeatingrabbits
Photo by TW Collins

One key principle of Agile development is Don’t Repeat Yourself. If you’ve got one piece of data, make sure it’s only defined in one place in your code. That way it’s easy to change without either having to remember everywhere else you need to modify, or introducing bugs because your software’s inconsistent.

This gets really hard when you’re dealing with data flowing back and forth between XML and SQL. There’s a fundamental mismatch between a relational database that wants to store its information in columns and rows, and the tree structure of an XML document. Stylus do a great job describing the technical details of why XML is from Venus and SQL is from Mars, but the upshot is that it’s hard to find a common language that you can use to describe the data in both. A simple example is a list of the recipients of a particular email. The natural XML idiom would be something like this:

<message>
<snipped the other data>
<recipients>
<email>bob@bob.com</email>
<email>sue@sue.com</email>
</recipients>
</message>

But in mysql, you’re completely listless. To accommodate a variable length collection of items you need to set up a separate table that connects back to the owner of the data. In this case you might have a seperate ‘recipients’ table with rows that contained each address, together with some identifier that linked it with the original message held in another table. It’s issues like this that make a native XML database like MarkLogic very appealing if you’re mostly dealing with XML documents.

What I’d like to do is define my email message data model once, and then derive both the XML parsing and mysql interaction code from that. That would let me rapidly change the details of what’s stored without having to trawl through pages of boiler-plate code. I’m getting close, sticking to a simple subset of XML that’s very close to JSON, but defining a general way to translate lists of items back and forth is really tough.

I’m trying to avoid being an architecture astronaut, but it’s one of those problems that feels worth spending a little bit of upfront time on. It passes the "will this save me more time than it takes in the next four weeks?" code ROI test. I’d welcome any suggestions too, this feels like something that must have been solved many times before.

How to convert individual email files to mbox

Pearls
Pearls by Matuko Amini

I need to load my Exchange server with a large set of real emails, so I can simulate how my tools will work on a big organization’s mail. The best data set out there is the Enron collection, but since most researchers are doing static analysis, it’s only available in easy-to-process forms like a mysql database or as individual files. There’s no obvious way to turn them back into something that can be imported into Outlook or Exchange.

I needed a way to get it into a form that standard mail programs would recognize. The easiest format to convert individual files to is mbox. In this setup, a set of email messages is stored in a single file ending with .mbox. Within each file, messages are seperated by a "From line". This consists of the characters ‘F’, ‘r’, ‘o’, ‘m’, ‘ ‘, followed by an email address and a date in asctime format. Each of these from lines must be preceded by a blank line. To make sure there’s no confusion with message content, any line beginning "From " in the body of a message must be changed to ">From ".

Since this all involves heavy text processing, I turned to Perl. Here’s a copy of my mailconvert.pl script, and I’ve included it inline at the bottom. It will take a directory hierarchy of individual email files, and for each folder will create a mailbox.mbox that contains all of the messages in that folder. It recognises emails by the inclusion of a "From: " header, and uses that address and the date header to create a complete from line seperator. Run it with the current working directory set to the root of the hierarchy. For example, cd to inside the maildir if you’re trying to convert the files extracted from the Enron tar.

I’ve tested with Apple Mail, and I’m able to import the files this generates. It’s a bit eerie seeing all the Enron mails show up in my inbox, and it’s a good reminder that these are messages that the senders never intended to be public. If you do use these mails yourself, please be respectful of their privacy.

Once you’re in mbox there’s a lot of tools available to convert them to Microsoft-friendly formats like psts. I’ll be covering those in a future article, along with some enhancements like grabbing the attachments and keeping the folder structure from the originals.

#!/usr/bin/perl
use strict;
use warnings;
use Cwd;
use POSIX;
use File::Find;
use Date::Parse;

# You need a date in the from line, though it seems redundant with the headers.
# Without a date there, Apple Mail at least won't parse the mbox files, so pick
# an arbitrary value to put in there if we don't find a header.
my $datedefault = "Tue, 18 Mar 2008 12:11:51";

# The name of the mbox file created from all the messages in the directory
my $outputfilename = "mailbox.mbox";

# Empty the file
open(OUTPUT, "> $outputfilename");
close(OUTPUT);

my $count = 0;

find(\&findcallback, cwd);

# This is called back for every file found, and appends the contents to the
# main mbox file for that directory, together with a from line of the format
# "From <email address> <asctime format date" and a blank line.
sub findcallback
{
my $file = $File::Find::name;

# If it's not a file then don't do anything
return unless -f $file;

# Avoid processing the output file
if ($file eq $outputfilename)
{
return;
}

open F, $file or print "couldn't open $file\n" && return;

my $text = "";
my $from = "";
my $date = "";
while (<F>)
{
my $line = $_;
# If this line is a From: header, and we haven't found one before, then
# grab the address to use in the "From " seperator between mail messages
if( ($from eq "") and ($line =~ /^From: .*$/) )
{
$from = $line;
$from =~ s/^From: /From /;
# remove the new line
$from =~ s/[\r\n]//g;
}
elsif ($line =~ /^From .*$/)
{
# If there's a line that looks like a "From " seperator, add a > to
# prevent it messing up the mbox parsing
$line = ">" . $line;
}

# If this is a Date: header, then grab the value to use after the address
if( ($date eq "") and ($line =~ /^Date: .*$/) )
{
my $inputdate = $line;
$inputdate =~ s/^Date: //g;

my $datevalue = str2time($inputdate);
$date = POSIX::gmtime($datevalue);
}
$text .= $line;
}

close F;

# If no date header was found, pick an arbitrary one with the correct format
if ($date eq "")
{
$date = $datedefault;
}

# Work out the final string if this looks like a valid mail file
if ((length($text)>0) and (length($from)>0))
{
my $outputstring .= $from . " " . $date . "\n" . $text . "\r\n\r\n";

open(OUTPUT, ">> $outputfilename");
print OUTPUT $outputstring;
close(OUTPUT);
}

}

How you can parse XML with PHP

Text
Photo by Dean Terry

I love XML, not because it’s an inherently beautiful format (it’s inelegant in a lot of ways, like why do we have both attributes and character data?) but because for once we have a sensible and widely supported standard in the computing world. The power of this shows when you want to parse an XML file in PHP. Support is built in by default, powered by the ExPat library. For small files you can use the SimpleXML wrapper that creates an object from the XML, but I need to parse large amounts of XML so I didn’t want to keep all of that information in memory. Instead I’m hooking directly into the ExPat event interface, which calls back to the client when tags and other data objects are encountered, and requires the caller to retain and assemble any information it wants to extract.

I’ve included the code below, and here’s a zip file of the example code together with a test XML file. It’s an expanded version of the example from the PHP manual, with the addition of character data handling and the storage of some data during the parsing. It takes the input XML file and outputs an indented version of all tags, showing any character data associated with each tag.

<?php
$file = "example.xml";
$depth = array();
$currenttagname = array();
$currenttagvalue = array();

function onStartElement($parser, $name, $attrs)
{
    global $depth;
    global $currenttagname;
    global $currenttagvalue;

    for ($i = 0; $i < $depth[$parser]; $i++) {
        echo "  ";
    }
    echo "$name\n";
    $depth[$parser]++;

    $currentdepth = $depth[$parser];

    if ($currenttagname[$parser]==null)
        $currenttagname[$parser] = array();

    if ($currenttagvalue[$parser]==null)
        $currenttagvalue[$parser] = array();

    $currenttagname[$parser][$currentdepth] = $name;
    $currenttagvalue[$parser][$currentdepth] = $value;
}

function onEndElement($parser, $name)
{
    global $depth;
    global $currenttagname;
    global $currenttagvalue;

    $currentdepth = $depth[$parser];

    $storedname = $currenttagname[$parser][$currentdepth];
    $storedvalue = $currenttagvalue[$parser][$currentdepth];

    for ($i = 0; $i < $depth[$parser]; $i++) {
        echo "  ";
    }
    echo $storedname;
    if ($storedvalue!="")
        echo " = " . $storedvalue;
    echo "\n";

    $depth[$parser]--;
}

function onCharacterData($parser, $data)
{
    global $depth;
    global $currenttagvalue;

    if ($currenttagvalue[$parser]==null)
        return; // ignore character data outside of tags

    // ignore new lines
    $data = str_replace("\n", "", $data);
    $data = str_replace("\r", "", $data);

    $currentdepth = $depth[$parser];

    $currenttagvalue[$parser][$currentdepth] .= $data;
}

$xml_parser = xml_parser_create();
xml_set_element_handler($xml_parser, "onStartElement", "onEndElement");
xml_set_character_data_handler($xml_parser, "onCharacterData");
if (!($fp = fopen($file, "r"))) {
    die("could not open XML input");
}
?>
<html>
<head><title>PHP XML Parsing Example</title></head>
<body><pre>
<?php

while ($data = fread($fp, 4096)) {
    if (!xml_parse($xml_parser, $data, feof($fp))) {
        die(sprintf("XML error: %s at line %d",
                    xml_error_string(xml_get_error_code($xml_parser)),
                    xml_get_current_line_number($xml_parser)));
    }
}
xml_parser_free($xml_parser);
?>
</pre></body>
</html>

How to get search terms from referer logs

Disksearch

To give search clouds a try, I need to extract the terms that visitors used to find my site from the referer logs. Luckily that’s another place where regular expressions come in handy. Here’s the REs for the three major search engine’s URLs, including MSN Live, Yahoo and Google, along with regional variants like google.co.uk. You should run them with case sensitivity disabled.

google(\.[a-z]{2,6})?\.[a-z]{2,4}\/search\?[a-z0-9&=+%]*q=([a-z 0-9=+%]+)
yahoo(\.[a-z]{2,6})?\.[a-z]{2,4}\/search\?[a-z0-9&=+%]*p=([a-z 0-9=+%]+)
live(\.[a-z]{2,6})?\.[a-z]{2,4}\/results.aspx\?[a-z0-9&=+%]*q=([a-z 0-9=+%]+)

You’ll end up with the search terms in the form "term1+term2+term3" in the second parenthesized results. If you want them as plain text, run urldecode() or equivalent on the string. Here’s a PHP function that takes a log file as a location, and returns an array of all the search terms listed in the referer URLs:

function extract_search_terms($filename)
{
  $logcontents = file_get_contents($filename);

  $searchrelist = array(
    "/google(\.[a-z]{2,6})?\.[a-z]{2,4}\/".
      "search\?[a-z0-9&=+%]*q=([a-z 0-9=+%]+)/i",
    "/yahoo(\.[a-z]{2,6})?\.[a-z]{2,4}\/".
      "search\?[a-z0-9&=+%]*p=([a-z 0-9=+%]+)/i",
    "/live(\.[a-z]{2,6})?\.[a-z]{2,4}\/".
      "results.aspx\?[a-z0-9&=+%]*q=([a-z 0-9=+%]+)/i",
  );

  $result = array();
  foreach ($searchrelist as $searchre)
  {
    preg_match_all($searchre, $logcontents, $logmatches,
      PREG_PATTERN_ORDER);

    foreach ($logmatches[2] as $currentmatch)
      array_push($result, urldecode($currentmatch));
  }

  return $result;
}