The SQL Trap

Photo by Beatrice Murch

Virtually every web developer starts off using a relational database like MySQL. It's so easy to use joins and sorts to implement complex operations your service needs, pretty soon you end up with big chunks of application logic in your SQL queries. You don't know it, but you've just entered The SQL Trap.

I first heard the phrase from Jud Valeski, but it's something I've seen happen to every startup that deals with massive data sets, and I've struggled with it myself. You build a working service on a small scale, and that success brings in more users and data. Suddenly your database is the bottleneck.

At this point you have two choices. You can continue writing your queries in expressive high-level SQL and pour resources into speeding up the underlying system, or you can switch to the database equivalent of assembler with a key/value store and write application code to implement the complex operations on top of that.

In an ideal world a database is a black box – you send a query and the system figures out how to execute that operation speedily. Even in conventional databases though we end up deviating from that, eg by indexing certain rows we know we'll be querying on. After wrestling with speed problems I took a few steps beyond that by denormalizing my tables for common queries to avoid joins at the cost of more fragile update logic. As my data grew, even that wasn't enough, and simple sorts on indexed rows were taking several minutes. I spent some time trying to second-guess MySQL's optimizer by tweaking various opaque limits, but it still insisted on sorting the few thousand rows by writing them out to disk and running something called FILESORT.

At this point I was in the trap. Getting further would require somebody with deeper knowledge of MySQL's internal workings, and would take a lot of tweaking of my queries and my system setup. Large companies end up throwing money at consultants and database vendors at this point, which is a great result for the providers!

Instead as a starving startup I had to bite the bullet and throw out all my lovely queries. I switched to a key/value database for storage, and designed the keys and values to get a workable sub-set of the information I needed for any query. I then sucked all that data in PHP and did the sorting and refining there.

After the initial pain it was a massive relief. I was able to use my standard computer science skills to design the right data-structures and algorithms for my operations, rather than trying to second-guess the black box of SQL optimization. Sure I've now got a whole different set of problems with more complex application code to maintain, but it's taking a lot less time and resources than the SQL route.

Don't get me wrong, premature optimization is still the root of all evil, but if you're dealing with massive datasets and your database is becoming a bottleneck, consider dropping SQL and falling back to something more primitive, before you're trapped!

How to get Tokyo Tyrant working in PHP

GodzillarockPhoto by NNE

Regular readers know that I've been both entranced and frustrated by Tokyo Tyrant. An elegantly minimal key/value database server with great performance, I've burnt days trying to get it running reliably with PHP.

I'm extremely happy to say I've now got it working, and it's everything I dreamed it could be. The major bug stopping me was truncation of values more than 16k in size, and that turns out to be a bug in the Net_TokyoTyrant PHP wrapper (and arguably a bug in PHP's libraries). The wrapper was using a single fread() call to get values, but this has size limits, and so needs to be called repeatedly in a loop to get the full result. Jeremy Hinegardner got me attacking this again after he confirmed he was using Tokyo successfully through Ruby, and after some debugging made me suspicious of fread's reliability Blagovest Buyukliev's post confirmed it was the cause, and gave me a drop-in fix.

I can't find a way to contact the original author of Net_TokyoTyrant to offer a patch, but the code is included in this updated unit test tokyotest.php

Incidentally, I'd highly recommend running through a unix file socket rather than a network socket on localhost, that's been a massive speedup for my use cases.

Why I hate client-side code (and the cloud will win)

CarcrashPhoto by Saiki

Most of my career's been spent on desktop or embedded systems code and I'm a relative newcomer to web programming. Despite the horrors of server-side development (debugger? ha!) it's so much faster to develop web services than traditional apps. The main reason is that I have control over far more of the environment when the code is running on my own box and I'm only relying on a client to display the UI. The testing matrix for Apple Motion was insane because it ran on the GPU, every piece of hardware behaved differently, and so as new graphics cards and machines came out the combinations we had to check exploded.

So, I have a lot of sympathy with Microsoft, and the Xobni folks doing client-side processing, but this novel-length KB article on troubleshooting Outlook crashes sums up why users are so happy with web apps, despite their limitations.

An implicit data bill of rights

Photo by Vkx462

I've been lucky enough to spend some time with Ken Zolot this week, who's heavily involved with startups both through MIT and the Kaufmann foundation. He threw some fantastic papers in my direction, and one of the most interesting finds was a proposal by Alex Pentland on data privacy, what he calls a New Deal on Data. I've been wrestling with how to use implicit data on people's behavior in an ethical and honest way and Pentland's definition is really helpful.

He draws on English Common Law principles of possession, use and disposal, and applies them to data about ourselves, to match our intuitive feelings of ownership of information about ourselves.

1. You have a right to possess your data. Companies should adopt the role of a Swiss bank account for
your data. You open an account (anonymously, if possible), and you can remove your data whenever you’d like.

2. You, the data owner, must have full control over the use of your data. If you’re not happy with
the way a company uses your data, you can remove it. All of it. Everything must be opt-in, and not only clearly explained in plain language, but with regular reminders that you have the option to opt out.

3. You have a right to dispose or distribute your data. If you want to destroy it or remove it and redeploy it elsewhere, it is your call.

In practice these make some technical demands for the ability to export and delete information that few services provide. Try saving out your friend graph from Facebook without violating their terms-of-service!

This makes it a tough sell for corporations built around hoarding users' information as a proprietary asset. In the long-term though, the benefits of users sharing information widely will benefit services that don't lock in their users. You can already see that with Twitter's API; their lack of restrictions has led to applications that weren't even imagined before the data became available.

How can you measure influence?


Influence is the measure of your ability to persuade others to take an action. Micah of Lijit gave a barn-storming talk at Boulder NewTech last night, describing how they are starting to measure blogger's influence. It's not publicly released yet, but they're combining both raw audience figures and the user activity they measure through the Lijit widget, things like searches and clicks.

This is exciting to me because nobody's been able to use implicit data on people's behavior in a widespread way, because nobody's had access to a large enough set. I'm bullish on Lijit's prospects because they are in a unique position with hundreds of millions of user interactions across thousands of sites in their database (OneRiot are the only other company I can think of that's got access to more info through its browser add-on).

Lijit's measure is a big step forward, but did leave me with a couple of questions. Influence has to be defined around an action, but their measure seems to be positioned as a universal metric. Lolcats is a lot more likely to make me buy a t-shirt than the Sunlight Foundation, but lolcats has no influence on how I vote. If you pick a single influence number you can't capture that.

There's also the question of who you're influencing. A picture on lolcats will get a lot more pageviews than a post on Brad's blog, but a lot more influential people in the tech community will see the blog post. Google's PageRank tackles this by taking the influence of the people who link to a site into account to calculate its influence. That means a bunch of barely-read geocities (RIP) sites linking to you doesn't matter as much as a link from the New York Times. There's no equivalent way of compensating for the relevance of the users whose activity you're measuring. Having a single Steve Jobs viewing your pages is more influential than 1000 random teenagers.

I've some thoughts on fixing this, and actually started running PageRank on Twitter conversations to figure out who was most influential on the service, but had to put that on hold to focus on other work. I can't wait for Lijit to launch the rankings, despite all my niggling this should be a massive jump forward!

Are you taking market risk or technology risk?

Photo by MHuang

If you're working in the pharmaceutical industry, your main risk is your new treatment won't work. There's a massive number of medical problems people are certain to pay money to solve, if you can create a drug that works.

In the toy business, it's completely different. Building that new Pet Rock or Cabbage Patch Doll is easy, but for very hard-to-predict reasons people may not like it. You may not be able to distribute or market it even if there are some who do.

Most startups lie on a spectrum between these two extremes of technology and market risk, but I've learnt it's crucial to understand what your mix is. People from a business background prefer market risk, because that's something they know how to measure and mitigate. Techies like me have a bias towards hard engineering problems that they know how to solve.

I started off thinking that Mailana's main risk was technology – it's really hard to integrate with Exchange, build Outlook plugins and analyze millions of emails in real time. There were all sorts of end-user problems that can be solved with the information derived from that, so once the system was built, customers would come. You can chuckle at my naivity, but I never understood that there were two separate risks. I put a lot more effort into coding than understanding the market, and then discovered there were all sorts of unexpected cultural issues around privacy that scuppered my first attempt when it was in front of customers

The beautiful thing about market risk is that you can take very simple steps to reduce it before you spend months coding. Build slideware and ask your potential customers if they'd buy a working version. Buy some relevant AdWords and point them at a dummy product page to see if anyone signs up for more information.

If you're reading this, you're not Pfizer and you do have a market risk. Take a long hard look at your business and see what you can do to reduce it.