Sewers and startups

Sewer
Photo by Elsie Esq.

I had a chance to chat with Matt Mullenweg yesterday, and he focused on something I've been struggling with; building to last. It also got me thinking about plumbing.

Joseph Bazalgette is one of my engineering heros. He built London's first sewers in the 19th century, and started by estimating how large they'd need to be to cope with the current population. He then said "Well, we're only going to be doing this once and there's always the unforeseen" and doubled the diameter! Thanks to his foresight and the beautiful workmanship of the bricklayers, those same sewers are still serving Londoners today, despite a population many times larger.

The biggest enemy of early-stage startups is time. We can't afford premature scalation, because before we've finished building a system robust enough to handle millions of active users we'll have run out of money. That means we end up accumulating technical debt as we struggle to get customers and revenue with the least possible amount of code.

The danger is we end up succesful, but so deeply mired in technical debt that we spend all our time paying interest rather than making meaningful progress with the product (see the last decade of Windows). As Vernor Vinge evokes so well, there's a good chance some of our code will be in the lower layers of the stack essentially forever. It's a deep engineering sin to inflict shoddy sewers on future generations.

Matt's key insight was "When you're in the red, time is working against you. Once you're profitable, time is on your side". Getting to even Ramen profitability changes everything, and gives you the ability to build for the long term.

When I joined Apple back in 2003, the central build farm for all projects had both PowerPC and x86 Darwin boxes, and our code had to compile on both. Steve was playing a long game, years before the Intel switch he was obviously planning for it, (though I only caught the significance in retrospect).

Looking at WordPress, you can see the same combination of long-term planning sustained by profitability. A lot of focus in the startup world is on exits, but I'll be ecstatic if I'm still helping build Mailana in 20 years time. Seeing Matt's dedication to building something to last gave me hope, especially as he gave practical steps to get there.

The SQL Trap

Venusflytrap
Photo by Beatrice Murch

Virtually every web developer starts off using a relational database like MySQL. It's so easy to use joins and sorts to implement complex operations your service needs, pretty soon you end up with big chunks of application logic in your SQL queries. You don't know it, but you've just entered The SQL Trap.

I first heard the phrase from Jud Valeski, but it's something I've seen happen to every startup that deals with massive data sets, and I've struggled with it myself. You build a working service on a small scale, and that success brings in more users and data. Suddenly your database is the bottleneck.

At this point you have two choices. You can continue writing your queries in expressive high-level SQL and pour resources into speeding up the underlying system, or you can switch to the database equivalent of assembler with a key/value store and write application code to implement the complex operations on top of that.

In an ideal world a database is a black box – you send a query and the system figures out how to execute that operation speedily. Even in conventional databases though we end up deviating from that, eg by indexing certain rows we know we'll be querying on. After wrestling with speed problems I took a few steps beyond that by denormalizing my tables for common queries to avoid joins at the cost of more fragile update logic. As my data grew, even that wasn't enough, and simple sorts on indexed rows were taking several minutes. I spent some time trying to second-guess MySQL's optimizer by tweaking various opaque limits, but it still insisted on sorting the few thousand rows by writing them out to disk and running something called FILESORT.

At this point I was in the trap. Getting further would require somebody with deeper knowledge of MySQL's internal workings, and would take a lot of tweaking of my queries and my system setup. Large companies end up throwing money at consultants and database vendors at this point, which is a great result for the providers!

Instead as a starving startup I had to bite the bullet and throw out all my lovely queries. I switched to a key/value database for storage, and designed the keys and values to get a workable sub-set of the information I needed for any query. I then sucked all that data in PHP and did the sorting and refining there.

After the initial pain it was a massive relief. I was able to use my standard computer science skills to design the right data-structures and algorithms for my operations, rather than trying to second-guess the black box of SQL optimization. Sure I've now got a whole different set of problems with more complex application code to maintain, but it's taking a lot less time and resources than the SQL route.

Don't get me wrong, premature optimization is still the root of all evil, but if you're dealing with massive datasets and your database is becoming a bottleneck, consider dropping SQL and falling back to something more primitive, before you're trapped!

How to get Tokyo Tyrant working in PHP

GodzillarockPhoto by NNE

Regular readers know that I've been both entranced and frustrated by Tokyo Tyrant. An elegantly minimal key/value database server with great performance, I've burnt days trying to get it running reliably with PHP.

I'm extremely happy to say I've now got it working, and it's everything I dreamed it could be. The major bug stopping me was truncation of values more than 16k in size, and that turns out to be a bug in the Net_TokyoTyrant PHP wrapper (and arguably a bug in PHP's libraries). The wrapper was using a single fread() call to get values, but this has size limits, and so needs to be called repeatedly in a loop to get the full result. Jeremy Hinegardner got me attacking this again after he confirmed he was using Tokyo successfully through Ruby, and after some debugging made me suspicious of fread's reliability Blagovest Buyukliev's post confirmed it was the cause, and gave me a drop-in fix.

I can't find a way to contact the original author of Net_TokyoTyrant to offer a patch, but the code is included in this updated unit test tokyotest.php

Incidentally, I'd highly recommend running through a unix file socket rather than a network socket on localhost, that's been a massive speedup for my use cases.

Why I hate client-side code (and the cloud will win)

CarcrashPhoto by Saiki

Most of my career's been spent on desktop or embedded systems code and I'm a relative newcomer to web programming. Despite the horrors of server-side development (debugger? ha!) it's so much faster to develop web services than traditional apps. The main reason is that I have control over far more of the environment when the code is running on my own box and I'm only relying on a client to display the UI. The testing matrix for Apple Motion was insane because it ran on the GPU, every piece of hardware behaved differently, and so as new graphics cards and machines came out the combinations we had to check exploded.

So, I have a lot of sympathy with Microsoft, and the Xobni folks doing client-side processing, but this novel-length KB article on troubleshooting Outlook crashes sums up why users are so happy with web apps, despite their limitations.

An implicit data bill of rights

Wethepeople
Photo by Vkx462

I've been lucky enough to spend some time with Ken Zolot this week, who's heavily involved with startups both through MIT and the Kaufmann foundation. He threw some fantastic papers in my direction, and one of the most interesting finds was a proposal by Alex Pentland on data privacy, what he calls a New Deal on Data. I've been wrestling with how to use implicit data on people's behavior in an ethical and honest way and Pentland's definition is really helpful.

He draws on English Common Law principles of possession, use and disposal, and applies them to data about ourselves, to match our intuitive feelings of ownership of information about ourselves.

1. You have a right to possess your data. Companies should adopt the role of a Swiss bank account for
your data. You open an account (anonymously, if possible), and you can remove your data whenever you’d like.

2. You, the data owner, must have full control over the use of your data. If you’re not happy with
the way a company uses your data, you can remove it. All of it. Everything must be opt-in, and not only clearly explained in plain language, but with regular reminders that you have the option to opt out.

3. You have a right to dispose or distribute your data. If you want to destroy it or remove it and redeploy it elsewhere, it is your call.

In practice these make some technical demands for the ability to export and delete information that few services provide. Try saving out your friend graph from Facebook without violating their terms-of-service!

This makes it a tough sell for corporations built around hoarding users' information as a proprietary asset. In the long-term though, the benefits of users sharing information widely will benefit services that don't lock in their users. You can already see that with Twitter's API; their lack of restrictions has led to applications that weren't even imagined before the data became available.

How can you measure influence?

Persuasion

Influence is the measure of your ability to persuade others to take an action. Micah of Lijit gave a barn-storming talk at Boulder NewTech last night, describing how they are starting to measure blogger's influence. It's not publicly released yet, but they're combining both raw audience figures and the user activity they measure through the Lijit widget, things like searches and clicks.

This is exciting to me because nobody's been able to use implicit data on people's behavior in a widespread way, because nobody's had access to a large enough set. I'm bullish on Lijit's prospects because they are in a unique position with hundreds of millions of user interactions across thousands of sites in their database (OneRiot are the only other company I can think of that's got access to more info through its browser add-on).

Lijit's measure is a big step forward, but did leave me with a couple of questions. Influence has to be defined around an action, but their measure seems to be positioned as a universal metric. Lolcats is a lot more likely to make me buy a t-shirt than the Sunlight Foundation, but lolcats has no influence on how I vote. If you pick a single influence number you can't capture that.

There's also the question of who you're influencing. A picture on lolcats will get a lot more pageviews than a post on Brad's blog, but a lot more influential people in the tech community will see the blog post. Google's PageRank tackles this by taking the influence of the people who link to a site into account to calculate its influence. That means a bunch of barely-read geocities (RIP) sites linking to you doesn't matter as much as a link from the New York Times. There's no equivalent way of compensating for the relevance of the users whose activity you're measuring. Having a single Steve Jobs viewing your pages is more influential than 1000 random teenagers.

I've some thoughts on fixing this, and actually started running PageRank on Twitter conversations to figure out who was most influential on the service, but had to put that on hold to focus on other work. I can't wait for Lijit to launch the rankings, despite all my niggling this should be a massive jump forward!

Are you taking market risk or technology risk?

Brokencasio
Photo by MHuang

If you're working in the pharmaceutical industry, your main risk is your new treatment won't work. There's a massive number of medical problems people are certain to pay money to solve, if you can create a drug that works.

In the toy business, it's completely different. Building that new Pet Rock or Cabbage Patch Doll is easy, but for very hard-to-predict reasons people may not like it. You may not be able to distribute or market it even if there are some who do.

Most startups lie on a spectrum between these two extremes of technology and market risk, but I've learnt it's crucial to understand what your mix is. People from a business background prefer market risk, because that's something they know how to measure and mitigate. Techies like me have a bias towards hard engineering problems that they know how to solve.

I started off thinking that Mailana's main risk was technology – it's really hard to integrate with Exchange, build Outlook plugins and analyze millions of emails in real time. There were all sorts of end-user problems that can be solved with the information derived from that, so once the system was built, customers would come. You can chuckle at my naivity, but I never understood that there were two separate risks. I put a lot more effort into coding than understanding the market, and then discovered there were all sorts of unexpected cultural issues around privacy that scuppered my first attempt when it was in front of customers

The beautiful thing about market risk is that you can take very simple steps to reduce it before you spend months coding. Build slideware and ask your potential customers if they'd buy a working version. Buy some relevant AdWords and point them at a dummy product page to see if anyone signs up for more information.

If you're reading this, you're not Pfizer and you do have a market risk. Take a long hard look at your business and see what you can do to reduce it.

How to make connections with people you don’t know

Stalking
Matt Van Horn from Digg gave a talk I wasn't expecting last night; the practical side of networking. The whole mission of Mailana is "You guys should talk", I love it when I can connect two people who can help each other. To make that happen, you have to be able to build bridges with strangers; Matt revealed his personal toolkit for reaching the right people.

Matt started off with the LOLCAT picture because if you're not used to networking it can feel creepy and exploitative. What I've realized, and Matt emphasized, is that you need to approach it as a way of helping other people, not just be a taker. It's a long-term project, not something you desperately turn to at the last minute when you need a job.

Having said that, the story of how Matt got to be business development manager at Digg is an example of how chutzpah pays off. He targeted Digg as a company he really wanted to work for, and queued for 2 hours at a trade show to get a business card from Jay Adelson, their CEO. After that he emailed him repeatedly trying to set up a meeting, as well as sending on relevant newspaper articles to the Digg offices. Then he guessed a couple of email addresses for their recruiter and CRO, and eventually landed an interview. They asked him to write a detailed description of the position he wanted to create in the company and how it would help Digg. Finally that landed him the job! Wouldn't you hire somebody who showed that much determination and resourcefulness?

Here's a few of the tips Matt gave out for getting in touch with people you want to talk to, but can't get a 'warm' introduction for:

Guess email addresses. Most companies have a fixed format, eg pete.warden@company.com, pwarden@company.com, pete@company.com. Figure it out from public examples or just guess and fire off a message.

Call at odd hours. Receptionists are usually only there 9-5, but most of us work before and after, so there's a good chance somebody helpful will pick up if you ring 7:00am to 9:00am or 5:00pm to 7:00pm.

Contact them through random social networks. Last.fm and other common sites with a social element have ways of sending their users messages. If you can find the account of the person you're looking for, send them a message and it will most likely show up in their regular inbox.

Send an 'I've worked with you' connect request on LinkedIn. Even if you haven't been a colleague, you've got a chance to explain in the note why you want to talk to them. I have a 'pro' account on LinkedIn which lets me send a limited number of messages to people outside my connections, but a sparing use of this approach is much cheaper!

I have a few more ideas I've found effective:

Blog about people or companies you like. I'll often spend time researching  companies or entrepreneurs I think are really cool so I can learn something, and then share it as a blog post. An awesome side-effect of that is that I often hear back from the people I've written about, that's how some of my best collaborations have come about. As I wrote in Beetlejuice, Beetlejuice, Beetlejuice, just saying someone's name on the internet is often the best way to get in touch.

Comment on their blogs or Twitter streams. I find myself doing this naturally with interesting people I'm following, but it's also a great way to build a relationship and demonstrate a sustained commitment.

The key to all of these is thoughtfulness and sincerity. If you really don't care about what they're doing it will come across and you'll just be wasting time. Be natural, be passionate. Follow up, and show you're listening by referencing previous conversations when you do. Spend more time figuring out how you can help them than how they can help you.

Skynet runs on Windows/MFC

Skynetscreenshot

Me and Liz were re-watching season 2 of the Sarah Connor Chronicles when I spotted some familiar-looking code on John Henry's bootup screen. WM_ACTIVATETOPLEVEL sure looks like a Win32 constant, and googling led me to MSDN documentation revealing it's a private message associated with MFC. It looks like autosysconf is running some C++ code to boot up the AI. The other evidence in the series is ambiguous about what side John Henry is on, but AI code in MFC is clearly evil. Interestingly the most common use of FEP is from The Symbolics Lisp Machine as a front end processor, which would be a much more sensible language.

Interestingly though John Henry (and presumably his brother Skynet) appears to be Windows/x86-based, Terminators are known to use Apple II/6502 processors. The thought of dealing with porting between those two almost makes me feel sorry for our future robot overlords.