How to profile MySQL

Rocketlaunch

Photo by davesag

MySQL 5.037 added a profiler, and I've been using it to track down bottlenecks in my email analysis. If you're doing any serious database access, it will be a godsend. I'd recommend starting by looking through the official introduction but here's the highlights:

Quickstart

– Log into the mysql command line client
– type set profiling=1;
– Run your slow query (eg SELECT * FROM messages WHERE fromaddress='pete@petewarden.com';)
– type SHOW PROFILES;

You should now see a table with a single row, with the number 1 to the left of your query.

– type SHOW PROFILE FOR QUERY 1;

You'll see a table summarizing how long each stage of executing the query took:

+--------------------+-----------+
| Status | Duration |
+--------------------+-----------+
| (initialization) | 0.000063 |
| Opening tables | 0.000015 |
| System lock | 0.000005 |
| Table lock | 0.120696 |
| init | 0.000056 |
| optimizing | 0.000017 |
| statistics | 0.000249 |
| preparing | 0.000026 |
| executing | 0.000008 |
| Sorting result | 0.000005 |
| Sending data | 11.371813 |
| end | 0.000019 |
| query end | 0.000006 |
| freeing items | 0.00002 |
| closing tables | 0.000011 |
| logging slow query | 0.000004 |
+--------------------+-----------+
16 rows in set (0.00 sec)

What each row means

Some of the status names were self-explanatory. If I saw a lot of time in Table lock I knew that had to be waiting for another process to finish updating the same table. What on earth does Sending Data mean though?

I eventually found the answer in the list of thread states, after chasing some links from the profile reference page.

Use the source, Luke

The entry for Sending Data is The thread is processing rows for a SELECT statement and also is sending data to the client. That still didn't help me understand why it was stuck in that state for so long. The open-source nature of MySQL came to the rescue. By doing SHOW PROFILE source FOR QUERY 1; I got this table:

+--------------------+-----------+-----------------------+---------------+-------------+
| Status | Duration | Source_function | Source_file | Source_line |
+--------------------+-----------+-----------------------+---------------+-------------+
...
| Sending data | 11.371813 | mysql_select | sql_select.cc | 2287 |
...
+--------------------+-----------+-----------------------+---------------+-------------+

That told me exactly which file and function was taking all that time. By heading over to the MySQL product archives for my 5.0.45 version, I was able to download a snapshot of the full source code. If you have a different version, go to the list of releases, open the page for yours and then look for the source link at the bottom.

I brought up that function and could see exactly what the underlying code was doing. It still didn't completely clear up what was happening, but I think it was getting jammed on a lock waiting for another thread in my case, so I was able to focus on other simultaneous accesses to the database as the culprit.

Good old EXPLAIN

I'm mentioning this last, but it should be the first thing you try. The most useful tool for debugging slow performance is putting EXPLAIN at the start of your query and seeing what the SQL optimizer thinks of it. Look at whether it has found an index to use. If it hasn't, it's scanning rows to check their values which rapidly makes it unusable as your dataset gets larger. Look at the far right box for notes on optimizations. If you see Using filesort when you're ordering your results, run screaming and reindex your data to allow fast sorts, since this also takes forever with lots of data.

Three good reasons you should use Amazon Web Services, and one bad

Amazon

Photo by Don Cram

Amazon Web Services has received a lot of coverage since it launched. I've held off investigating it, since it smelt like one of those shiny things that end up distracting me from my core work. Last week I realized I was hitting the limits of my cheap dedicated server and needed some sort of change, so I decided to dip my toe into AWS. Two days later I had a big smile on my face, and my whole service switched over. I was surprised at how much it offered to an early-stage startup like mine. Here's some unexpected benefits of the change:

It saves me time on administration

I had to spend a few hours upfront creating a custom AMI disk image with all the software I needed pre-installed, and a couple of EBS volumes for my site code and database files. With that done, it is literally a 30 second job to create a new server instance, mount the volumes and have a new copy of the site running. With one click I can create a remotely stored backup of all my data. No early-stage startup could afford to create the infrastructure to make things this easy themselves. As a big bonus, the wonderful ElasticFox Firefox extension gives you a very fast way to manage all your services too. It's already saved me hours of messing around.

I can rapidly prototype

Having an easy way to revert to frequent backups means I can stop worrying about trashing my system and try out risky changes without fear. A straightforward way of creating new servers lets me leave a production version running while I try out something scary on a development machine. It's like the difference a version control system makes for source code. Suddenly I'm able to rapidly develop new systems, safe in the knowledge I can quickly back out my changes if they prove to be a mistake. This has helped me add new features much faster than before.

I'm only paying for what I use

If I need to do some processing on the 500,000 emails in the Enron database, having a blazingly-fast machine can save me a lot of waiting. With EC2 I pay by the hour, so I can pay $1.60 for 120 minutes of their fastest machine, and then switch back to something much cheaper for standard web serving. This flexibility lets me keep my costs very low, without compromising performance.

There is one very tempting reason to move to Amazon, but I think it's a trap:

Scalability

If you're an engineer at an early-stage startup, it's a lot of fun to spend time building to cope with a million visitors. After all, look at Twitter's troubles, wouldn't they have been better off with a more scalable architecture? No, premature scalability is a company-killing mistake. Having too many visitors to your site is a great problem to have, and at that point you'll have people throwing money at you, you can use that to re-engineer. Most companies die before that point because they don't release, and time spent on scalability is time taken from features, bug-fixing and actually getting the damn thing shipped.

If the only appealing thing about AWS for your unreleased project is scalability, forget it and focus on development.

The best automatic tagger you’ve never heard of

Gifttags

Photo by Sarah Parrot
I was searching for other applications that were using Wikipedia entry titles for semantic analysis of texts, when I came across Chris Sizemore's conText experiment. Testing it out, I was blown away by how well it worked as an automatic tagger, better than commercial semantic analysis solutions like OpenCalais or SemanticHacker.

I used the same two texts I tried those two services with, an asteroid news article and one of my own blog posts. Here's the top ten results for the news article:


Asteroid_deflection_strategies


Asteroids_in_fiction


99942_Apophis

Impact_event
Near-Earth_asteroid
Asteroid


Planetary_defense
Human_extinction
Space_colonization
Risks_to_civilization,_humans_and_planet_Earth

And for my 4th of July blog post:

Thunder_on_the_mountain
The_Dilbert_Future


Art_of_Motion_(album)
Joseph_T._Bockrath


List_of_Elvis_Presley_songs


List_of_disco_artist
Songs_of_the_Century
Farris_Hassan
To_Tell_the_Tooth

List_of_Beatles_songs

9 of the top 10 results for the asteroid article are one's I'd pick as good categories for it. That's a much better hit ratio than OpenCalais or SemanticHacker in my tests. The results for the blog post are all completely unrelated, but the commercial tools do only slightly better, picking out one or two related concepts. Having text that's full of abstract musings rather than concrete nouns seems to be bad news for any semantic analysis.

In fairness I should mention that both OpenCalais and SemanticHacker are not primarily aimed at my goal, which is to automatically extract a small set of categories from short-form pieces of text (eg emails), so the comparison isn't apples to apples. It is still good news for me that Chris' approach is so useful for my purpose though.

What's really fun about his project is that it's a true garden-shed effort, produced as part of the BBC radio labs from open-source parts without requiring a massive development budget. Here's how he did it:

– Download the whole of Wikipedia, and save out each article as a file on disk.
– Index all those files using the open-source search framework Lucene.
– For every candidate text, use 'More like this' (Lucene's equivalent of Google's related sites) to generate a list of the most similar Wikipedia articles.

I really like this approach. It's all statistically-based so you get the advantage of very broad and robust coverage and don't have to sweat over hand-tuned vocabularies. I'm also a firm believer in using Wikipedia as a list of concepts for semantic analysis. The one downside is that the current implementation of the 'More like this' functionality is slow, it can take 20-30 seconds to process an article. Happily that seems open to improvement, rather than anything fundamental.

How to find a technical co-founder

Nerdfinger

Photo by Bayat

A non-technical friend of mine recently asked for my advice as a card-carrying geek: If he doesn't have the right person in his immediate network, how can he find a technical co-founder?

When you're searching, there are two things a good engineer won't care about:

Job title
Job description

Here's three things the engineer should be passionate about:

Your big vision
Getting things done
Being the alpha geek

The job title one is almost a cliche, but if there's one
thing that drives most engineers crazy it's 'office politics', by which
they mean any status system that's not based around pure technical
ability. Caring about fancy job titles is seen as a sign you can't win
respect by the sheer awesomeness of your code. 'Founder' should be
enough if you need one to throw around.

All you should write for the job description is 'Create the
next X', or 'Build the Y-killer', or some other world-changing goal.
You don't want an employee waiting to be told what to do, you need an owner who takes the company's mission and figures out technical steps
it will take to achieve it. You should have enough familiarity with the
likely tools to weed out people without some relevant experience before
you talk to them, but I would keep that mental list away from
candidates, since the standard HR checklist of '5 years PHP experience'
is a sign of clueless-ness, and will put off people with equally good
alternative qualifications (eg Ruby on Rails instead of PHP).

So, without a job description, how do you find someone? Mailing
lists, blogs and web forums are a fantastic way to find undiscovered talent.
Find a few open-source projects that are related to the area you're
interested in. Dig out their developer mailing lists, and look at who's
posting frequently. Look through their messages and you'll get a better
sense of who they are than any interview could give you. They're
already passionate about the area if they're on the list, and
you can easily see who's rude, who can communicate, who's willing to go
out of their way to help others, who's super-smart, and who bites the bullet and implements things
rather than just talking about them. You'll also get to see code
samples, and even if you're not technical a quick scan can show you if
they're writing vaguely intelligible code. Their interaction with the
group also shows who's hungry to stay on top of the technical issues,
and so can remain alpha geek and keep the respect of engineering subordinates.

Technical conferences and user groups are great hunting grounds
too, the nerdier the better. I've worked with so many people who are
amazing engineers but interview like Rainman. If you can catch us in
our natural habitat, talking about things we love, you've got a much
better chance of seeing us in a true light.

Once you have someone in your sights, it's like dating, I'm not
sure there's any formula. The goal is to convince them you're someone
they can trust, and that you add value yourself. Demonstrating
things like your ability to get press coverage, how you can increase
the chances of being funded, and bringing in revenue are essential.
It's also tricky because engineers often discount the value of those
skills, so bringing them down to concrete results helps, eg 'Blogger
Joe wrote about my last three releases, he'll write about this
project', 'I raised x with my last company, the same investors are
interested in my new venture'. Bringing on previous technical partners
who can talk engineer is vital to help gain trust too, sort of like
duck decoys.

I've worked with so many great engineers who would make fantastic co-founders, but they're uncomfortable with traditional networking and self-promotion. You'll be doing them a favor if you hunt them down yourself!

Reasons to be cheerful

Piercingsmile

Photo by bdebaca

There's a lot of depressing posts floating around the startup blogosphere. Here's why I'm still a happy camper:

We're the luckiest people on the planet

If you're reading this, you're almost certainly a well-educated inhabitant of a developed country, with strong computer skills and enough time on your hands to browse the internet. Times may get tougher in a relative sense, but the options and resources we have make us part of the world's upper class.

Family and friends aren't going anywhere

There's no stock index for the value people bring to your life. No matter what happens, I can still go home to my family and friends.

My competitors are in the same boat

The barriers to entry have been raised for everyone. It's going to be harder for me to raise new funding and make sales, but that's true for everyone around me. Even funded startups will have a tougher time meeting their targets and getting follow-on rounds. The upside of being early-stage is that adaption is far less painful. If I can't raise enough funding to attack the enterprise market directly, I can go lean, and get revenue and domain knowledge by selling the business tools directly to individuals.

Tough times favor trying new things

There are a lot more pressing reasons to change the way you do things when businesses are under pressure. As Nat Torkington says, depressions force technologists away from iterating on existing solutions, and onto radically new approaches. The sudden increase in painful problems means lots of new opportunities to solve them.

There will be a lot less noise

A lot fewer businesses will get started. That means my startup will find it a lot easier to get noticed, rather than competing with the latest toothpick collector social network for coverage.

I'm still building something I value

The one thing I can truly control is what I'm working towards. As long as I'm able to keep making progress towards my goals for Mailana, I'll be happy. As an engineer I always have the option of continuing in complete Ramen Noodle mode to keep development progressing, so I can last a long time.

What’s the big picture?

Largepainting
Photo by Eytonz

So why did I quit my job and start a business?

I’m convinced technology can help big companies execute far better than they do today. Every day when I was at Apple I could see ways to improve how I did my job, if only I had the right tools. The crux is communicating needs. I knew there were colleagues out there willing to share the expertise, resources or external contacts we needed, but there was no way to find them. Equally, I had no way of spotting situations outside our team where we could help.

Everyone there lived on email, almost every decision, statistic, question, answer, and connection was held on the mail server. That was exactly the data we all needed to do our jobs, but the only way we could interact with it was through a primitive client-side keyword search on whatever email we happened to have on our local machine. It felt like the web before Google, the information was out there but effectively dead, since we’d never be able to find it.

I’m building Mailana to solve the problems I ran into in my daily work. The technology’s out there to locate experts, contacts, and documents, once you can break into the antiquated silos that hold all the messages captive. It’s an itch I have to scratch.

How to insert inline images in Gmail

Polaroids
Photo by kk+

One thing I really miss when I’m using Gmail compared to Apple’s Mail.app or Outlook is the ability to include inline images. Whenever I’m trying to explain almost anything I rely on pictures, and if you can’t insert them in the flow of the text they lose their context. It’s tough telling a story if you can’t illustrate it.

There isn’t a technical reason behind their lack of inline images, you can actually insert your own if you’re willing to jump through some hoops:

– First upload the image to a website, using ImageShack if you don’t have your own
– Open up that image in a new browser window
– Do a ‘Select All’ in Firefox to copy the HTML for that image
– Paste the HTML into the email composition box within Gmail

Having to upload images a big pain, and still doesn’t allow the simple insertion of a desktop app. If there’s any Gmail programmers reading, here’s what you need to do to enable inline image attachments:

– Set the Content-Disposition header of the image attachment to ‘inline’
– Set the Content-ID header to a unique identifier (the cid)
– In the HTML body of the message, reference it as <img src="cid:<that identifier>"/>

Gmail even displays messages created like this correctly, so there’s obviously the internal know-how within Google to implement it. The lack of this widely-requested feature several years after launch is a real shame, and part of the reason I’ve become worried about Gmail’s lack of progress. Here’s hoping they’re busy behind the scenes!

OpenMapi makes an Exchange-killer possible

Openmapilogo

Scott Chaslin just alerted me to the launch of OpenMapi.org. MAPI/RPC is the proprietary protocol that Outlook uses to talk to Exchange. Unlike open standards like IMAP, it deals with the full spectrum of tasks, appointments and contacts, not just mail messages. If you want to write a drop-in replacement for an Exchange server that supports Outlook, it needs to talk MAPI/RPC.

MAPI/RPC was traditionally both secret and very hard to reverse engineer, so solutions like Zimbra required you to install a connector add-in on every PC running Outlook, to communicate in some other protocol. This was a big barrier to adoption, since it means supporting and maintaining all those client installations. The barrier started to crack as companies like PostPath started offering their own emulations, and Microsoft themselves released documentation on the protocol.

What's exciting about OpenMapi is that it offers an open-source implementation of the MAPI/RPC protocol. Now this is in the wild, it's possible to develop your own server that looks exactly like Exchange to the outside world! Potentially, you could deploy a new solution silently, without all the Outlook clients even knowing anything has changed. This could be the first step in turning enterprise mail servers into a commodity. The actual transport technology behind email servers is a well-understood problem, and if you look at the history of Samba you can see what happens when you free up a core Windows service to talk to any back-end. I'm hoping this will lead to a lot more alternatives to Exchange popping up, it should dramatically lower the barrier to entry.

Personally the most interesting part is the proxy server, which should let me write a component that sits between the real Exchange server and all the Outlook clients, pulling out information I want like attachments, but passing on the rest of the work to Exchange. The real test will be how complete and robust the MAPI emulation is in practice, I'll let you know once I've put it through its paces.

[Update- I was getting confused, this description actually applies more to OpenChange. OpenMapi is not focused on the binary network protocol, and more on the actual programming interface offered by MAPI. See the comment below from Thomas]

How you can build trails by shopping at REI

Reilogo

As a rabid hiker and mountain biker, I love REI. They’ve also been long-term supporters of the Santa Monica Mountains Trails Council’s volunteer trail building, for years giving us grants for equipment and publicizing our events, often even bringing their own staff along.

Last Saturday I was chatting to Jennifer Day, the outreach officer at their Santa Monica store. She’d not only brought along a crew of volunteers to work on the trails, she also supplied free t-shirts, juice and raffle prizes. She mentioned that REI ran an online affiliation program we might be interested in. It’s a generous 7% of any purchase someone makes through the website after clicking on the ad. Since most of our members are already heavy REI shoppers, and there’s free shipping to the local stores, this was money for nothing we could spend on keeping the trails open.

There’s no catches, all you have to do is click on the ad here or in the sidebar, and for the next 30 days any purchase you make online will mean a 7% donation to the SMMTC, for the same price you would have paid anyway. If you have an organization you’d like to help instead, signing up for their affiliate program yourself is easy too.

Click here to start donating