LA’s guilty pleasure

Tacocrash
Photo by mxlanderos

If there’s one thing that unites Angelenos, it’s a fascination with car chases. The main news shows will completely shut down their regular coverage for the whole hour if there’s a live chase happening, no matter how uneventful it is. The anchors turn into sports commentators, with lots of informed speculation about the exact tactics police will use, when the PIT maneuver is safe, if the CHP or sheriffs have jurisdiction at every point. KTLA even has a helicopter pilot with the perfect name of Johnny McCool to cover it all.

Liz asked me last night, "Is that morally correct?" and the answer has to be "No, but I still can’t look away". It’s glorifying criminals who are putting a lot of innocent people in danger for the sake of entertainment, and feels like a disorganized version of The Running Man.

Still, to LA residents who spends a significant portion of their life at a frustrated standstill in traffic, the sight of someone breaking free and using every trick to speed along the freeways is mesmerizing and vicariously liberating. The fact that they almost always get caught at the end provides a moral alibi, but the real payoff is seeing them in flight.

I had dinner last night with a friend who’s been collecting the most gripping examples on his blog, and we talked a lot about this local obsession. This New Yorker article is the best exploration I’ve seen, with Sheriff Baca blaming the large number of chases on a shortage of cops and lots of "highly mobile idiots", but it never manages to really explain their popularity. It looks like the number of local car chases is actually declining since the peak in 2004, but I’m betting LA stays way ahead of the rest of the country for a long time to come.

How to easily search and replace with sed

Textjumble
Photo by wai:ti

If you’ve used any flavor of unix for programming, you’re probably familiar with grep, the tool for locating patterns in text files. That’s great if you just want to search for a string, but what if you want to replace it?

Sed, the stream editor, is the answer, but that also brings up a new question: how on earth do I use it? It’s probably one of the most obscure interfaces ever invented, its syntax makes obfuscated perl look like a model of clarity. Usually with a new tool I start off looking at a series of examples, like these sed one-liners, to get a rough mental model of how it works, and then dive into the documentation on specific points. That didn’t work with sed, I was still baffled even after checking those out. The man page didn’t help, I could read the words but they didn’t make any sense.

Finally I came across my salvation, Bruce Barnett’s introduction and tutorial for sed. He hooked me with his first section on The Awful Truth about sed, with its reassurance that it’s not my fault that I’m struggling to make head or tail of anything. He then goes through all the capabilities of sed in the order he learnt them. It’s a massive list, but even if you only get a few pages in you’ll know how to do a simple search and replace. Sed is a very powerful tool, it’s worth persevering with the rest so you can discover some of the advanced options, like replacing only between certain tags in a file (eg how to change the content text of a particular XML tag) and working from line numbers. Bruce is an entertaining companion for your journey too, he has great fun with some asides demonstrating how to make the syntax even harder to read, just for kicks.

Why massive datasets beat clever algorithms

Library
Photo by H Wren

Jeremy Liew recently posted some hints, tips and cheats to better datamining. The main thrust, based on Anand Rajaraman’s class at Stanford, is that finding more data is a surer way to improve your results than tweaking the algorithm. This matches both my own experience trying to do datamining, and what I’ve seen with other company’s technologies. Engineers have a bias towards making algorithms more complex, because that’s the direction that gets the most respect from your peers and offers the most intellectual challenge.

That makes us blind to the advantages of what the Jargon File calls wall-followers, after the Harvey Wallbanger robot that simply kept one hand on a wall at all times to complete a maze, and gave far better results than the sophisticated competitors using complex route-finding. Google’s PageRank is a great example, almost zen in its simplicity, with no semantic knowledge at all.

One hidden advantage to this simple-mindedness is very predictable behavior, since the simplicity means there’s a lot fewer variables that affect the outcome. This is why machine-learning is so scary a change for Google, there’s no guarantee that some untested combination of inputs won’t result in very broken results.

Another obvious benefit is speed of both development and processing. This lets you get up and running very fast, and get through a lot of data. This gives you a lot more coverage. Yahoo’s directory wasn’t beaten because Google ranked pages more accurately than humans, but because Yahoo could only cover a tiny fraction of what was out there.

On the face of it, this doesn’t sound good for a lot of the alternative search engines that are competing out there. If it’s hard to beat a simple ranking algorithm, should they all pack up and go home? No, I think there’s an immense amount that can be improved both on the presentation side and by gathering novel data sets. For example why can’t I pull a personal page rank based on my friends and friends-of-friends preferences for sites? What about looking at my and their clickstreams to get an idea of those preferences?

Getting people to listen

Listen
Photo by Paulgi

I’ve often sat spellbound listening to a great speaker like Steve Jobs or Al Gore and wondered what makes it different from the majority of talks I have trouble paying attention to. Some of it is their passion bubbling up, part of it is sheer practice that ironically lets them relax and sound natural (I never seem as impromptu as when I’ve rehearsed a talk 25 times). What I didn’t understand until I saw these videos by Ira Glass is that they’re using classic story-telling techniques too, mental hacks that grab the audience’s attention.

There’s lots of good advice in there, I recommend checking them all out, but the most valuable for me was the anecdote/reflection structure. We’re always taught to lay out our thoughts with the "Say what you’re going to say, say it, and then say what you said" model, where you present your argument’s conclusion, then back that up with facts, and then revisit the conclusion. This is a great way of presenting a mathematical proof, but a terrible way of engaging the interest of a human being. We’re all wired to love stories, and the basic structure of a story is a series of connected events that raise some questions, which are then answered by the conclusion.

The terms Ira uses are anecdote for describing a sequence of things happening, and then a bit of reflection afterwards that tells the audience why those events were worth describing, answering the questions that they implicitly pose. The example he uses goes something like this:

"The man woke up, and it was silent. He got out of bed and looked around, and there was still no noise. He walked downstairs, and the house was completely quiet."

One the face of it, it’s the most boring set of facts imaginable, but your mind expects you’re being told this for some reason, and anticipates the question being answered. Is it silent because the man has gone deaf? The world has ended? This keeps the audience listening for clues, and gives them a payoff when you explain the significance of the details you told them at the end, during the reflection. After hearing this explained, I realized that both Steve’s keynotes and Al’s presentation use this structure masterfully. They build up questions with an anecdote, and then tell you what the conclusion was.

We all end up having to persuade others to take action, and the first step is actually getting them to listen. Give this a try with your own speaking and writing, I’ve been surprised by how much it’s helped me.

Let my calendars go!

Silo
Photo by Dean Forbes

I had a phone call a few days ago that left a painful problem sitting on my lap. A friend runs a company, has a crazy schedule with lots of travel, and his wife can’t always keep up with when he’s going to be around. He knew I’m on a crusade to break all sorts of juicy information out of the Exchange silo, so he asked me for my thoughts.

A password-protected web calendar that you can share with friends and family would let his wife easily keep on top of his plans. There’s quite a few of these, most famously Google Calendar, but there’s no simple way to drive them from Exchange. Google itself offers an Outlook plugin that periodically syncs, but this has a lot of functional limitations. SyncMyCal gets good reviews, and offers a mobile version too, but still requires installation on every machine you use.

It seems like a no-brainer to offer this directly from the Exchange data, rather than going through a complex dance of syncing from Exchange to Outlook to Google. Microsoft actually do have a way to view somebody else’s calendar through Outlook Web Access, but it involves manually typing out a special URL suffix. Not exactly user-friendly.

Microsoft produce some world-class tools for a networked world, but with Exchange they seem focused on incremental improvements to existing work flows, not taking advantage of the new opportunities of the web. They’ve even got the technology to compete with Google’s calendar under the hood. That URL hack has been there for two versions, but they’re hiding it away.

I couldn’t find a good solution for him, so I’m adding that to my ever-growing list of painful problems I can solve by opening up Exchange’s data store.

Why you should install Intense Debate

Debate
Photo by Ohhector

I’ve been frustrated with loading speed problems I’ve traced back to blog widgets in the past, so I’ve been very resistant to installing any new plugins. In the past few weeks I’ve had three different people urge me to give the Intense Debate comment system a try, so I gave in and installed it. I could see the advantages for readers, but I was wary that it didn’t offer much to the blog owner. I was wrong, I’m loving it, and here’s why you should add it too.

More Comments

The old Typepad comment system had a lot of friction. Their login system never seemed to remember who I was, so I had to got to a separate screen for that. Then I would have to navigate back to the page, enter my comment, and submit it. After that was the obligatory bad-acid-trip sequence of letters to prove I wasn’t a robot, and finally my comment might appear.

 

I guess I wasn’t alone in finding this a pain. I’ve got four or five comments in 24 hours, whereas I used to be lucky to get one every couple of days. ID has a great in-context system for entering comments, and remembers who I am for more than a few minutes.

Comments

Part of the reason for their ease-of-use is that they don’t have any visible anti-spam measures. This might be a problem as they get more popular, and people target them, but just getting away from the comment monoculture we have know will make the spammers life harder.

Brings in Visitors

People have been finding my site through commentors I share with other sites. Since every comment is an implicit vote of interest by the commenter, this is a great way of discovering new blogs you’re likely to be interested in too. I’d love to see a way of easily tracking these visits directly through the ID interface, bloggers would be interested in the statistics and it would be a direct demonstration of the system’s value.

Rewards Commenters

The widget they offer for recent comments is clearer and more informative than the default typepad system. I like the snippet and the button to jump makes more sense to me than the old Typepad version. The prominence it gives to comments lets me demonstrate how much I like getting them, and even better there’s another that shows top commenters. Happy commenters means more comments means a happy Pete.

Great Service

I’ve had nothing but good experiences with their email support, patiently helping me figure out an early issue that turned out to be an option I’d mistakenly set. It’s a scary step to hand over your comments to an external startup, but they offer importing and exporting to XML which makes the risk a lot more palatable.

How to profile your PHP code with Xdebug

Tapemeasures
Photo by Jek in the Box

I was adding some functionality to my mail system, and noticed it seemed to be running more slowly. I didn’t have any way to be sure though, and I didn’t know how to get performance information on my PHP code. After a bit of research, I discovered the xdebug PHP extension can gather profile data, and you can then use kcachegrind to view it.

On Fedora linux, you can install xdebug by running the following line in superuser mode:
yum update php-pecl-xdebug

To enable capturing of performance data, you’ll then need to edit your php.ini file, adding the following lines:

; xdebug settings
xdebug.profiler_enable = 1
xdebug.profiler_append = 1
xdebug.profiler_output_name = cachegrind.out.%s

This will output raw profile data to files named cachegrind.out.<your php script name> in /tmp. There’s a some options you may want to tweak, for example not appending repeated calls, or naming them after something other than the script name.

Once you’ve made those changes, restart apache with
/sbin/service httpd restart

Now navigate to the pages you want to profile, and the data should appear as files in /tmp. Once you’ve done the operations you’re interested in, possibly repeatedly to generate a larger sample, edit the php.ini xdebug.profile_enable to 0 and restart apache again.

I’d now got a nice collection of data, but that wasn’t much use without a way to understand what it meant. Kcachegrind is the most popular tool for viewing the output files, but it doesn’t have a native OS X version. I tried the darwin ports approach, but as always at least one of the dozens of dependencies failed to compile automatically, so I resorted to my Fedora Linux installation running in Parallels. If you’re on Windows, WinCacheGrind is a native version that’s had some good reviews. I couldn’t find a separate Linux binary, but it’s part of the kdevelop suite, so I was able to install that easily through the package manager.

Once that’s installed, copy over the cachegrind data files from your server, and then open them up in the application. You should see a list of function calls, and  if you’re used to desktop profiling, a lot of the options for drilling down through the data should seem familiar. The kcachegrind team has some tips if you’re looking for a good way to get started.

For my case, I found that most of the time was spent inside IMAP, which is actually good news since it means I’m running close to the maximum download speed and my parsing code isn’t getting in the way too much.

Where to hike and camp in Los Angeles

Sandstonepeak2
Photo by Caroline on Crack

I’ve posted a lot of local hiking and camping guides here, but there’s never been a good way to find them all. If you use this search link, you’ll see an up-to-date list of all my outdoors posts, but here’s a collection of my greatest hits to date:

Camping on Santa Cruz Island
Camping in the Santa Monicas: Sycamore Canyon

Camping in the Santa Monicas: La Jolla Valleymy favorite ‘secret’ campground
More on the La Jolla Valley hike-in campground
Camping in the Santa Monicas: Topangaa little-known hike-in campground
Camping in the Santa Monicas: Circle X
Camping in the Angeles: Chantry Flat

Hiking trails on Santa Cruz Island
Fancy a trail with oil bubbling from the ground?
How to hike to the highest point in the Santa Monicas
Slickrock in LA
Big Sky trail in Simi Valley
Condor Peak trail in the Angeles
Bike trails in Sycamore Canyon
See where they filmed MASH

How to post an asynchronous HTTP request in PHP

Sockets
Photo by Zombizi

The way users make things happen with web applications is by fetching from a URL, passing in some parameters, eg http://somewhere.com/somescript.php?action=dosomething. For an architecture like Facebook, this is also how API calls are made. I was pretty surprised to find that PHP has a hard time doing something this simple asynchronously.

The key problem is that most HTTP libraries are designed around two-way communication. You send off a request and then wait for a response. In this case I don’t want a reply, I just want the request to trigger some action on the other end, which might eventually involve that server calling me back with a similar fetch with some result, or it might just update an internal database. I want my PHP script to fire off that request and then continue executing, but the cURL functionality that’s built in always waits for the response before carrying on.

At first, I just needed to make an occasional call like this, and I found a hacky solution that set the timeout on the cURL fetch to 1 second. This meant the request was fired off, but then almost immediately timed-out. The problem is that almost immediately wasn’t fast enough once you start calling this frequently, that 1 second every time builds up, and you can’t set the timeout to 0. Here’s that code:

function curl_post_async($url, $params)
{
    foreach ($params as $key => &$val) {
      if (is_array($val)) $val = implode(‘,’, $val);
        $post_params[] = $key.’=’.urlencode($val);
    }
    $post_string = implode(‘&’, $post_params);

    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_POSTFIELDS, $post_string);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($ch, CURLOPT_USERAGENT, ‘curl’);
    curl_setopt($ch, CURLOPT_TIMEOUT, 1);
    $result = curl_exec($ch);
    curl_close($ch);
}

I needed something that didn’t have that one-second penalty every time. PHP doesn’t support threads out of the box, so I looked at using its flavor of fork, pcnt_fork(). It was looking promising until I realized that it’s disabled by default in Apache, with some reason since there’s a lot of process baggage to copy when it’s running in that environment. I then toyed with the idea of using exec to spawn a cURL command-line instance to carry out the command, but that just seemed ugly, fragile and too heavy-weight. I looked at PHP’s HTTP streams too, but they are also synchronous.

I was getting frustrated because HTTP is a simple protocol at heart, and it shouldn’t be this hard to do what I need.

At last, White Shadow came to the rescue. His post talks about a few different ways of doing what I need, but importantly he has one based on raw sockets, and closing the connection immediately after writing the post data. This is exactly what I needed, it fires off the request and then returns almost immediately. I was able to get a lot better performance using this technique.

function curl_post_async($url, $params)
{
    foreach ($params as $key => &$val) {
      if (is_array($val)) $val = implode(‘,’, $val);
        $post_params[] = $key.’=’.urlencode($val);
    }
    $post_string = implode(‘&’, $post_params);

    $parts=parse_url($url);

    $fp = fsockopen($parts[‘host’],
        isset($parts[‘port’])?$parts[‘port’]:80,
        $errno, $errstr, 30);

    pete_assert(($fp!=0), "Couldn’t open a socket to ".$url." (".$errstr.")");

    $out = "POST ".$parts[‘path’]." HTTP/1.1\r\n";
    $out.= "Host: ".$parts[‘host’]."\r\n";
    $out.= "Content-Type: application/x-www-form-urlencoded\r\n";
    $out.= "Content-Length: ".strlen($post_string)."\r\n";
    $out.= "Connection: Close\r\n\r\n";
    if (isset($post_string)) $out.= $post_string;

    fwrite($fp, $out);
    fclose($fp);
}

An easy way to keep a running total in mysql

Numbers
Photo by Walsh

I needed to keep track of how many emails a user had sent on a given day in my database, and it was surprisingly tough to implement. As I came across a message, I needed to add one to the total for that user on that day. If a record for that time and address already exists, it’s simple:

UPDATE messagefrequencies SET senttocount=senttocount+1
WHERE address=currentaddress AND day=currentday;

When I implemented that, I realized that it didn’t do quite what I wanted. If there wasn’t already a record for that address and day, the update would fail. There were a potentially massive number of combinations of days and addresses, and I didn’t want to create blank rows for all of them, so I needed some way to increment the current value if it already exists, or create it and set it to 1 if there isn’t a row that matches.

My first attempt was to use the IF EXISTS syntax, but I discovered that’s only valid within stored procedures. The real solution turned out to be the opposite of the way I was thinking about the problem, since there’s an ON DUPLICATE KEY command that lets you attempt a row INSERT and then if the row already exists you can do an update. One thing to watch out for is that this update syntax doesn’t require a SET, instead you just specify the columns you want to change.

INSERT INTO messagefrequencies (day, address, senttocount)
VALUES (TO_DAYS(‘2006-05-29 13:59:10’), ‘[email protected]’, ‘1’)
ON DUPLICATE KEY UPDATE senttocount = senttocount+1;