Getting people to listen

Listen
Photo by Paulgi

I’ve often sat spellbound listening to a great speaker like Steve Jobs or Al Gore and wondered what makes it different from the majority of talks I have trouble paying attention to. Some of it is their passion bubbling up, part of it is sheer practice that ironically lets them relax and sound natural (I never seem as impromptu as when I’ve rehearsed a talk 25 times). What I didn’t understand until I saw these videos by Ira Glass is that they’re using classic story-telling techniques too, mental hacks that grab the audience’s attention.

There’s lots of good advice in there, I recommend checking them all out, but the most valuable for me was the anecdote/reflection structure. We’re always taught to lay out our thoughts with the "Say what you’re going to say, say it, and then say what you said" model, where you present your argument’s conclusion, then back that up with facts, and then revisit the conclusion. This is a great way of presenting a mathematical proof, but a terrible way of engaging the interest of a human being. We’re all wired to love stories, and the basic structure of a story is a series of connected events that raise some questions, which are then answered by the conclusion.

The terms Ira uses are anecdote for describing a sequence of things happening, and then a bit of reflection afterwards that tells the audience why those events were worth describing, answering the questions that they implicitly pose. The example he uses goes something like this:

"The man woke up, and it was silent. He got out of bed and looked around, and there was still no noise. He walked downstairs, and the house was completely quiet."

One the face of it, it’s the most boring set of facts imaginable, but your mind expects you’re being told this for some reason, and anticipates the question being answered. Is it silent because the man has gone deaf? The world has ended? This keeps the audience listening for clues, and gives them a payoff when you explain the significance of the details you told them at the end, during the reflection. After hearing this explained, I realized that both Steve’s keynotes and Al’s presentation use this structure masterfully. They build up questions with an anecdote, and then tell you what the conclusion was.

We all end up having to persuade others to take action, and the first step is actually getting them to listen. Give this a try with your own speaking and writing, I’ve been surprised by how much it’s helped me.

Let my calendars go!

Silo
Photo by Dean Forbes

I had a phone call a few days ago that left a painful problem sitting on my lap. A friend runs a company, has a crazy schedule with lots of travel, and his wife can’t always keep up with when he’s going to be around. He knew I’m on a crusade to break all sorts of juicy information out of the Exchange silo, so he asked me for my thoughts.

A password-protected web calendar that you can share with friends and family would let his wife easily keep on top of his plans. There’s quite a few of these, most famously Google Calendar, but there’s no simple way to drive them from Exchange. Google itself offers an Outlook plugin that periodically syncs, but this has a lot of functional limitations. SyncMyCal gets good reviews, and offers a mobile version too, but still requires installation on every machine you use.

It seems like a no-brainer to offer this directly from the Exchange data, rather than going through a complex dance of syncing from Exchange to Outlook to Google. Microsoft actually do have a way to view somebody else’s calendar through Outlook Web Access, but it involves manually typing out a special URL suffix. Not exactly user-friendly.

Microsoft produce some world-class tools for a networked world, but with Exchange they seem focused on incremental improvements to existing work flows, not taking advantage of the new opportunities of the web. They’ve even got the technology to compete with Google’s calendar under the hood. That URL hack has been there for two versions, but they’re hiding it away.

I couldn’t find a good solution for him, so I’m adding that to my ever-growing list of painful problems I can solve by opening up Exchange’s data store.

Why you should install Intense Debate

Debate
Photo by Ohhector

I’ve been frustrated with loading speed problems I’ve traced back to blog widgets in the past, so I’ve been very resistant to installing any new plugins. In the past few weeks I’ve had three different people urge me to give the Intense Debate comment system a try, so I gave in and installed it. I could see the advantages for readers, but I was wary that it didn’t offer much to the blog owner. I was wrong, I’m loving it, and here’s why you should add it too.

More Comments

The old Typepad comment system had a lot of friction. Their login system never seemed to remember who I was, so I had to got to a separate screen for that. Then I would have to navigate back to the page, enter my comment, and submit it. After that was the obligatory bad-acid-trip sequence of letters to prove I wasn’t a robot, and finally my comment might appear.

 

I guess I wasn’t alone in finding this a pain. I’ve got four or five comments in 24 hours, whereas I used to be lucky to get one every couple of days. ID has a great in-context system for entering comments, and remembers who I am for more than a few minutes.

Comments

Part of the reason for their ease-of-use is that they don’t have any visible anti-spam measures. This might be a problem as they get more popular, and people target them, but just getting away from the comment monoculture we have know will make the spammers life harder.

Brings in Visitors

People have been finding my site through commentors I share with other sites. Since every comment is an implicit vote of interest by the commenter, this is a great way of discovering new blogs you’re likely to be interested in too. I’d love to see a way of easily tracking these visits directly through the ID interface, bloggers would be interested in the statistics and it would be a direct demonstration of the system’s value.

Rewards Commenters

The widget they offer for recent comments is clearer and more informative than the default typepad system. I like the snippet and the button to jump makes more sense to me than the old Typepad version. The prominence it gives to comments lets me demonstrate how much I like getting them, and even better there’s another that shows top commenters. Happy commenters means more comments means a happy Pete.

Great Service

I’ve had nothing but good experiences with their email support, patiently helping me figure out an early issue that turned out to be an option I’d mistakenly set. It’s a scary step to hand over your comments to an external startup, but they offer importing and exporting to XML which makes the risk a lot more palatable.

How to post an asynchronous HTTP request in PHP

Sockets
Photo by Zombizi

The way users make things happen with web applications is by fetching from a URL, passing in some parameters, eg http://somewhere.com/somescript.php?action=dosomething. For an architecture like Facebook, this is also how API calls are made. I was pretty surprised to find that PHP has a hard time doing something this simple asynchronously.

The key problem is that most HTTP libraries are designed around two-way communication. You send off a request and then wait for a response. In this case I don’t want a reply, I just want the request to trigger some action on the other end, which might eventually involve that server calling me back with a similar fetch with some result, or it might just update an internal database. I want my PHP script to fire off that request and then continue executing, but the cURL functionality that’s built in always waits for the response before carrying on.

At first, I just needed to make an occasional call like this, and I found a hacky solution that set the timeout on the cURL fetch to 1 second. This meant the request was fired off, but then almost immediately timed-out. The problem is that almost immediately wasn’t fast enough once you start calling this frequently, that 1 second every time builds up, and you can’t set the timeout to 0. Here’s that code:

function curl_post_async($url, $params)
{
    foreach ($params as $key => &$val) {
      if (is_array($val)) $val = implode(‘,’, $val);
        $post_params[] = $key.’=’.urlencode($val);
    }
    $post_string = implode(‘&’, $post_params);

    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_POSTFIELDS, $post_string);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($ch, CURLOPT_USERAGENT, ‘curl’);
    curl_setopt($ch, CURLOPT_TIMEOUT, 1);
    $result = curl_exec($ch);
    curl_close($ch);
}

I needed something that didn’t have that one-second penalty every time. PHP doesn’t support threads out of the box, so I looked at using its flavor of fork, pcnt_fork(). It was looking promising until I realized that it’s disabled by default in Apache, with some reason since there’s a lot of process baggage to copy when it’s running in that environment. I then toyed with the idea of using exec to spawn a cURL command-line instance to carry out the command, but that just seemed ugly, fragile and too heavy-weight. I looked at PHP’s HTTP streams too, but they are also synchronous.

I was getting frustrated because HTTP is a simple protocol at heart, and it shouldn’t be this hard to do what I need.

At last, White Shadow came to the rescue. His post talks about a few different ways of doing what I need, but importantly he has one based on raw sockets, and closing the connection immediately after writing the post data. This is exactly what I needed, it fires off the request and then returns almost immediately. I was able to get a lot better performance using this technique.

function curl_post_async($url, $params)
{
    foreach ($params as $key => &$val) {
      if (is_array($val)) $val = implode(‘,’, $val);
        $post_params[] = $key.’=’.urlencode($val);
    }
    $post_string = implode(‘&’, $post_params);

    $parts=parse_url($url);

    $fp = fsockopen($parts[‘host’],
        isset($parts[‘port’])?$parts[‘port’]:80,
        $errno, $errstr, 30);

    pete_assert(($fp!=0), "Couldn’t open a socket to ".$url." (".$errstr.")");

    $out = "POST ".$parts[‘path’]." HTTP/1.1\r\n";
    $out.= "Host: ".$parts[‘host’]."\r\n";
    $out.= "Content-Type: application/x-www-form-urlencoded\r\n";
    $out.= "Content-Length: ".strlen($post_string)."\r\n";
    $out.= "Connection: Close\r\n\r\n";
    if (isset($post_string)) $out.= $post_string;

    fwrite($fp, $out);
    fclose($fp);
}

Does the opening of Facebook’s source reveal anything?

Streaker
Photo by Paste Magazine

The first thing I discovered when I looked over Facebook’s recent platform code release was a security flaw that lets you run malicious Javascript through applications, bypassing their security, but I won’t be blogging any details until the team has implemented a fix.

When they recently released some of their platform code as open source (link seems to be temporarily down, but you can download the source directly here) it led to a lot of discussion on the strategic significance of the move, aimed at keeping Facebook’s lead in the application space against competitors like OpenSocial, and on the implications of the unusual CPAL license chosen.

I’m much more interested in the technical lessons you can learn about Facebook’s code and architecture from the source. From looking through it, I’m confident this is drawn from their actual production code, so it’s a rare glimpse inside the implementation of a web application battle-tested with millions of users. I’ve uploaded a version with an Xcode project for easy browsing if you want to explore for yourself on a Mac.

There’s a disappointing lack of swearing in the comments, though I did find one "omg this is so retarded" in typeaheadpro.js. With that fun out of the way, a good place to start after the main README is to search for "FBOPEN:" in all files, since this brings up comments that were added to document the parts the developers thought would be interesting to users of the open version.

Examining the basic structure confirms that Facebook are still basically a LAMP shop. The only part that I wondered about was the M of Mysql, since that’s traditionally been tough to scale., but all of the database access here is through raw SQL strings. They’re known for their use of memcache to speed up data fetching, but there’s no sign of it in the code they’ve released. I was hoping for some heavy-weight examples of how to handle snooping on updates to invalidate memcache entries, but no such luck. They do have an interesting pattern of assembling their query strings using printf style format strings and varargs, rather than directly appending, which results in cleaner-looking code. If you want to look at the implementation, that’s in lib/core/mysql.php.

One component I hadn’t seen before was Thrift, Facebook’s open source framework for building cross-language APIs and data structures. It takes an interface definition file, and then creates a lot of the glue you need to implement the methods and data structures in PHP, Java, C++, Ruby and Erlang. I was interested because I’ve found I need a lowest-common-denominator data definition and code generation framework as I end up bouncing between C++, PHP and SQL tables. They don’t address the database storage side, which I hit problems with too since some basic data structures like lists inside structures don’t translate into a relational database unambiguously.

They look like they hit similar illegal character problems to my XML parsing woes, since they’ve got a call to iconv(‘utf-8’, ‘utf-8//IGNORE’, $str) that they use to sanitize their input in strings.php.

Los Angeles web ventures barbecue

Barbecue
Photo by SpacePotato

If you’re a tech entrepreneur in LA, come along to the first meeting of the local web ventures group. It’s on Saturday June 14th at 1:00pm in Sherman Oaks, organized by Wil Fernandez. There’s all the practical benefits of networking, but the real point for me is to be around a bunch of people driven by the same passion. I’ve made it to too few EMS events, (Saturday mornings are usually booked for trailwork) but I always walk out fired up by the determination of both the students and presenters to Get Stuff Done.

I was looking at restarting the apparently moribund LA open coffee club, but whilst researching this article found that it’s alive and well, just not covered on the main site. The joys of a fragmented web. I just missed a meeting yesterday, but I’ll be making it along to the next. If there’s any other SoCal entrepreneur events I’m missing, let me know in the comments and I’ll check them out.

You prefer lovable fools to competent jerks

Clown
Photo by Cemetery Belle

That’s the argument of this Harvard Business Review article by Tiziana Casciaro and Miguel Sousa Lobo. They’ve researched how collaboration networks actually form within organizations, trying to work out how people choose who to connect with to get a task done. They tried to measure two attributes, competence and likableness, and then look at how those measures relate to who you decide to work with. Based on the combinations of likableness and competence, they classify people into incompetent jerks, competent jerks, lovable fools and lovable stars.
Competencymatrix

Two corners of the matrix are obvious, nobody wants to work with somebody who’s bad at their job and has no personal skills, and everybody is happy to work with a superstar who’s also a nice person. The surprising part is that whilst most people claim to prefer competent jerks to lovable fools as work partners, in practice they choose people they like regardless of their competence.

They talk about some of the consequences of this, that you tend to have more homogeneous groups of people working together with less diversity of viewpoints, since people tend to like others who are similar to themselves. They also informally talk about the mechanism that drives people to prefer likable fools over more competent but grating alternatives. They mention trust and familiarity, but it would be interesting to see how much correlation there is with the network measure of closure within a group. It seems likely that you share a lot of mutual friends with likable people, since by definition a lot of people like them, and so the reputation cost of letting you down will be a lot higher for them. Competent jerks won’t have those same third-party ties.

Based on my experience I avoid anyone who’s a real jerk purely because they also tend to be unreliable in delivering results. There’s only a theoretical distinction between someone who can’t do a task, and someone who can but won’t, and I think that managers overestimate their ability to change a jerk into someone productive, and underestimate the damage jerks do to their peers. I love Bob Sutton’s work with the No Asshole Rule looking at the
impacts of jerks in the workplace, and how to spot and deal with them.

I do think the matrix above is incomplete though, there’s a large group of employees who aren’t widely liked, but aren’t jerks either, they’re just socially disconnected from their colleagues. They’re often the bedrock of the team, quietly getting work done. These are the people that management can really help, by acting as an interface between them and the outside world, protecting them from perceived hassle and distilling the competing external demands into simpler requirements.

You’ll need to pay $7 to get the full document, but the summary gives you a good overview. There’s a free technical paper that’s aimed at an academic audience, the article itself is focused on practical lessons you can draw from their research. The work relies on the standard self-reporting surveys to figure out networks, as always I’d be fascinated to see if automated data-mining techniques on email and phone usage within a company gives the same picture.

The joy of nearly being eaten

Kingsnake_2

After growing up in Britain, where the apex predator is the badger, I feel lucky to be living where there’s truly wild wildlife. There’s something about the knowledge that you could be eaten or poisoned around the next corner to add an edge of alertness to any trip. The possible downside is being somebody’s next meal, but the certain upside is appreciating you’re in a true wilderness.

Liz once saw the rear end of a mountain lion disappear down the trail, but I’ve had to content myself with plenty of bob cats, coyotes and rattlesnakes. Two weeks ago, we even had a rattler who refused to leave our worksite, so he watched us warily for a few hours. Above you can see me relocating a harmless California King Snake after our maintenance had disturbed its home. Below are a few more of the lovely beasties we’ve encountered.

Scorpion_2

It’s not unusual to come across these small scorpions when you turn over a rock. So far nobody’s been stung, and from what I understand our local variety aren’t too poisonous anyway. It makes me feel like I’m in a western every time I spot one though.

Blackwidow1_2

This action shot is a Black Widow in our back yard. We seem to have dozens around the outside of the house, they have the most beautiful sleek black bodies, with the distinctive red hourglass marking. We don’t have many closeup photos of them for obvious reasons.

Walkingstick_2

I’m not too worried about this Walking Stick insect eating me, but he’s one of the coolest designs I’ve seen in a long time. He’s definitely got the Apple elegance about him, the MacBook Air of the insect world.

Search massive XML datasets with MarkLogic

Bookpile
Photo by GeorgMayer

After my last post on the MarkMail project, I heard back from MarkLogic’s Jason Hunter with some more information about the underlying implementation. Almost all the capabilities are provided by the MarkLogic database server, which seems to offer an impressive set of features. I initially had some trouble finding the technical information on their main site, since it’s mostly geared towards of high-end content publishers who are the main users of the system, but then I came across their developer center.

What is MarkLogic Server? is probably the best place to start for an overview of what they offer. Essentially they differ from a standard relational database by accepting comparatively unstructured data, without a rigid schema, and focusing on the great search and retrieval performance you need for any publishing system. As they’ve demonstrated with MarkMail, this makes a good interface for large collections of email too. The technology has been battle-tested, deployed in situations dealing with terabytes of data and with the ability to run in a distributed cluster so you can scale performance to cope with heavy loads.

As well as the MarkMail demonstration, they also have the Executive Pay Check site that lets you do a live search on 14A filings to see the salaries of leaders at public companies. This is interesting mostly because it’s doing a good job coping with some theoretically structured, but in practice quite messy, source data, with inconsistent naming and formatting for the tables holding the filings. It would require some heavy massaging to get this into a traditional relational database, but MarkLogic seems to be a lot better at handling that sort of problem.

There’s a free community version of the engine available for download, so I’ll be experimenting with it when I have the chance. An active developer community has grown up around the product over the last few years, with lots of documentation I’m absorbing to help my understanding. I’m surprised that it isn’t better known in the search community, it seems like it offers some unique features that would let you build an interesting search engine for all kinds of rich content.

Is PageRank the ultimate implicit algorithm?

Bookpile

PageRank has to be one of the most successful algorithms ever. I’m wary of stretching the implicit web definition until it breaks, but it shares a lot of similarities with the algorithms we need to use.

Unintended information. It processes data for a radically different purpose than the content’s creators had in mind. Links were meant to simply be a way of referencing related material, nobody thought of them as indicators of authority. This is the definition of implicit data for me, it’s the information that you get from reading between the lines of the explicit content.
Completely automatic. No manual intervention means it can scale up to massive sets of data without having a corresponding increase in the numbers of users or employees you need. This means that its easy to be comprehensive, covering everything.
Hard to fake. When someone links to another page, they’re putting a small part of their reputation on the line. If the reader is disappointed in the destination, their opinion of the referrer drops, and this natural cost keeps the measure correlated with authority. This makes the measure very robust against manipulation.
Unreliable. PageRank is only a very crude measure of authority, and I’d imagine that a human-based system would come up with different rankings for a lot of sites.

As a contrast, consider the recipe behind a social site like Digg that aim to rank content in order of interest.

Explicit information. Every Digg vote is done in the knowledge that it will be used to rank stories on the site.
Human-driven. It relies completely on users rating the content.
Easy to fake. The voting itself is simple to game, so account creation and other measures are required to weed out bad players.
Reliable. The stories at the top of its rankings are generally ones a lot of people have found interesting, it seems good at avoiding boring content, though of course there’s plenty that doesn’t match my tastes.

A lot of work seems to be fixated on reliability, but this is short-sighted. Most implicit data algorithms can only ever produce a partial match between the output and the quality you’re trying to measure. Where they shine is their comprehensiveness and robustness. PageRank shows you can design your system around fuzzy reliability and reap the benefits of fully automatic and unfakeable measures.