Five short links


Picture by Robert Edgar - I was complaining about all the problems I've had installing Postgresql on OS X, so I'm grateful to a commenter for pointing me to this Heroku-sponsored attempt to make it easier. It does seem a lot more usable than any of the alternatives I've tried.

Swoosh – "A generic approach to entity resolution". Fuzzily matching large numbers of records to figure out which ones represent the same objects is a fundamental operation whenever you're dealing with big unstructured data sets. It's hard to implement all the plumbing to support the matching operations, and even harder to optimize the whole process, so this Stanford framework looks appealing, and I know at least one startup is using it in production.

Charles – A startlingly-useful desktop app for inspecting the web traffic your machine is sending and receiving. Great for understanding the underlying nuts-and-bolts of an online application.

Tune in next week – I spend a lot of time trying to understand our user's psychology so we can convert the time they spend on Jetpac into actions that persuade more people to sign up, and generally build the business. This is something I struggle with, I worry that I'm doing something unseemly, which is why this essay on the art of the cliff-hanger was so striking. Story-tellers have hacked their audiences for millenia, aiming for the moment when "curiosity is converted into a commercial transaction", which captures the balance I'm trying to strike.

Rare photos of the Soviet bomb project – There are some amazing photos from the dawn of the atomic age, but what leapt out at me was how the scientist behind the Communist bomb figured out that the US was working on a weapon. After he'd published a nuclear research paper that received no citations "Flerov did a literature search and realized that nobody was publishing on fission anymore — and indeed, all of those who had been publishing on it had dropped off the map completely. He immediately started writing letters — including to Stalin himself — pointing out that this could only indicate that the United States was working on an atomic bomb." This was information that was visible to anyone, but only he seemed to spot it.

Seven short links

Photo by Trials and Errors

I've been a bit distracted for the past few weeks, so here's a bumper crop of links from my backlog:

How to install Ruby on Rails on a Mac – It's amazing how tough getting all the dependencies for a Ruby project can be on stock OS X, especially Postgres. It's a big reason why I'm now moving to Vagrant.

A vernacular web – A look back at the early web through an art history lens. I thought it would be cheap nostalgia, but it turned out to surprisingly revealing and insightful.

Making Maps – A wonderful guide to online map making from the Chicago Tribune.

Speak.js – The power of Javascript in modern browsers is amazing. Here's a speech synthesizer that relies on the new typed arrays addition, compiled from LLVM!

GFX – A 3D CSS animation library. Does exactly what it says on the tin, but it's another sign of how capable modern browsers are.

Email guesser – Email user names that are derived from people's real names, plus a clear MD5 hash used by a widely-used service, gives you an effective tool for guessing people's addresses. Thought-provoking work by Adam Loving.

The Bulwer-Lytton Fiction Contest - Terribly wonderful first lines of imaginary novels, inspired by the writer who came up with "It was a dark and stormy night". The Darwin Awards of literature.

Cassandra connections are costly

Photo by Sara Lando

I've been working on speeding up our page load times, and one of the slowest sections turns out to be the calls to the Cassandra database. I was surprised to see a simple value fetch taking up to 500ms. When I dug down into the Ruby gem we're using, I discovered that 400ms of that was inside a call named extract_and_validate_params(), and then I traced the time down to column_family_property(). At this point it descends into the world of auto-generated Thrift code, but as far as I could tell from inspection, the time taken is in fetching the column family schema definitions which are held as values in Cassandra's key/value store.

At that point I hopped over to the #cassandra IRC channel, and Ben Black confirmed that my suspicions were correct. The first get on a given Cassandra connection also has to fetch the schemas, so there's at least one round-trip to the database in addition to the operation you're actually performing. Ben's response was that frequent connection destruction and creation was "not a good idea, regardless of database", which is somewhat true, but generally the sort of connection pooling that you need to avoid it is not much fun to handle at the application layer level. When connections are expensive to create it's because they have state associated with them, and in my experience that lingering state leads to all kinds of hard-to-track-down bugs as one piece of code sets something which then has knock-on effects in an unrelated module when it reuses the connection.

Luckily our use of Cassandra in the frontend is fairly straightforward right now, so I'm going to try the simplest possible kind of connection pooling, a single global connection that's opened once for each Ruby process, and then reused for every subsequent call. The only wrinkle I'm anticipating is error handling. If something goes wrong with a call and the connection goes down, I need to detect that and re-open the connection, and it's not clear how to detect that. Anyway, I wanted to share a bit of my work-book with the world as I'm wrestling with this, and I'll be curious to hear if anyone else has had to tackle this problem, especially in Ruby.