Five short links

Hello World in Common Crawl – I've been very excited to see the wider developer community starting to realize how many awesome projects Common Crawl makes possible. Here's a simple guide to getting started with its four billion web pages.

Remember the memristor – A great explanation of an intriguing new computing component, along with a clear-eyed look at why so many of these early-stage experiments never make it into commercial production.

ZIP code data hacking – It's amazing how many of the fundamental facts about our world are kept in closed-license silos. Anybody who wants to link US data sets pretty quickly runs into the translation issues between different area designations, so this project to convert between the common ZIP codes and the governmental FIPS designations is a fantastic idea.

PlacePugs – The ultimate image placeholders, as featured above.

Sorting one million eight-digit numbers in 1MB of RAM – So much of coding is about spotting exploitable patterns in your input data and requirements.

How to host a Tumblr blog in a subdirectory on a Ruby/Sinatra site

Uhaul
Photo by Joey Rozier

One of the most fun parts of working on Jetpac has been following Cathrine’s stellar posts on the company blog. She’s had some amazing finds, including surprises like the hammock tent! We’ve had hundreds of thousands of visitors and a lot of links, but until recently I didn’t realize that having a separate subdomain at blog.jetpac.com meant that we didn’t get the full search ranking benefit. We’re hosted on Tumblr, so I assumed moving things over to a subdirectory would be fairly straightforward, but it turned out to be surprisingly tough. In the end I not only had to manually proxy all /blog content through our frontend servers, I even had to rewrite the page content on the fly to remove hard-coded links! I found a couple of examples in PHP, but nothing that worked with our Ruby and Sinatra setup, so here’s what we ended up with:

https://gist.github.com/petewarden/3950261

In brief, this code intercepts all URLs that start with /blog on a site, and pulls the content from the equivalent yourblog.tumblr.com. If the page is HTML, it also does some quick-and-dirty regex alterations to change any root URLs in links and resource references to point to the new /blog versions. Not my most elegant bit of code, but it seems to be doing the job.

Five short links

Fivecorks
Photo by Rennett Stowe

Save vs Death – As a gamer, and especially as a role-playing gamer, I've spent a lot of time learning to viscerally understand probabilities at a gut level. I'd never made the connection, but that perspective has informed so much of what I've done in my life, from optimizing products to making life choices. It's sobering to read Jim's meditation on how that approach affects your world view when you're facing mortality head on.

It’s not what you know, but who you know: The role of connections in academic promotions – A natural experiment that demonstrates how much having a personal connection to a job candidate affects decision makers, with some thoughtful analysis of why that sometimes makes sense at an individual level, even when it results in overall unfairness.

Mapping with Geocommons and OpenHeatMap – Out of all the open-source projects I've done, OpenHeatMap has had the longest life. It's a painful mess of PHP code, but it seems to meet a real need, so I try to keep it up and running. The hardest part is that as more and more people use it, the mean time between server failures shrinks, so I end up having frequent down time. I've been working on strategies to reduce the flakiness (the usual failure mode is the postgres server hitting memory problems and dying) but apologies to anyone who's been affected recently.

How location technology can change publishing – My friend Drew Breunig always has interesting things to say about practical applications of location data.

Zombie.js – After the shame of having the Jetpac login system down overnight without noticing, I've been investing heavily in automated testing. So far desktop Selenium has been my favorite approach, but someone pointed me at Zombie as an alternative, and it looks very impressive. I've used PhantomJS before for screenshots, but from what I've seen so far Zombie doesn't require X windows or other heavy dependencies. I'll give a full report once I've had a chance to use it in production!