Does anyone have the PetesPlugin binaries?

Kaleidoscope
Photo by Yesica

I'm embarrassed to admit that I've lost the compiled versions of my open-source visual-effects plugins! In a recent overhaul of petewarden.com I managed to overwrite the binary versions, and while the source code is still available, I no longer have the After Effects and other SDKs and compilers I'd need to rebuild them. They're over twelve years old now and don't work with recent AE versions, but I still get regularly get requests and questions about them, so could anybody who has downloaded binary copies of either the FreeFrame, VisualJockey or AfterEffects plugins drop me an email to pete@petewarden.com? Any help much appreciated!

Five short links

Hello World in Common Crawl – I've been very excited to see the wider developer community starting to realize how many awesome projects Common Crawl makes possible. Here's a simple guide to getting started with its four billion web pages.

Remember the memristor – A great explanation of an intriguing new computing component, along with a clear-eyed look at why so many of these early-stage experiments never make it into commercial production.

ZIP code data hacking – It's amazing how many of the fundamental facts about our world are kept in closed-license silos. Anybody who wants to link US data sets pretty quickly runs into the translation issues between different area designations, so this project to convert between the common ZIP codes and the governmental FIPS designations is a fantastic idea.

PlacePugs – The ultimate image placeholders, as featured above.

Sorting one million eight-digit numbers in 1MB of RAM – So much of coding is about spotting exploitable patterns in your input data and requirements.

How to host a Tumblr blog in a subdirectory on a Ruby/Sinatra site

Uhaul
Photo by Joey Rozier

One of the most fun parts of working on Jetpac has been following Cathrine’s stellar posts on the company blog. She’s had some amazing finds, including surprises like the hammock tent! We’ve had hundreds of thousands of visitors and a lot of links, but until recently I didn’t realize that having a separate subdomain at blog.jetpac.com meant that we didn’t get the full search ranking benefit. We’re hosted on Tumblr, so I assumed moving things over to a subdirectory would be fairly straightforward, but it turned out to be surprisingly tough. In the end I not only had to manually proxy all /blog content through our frontend servers, I even had to rewrite the page content on the fly to remove hard-coded links! I found a couple of examples in PHP, but nothing that worked with our Ruby and Sinatra setup, so here’s what we ended up with:

https://gist.github.com/petewarden/3950261

In brief, this code intercepts all URLs that start with /blog on a site, and pulls the content from the equivalent yourblog.tumblr.com. If the page is HTML, it also does some quick-and-dirty regex alterations to change any root URLs in links and resource references to point to the new /blog versions. Not my most elegant bit of code, but it seems to be doing the job.

Five short links

Fivecorks
Photo by Rennett Stowe

Save vs Death – As a gamer, and especially as a role-playing gamer, I've spent a lot of time learning to viscerally understand probabilities at a gut level. I'd never made the connection, but that perspective has informed so much of what I've done in my life, from optimizing products to making life choices. It's sobering to read Jim's meditation on how that approach affects your world view when you're facing mortality head on.

It’s not what you know, but who you know: The role of connections in academic promotions – A natural experiment that demonstrates how much having a personal connection to a job candidate affects decision makers, with some thoughtful analysis of why that sometimes makes sense at an individual level, even when it results in overall unfairness.

Mapping with Geocommons and OpenHeatMap – Out of all the open-source projects I've done, OpenHeatMap has had the longest life. It's a painful mess of PHP code, but it seems to meet a real need, so I try to keep it up and running. The hardest part is that as more and more people use it, the mean time between server failures shrinks, so I end up having frequent down time. I've been working on strategies to reduce the flakiness (the usual failure mode is the postgres server hitting memory problems and dying) but apologies to anyone who's been affected recently.

How location technology can change publishing – My friend Drew Breunig always has interesting things to say about practical applications of location data.

Zombie.js – After the shame of having the Jetpac login system down overnight without noticing, I've been investing heavily in automated testing. So far desktop Selenium has been my favorite approach, but someone pointed me at Zombie as an alternative, and it looks very impressive. I've used PhantomJS before for screenshots, but from what I've seen so far Zombie doesn't require X windows or other heavy dependencies. I'll give a full report once I've had a chance to use it in production!

Are you a Bay Area tech startup that wants good job candidates?

Helpwanted1
Photo by Supermuch

One of the hardest things about the recruiting process is having to turn down great potential hires because they're not a good fit with the positions we have. Most of my time at Jetpac these days is spent doing recruiting, so I've ended up with quite a few job-seekers I've been really impressed by, but that I can't hire myself. I've been sending them on piecemeal to friends, but I decided to set up something a bit more organized, so here's a mailing list you can join to see them:

https://groups.google.com/d/forum/youreallyshouldhire

Here's how the group works:

 - You must have a candidate's permission to post their details!

 - Anyone can post candidates, I just ask that you've at least emailed or spoken to them on the phone, to do the initial "Is she a serial killer?" screening.

 - There's no blind resumes, every post should give the full contact details for the candidate. 

 - This is focused on Bay Area tech startups. Feel free to start your own for other areas and industries, with my blessing.

 - This will only add value if it's mostly folks who are actively hiring at their companies, so I'll be moderating the list, hopefully with a very light touch.

If you're a jobseeker and want to be shared with some interesting tech startups, contact me through our jobs page and mention 'youreallyshouldhire'!

Five short links

Freedomburger
The 'Freedom Burger' I ate in Sitka, AK last week – a bacon cheeseburger between two grilled cheese sandwiches!

Heritrix – The open-source web crawler created by the Internet Archive, source is here. It's easy to get started writing a crawler, but there are a lot of deep issues you have to wrestle with if you want sophisticated features, so it's great to have production-tested code to reference.

The animals of O'Reilly – A wonderful initiative to highlight worthwhile wildlife projects, a lot of them involving fascinating technology hacks.

What makes Paris look like Paris? – Automatically extracting the visual elements that define a place.

MangoDB – MongoDB has been fine for the applications I've used it on, and the support has been top-notch, but some frustrated person has put way too much thought into this open-source parody.

Solr vs ElasticSearch – A good overview of how the two big open-source search frameworks stack up against each other. This quora thread has some informed opinions too.

Five short links

PentagonalkaleidoscopePicture by Pete Kaminski

Dancing with handcuffs – the changing geography of trust in China - There's a lot to chew on in Tricia Wang's talk and notes, but it's fascinating to watch state censorship collide with social media, even if technology isn't a silver bullet against repression.

An archive of quotes – My friend Drew has put up a collection of quotes he's gathered by scraping news for download. I'm fascinated by hacks like these that rely on human conventions around the way we produce text, in this case looking for the formatting signatures that indicate that something's a quote.

Why are Americans so… – Another beautiful hack, this time using the data from Google's auto-complete to map the adjectives that are paired with different US states.

The free CDN – A little-known feature of Google's App Engine setup that gives you a content-delivery network for no extra charge!

Craigslist blocks search bots – I've been following the PadMapper/CL case closely, because it's putting the implicit bargain that publishers make with search engines under the spotlight. It's still unclear what's going on with the latest developments, but obviously Craig wants to keep the site showing up in search results but avoid handing over all his data to third parties. There's a whole long post I need to write with my thoughts about this.

 

Five short links

Spacefive

Picture by Robert Edgar

Postgres.app - I was complaining about all the problems I've had installing Postgresql on OS X, so I'm grateful to a commenter for pointing me to this Heroku-sponsored attempt to make it easier. It does seem a lot more usable than any of the alternatives I've tried.

Swoosh – "A generic approach to entity resolution". Fuzzily matching large numbers of records to figure out which ones represent the same objects is a fundamental operation whenever you're dealing with big unstructured data sets. It's hard to implement all the plumbing to support the matching operations, and even harder to optimize the whole process, so this Stanford framework looks appealing, and I know at least one startup is using it in production.

Charles – A startlingly-useful desktop app for inspecting the web traffic your machine is sending and receiving. Great for understanding the underlying nuts-and-bolts of an online application.

Tune in next week – I spend a lot of time trying to understand our user's psychology so we can convert the time they spend on Jetpac into actions that persuade more people to sign up, and generally build the business. This is something I struggle with, I worry that I'm doing something unseemly, which is why this essay on the art of the cliff-hanger was so striking. Story-tellers have hacked their audiences for millenia, aiming for the moment when "curiosity is converted into a commercial transaction", which captures the balance I'm trying to strike.

Rare photos of the Soviet bomb project – There are some amazing photos from the dawn of the atomic age, but what leapt out at me was how the scientist behind the Communist bomb figured out that the US was working on a weapon. After he'd published a nuclear research paper that received no citations "Flerov did a literature search and realized that nobody was publishing on fission anymore — and indeed, all of those who had been publishing on it had dropped off the map completely. He immediately started writing letters — including to Stalin himself — pointing out that this could only indicate that the United States was working on an atomic bomb." This was information that was visible to anyone, but only he seemed to spot it.

Seven short links

Wildwheelwonders
Photo by Trials and Errors

I've been a bit distracted for the past few weeks, so here's a bumper crop of links from my backlog:

How to install Ruby on Rails on a Mac – It's amazing how tough getting all the dependencies for a Ruby project can be on stock OS X, especially Postgres. It's a big reason why I'm now moving to Vagrant.

A vernacular web – A look back at the early web through an art history lens. I thought it would be cheap nostalgia, but it turned out to surprisingly revealing and insightful.

Making Maps – A wonderful guide to online map making from the Chicago Tribune.

Speak.js – The power of Javascript in modern browsers is amazing. Here's a speech synthesizer that relies on the new typed arrays addition, compiled from LLVM!

GFX – A 3D CSS animation library. Does exactly what it says on the tin, but it's another sign of how capable modern browsers are.

Email guesser – Email user names that are derived from people's real names, plus a clear MD5 hash used by a widely-used service, gives you an effective tool for guessing people's addresses. Thought-provoking work by Adam Loving.

The Bulwer-Lytton Fiction Contest - Terribly wonderful first lines of imaginary novels, inspired by the writer who came up with "It was a dark and stormy night". The Darwin Awards of literature.

Cassandra connections are costly

Connection
Photo by Sara Lando

I've been working on speeding up our page load times, and one of the slowest sections turns out to be the calls to the Cassandra database. I was surprised to see a simple value fetch taking up to 500ms. When I dug down into the Ruby gem we're using, I discovered that 400ms of that was inside a call named extract_and_validate_params(), and then I traced the time down to column_family_property(). At this point it descends into the world of auto-generated Thrift code, but as far as I could tell from inspection, the time taken is in fetching the column family schema definitions which are held as values in Cassandra's key/value store.

At that point I hopped over to the #cassandra IRC channel, and Ben Black confirmed that my suspicions were correct. The first get on a given Cassandra connection also has to fetch the schemas, so there's at least one round-trip to the database in addition to the operation you're actually performing. Ben's response was that frequent connection destruction and creation was "not a good idea, regardless of database", which is somewhat true, but generally the sort of connection pooling that you need to avoid it is not much fun to handle at the application layer level. When connections are expensive to create it's because they have state associated with them, and in my experience that lingering state leads to all kinds of hard-to-track-down bugs as one piece of code sets something which then has knock-on effects in an unrelated module when it reuses the connection.

Luckily our use of Cassandra in the frontend is fairly straightforward right now, so I'm going to try the simplest possible kind of connection pooling, a single global connection that's opened once for each Ruby process, and then reused for every subsequent call. The only wrinkle I'm anticipating is error handling. If something goes wrong with a call and the connection goes down, I need to detect that and re-open the connection, and it's not clear how to detect that. Anyway, I wanted to share a bit of my work-book with the world as I'm wrestling with this, and I'll be curious to hear if anyone else has had to tackle this problem, especially in Ruby.