How to run simple smoke tests in Ruby

December 8, 2011 By Pete Warden in Uncategorized Leave a comment

One lesson I learned from Eric Ries is how powerful an incremental, reactive approach to testing can be. It's really hard to balance resources between development and testing, especially as a starving startup, but if you build tests to catch errors that have actually happened, you know you're focused on high-priority issues.

We started Jetpac with a very minimal deployment process with few automated checks, but two stages so we could eye-ball the test environment before pushing it to the final set of live servers. Yesterday that manual process finally failed after we accidentally pushed a completely broken build to the main site and took it down for a few minutes. That gave me a strong reason to add the first automatic checking to our deployment scripts to make sure we couldn't push to production if the test environment wasn't responsive.

To start with I just wanted something very basic that will catch glaring errors that stop our Ruby app from running entirely, since that was what actually happened and they're pretty simple to detect. To do that, I wrote a short Ruby script that can be invoked from the command line and will spot empty responses, 404's and other obvious problems with a URL. We invoke it like this in our deployment bash script, after calling Capistrano to do the actual push:

smoketest.rb "http://testingenvironment.example.com"

if [ $? -gt 0 ]; then

echo '*$!*$!*$!*$!*$!*$!*$!*$!*$!*$!*$!*$!*$!*$!*$!*$!*$!*$!'

echo "Deployment not allowed, test server is not responding"

echo '*$!*$!*$!*$!*$!*$!*$!*$!*$!*$!*$!*$!*$!*$!*$!*$!*$!*$!'

exit 1

It will print out information to stderr about any problems it encountered, and handles both http and https URLs. I'd imagine as our needs grow we'll turn to something more complex like Capybara, but for now this simple script is a very quick and easy way of catching a lot of common problems.

Death of a startup

December 7, 2011 By Pete Warden in Uncategorized Leave a comment

Photo by Mugley

Mailana, Inc is dead. This week I've been going through the formalities, squaring away the legal paperwork and returning the tiny amount of money I'd raised, but truthfully it had been dead a long time, I just hadn't faced it. I'm over the moon about my new startup, but it feels like time to raise a glass to the last three years of my life, and the majority of my savings.

It began as a dream while I was still at Apple. I knew I wanted to strike out on my own, but by a strange kind of luck the painfully-slow green card process kept me living the corporate life for five years while Apple's shares kept rising, and my tiny windfall from technology I sold them when I joined became enough to live on for a few years. Within a couple of weeks of getting my permanent resident's status I handed in my notice, and set out to Build A Startup.

Technology risk

The hardest lesson to learn was how obsessed with technology I am. I had a problem in mind, one that I'd lived with at Apple – how to identify experts in large companies – but to be honest I chose that because I already had a solution that involved interesting engineering. I spent a year building a pipeline that could semantically hundreds of millions of email messages on a shoestring budget, seamlessly interface with Exchange, and present the results as beautiful visualizations. The only thing I failed to do was sell it to the enterprises I claimed were my customers. I wasn't a complete idiot, I spent time flying to boardrooms and talking to mid-level executives, creating demo videos, I even wrangled a few free pilot programs, but fundamentally I didn't care enough.

Shiny things

That meant when I'd proven my technical point, and faced a mountain of distribution problems instead, at some level I started to look for ways out. I'd already been using Twitter as a source of hundreds of millions of public messages for my demos, and then the public versions of the visualizations started to get some attention. I'd soured on the enterprise sales experience, so I started to explore what I could do on the consumer side. The trouble was I'd completely lost sight of what problem I was tackling. At least with the original version I'd set out to fix an issue I'd spent years living with. Now I was driven purely by curiosity, hoping I'd find neglected data that was so useful the problems I'd apply it to could be an afterthought.

Lonesome founder

I knew myself well enough to spot some of this at the time, and that the best prescription was a business and product-focused founder. I spent a lot of time dating potential partners, especially as I went through Techstars, but there was never a good enough fit. I needed somebody who was willing to bet on what we'd now call Big Data, who'd believe that there was a coming revolution that would bring data-processing problems that had previously required millions of dollars of investment within the reach of early-stage startups. Without external validation, nobody non-technical was convinced of either the concept in general, or my particular ability to execute on it.

Chasing the dragon

Muttering to myself in true mad scientist fashion, I set out to prove them all wrong by adding so many awesome features to my by-now-Gmail-addon Mailana that the world would have no choice but to sit up and take notice! One of these features was an email-contact-to-social-network-profile connector that involved me indexing public Facebook profiles, and then getting in a legal kerfuffle. Nightmarish as the situation was, the publicity and validation I received from that visualization work was addictive. I set out to explore that more, with the excuse that it provided distribution opportunities for my business, but if I was honest with myself it was because I found the whole area fascinating.

Startup neglect

I wandered farther and farther from my nominal business, first as I launched OpenHeatMap, and then as I delved into data arcana through books and journalism. It felt great, because I was actually making a difference to the world, I was having an impact! Sadly I wasn't building a company. I took one final stab at that with the Data Science Toolkit as the final product from Mailana, but in my heart I knew that it had a lot more potential as a long-term open-source project than as a revenue-generating business.

Closure, and a new start

It took a lot of soul-searching to accept, but I knew Mailana was over. I'd originally given myself an allowance of two years to spend with no revenue, and it had been over three. I was lucky enough to have a circle of trusted friends who were working on interesting projects, but Julian and Derek were particularly appealing. I'd been working with them for months as an advisor and fell in love with the idea behind Jetpac. It gave me the chance to keep exploring a lot of the technology ideas I was fascinated, but within the context of an actual business, with a revenue model, funding and more than one employee!

I don't regret any of the time or money I sank into Mailana, I've got so much to be thankful for over the last three years. The people I've met alone make it worth every minute, and I feel like I've now got a lifetime's worth of mistakes to learn from! I'm sorry to say goodbye to Mailana, but glad I had the chance to try something crazy and fail.

Five short links

December 6, 2011 By Pete Warden in Uncategorized 1 Comment

Photo by Jannis Andrija Schnitzer

On being wrong in Paris – A great general meditation on the slippery nature of facts, but the specific example is very resonant. We tend to think of places having clear boundaries, but depending on who I was talking to I'd describe my old house as either in "Los Angeles", "Near Thousand Oaks" or "Simi Valley". Technically I wasn't in LA, but the psychological boundaries aren't that neat.

The devil in the daguerrotype details – The detail you can see on this old photograph is amazing, and I love how they delve into the capture method. I was disappointed there was nothing on the role of lenses as a limiting factor on resolution though, I'd love to know more about that.

Katta – A truly distributed version of Lucene, designed for very large data sets. I haven't used it myself yet, but I'm now very curious.

Hbase vs Cassandra – An old but fair comparison of the two technologies. This mirrored the evaluation I went through when picking the backend database for Jetpac, and I ended up in the same place.

It's cheaper to keep 'em – Your strategy is sometimes pre-determined by what numbers you're paying attention to. If you start off with the assumption your job is to get new users as cheaply and fast as possible, you'll never realize how important retaining existing customers can be.

Sad Alliance

December 5, 2011 By Pete Warden in Uncategorized Leave a comment

A friend inspired me to dig around in my digital attic, and resurrect a video of one of my live VJ performances. It's playing off the music of Richie Hawtins and Pete Namlook, and created on the fly using my home-brewed software, a MIDI controller, and a live camera feedback loop. There's no clips or pre-recorded footage, everything's my own response to the audio as it's happening.

Lessons from a Cassandra disaster

December 5, 2011 By Pete Warden in Uncategorized Leave a comment

Photo by Earthworm

Yesterday one of my nightmares came true; our backend went down for seven hours! I'd received an email from Amazon warning me that one of the instances in our Cassandra cluster was having reliability issues and would be shut down soon, so I had to replace it with a new node. I'm pretty new to Cassandra and I'd never done that in production, so I was nervous. Rightfully so at it turned out.

It began simply enough, I created a new server using a Datastax AMI, gave it the cluster name, pointed it at one of the original nodes as a seed, and set 'bootstrapping' to true. It seemed to do the right thing, connecting to the cluster, picking a new token and streaming down data from the existing servers. After about an hour it appeared to complete, but the state shown with nodetool ring was still Joining, so it never became part of the cluster. After researching this on the web without any clear results, I popped over to the #cassandra IRC channel and asked for advice. I was running 0.8.1 on the original nodes and 0.8.8 on the new one, since that was the only Datastax AMI available, so the only suggestion I had was to upgrade all the nodes to a recent version and try again.

This is where things started to get tough. There's no obvious way to upgrade a Datastax image and IRC gave me no suggestions, so I decided to try to figure how to do it myself from the official binary releases. I took 0.8.7 and looked at where the equivalent files to the ones in the archive lived on disk. Some of them were in /usr/share/cassandra, others in /usr/bin, so I made backup copies of those directories on the machine I was upgrading. I then copied over the new files, and tried restarting Cassandra. I hit an error, and then I made the fatal mistake of trying to restore the original /usr/bin by first moving out the updated one, thus bricking that server.

Up until now the Cassandra cluster had still been functional, but the loss of the node I had the code contact initially meant we lost access to the data. Luckily I'd set things up so that the frontend was mostly independent of the backend data store, so we were still able to accept new users, but we couldn't process them or show their profiles. I considered rejigging the code so that we could limp along with two of the three nodes working, but my top priority was safeguarding the data, so I decided to focus on getting the cluster back up as quickly as I could.

I girded my loins and took another try at upgrading a second node to 0.8.7, since that was the most likely cause of the failure-to-join issue according to IRC. I was painstaking about how I did it this time though, and after a little trial and error, it worked. Here's my steps:

There were a couple of gotchas. You shouldn't copy the bin/cassandra.in.sh file from the distribution, that contains settings like the location of the library files that you want to retain from the Datastax AMI, and if you see this error:

ERROR 22:15:44,518 Exception encountered during startup.

java.lang.NullPointerException

at org.apache.cassandra.db.ColumnFamilyStore.scrubDataDirectories(ColumnFamilyStore.java:606)

it means you've forgotten to run Cassandra as an su user!

Finally I was able to upgrade both remaining nodes to 0.8.7, and retry adding a new node. Maddeningly it still made it all the way through the streaming and indexing, only to pause on Joining forever! I turned back to IRC and explained what I'd been doing, and asked for suggestions. Nobody was quite sure what was going on, but a couple of people suggested turning off bootstrapping and retrying. To my great relief, it worked! It didn't even have to restream, the new node slotted nicely into the cluster within a couple of minutes. Things were finally up and running again, but the downtime definitely gave me a few grey hairs. Here's what I took away from the experience:

– Practice makes perfect. I should have set up a dummy cluster and tried a dry run of the upgrade there. It's cheap and easy to fire up extra machines for a few hours, and would have saved a lot of pain.

– Paranoia pays. I was thankful I'd been conservative in my data architecture. I'd specified three-way replication, so that even if I'd bricked the second machine, no data would have been lost. I also kept all the non-recoverable data either on a separate PostGres machine, or in a Cassandra table that was backed up nightly. The frontend was still able to limp along with reduced functionality when the backend data store was down. There's still lots of potential showstoppers of course, but the defence-in-depth approach worked during this crisis.

– Communicate clearly. I was thankful that I'd asked around the team before making the upgrade, since I knew there was a chance of downtime whenever you had to upgrade a database server. We had no demos to give that afternoon, so the consequences were a lot less damaging than they could have been.

– The Cassandra community rocks. I'm very grateful for all the help the folks on the #cassandra IRC channel gave me. I chose it for the backend because I knew there was an active community of developers who I could turn to when things went wrong, even when the documentation was sparse. There's no such thing as a mature distributed database, so having experienced gurus to turn to is essential, and Cassandra has a great bunch of folks willing to help.

How to brick your Ubuntu EC2 server

December 2, 2011 By Pete Warden in Uncategorized Leave a comment

Photo by Mutasim Billah

sudo mv /usr/bin /usr/bin.latest - Don't do this!

What on earth possessed me to run that command? I had just attempted to upgrade to Cassandra 0.8.7 on a DataStax AMI which started as 0.8.1. That involved manually copying files, so trying to be careful I made a backup of any directories I was touching, including /usr/bin. The upgrade didn't work, so I decided to roll back by swapping out the updated and backed-up directories, and the command above was the first stage.

As far as I can tell, there's no way to do anything useful with the machine after that. Ubuntu requires sudo before you can perform any system related, and that command's broken by that folder change, it can't find /usr/bin/python. I had the default Ubuntu setup where you can't run su or ssh in as root, and running /usr/bin.latest/sudo gave a cryptic 'sudo: must be setuid root' error, possibly because of the dot in the path name? Even worse, the Cassandra data files required permissions to access, so I couldn't copy them off.

It turned into an interesting puzzle for the Unix folks in my twitter stream, thanks for all the ideas. I'm just glad I have three-way replication, and in the worst case nightly backups of any non-reproducible data. The pain of losing hours that I could be spending on features makes this a memorable lesson though. Now I just have to persuade my replacement Cassandra node to get beyond "Joining" after I add it. At least this is giving me plenty to blog about!

Five short links

November 29, 2011 By Pete Warden in Uncategorized Leave a comment

Photo by Nick Kenrick

Massive Scale Data Mining for Education – Companies invented Big Data techniques for mining useful information from a mountain of behavior information to optimize shopping sites, and now they're well-understood I'm excited to see how they can be applied to other areas. This post outlines one idea for applying them to education – I've no idea how well this would work in practice but I'd love to see the results.

Giant Tesla Coils – A project to create 200 foot-long bolts of artificial lightning. I don't have to explain how awesome this is.

Zynga's tough culture risks a talent drain – There's something about 'fun' industries like games or films that seem to encourage terrible working conditions. Much as I love data, it's really hard to use it to drive personnel decisions, see Enron for a classic example.

Forward Secrecy – Google's doing great work by supporting this improvement to https, including code contributions to OpenSSL.

Common Crawl Email List – CC is a fantastic project to create a sharable data set of web content, and I'm glad to see a community starting to grow up around it. Now, who will post the first message?

How to post a screenshot of your site to a user’s Facebook account

November 27, 2011 By Pete Warden in Uncategorized Leave a comment

If a user shares your site's content as a photo on Facebook it's incredibly powerful marketing. It means you've produced something compelling enough that users want to show it off to their friends, and it gives you a good chance to entice those friends to try your service. It's pretty hard to figure out how to implement it though, there's a lot of moving parts and no examples that show how to put them all together. Since I just added this as a new feature on Jetpac and it's been a big hit, I thought I'd share how I did it.

Asking for extra permissions

One of the nicest surprises about jumping back into Facebook development after a long hiatus was how savvy users are about permissions. In the old days people tended to click through no matter what was there, but I found our acceptance rate dropped off a cliff when we had 'publish_stream' as default. I'm guessing that's because users have been burned by spammy apps, so I now ask for a lot fewer permissions on the first connect. That meant that the first step after a user clicked on the 'Share' button was to ask for that extra permission.

Actually though, there's a step before that, as suggested by my friend Jeff Widman. He's had a lot of experience optimizing Facebook conversions and he recommended a short dialog explaining why I was going to ask for permissions before sending the users to Facebook's site. That seemed obvious after he said it, so I whipped up a quick explanation:

The next challenge was how to send the user to a new login dialog that requested the publish_stream permission. We're using the OmniAuth gem in Ruby, and with a bit of Googling I found Mike Pack's explanation of how to add an extra setup stage in Ruby on Rails. It was a bit funky, involving returning a 404 code to indicate success for example, but it worked like a charm. Here's the Sinatra version I ended up with:

Rendering the screenshot

It's surprising just how hard it is to take a screenshot of a web page. For very good security reasons it's not something you can do in a general way on the client side. It's still possible to run a signed Java Applet if you can persuade your users to click through scary security dialogs, but there's no other way I know to access the browser's rendering of the DOM. You can write your own HTML renderer into something like Canvas, or use Flash's renderer, but those have very patchy results.

That meant I had to investigate server-side rendering. In the past I'd experimented with using Firefox as a headless browser, but I was intrigued by what I'd seen of tools like PhantomJS that use QT's built-in support to render using webkit. This turned out to be a good solution, with a few things to watch out for:

– It's not truly headless. You still need X Windows to run it, but happily xvfb-run is easy to install and does the trick, at the cost of a bit of startup overhead and complexity.

– It takes about 15 seconds to render in our case, which is long enough that we needed some kind of frontend logic to tell the user that we're still working on it.

– If things go wrong, there's no obvious error messages, the output image just isn't there. I didn't debug deeply enough to figure out if there's some stderr that I'm missing, or if it's lost in the bowels of X Windows somewhere, but when there's a problem it makes figuring it out tough.

– The best way I could find to integrate it was through a system call to an external process. In my case the URL has no user input, but if it did you'd need to validate everything that goes into the command string to avoid nasty security issues. It also meant lots of kludgy bouncing of data back-and-forth between the script and the file system.

– You'll have to think about how to authenticate the call, since the server-side page request won't automatically inherit the user's normal cookies. In my case I ended up packaging the authentication information so I could appear to be the user as far as my frontend server was concerned.

– You'll see scroll-bars baked into the result unless the image is the same size as the rendered page. In my case we have CSS that always causes this problem no matter what size you specify! I'll be putting in a fix to special-case the styling as a workaround tomorrow, but there doesn't seem to be a general way to solve this.

Here's the function I wrote to convert a URL into an image:

Uploading the screenshot to Facebook

We're almost there! The final hurdle was figuring out how to use Facebook's API to actually upload the image to the user's profile. I was excited when I ran across an official blog post describing how to do this, but unfortunately it only shows how to craft a form a user can use to upload a file from their own machine. I needed something that would emulate a multipart form post from Ruby, and happily I discovered Cody Brimhall's module that did exactly that. I had to modify it a little since it still expected to pull the data from a file on the server, instead of an in-memory object, but that was easy enough. Here's the modified module:

And here's a snippet of code that shows how to call the Graph API with the right form data:

Before I call this, I give the user a preview of the image and the chance to write a custom caption for it:

I was briefly tempted to give people an option to tag friends who are mentioned in the profile but that felt too spammy, and it turns out it's prohibited by Facebook anyway. In the future we might make it easy for users to 'Send' the photo, but it's a very delicate line to tread. We've seen decent uptake from it just appearing in the stream, so we're happy with that for now.

There you have it, the three steps towards screenshot sharing nirvana! If you've got something remarkable enough that users want to share it with their friends, now you can make it easy for them and drive your own growth at the same time.

Five short links

November 26, 2011 By Pete Warden in Uncategorized Leave a comment

Photo by CJ Schmit

Github Secrets – One of the things I gave thanks for on Thursday were the improvements I've seen in my development environment recently. I was sceptical after being burned by previous 'upgrades', but Xcode 4 is a big step forward, and this post illustrates why Github has been a godsend. It's built by people who live the same problems as me, and it's great to see all the easter-egg features they've snuck in to solve them, even when they haven't been able to expose them through the UI.

Downloading the WIGLE data set – I've been working on an update for my data sources handbook, and I was excited to see a user-generated database of Wifi networks. Unfortunately it's a write-only store in a lot of ways. You can use their proprietary desktop tool to access the data in small chunks, but there's no way of downloading the complete set. I understand their reasoning in a narrow sense, they want to generate value from their data set by keeping it under wraps, but I think they're missing a big opportunity. There's already commercial providers of this information, they could reach a whole different set of people, follow the Wikipedia model instead of the Encyclopedia one.

Libraries – Where it all went wrong – An inspired rant by my friend Nat Torkington. I still love libraries, they were my haven and inspiration as a kid, but they're no longer part of my life.

Weathermob – Crowdsourced weather on your iPhone. I see Britain as a market ripe for the picking, considering the proportion of my conversations with friends and family that are about the weather. They should have used 'cloud-sourcing' somewhere in the message though.

Flamethrower Storage – A notice from an Antarctic researcher with too much time on their hands.

How to create a terrible visualization

November 23, 2011 By Pete Warden in Uncategorized Leave a comment

The last couple of visualizations I've done have been complete flops, at least in terms of traffic. A geeky post about my profiling habits got more visitors than a shiny 3D globe! It's never fun to confront it but as Bob Sutton says; 'failure sucks but instructs'. In that spirit, here's what I learned about how to create an unpopular visualization:

Tell lots of stories at once

I love exploring complex data sets, but it takes a lot of effort and time. Most people are looking for a quick insight, something catchy, unexpected, but obvious once you see it. Unless you have a one-sentence message that you can get across in the first few seconds, the audience will move on. The two really popular visualizations I built (the Five Nations of Facebook and the iPhone tracking app) had a strong, simple story attached. The most recents ones have been exploratory tools without much of a narrative behind them.

Focus on the technology

I'm completely technology-driven. My starting point is always finding interesting but unused data, and trying to bring it to light as a visualization. That often involves creating new techniques to capture and show the information, but my weakness is that I'll often fall in love with those techniques, at the expense of the end result. With the globe, I rely completely on WebGL, so only people with Chrome or Firefox on a non-mobile device could view it. I get excited just by the idea of being able to run complex 3D rendering inside of a browser from Javascript, but I know that leaves me in a minority.

Advanced technology like that is essential for a strong visualization, but you need something more on top. Often having a static image of the result is the most powerful product, if you can impress people with that it means you have a story that doesn't require them to interact with a tool to understand. I'd actually had my Facebook visualization out for a couple of weeks as an interactive service with almost no visitors. It was only when I created a throwaway blog post with a screenshot and some funny names that it reached millions of people.

Copy a previous success

People pass a link to their friends if they think it's remarkable, something they've not seen before. That means that the bar keeps getting higher and it's very hard to repeat an approach that worked before and get the same results. A couple of years ago there weren't very many online visualizations, so it was a lot easier to get noticed. I've been excited to see so many amazing projects appear recently, as a visualization fan it's been a golden age, but it does mean the competition is a lot stiffer. You need to build something above and beyond what's already been done to get noticed.

Leave out the magic

One of the things I love about creating visualizations is that it's more art than a science. There is no formula for success, and the only way I know to make progress is to follow my own curiosity. A visualization really taking off depends on dozens of things all going right, so the whole process does feel like magic sometimes. At heart I'm building them for my own enjoyment, and I hope that comes out in the results. People do seem to respond to a sense of fun, and the best way to create a boring visualization is to force one out.

In the past I've effectively been goofing off when I was working on a new graph, procrastinating on my real work, but these days my responsibility for Jetpac is always in the back of my mind. That has cramped my imagination, so I'll be trying to get back to my footloose and fancy-free roots and worry less about traffic. I can't guarantee a more whimsical approach will give more interesting visualizations, but I know for sure I'll be having more fun!

	Moonshine Voice v2 v… on Announcing Moonshine Voice
	Pete Warden on Launching a free, open-source,…
	riddelln on Launching a free, open-source,…
	I see dead people. Y… on Announcing Moonshine Voice
	Pete Warden: Announc… on Announcing Moonshine Voice

Pete Warden's blog

Ever tried. Ever failed. No matter. Try Again. Fail again. Fail better.

Category Archives: Uncategorized

How to run simple smoke tests in Ruby

Death of a startup

Five short links

Sad Alliance

Lessons from a Cassandra disaster

How to brick your Ubuntu EC2 server

Five short links

How to post a screenshot of your site to a user’s Facebook account

Five short links

How to create a terrible visualization