Five short links

Photo by Robert1407

Hadoop Cluster Chef – A very useful project by Joe Kelly Flip Kromer of InfoChimps. It's designed to make building compute clusters across a range of technologies very easy, and it's built by someone in the trenches so it has a lot of street-smarts baked in, things like using spot EC2 instances for rock-bottom server prices

Comparing email address validation regular expressions – This is the mother of all email REs, with a lot of testing behind it. As someone who's wrestled with this problem myself I'm impressed, though I lean towards using a much more accepting version for testing user input

How to check if an email address exists without sending an email – I wasn't aware that you could use SMTP to discover if an email address is valid. It makes the "we can't add an API to look up users by email on our service because spammers will use it to validate emails" line from social networks look even more like a flimsy excuse

Reflections on startup life – week 28 – Tim Bull has been writing weekly entries about the process of building a company around Tribalytic. Writing this stuff in realtime makes it a great antidote to the 'we knew what we were doing all along' PR spin that always obscures the real story of successful startups

Tree of Ténéré – This was the most isolated tree on the planet, over a hundred miles from its nearest neighbor. The last remnant of a grove that grew in another era when the Sahara had water, its roots stretched over 30 meters downwards, and it was a landmark for generations of travelers. Then it was run over by a truck.

Was the financial meltdown an alien invasion?

Photo by Scott Wills

Before you start worrying I've gone all Icke on you the answer is no, but what's interesting about the question is that it's a lot harder to answer than it used to be. Todd Vernon's article on the still-unexplained stock market crash of May 3rd reminded me of something, but it took a while to put my finger on it. Then I realized, it was from Vernor Vinge's A Deepness in the Sky.

In the novel Vinge describes financial weapons that are unleashed by a malevolent intelligence on an unsuspecting civilization, business plans that explode their systems. At the time I was naive enough that I couldn't picture how just transmitting an idea could have that sort of impact, but now it's pretty obvious. All you need is an alien in the early 2000's to email an investment banker "Hey, everyone wants triple-A securities. If you bundle this risky debt into tranches you can persuade the ratings agencies to certify part of it as AAA and you'll make a fortune".

What I find fascinating as a computer engineer is that this is all my profession's fault. We've built an amazing infrastructure of information plumbing to allow the financial system to operate on auto-pilot, without those pesky humans sucking up salaries and slowing the whole process down. The late lamented Tanta opened my eyes to how the job of a mortgage underwriter had changed from being a gatekeeper, almost a detective trying to sniff out risky loans, to someone whose focus was making sure that the submitted forms matched the criteria set in the computer programs they now used to grade loans. In a very literal sense these programs are artificial intelligences, expert systems that try to mechanize the thought process that we used to rely on loan officers to go through.

These loans were then bundled and sold as securities based on complex computer models based on historical data (which didn't include falling house prices). These securities were then bought and resold by investors using their own complex trading programs. This was all in the debt market, but the same takeover of decisions by software occurred in equities, and the end result is a system composed of hundreds of thousands of different programs all talking to each other and making enormous financial decisions on their owner's behalf. What's truly scary about this size of system is that it's reached a scale where it's so complex that it's impossible to understand why anything happened. That's what worries me about the May 3rd crash, it's the first time a vital information system we've built has proved too complex to debug. In the pre-computer world, you could just interview everyone who bought or sold on the daya of a crash and ask them why they took action. We can't do that with our programs, which is why the crash will remain a mystery.

In another part of Deepness, Vinge describes a civilization whose systems have reached the point where they're so entangled and baroque that nobody can fix them when they crash, and the whole world is destined to collapse. Again I couldn't picture that when I first read it, but now I can. In the financial crisis we had cargoes that couldn't be shipped because the standard letters of credit between banks were no longer being honored. Our fundamental mechanisms for delivering food were taken out by problems in our financial information systems! Our AIs don't need to become self-aware to turn into Skynet, they're quite capable of causing serious damage just as they are.

So how can we fix this? As a start we need all programs making financial decisions to produce a clear audit trail in a standard format, explaining not only what actions they took but why. With this sort of log available for every market participant, forensic investigators would be able to build a picture of the cause of events like May 3rd. It may well prove to be impossible for some existing code to come up with meaningful reasons for its decision tree, but that's a feature of this proposal, not a bug. If the operators can't justify the actions of the programs they're running, then that's a clear sign they're too dangerous to be interfacing with our financial system.

Camping at Ceran St Vrain


I just got back from a wonderful one-night camping trip with Liz and our dog Thor, the first of the summer. We'd been looking for quick-and-easy local campgrounds, and Liz found a hidden gem at Ceran St Vrain. The location's great for us, only 40 minutes from our house in Boulder, just past Jamestown in the mountains. It's got a lot of features to love – it's free with no reservation system, has campsites starting less than a 1/2 mile from the trailhead, there's a beautiful creek running through it, dogs are welcome and and wood fires are allowed. It is primitive camping, so you'll need to purify your water from the stream and there's no restrooms, but I can't imagine an easier introduction to roughing it if you're graduating from car-camping.

To get there from Boulder, you can drive up 36 north of town until you reach Left Hand Canyon Road. You then continue up Left Hand through Jamestown, and then about 10 minutes after the town it switches to dirt. Less than a mile after that, there's a marked turn to the right that takes you past a small ranch house to the trailhead parking lot.

There's only a single trail leading out, which takes you across a small wooden bridge and then follows the creek downstream. After a half-mile, you should start to see the first of the areas people have been using as a campsite. These typically are between the trail and the stream, have a small firepit formed of rocks, and have space for between two and four small tents each. I counted about eight of these areas over the next quarter-mile, and there's also other spaces you could set up tents if those were taken. After that the canyon narrows and the trail heads high up on the side of the bank, so camping's not possible until it heads back down to the water after about half a mile. There's about three more sites with fire-pits in this section, and you can see the one we chose above (wine not included as standard).

If you carry on down the trail, you come to a series of junctions, including a track that theoretically could be used by dirt-bikes and quads, but seemed like it would be extremely challenging. We had two maps with us, and Liz had hiked the area several times previously, but it was still extremely confusing. There's some signs that contradict the names our maps gave for the trails, so make sure you don't attempt them without compass, maps and crossed-fingers.

I'd imagine this area is a popular destination for the locals. Most of the trails are in the canyon or have their views blocked by trees, so it's not Yosemite but it has its own low-key beauty in the pines and creeks. The downside of the lack of reservations is that there's no guarantees on how many people will already be there, but there's a lot of space available so hopefully even on busy weekends you'll be able to find a spot. The parking lot should give you an early warning for how packed it will be too.

It was a fantastic break from civilization, I'm always grateful when Liz manages to pry me away from my laptop, and I'm hoping to drag some more friends out there for quick camping trips over the summer. Maybe we need to make it a Boulder geek event, it's a shame that all these so-called 'camps' take place indoors!


Five short links

Photo by Oncle Tom

Facebook indifference at work again – Pat hits the nail on the head when he describes how dangerous it is for your business to rely on any large company when you're a small player with no leverage. The hardest part of my job at Apple was telling third-parties relying on the FxPlug API that we wouldn't be fixing bugs that were crucial for them. Often the effort on our end would be comparatively minor, but we couldn't persuade the powers-that-be that they were a high-enough priority to make it into a schedule.

Assholes who turned out to be right – Bob Sutton is by far my favorite business writer because he focuses on the everyday reality of cramming a bunch of primates into an office, not the business-school theory. Did you know the guy who invented the term supernova was both a genius and a complete psychopath whose idea of small-talk was "I myself can think of a dozen ways to annihilate all living beings in one hour"?

SNAP graph library – Not only a fascinating code-base for dealing with massive network problems, the SNAP project is also the home to the best collection of sample data sets I've ever seen.

It's OK not to write unit tests – I'm slowly figuring out where unit tests are useful and where they're overkill, so it was helpful to read a well-argued rant from an opponent.

Always Searching – Jessa Crispin argues that the secret to a well-lived life is figuring out how you can contribute to the greater good, and then pouring all your energies into that. Easier said than done of course, but I find it appealingly simple.

What I learned about engineering from the Panama Canal

Photo by Scott Ableman

I've just finished The Path Between the Seas, the story of the building of the Panama Canal by David McCullough. I was left in awe of the builder's achievements, and the price they paid in lives. It really was a heroic age of engineering and almost every page had something that made me think about my own work. Here's some of my favorites.

"Now, boys, we have got her done, let's start her up and see why she doesn't work"

Release early, release often for the 1900's. Sometimes the only way to find out is to try it and see, and this was a big difference between the French and US attempts at building the canal. The Americans were brought up in a culture that valued improvisation and experimentation at all levels, the French found the flexibility needed to cope with brand-new challenges a lot harder to come by.

"Engineers are sometimes the least practical of men, they may be attracted by difficulties"

Before malaria and yellow-fever were brought under control the death-rate amongst the canal builders was horrendous. Despite that, the challenge was so compelling that the American engineers tasked with evaluating the project's potential were irresistibly drawn to it. I recognize this in my own work- if there's a powerful but hard-to-master solution to a issue (compiling Cassandra anyone?), I'll find myself inventing reasons to use it rather than a more mundane alternative, just because of how interesting the problems would be.

"The diagrams were as simple as illustrations in a childs primer, conveying their message at a glance and easy to remember. They were an inspiration, Hanna saw instantly. The inevitable problem with technical reports, with any arguments based on technical data, was that few would read them"

Communication is the key. Most people have far less time and patience than you imagine, so if you're going to persuade them you need to expend a tremendous amount of time and energy simplifying your story. The managers of the canal project spent years ignoring the overwhelming evidence and scientific consensus that mosquitoes spread malaria and yellow fever, a blindness that cost thousands of lives. Never underestimate how hard someone's established opinions will be to budge.

"You won't get fired if you do something, you will if you don't do anything. Do something if it is wrong, for you can correct that, but there is no way to correct nothing"

I don't know that my bias towards action always leads me to the wisest path (I didn't expect what would happen when I poked Facebook with a stick), but I get incredibly frustrated when I can't make progress. I'd rather screw up, take my knocks and learn a lesson than be stuck in indecision. I love these instructions to a subordinate, it perfectly captures the attitude I look for in the people I work with.

"Suppose after all that should prove to be right, and it all ends in your butterflies and morlocks. That doesn't matter now. The effort's real. It's worth going on with. It's worth it-even then"Teddy Roosevelt in a meeting with HG Wells

I consider myself a cheerful pessimist. I've been through enough that I know how steep the odds of success are, but I've made a choice that even a hopeless fight in a good cause is worthwhile.

"Bishop began publishing weekly excavation statistics for individual steam shovels and dredges, and at once a fierce rivalry resulted, the gain in output becoming apparent almost immediately. 'It wasn't so hard before they began printing the Canal Record' a steam-shovel man explained to a writer for the Saturday Evening Post. "We were going along, doing what we thought was a fair day's work … [but then] away we went like a pack of idiots trying to get records for ourselves"

Statistics are incredibly powerful motivators, we're driven to measure ourselves against others so just exposing a new metric can radically change people's behavior. The tricky thing with our engineering world is that output is much harder to quantify. If you look at lines-of-code per day you'll be rewarding hacky programmers who cut and paste, not the elegant coders able to reduce complexity.

"Even New Englanders grow almost human among their broad-minded fellow-countrymen. Any northerner can say 'nigger' as glibly as a Carolinian, and growl if one of them steps on his shadow"

The white workface was tiny compared to the number of black laborers. They were given an astonishing range of perks from free housing to social clubs while the blacks were left to clear a home in the jungle.

It's worth looking around every now and again and to really see the people who are dealing with the less glamorous side of our work, especially now we're outsourcing so much overseas. Are there potential stars we're overlooking because they're not the people we expect to excel? Are we sharing enough of our success with the people working hard for our company, even when their skills aren't as prestigious?

Five short links – Eddy’s sofa edition

Photo by Sam Cornwell

Eddy's sofa and the nightmare of a single global places register
– The large web companies want to own a monopoly on some important data source, it's their strategy to lock in long-term defensible revenue. This scenario lays out how seductive a bait-and-switch approach could be for geo data, persuading everyone to input their information and then later putting up pay-walls.

Why do women leave science and engineering – A new study argues that it's not the traditional culprits, but poor promotion prospects that drive women from our industry

Hacker News on Why nobody understands your visualization – As always HN has a high standard of discussion, and pulls out some great war stories and further reading

My contrarian stance on Facebook privacy – I found myself nodding constantly as I read Tim O'Reilly's take on Facebook's recent troubles. Despite my own legal tussles with the behemoth, and my dismay at their cavalier attitude to user privacy, I think there still needs to be space for innovation

Am I human? – Adding CAPTCHA to graffiti. I wish they'd taken this further and caught some people noticing and reacting to the additions

Why nobody understands your visualization


I love visualizations, but they keep letting me down. I think I've built something that tells a crystal clear story, only to see people's eyes glaze over when they sit down in front of it. Why don't most of my pictures communicate the story I want to tell? And how was my map of Facebook connections different, what did I learn from its success?

The obvious place to look is Edward Tufte's work, but even as a fan I still find myself producing unintelligible visual explanations. Instead, I've found it more useful to think about the problem in a much simpler way. I use a visual vocabulary that nobody else understands. Without really noticing, I've invented a private language that makes perfect sense to me but conveys nothing to anyone else. So how do I make myself understood?

Keep it simple

There really is a visual vocabulary that we all learn. Our visual vocabulary is not as obvious as the verbal one because acquiring it is a much more informal process. There's no dictionaries or classes dedicated to understanding diagrams, so I had the unconscious idea that they just automatically make sense. This is seductive because our culture ensures that most of us do readily comprehend common charts like maps and bar graphs, which in fact would be incomprehensible to most of our ancestors. As the Tour through the Visualization Zoo points out "Although a map may seem a natural way to visualize geographical data, it has a long and rich history of design". Every familiar form (even the map) had to be invented, and only became widely-understood after it had proved its usefulness over a long period of time.

I try to remember that and stay humble whenever I'm tempted by a shiny new way of presenting data. There's probably about six basic ways of presenting data that everyone understands, colored maps (choropleth, hate the name!), points on maps, bar graphs, line graphs, pie charts and scatter graphs. If the new technique doesn't reuse people's understanding of those basic elements, 99% of people will look blank and move on. A few highly motivated people might puzzle out what you're showing, but if you want to tell a story to a wide audience, keep it painfully simple.

Label and annotate

I'd released an earlier version of my Facebook connections map two weeks before the version that became popular. It shows exactly the same connections, but without any coloring, labels or descriptions. The crucial breakthrough in reaching a mass audience was adding that extra guidance, making it very clear what I thought the diagram showed. I could see the same story in the spider web of connections, but that was because it was in a visual language I understood. By highlighting and naming the different areas I was able to tell that story to other people who would otherwise be baffled by the mass of lines.

Animate and interact

There's a massive untapped element that everyone understands: time. Before computers came along, animated graphs were almost unheard of, so they are absent from the traditional literature on creating good diagrams. The tools are still lagging behind the possibilities, but there's a rich world of vocabulary and metaphors to draw on. Our brains come packed with hardware to translate moving images into meaning and Saturday morning cartoons help train that capacity! If you can take common elements like bar graphs and animate them, you've got a whole new axis to display information on.

Even better, you can pack more information in without confusing people by making your charts interactive. If someone selects an element, you can focus on that dimension of your data in much more detail, something that's impossible when you're communicating through a sheet of pulped wood.

Show me

You may have noticed that my last point doesn't fit, I didn't use time or animation in my Facebook visualization. That was because I didn't have the technology I needed, so I have no proof to demonstrate that I'm right. I'm working to remedy that problem, so stay tuned for my next project.