Five short links

Photo by Robert1407

Hadoop Cluster Chef – A very useful project by Joe Kelly Flip Kromer of InfoChimps. It's designed to make building compute clusters across a range of technologies very easy, and it's built by someone in the trenches so it has a lot of street-smarts baked in, things like using spot EC2 instances for rock-bottom server prices

Comparing email address validation regular expressions – This is the mother of all email REs, with a lot of testing behind it. As someone who's wrestled with this problem myself I'm impressed, though I lean towards using a much more accepting version for testing user input

How to check if an email address exists without sending an email – I wasn't aware that you could use SMTP to discover if an email address is valid. It makes the "we can't add an API to look up users by email on our service because spammers will use it to validate emails" line from social networks look even more like a flimsy excuse

Reflections on startup life – week 28 – Tim Bull has been writing weekly entries about the process of building a company around Tribalytic. Writing this stuff in realtime makes it a great antidote to the 'we knew what we were doing all along' PR spin that always obscures the real story of successful startups

Tree of Ténéré – This was the most isolated tree on the planet, over a hundred miles from its nearest neighbor. The last remnant of a grove that grew in another era when the Sahara had water, its roots stretched over 30 meters downwards, and it was a landmark for generations of travelers. Then it was run over by a truck.

Was the financial meltdown an alien invasion?

Photo by Scott Wills

Before you start worrying I've gone all Icke on you the answer is no, but what's interesting about the question is that it's a lot harder to answer than it used to be. Todd Vernon's article on the still-unexplained stock market crash of May 3rd reminded me of something, but it took a while to put my finger on it. Then I realized, it was from Vernor Vinge's A Deepness in the Sky.

In the novel Vinge describes financial weapons that are unleashed by a malevolent intelligence on an unsuspecting civilization, business plans that explode their systems. At the time I was naive enough that I couldn't picture how just transmitting an idea could have that sort of impact, but now it's pretty obvious. All you need is an alien in the early 2000's to email an investment banker "Hey, everyone wants triple-A securities. If you bundle this risky debt into tranches you can persuade the ratings agencies to certify part of it as AAA and you'll make a fortune".

What I find fascinating as a computer engineer is that this is all my profession's fault. We've built an amazing infrastructure of information plumbing to allow the financial system to operate on auto-pilot, without those pesky humans sucking up salaries and slowing the whole process down. The late lamented Tanta opened my eyes to how the job of a mortgage underwriter had changed from being a gatekeeper, almost a detective trying to sniff out risky loans, to someone whose focus was making sure that the submitted forms matched the criteria set in the computer programs they now used to grade loans. In a very literal sense these programs are artificial intelligences, expert systems that try to mechanize the thought process that we used to rely on loan officers to go through.

These loans were then bundled and sold as securities based on complex computer models based on historical data (which didn't include falling house prices). These securities were then bought and resold by investors using their own complex trading programs. This was all in the debt market, but the same takeover of decisions by software occurred in equities, and the end result is a system composed of hundreds of thousands of different programs all talking to each other and making enormous financial decisions on their owner's behalf. What's truly scary about this size of system is that it's reached a scale where it's so complex that it's impossible to understand why anything happened. That's what worries me about the May 3rd crash, it's the first time a vital information system we've built has proved too complex to debug. In the pre-computer world, you could just interview everyone who bought or sold on the daya of a crash and ask them why they took action. We can't do that with our programs, which is why the crash will remain a mystery.

In another part of Deepness, Vinge describes a civilization whose systems have reached the point where they're so entangled and baroque that nobody can fix them when they crash, and the whole world is destined to collapse. Again I couldn't picture that when I first read it, but now I can. In the financial crisis we had cargoes that couldn't be shipped because the standard letters of credit between banks were no longer being honored. Our fundamental mechanisms for delivering food were taken out by problems in our financial information systems! Our AIs don't need to become self-aware to turn into Skynet, they're quite capable of causing serious damage just as they are.

So how can we fix this? As a start we need all programs making financial decisions to produce a clear audit trail in a standard format, explaining not only what actions they took but why. With this sort of log available for every market participant, forensic investigators would be able to build a picture of the cause of events like May 3rd. It may well prove to be impossible for some existing code to come up with meaningful reasons for its decision tree, but that's a feature of this proposal, not a bug. If the operators can't justify the actions of the programs they're running, then that's a clear sign they're too dangerous to be interfacing with our financial system.

Camping at Ceran St Vrain


I just got back from a wonderful one-night camping trip with Liz and our dog Thor, the first of the summer. We'd been looking for quick-and-easy local campgrounds, and Liz found a hidden gem at Ceran St Vrain. The location's great for us, only 40 minutes from our house in Boulder, just past Jamestown in the mountains. It's got a lot of features to love – it's free with no reservation system, has campsites starting less than a 1/2 mile from the trailhead, there's a beautiful creek running through it, dogs are welcome and and wood fires are allowed. It is primitive camping, so you'll need to purify your water from the stream and there's no restrooms, but I can't imagine an easier introduction to roughing it if you're graduating from car-camping.

To get there from Boulder, you can drive up 36 north of town until you reach Left Hand Canyon Road. You then continue up Left Hand through Jamestown, and then about 10 minutes after the town it switches to dirt. Less than a mile after that, there's a marked turn to the right that takes you past a small ranch house to the trailhead parking lot.

There's only a single trail leading out, which takes you across a small wooden bridge and then follows the creek downstream. After a half-mile, you should start to see the first of the areas people have been using as a campsite. These typically are between the trail and the stream, have a small firepit formed of rocks, and have space for between two and four small tents each. I counted about eight of these areas over the next quarter-mile, and there's also other spaces you could set up tents if those were taken. After that the canyon narrows and the trail heads high up on the side of the bank, so camping's not possible until it heads back down to the water after about half a mile. There's about three more sites with fire-pits in this section, and you can see the one we chose above (wine not included as standard).

If you carry on down the trail, you come to a series of junctions, including a track that theoretically could be used by dirt-bikes and quads, but seemed like it would be extremely challenging. We had two maps with us, and Liz had hiked the area several times previously, but it was still extremely confusing. There's some signs that contradict the names our maps gave for the trails, so make sure you don't attempt them without compass, maps and crossed-fingers.

I'd imagine this area is a popular destination for the locals. Most of the trails are in the canyon or have their views blocked by trees, so it's not Yosemite but it has its own low-key beauty in the pines and creeks. The downside of the lack of reservations is that there's no guarantees on how many people will already be there, but there's a lot of space available so hopefully even on busy weekends you'll be able to find a spot. The parking lot should give you an early warning for how packed it will be too.

It was a fantastic break from civilization, I'm always grateful when Liz manages to pry me away from my laptop, and I'm hoping to drag some more friends out there for quick camping trips over the summer. Maybe we need to make it a Boulder geek event, it's a shame that all these so-called 'camps' take place indoors!


Five short links

Photo by Oncle Tom

Facebook indifference at work again – Pat hits the nail on the head when he describes how dangerous it is for your business to rely on any large company when you're a small player with no leverage. The hardest part of my job at Apple was telling third-parties relying on the FxPlug API that we wouldn't be fixing bugs that were crucial for them. Often the effort on our end would be comparatively minor, but we couldn't persuade the powers-that-be that they were a high-enough priority to make it into a schedule.

Assholes who turned out to be right – Bob Sutton is by far my favorite business writer because he focuses on the everyday reality of cramming a bunch of primates into an office, not the business-school theory. Did you know the guy who invented the term supernova was both a genius and a complete psychopath whose idea of small-talk was "I myself can think of a dozen ways to annihilate all living beings in one hour"?

SNAP graph library – Not only a fascinating code-base for dealing with massive network problems, the SNAP project is also the home to the best collection of sample data sets I've ever seen.

It's OK not to write unit tests – I'm slowly figuring out where unit tests are useful and where they're overkill, so it was helpful to read a well-argued rant from an opponent.

Always Searching – Jessa Crispin argues that the secret to a well-lived life is figuring out how you can contribute to the greater good, and then pouring all your energies into that. Easier said than done of course, but I find it appealingly simple.

What I learned about engineering from the Panama Canal

Photo by Scott Ableman

I've just finished The Path Between the Seas, the story of the building of the Panama Canal by David McCullough. I was left in awe of the builder's achievements, and the price they paid in lives. It really was a heroic age of engineering and almost every page had something that made me think about my own work. Here's some of my favorites.

"Now, boys, we have got her done, let's start her up and see why she doesn't work"

Release early, release often for the 1900's. Sometimes the only way to find out is to try it and see, and this was a big difference between the French and US attempts at building the canal. The Americans were brought up in a culture that valued improvisation and experimentation at all levels, the French found the flexibility needed to cope with brand-new challenges a lot harder to come by.

"Engineers are sometimes the least practical of men, they may be attracted by difficulties"

Before malaria and yellow-fever were brought under control the death-rate amongst the canal builders was horrendous. Despite that, the challenge was so compelling that the American engineers tasked with evaluating the project's potential were irresistibly drawn to it. I recognize this in my own work- if there's a powerful but hard-to-master solution to a issue (compiling Cassandra anyone?), I'll find myself inventing reasons to use it rather than a more mundane alternative, just because of how interesting the problems would be.

"The diagrams were as simple as illustrations in a childs primer, conveying their message at a glance and easy to remember. They were an inspiration, Hanna saw instantly. The inevitable problem with technical reports, with any arguments based on technical data, was that few would read them"

Communication is the key. Most people have far less time and patience than you imagine, so if you're going to persuade them you need to expend a tremendous amount of time and energy simplifying your story. The managers of the canal project spent years ignoring the overwhelming evidence and scientific consensus that mosquitoes spread malaria and yellow fever, a blindness that cost thousands of lives. Never underestimate how hard someone's established opinions will be to budge.

"You won't get fired if you do something, you will if you don't do anything. Do something if it is wrong, for you can correct that, but there is no way to correct nothing"

I don't know that my bias towards action always leads me to the wisest path (I didn't expect what would happen when I poked Facebook with a stick), but I get incredibly frustrated when I can't make progress. I'd rather screw up, take my knocks and learn a lesson than be stuck in indecision. I love these instructions to a subordinate, it perfectly captures the attitude I look for in the people I work with.

"Suppose after all that should prove to be right, and it all ends in your butterflies and morlocks. That doesn't matter now. The effort's real. It's worth going on with. It's worth it-even then"Teddy Roosevelt in a meeting with HG Wells

I consider myself a cheerful pessimist. I've been through enough that I know how steep the odds of success are, but I've made a choice that even a hopeless fight in a good cause is worthwhile.

"Bishop began publishing weekly excavation statistics for individual steam shovels and dredges, and at once a fierce rivalry resulted, the gain in output becoming apparent almost immediately. 'It wasn't so hard before they began printing the Canal Record' a steam-shovel man explained to a writer for the Saturday Evening Post. "We were going along, doing what we thought was a fair day's work … [but then] away we went like a pack of idiots trying to get records for ourselves"

Statistics are incredibly powerful motivators, we're driven to measure ourselves against others so just exposing a new metric can radically change people's behavior. The tricky thing with our engineering world is that output is much harder to quantify. If you look at lines-of-code per day you'll be rewarding hacky programmers who cut and paste, not the elegant coders able to reduce complexity.

"Even New Englanders grow almost human among their broad-minded fellow-countrymen. Any northerner can say 'nigger' as glibly as a Carolinian, and growl if one of them steps on his shadow"

The white workface was tiny compared to the number of black laborers. They were given an astonishing range of perks from free housing to social clubs while the blacks were left to clear a home in the jungle.

It's worth looking around every now and again and to really see the people who are dealing with the less glamorous side of our work, especially now we're outsourcing so much overseas. Are there potential stars we're overlooking because they're not the people we expect to excel? Are we sharing enough of our success with the people working hard for our company, even when their skills aren't as prestigious?

Five short links – Eddy’s sofa edition

Photo by Sam Cornwell

Eddy's sofa and the nightmare of a single global places register
– The large web companies want to own a monopoly on some important data source, it's their strategy to lock in long-term defensible revenue. This scenario lays out how seductive a bait-and-switch approach could be for geo data, persuading everyone to input their information and then later putting up pay-walls.

Why do women leave science and engineering – A new study argues that it's not the traditional culprits, but poor promotion prospects that drive women from our industry

Hacker News on Why nobody understands your visualization – As always HN has a high standard of discussion, and pulls out some great war stories and further reading

My contrarian stance on Facebook privacy – I found myself nodding constantly as I read Tim O'Reilly's take on Facebook's recent troubles. Despite my own legal tussles with the behemoth, and my dismay at their cavalier attitude to user privacy, I think there still needs to be space for innovation

Am I human? – Adding CAPTCHA to graffiti. I wish they'd taken this further and caught some people noticing and reacting to the additions

Why nobody understands your visualization


I love visualizations, but they keep letting me down. I think I've built something that tells a crystal clear story, only to see people's eyes glaze over when they sit down in front of it. Why don't most of my pictures communicate the story I want to tell? And how was my map of Facebook connections different, what did I learn from its success?

The obvious place to look is Edward Tufte's work, but even as a fan I still find myself producing unintelligible visual explanations. Instead, I've found it more useful to think about the problem in a much simpler way. I use a visual vocabulary that nobody else understands. Without really noticing, I've invented a private language that makes perfect sense to me but conveys nothing to anyone else. So how do I make myself understood?

Keep it simple

There really is a visual vocabulary that we all learn. Our visual vocabulary is not as obvious as the verbal one because acquiring it is a much more informal process. There's no dictionaries or classes dedicated to understanding diagrams, so I had the unconscious idea that they just automatically make sense. This is seductive because our culture ensures that most of us do readily comprehend common charts like maps and bar graphs, which in fact would be incomprehensible to most of our ancestors. As the Tour through the Visualization Zoo points out "Although a map may seem a natural way to visualize geographical data, it has a long and rich history of design". Every familiar form (even the map) had to be invented, and only became widely-understood after it had proved its usefulness over a long period of time.

I try to remember that and stay humble whenever I'm tempted by a shiny new way of presenting data. There's probably about six basic ways of presenting data that everyone understands, colored maps (choropleth, hate the name!), points on maps, bar graphs, line graphs, pie charts and scatter graphs. If the new technique doesn't reuse people's understanding of those basic elements, 99% of people will look blank and move on. A few highly motivated people might puzzle out what you're showing, but if you want to tell a story to a wide audience, keep it painfully simple.

Label and annotate

I'd released an earlier version of my Facebook connections map two weeks before the version that became popular. It shows exactly the same connections, but without any coloring, labels or descriptions. The crucial breakthrough in reaching a mass audience was adding that extra guidance, making it very clear what I thought the diagram showed. I could see the same story in the spider web of connections, but that was because it was in a visual language I understood. By highlighting and naming the different areas I was able to tell that story to other people who would otherwise be baffled by the mass of lines.

Animate and interact

There's a massive untapped element that everyone understands: time. Before computers came along, animated graphs were almost unheard of, so they are absent from the traditional literature on creating good diagrams. The tools are still lagging behind the possibilities, but there's a rich world of vocabulary and metaphors to draw on. Our brains come packed with hardware to translate moving images into meaning and Saturday morning cartoons help train that capacity! If you can take common elements like bar graphs and animate them, you've got a whole new axis to display information on.

Even better, you can pack more information in without confusing people by making your charts interactive. If someone selects an element, you can focus on that dimension of your data in much more detail, something that's impossible when you're communicating through a sheet of pulped wood.

Show me

You may have noticed that my last point doesn't fit, I didn't use time or animation in my Facebook visualization. That was because I didn't have the technology I needed, so I have no proof to demonstrate that I'm right. I'm working to remedy that problem, so stay tuned for my next project.

Five short links – friction reduction edition

Photo by J-ster

Try Gnip – My friends at Gnip have just launched a free trial service. Making it all self-serve radically lowers the barriers to just giving it a try, so if you're at all interested in sucking down masses of data from Twitter, Flickr, etc, you really should give it a shot.

The Hotlist – There's tonnes of information about what our friends are up to flowing around us these days, but it's far too time-consuming to actually make sense of it all. That's why I love this mashup of your social graph, Facebook events and a map to help you discover events you should care about.

Jumppost – It's really hard to uncover the information you need to find good apartment rentals, especially if you're looking more than a few weeks ahead of your moving date. These folks have figured out that existing tenants know when their lease is up, and are willing to pay them up to $500 to pass that information along to apartment hunters.

Everlater books – My old Techstars colleagues Nate, Natty and Ryan have been doing an awesome job with their service that makes it much simpler to share your travel memories than the alternatives of email, blogs or Flickr albums. Now they've made it easy to create books from your trip diaries, and the quality of the examples I got to handle was impressive. If you order one today, you'll get free shipping too.

Super-simple Storage Service – I've been frustrated by how much complexity is introduced into my processing pipelines by actually having to read data, so S4's new write-only storage engine promises to radically simplify my code. At $12 a year for a terabyte of data the price is right, and AT&T's adoption of it for their customer complaint database proves it's ready for enterprise use.

Are you in danger from Facebook’s privacy changes?

Photo by Johnny Grim

"How am I in danger? Do people really care about what I post – like and
dislike on a social networking site? If so, what are they going to do
with the information? I don't get it.

This question came up in the comments of my blog, and though it's very simple, the answer's surprisingly complex and brings up much deeper philosophical questions.

The short answer is that you're in no danger right now, despite all the gnashing of teeth and wailing in the tech community. There's no evidence that anyone's using this information for malicious purposes, just as I've seen no actual burglars using the information in Please Rob Me.

So why are the geeks so upset? They're looking down the road and imagining all the bad things that the people wearing Black Hats will be able to do once they figure out what a bonanza of information is being released. Do you remember in the 90's when techies were hating on Windows for its poor security model? That seemed pretty esoteric for ordinary people because it didn't cause many problems in their day-to-day usage. The next decade was when those bad decisions about the security architecture became important, as viruses and malware became far more common, and the measures to prevent them became a lot more burdensome. The geeks were proved right, you can't start with a shoddy security model and just patch it into something secure.

I think the inelegance of Facebook's approach is what makes engineers' skin crawl. The model they use to prevent your information leaking out is a mess, both from the API side and in the UI. This makes it almost certain that there's unintended holes that leak information that even Facebook aren't aware they're revealing, and ensures users have no clue about what they're opening up to the world.

Fueling the anger is the feeling that Facebook is being deceptive in how they change their privacy model. They appear to believe there's a simple trade-off between making money and keeping users happy, and have apparently decided that they're in a strong enough position to ignore user complaints in order to increase their revenue. They're making information public because they want Google Juice. The more user-generated content they have on the public web, the more visitors from search engines they'll get, and the more important it will be for companies to have Facebook pages and advertising.

In practical terms, why is the information they're revealing important? Here's some of the scenarios that dance through geek's heads:

Embarrassment: There's a lot of personal information we'd rather keep to ourselves that might be revealed by our fan choices or friendships. You fan a gay club, and a homophobic potential employer spots that. Your ex-partner's divorce lawyer spots you're a fan of 'partying', and uses that as evidence against you in a child custody battle. Someone with a grudge targets your friends and family for harassment.

Big Brother: Social tools played an important part in the Green uprising in Iran, but you can bet your bottom dollar that there's now people within the regime using the same tools to track down dissidents. There's a lot of people within Iran who are fans of Mousavi, and since people generally use their real names on Facebook they could easily be found. I actually removed detailed data from FanPageAnalytics for Iran, Burma and North Korea because I was worried about this sort of usage.

Criminals: I'm skeptical that social network information will help traditional criminals, but there's a massive world of phishers, scammers and identity thieves I can see learning to use what's being revealed. If you got an email that appeared to be from one of your friends, said hello by name and giving you a link to something you were interested in, wouldn't you be a lot more likely to click on it? Facebook's starting to reveal the information criminals need to personalize social engineering attacks like phishing emails, it's just that the bad guys don't have the sophistication to use it yet.

So, don't panic, but pay attention to what Facebook's doing. In the short term the biggest security issue on the site is still the spread of traditional Windows viruses and malware so keeping your virus checkers up to date should be your first priority. Long term, we need to figure out what information we want to reveal, rather than letting Facebook decide for us.

How to lure people to your startup with analytics

Photo from Marquette University

I'm fascinated by statistics, and a lot of my work has revolved around trying to analyze and visualize online activity, whether it's Twitter conversations, email behavior or Facebook friendships. These have generated a lot of interest, but it's been hard to see how to convert that into enough revenue to create a real business.

The fundamental problem is that in most cases the statistics aren't solving a painful problem for anyone. We all love to look at ourselves in the mirror, and services that analyze our online behavior satisfy that craving, but there's seldom enough to justify a subscription or purchase. There's also the problem of driving repeat visits. Many meaningful statistics are very static, so the second visit to a site will often show exactly the same information as the first, discouraging continued engagement.

As a practical example, look at Xobni. I love what they're doing, and they initially launched with a lot of in-depth statistics about your email behavior, but in subsequent releases de-emphasized those in favor of productivity enhancements to Outlook. That's been successful for them and they're expanding back into analytics again, but it shows what a problematic driver statistics can be.

There is a tried-and-tested path to building a business around analytics though. I think of it as the Feedburner model, since they're the first company I encountered using it. They offered analytics as the carrot to give users a reason to sign up to the service, and then monetized by inserting ads into the RSS feeds they now controlled. I'm addicted to my Feedburner stats, but I'd never pay for them, so this was a great way of getting revenue out of a free service.

More generally it doesn't have to be ad-based, you just need a business model that benefits from access to a large audience of engaged users. Can you up-sell premium services that appeal to the same market? Can you get permission to pass their contact information to other companies they might be interested in, and get paid for sales leads? Is building an appealing analytics package just a good marketing investment, driving traffic to your site more cheaply than conventional advertising?

If you are interested in that approach, here's a couple of tips. First, try to show users something as soon as possible. In an ideal world they arrive at your page and immediately see a graph that tells them something interesting about themselves or something they relate too. Typically this isn't achievable, but at the very least have a single step where they enter an email address, twitter name, etc and then within a few seconds get some information. You should also show an example of what they will get on the landing page. These techniques reduced my bounce rate massively, never overestimate people's patience, you constantly need to be convincing them to spend time navigating your site.

The second key is presenting your statistics in an actionable way. If you can not only tell a user something interesting, but cause them to do something based on that information, then your chances of a repeat visit shoot way up. Feedburner has an 'Optimize' tab that guides you through ways of increasing your traffic. I found that changing from just showing your most-frequently-contacted friends to sending a report of the people you used to talk to and haven't for a while ('Losing touch report') and giving them a link to email each person alongside the list turned it from an 'oh, that's nice' to a must-have.

If you're as addicted to statistics as I am, but despairing about turning it into a business, it's worth thinking laterally. Analytics are a great carrot, can you use them to super-charge another business model?