Five short links

I5
Photo by Brian Gurrola

How removing one data field saved $12m – A perfect example of why testing against reality is so important. It’s like code profiling, you never know where the bottlenecks are until you measure. via Felix Salmon

Ack – I’m doing a regular series over at ReadWriteWeb, trying to cover some of the folklore that’s typically passed down informally through the programmer ranks. I did one on using grep to search your source, and forgot to mention Ack, despite having used it and loved it with Textmate. Luckily the commenters kept me honest, so check it out, it really is better than grep for source code searching.

Google Refine – A very tasty looking open-source project from Google, designed to make it easy to import semi-structured data. Alex Dong has got my hopes up that it might be the missing tool for data scientists that I keep dreaming of, so I’ll be testing it out as soon as I have a chance.

GISCloud – The GIS world is one of the last holdouts where desktop software still rules. It’s inevitable that the tools will migrate to the cloud, so I’m excited to see how this progresses. via Andraz Tori

What Killed Aiyana Stanley-Jones? – This story is important because it builds a rounded picture of the factors that led to the death of a little girl in Detroit. Whatever your political persuasion, there will be parts that will back up your existing views, but other sections that may surprise you. With everything from reality TV crews to corrupt judges playing their part, there are no easy conclusions. It’s the sort of article that we need more of, exploring the problems rather than just cherry-picking them as evidence to push a solution.

What is data journalism?

Usbpunchedcard
Photo by Ian S

If I was the Lord High Dictator of English Usage, the very first term I'd ban is NoSQL, since all it does is enrage potential sympathizers whilst failing to accurately summarize what the new wave of database technologies have in common. Right after that though, I'd turn my beady eye towards 'data journalism', and give it a good hard stare.

The term actually has a long and illustrious history, if you consider it a synonym for database journalism. The trouble is, every modern journalist is using databases constantly, even if it's just via LexisNexis or Google searches. This makes the original term so broad to be meaningless.

I'd prefer to reserve the name for some of the interesting and unique work that's been emerging, work that's driven by a lot of the same shifts that have propelled Big Data into prominence. Here's the characteristics that I think true data journalism stories should possess:

The data is a lead protagonist. It's common for trend stories to reach for a few statistics to back up a pre-determined conclusion, but this use of data as a Greek chorus is seldom very enlightening or rigorous, as it lends itself to cherry picking. I'm much more interested when the data is treated as an interview subject by the journalist, asked questions and the answers are allowed to drive the story's conclusion. With the Wikileaks dumps, the data is the lead character in most of the reports, and it's unearthed some unexpected results.

The source material is public. If you quote a named source to back up your story, then anyone who wants to check whether you're distorting them can go back and talk to that person. With data-driven journalism, the only way to keep reporters honest and enable a real debate is to make the original information you base your conclusions on publicly available. Otherwise it's like using unnamed sources, you require a leap of faith from the reader that you're not being mislead. The Guardian is a shining example of how easy this can be, but the New York Times consistently refuses to release copies of the original source documents they base their stories on.

There's real detective work involved. Is reporting on the unemployment rate or stock market data journalism? Almost always it's just repeating a pre-digested number, with a hand-wavey explanation throwing in for good measure – "The stock market was up today because of <random correlation>". What I love instead is when a reporter is clever about finding unusual data sources or powerful tools to uncover new information, often that was hidden in plain sight. My favorite recent example is Marshall Kirkpatrick's use of Needlebase to uncover information on Twitter's new data center, just by analyzing public Tweets from their employees.

On second thoughts, forget about changing the name. These principles are going to win out because they lead to more interesting and trustworthy stories, no matter what you call the genre. So let's just call it plain old good journalism instead.

My life in software

Floppies
Photo by Blake Patterson

I've been writing code almost as long as I can remember, and I've had a weird, reflective relationship with my projects, they regularly change my life. Here's the code that I built, and that helped build me.

28th April 198X – Anna's birthday card

My first taste of how powerful software could be was for my sister's birthday, when I was maybe eight or nine. I'd been fooling around with the wonderful programming manual that came with the ZX Spectrum, and found a listing that played 'Happy birthday to you' via BEEP commands. I combined that with some blinking ASCII graphics I built myself to approximate a birthday cake, and presented it to her and the rest of the family. They all told me they loved it, and that was a heady feeling. That joy in showing off something I've made myself continues to be the best part of my work, even if I've never managed to do anything that's quite as comprehensible to my family since.

1991 – Mars Landscape

When I was 15 I made my first money by writing code. I put together a procedural texture generator in BASIC for the Archimedes series of computers, and made 15 pounds when Acorn User published it! Sadly it would be another six years before I'd earn anything else through my software, so I had to rely on supermarket shelf-stacking instead.

1993 – Ped => Hex

Inspired by Coldcut's pioneering video work, and the 'Public Domain' disk-sharing-by-mail world that was the poor man's BBS, I put together a demo in ARM assembler. I was most proud of the 'ohhs' and 'ahhs' it pulled out of people, not because it was technically astounding but because I used every trick I could to make it have an impact, from strobing to a psychotropic color palette. In that sense it was a lot like a logical continuation of Anna's birthday card.

1994 – MUDs

Before there was World of Warcraft, there were text-based 'Multi-User Dungeons'. I'd got hooked on them a couple of year before, when me and a friend would sneak into the University of Cambridge's computer labs, but when I went away to college it became a debilitating addiction. I ended up getting an 18% overall mark averaged over all my first year exams and coursework, and got married, disasterously, to a girl I'd originally met through one of the systems when I was just 19. I spent time upgrading an maintaining some code for one the systems I used, which would have been an excellent foundation for the exploding world of the web, if I hadn't decided that TCP/IP networking wasn't interesting enough to keep working on!

1996 – XPacman

When I went away to university, my Acorn was becoming obsolete and I couldn't afford to get another computer, so all of my programming was on Manchester's set of Sun terminals. I grew to abhore the XWindows network-based model of GUI development, but I did learn enough to create a Pacman clone. I loved the calm and quiet of the computer labs, as a contrast to an increasingly stormy home life.

1997 – Diablo for the Playstation

After I graduated with my BSc I was giddy at the prospect of being able to work full-time writing software, and especially to follow my dream of writing game software. I got my first job with Climax Industries (endless comedy value) working on porting Diablo from the PC to the Playstation 1. The salary was $15,000 a year, which after measuring my supermarket wages hourly sounded like a mind-blowing amount of money. I soon learned I was actually making less than I did stacking shelves, but the experience of working with some very savvy engineers and learning to read other people's code was a crucial education. Gosport was by far the roughest place I've ever lived, I got mugged before I'd been there a week, and the paltry pay, high rent and an unemployed wife drove me deeper into debt. I was at the point of getting letters threatening county court judgments, and that experience has left me deeply averse to ever taking out any sort of credit. After six months I found a better job up in Scotland, and could begin to dig myself out of that hole.

2001 – Pete's Plugins

My marriage finally ended, and I decided to take some time out to travel. My first stop was the US, where I intended to work for maybe six months and see a bit of the country. I fell in love, both with the country and a girl, and I've been here ever since. I'd been learning my craft in the game industry for the last few years, but it wasn't the creative endeavour I'd dreamed of as a kid. As an outlet for that, I started writing software that harked back to my demo days, effects that I could project onto screens at clubs and concerts, with me controlling the video in a way that complemented the music. I loved being able to actually see my audience reacting to my work, right in front of me, that's so rare in coding, it drove me to produce more than I'd ever imagined I could. Over the course of a year, I wrote over forty effects in my spare time, and released them as open-source plugins for all of the popular VJ packages. Users kept bugging me to port them over to something called After Effects, which I'd heard was a tool professional video producers used. Once I did, the reaction was amazing, people were literally emailing me to ask how they could pay me for the effects. They were used to buying three or four effects for a thousand dollars, and I was giving them forty for free. I was slowly catching on to the whole capitalism thing, so I quit my day job to produce a new set of professional plugins for that market.

2003 – Apple Motion

It rapidly became obvious that setting up a small business, even one with customers clamoring to give me money, was not going to be possible as an immigrant. Happily Apple approached me on the basis of my open source work, offered to buy out the few months of work I'd put into my business and have me help build a new secret project they were working on. I spent five years working deeply with all sorts of graphics technologies within Apple, using my game engine experience to help deal with graphics driver issues and all sorts of GPU programming adventures.

2006 – SearchMash

I was growing increasingly fascinated by the power of the web. I could already see the writing on the wall for desktop software, and knew I wanted to work with more creative control, which meant becoming my own boss. SearchMash was my first real experiment on the web, a klunky Java applet that displayed a split-screen view of Google's search results. I went on to follow similar threads with GoogleHotKeys, and then dived into applying my graphics skills at analyzing large data sets with more social tools like Event Connector. I'm not even sure how many of these experiments are still functioning, but it was a real education in what actually works for engaging people.

2008 – Mailana

My green card finally came through, so I left Apple and set out to build some of the tools I wished I'd have while I was there, a service to locate experts and external contacts within large companies using only the information lying around on their Exchange servers. The experience was a fantastic education in how not to kick off a startup. Though I did run a few small pilots, I focused almost entirely on the technology for a year. I talked to potential investors more than I talked to potential customers. Inevitably I realized after a year that I had a big bag of technology but no business. The only bright side was that I'd ended up getting some love when, on a whim, I applied that processing pipeline to Twitter

2009 – FanPageAnalytics

After that realization, I spent some time metaphorically wandering in the wilderness. I knew the enterprise approach was dead, but I couldn't get a good handle on a consumer angle. I moved to Boulder to get some mentoring at Techstars, but none of my experiments worked. The most interest I had was for a Gmail-based service that would send useful information to you about people who'd emailed you, the same sort of thing you get when you Google someone's name. I couldn't find an API that would allow me to do that sort of query, so I decided to write my own search engine to crawl the public pages of the sites I cared about. While I was doing that, I noticed that Facebook had lots of interesting information in those crawlable pages, and thought there might be a business in providing analytics for brands, giving them extra information about their (and their competitor's) fan pages on Facebook. FanPageAnalytics was born, and then quickly died as I got entangled with Facebook's legal department.

2010 – OpenHeatMap

I had a lot of time to think about what was working, and what wasn't, as I waited for the Facebook saga to end. I'd had over 500,000 people visit my Five Nations of Facebook map in just a few days, but other than the extra readership I didn't build a business relationship with any of them. I realized that people loved these online maps, that this was an amazing distribution channel for reaching an audience, and I wanted to build more, and help other people build their own. I couldn't find existing tools that did what I wanted, so OpenHeatMap was born, and that's where I am right now. I'm happy with the progress the free site is making, and a revenue-generating spin-off is in early testing. I wonder how this one will change my life?

Five short links

Fivefingers
Photo by Jane Rahman

What do prototypes prototype? – In my last post I explored some of the problems with our model of prototypes, which inspired Joe Parry to point me at this in-depth paper. Very thought-provoking ideas on there being three axes to classify prototypes by; implementation, look and feel, and role.

How telephone directories transformed America – A hundred years before Facebook, our social structure was transformed by a book that listed our names and addresses, and gave strangers a way to contact us directly. I've often used the phone book analogy when discussing privacy, but I'd never really understood what a major shift it actually represented.

We ten million – That's a guesstimate of the number of aspiring novelists out there. Whenever I get daunted by the difficulty of making it with a startup, I remember all the amazing artists and writers I know who'd kill to share the odds we have. My favorite advice for success is "I was neither the most talented nor the most clever writer in my writing group, but I was the one who stuck with it." I hope that works for founders too, since my only super-power is the iron-plated stubborness I inherited from my grandmother.

Some cool things I heard in New Zealand – Sounds like a fascinating conference, gathering together 300 CEOs from the design world. Lots of great quotes, but the most insightful was "When asked why Method keeps innovating, he answered 'our people give a shit.'". I know exactly what he means, you can tell within a few minutes of dealing with most companies whether the people there actually care, or if they're so beaten down they're just picking up paychecks.

Exercise more to hack better – I smoked and barely exercised until I was 25 and moved to the US, but starting made an amazing difference in my own productivity, and in my general quality of life. It's funny, but I'd always assumed exercise made you tired. In fact that endorphin cocktail has been so helpful I try to get in a decent workout every morning, just to keep me on the ball with my work.

Works like, looks like

Dolllamp
Photo by Julian Wearne

My friend Nick Napp introduced me to some new terms when we were discussing prototypes a few days ago. In the toy industry they classify their experimental versions of products as 'works-like', 'looks-like' or 'works-like, looks-like'. As you might guess 'works-like' prototypes demonstrate the functionality that the designers are after, but look nothing like the intended product, 'looks-like' means that you have an empty shell that doesn't work but reflects the intended form, and 'works-like, looks-like' is a more traditional prototype that has both the form and function.

I love this language, because it gives us a way to express some of the lessons we've been learning in the web industry. Traditionally in engineering disciplines the risk is that your technology won't work – that the bridge will collapse after you've built it. That was true in the early days of the web, we had no confidence that we could build a photo upload site that would actually work, so we got into the habit of testing the functionality through prototypes. In our minds, the only kind of prototype was 'works-like'.

A lot of the most interesting recent techniques for startups are focused on reducing market risk instead of technology risk. I see everything from running AdWords campaigns that lead to a shell landing page for your hypothetical product, to powerpoint-prototypes or photoshop mockups as 'look-likes', because they're intended to answer fuzzy questions about people's superficial reactions to the products.

Our current problem is that we've overloaded the term 'prototype' to mean both kinds of experiment. I've seen the confusion and frustration that this causes, as engineers use the effort to try and answer technical questions while the business folks gnash their teeth at the slap-dash styling that makes it useless for showing to potential customers. If we adopted the terms and mindset that they're actually two different things, then we'd have a better chance of modularizing our design process to answer those questions separately, and then have a final synthesis phase where we try to bring them together into a 'works-like, looks-like' final prototype. I found this article from the D School very enlightening on the way traditional product designers apply the same technique.

I'm going to try to clarify my thinking about the design process by using these terms going forward, and I'm curious if anyone else has run across anything similar in the web world?

A mistake

After sleeping on it, I realized that my commentors were right to gently and not so gently chastise me for using the Holocaust as a rhetorical device in my last post. Bloodlands had a powerful effect on me, as have many other books on the various horrors of the twentieth century. They left me with a lasting sense that we have to actively fight against anything that could lead to their recurrence, which was the goal of that post. Unfortunately the route I took reduced the complexities of history in a way that comes across as very glib, on a subject that demands a deadly serious approach.

Almost a decade ago, I was sitting next to an old man and his daughter at a breakfast counter in California, and we started chatting. Casually he lifted his sleeve and showed me the number tattooed on his arm, asking if I knew what it was? I immediately understood he'd been in a concentration camp. He told me that as time went by, fewer and fewer people recognized it. We only spoke for a few minutes, but that encounter has haunted me since. It's easy to treat the Holocaust as dry history, a subject to weave clever arguments around, but it's still a living breathing horror for many. I'm sorry, I made a mistake when I treated it the way I did.

Data, Story Telling and Mass Murders

Nazipropaganda
Photo from the Calvin propaganda collection

I've just finished Bloodlands, a scholarly but powerful overview of the millions of killings committed by both the Soviets and the Nazis before and during World War II. What was especially striking was how little the murders relied on technology. I was more familiar with the western portion of the Holocaust, which required an elaborate system of transportation and bureacracy to carry out its murders. Most of the eastern victims were simply rounded up where they lived, forced to dig their own graves and then shot.

This is important because the threat model we use for privacy is based on the dangers of giving a state bureacracy too much information about our lives, to feed into a sophisticated Big Brother operation that constantly monitors us. The chilling examples of both Germany under the Nazis and communism demonstrate how much truth there is to that fear, but I think it also causes us to miss other risks.

Think about the Rwandan genocide. The killings weren't carried out using a sophisticated organization, it was people killing their neighbors with machetes. What the Nazis, Soviets and Rwanda all had in common was a sophisticated and effective propaganda campaign. They successfully convinced large numbers of people that unarmed men, women and children were deadly threats who had to be exterminated. Hutu broadcasters branded Tutsis as 'cockroaches', Nazi propaganda claimed Jews poisoned Aryan children, Stalin claimed the Ukranians were deliberately witholding food and causing starvation.

Story-telling like this is a vital component for every genocide. Killings on that sort of scale require active effort from hundreds of thousands of people, and passive acquiesance from millions. This requires a massive amount of motivation, and the only way to drive that is through effective propaganda.

What this doesn't explain is why the propaganda succeeded so well in those particular cases but not in others. My theory is that the recent introduction of new communications technologies should take a lot of the blame. The Nazis were able to harness films and radio to spread their message, the Hutus used radio almost exclusively. What the situations have in common is an elite who discover a way to tell stories in a very powerful way, thanks to a previously untapped medium. They succeed because the audience lacks a healthy scepticism. Stalin could edit people out of photos and be believed because 'Cameras don't lie'.

In this model, a new form of media is like an infection hitting a previously unexposed population. Some people figure out how it can be used to breach the weak spots in the audience's mental 'immune system', how to persuade people to believe lies that serve the propagator's purpose. Eventually  the deviation from reality becomes too obvious, people wise up to the manipulation and a certain level of immunity is propagated throughout the culture.

What does this have to do with our work in the startup and data worlds? I'm passionate about this area because I truly believe it can change the world. The downside is that history shows the power inherent in any new communications tool is often abused by evil people. Why should what we're doing be any different? I've already had some of my visualizations used by 'White Power' groups to argue that the US is being taken over by Mexicans, thanks to a few border counties where Jose pops up as a common first name. In general I'm worried by the lack of scepticism about the truth behind the results of large-scale data analysis and visualization. My Facebook map was a Saturday afternoon effort with a paint program, the methodology would never survive peer review, but it still ended up getting discussed very seriously in influential publications. As long as a visualization has decent-looking production values, or an analysis claims to use a sufficiently large set of data, most people will take it at face value, no matter how murky its hidden foundations. You might call this the Freakonomics Effect; while their arguments are backed up with real evidence, a lot of people have copied the same form without putting in the hard work needed to be as sure of the conclusions.

What's vital is that we take responsibility for the effects of the tools we're creating. In practical terms that means relentlessly pushing journalists to improve their understanding and scepticism of our results, and taking a stand when we see bogus work being promoted. We need to have standards we expect reputable data scientists to adhere to, and demand enough information and reproducibility before we take data-backed claims seriously. Science polices itself thanks to an informal community structure, we need to learn lessons from that. If we don't, then our wonderful new tools will just end up being hijacked, and mislead instead of enlightening.

Five short links – Security edition

Padlock2
Photo by Darwin Bell

Now I'm looking at handling more sensitive customer data I have an obligation to do everything I can to secure my server and code. Here's five resources I've found helpful:

Nikto – This tool runs an external scan of your webserver and spots common problems. The most serious mistake it found was that I'd forgotten to disable automatic indexes for directories without an index file.

Snort – Snort is an open-source intrusion detection system, and was pretty straightforward to get up and running. In its basic form it will just log suspicious events to a file, but it can be set up to use a database and send notifications too.

PHP Security Guide – A short, clear guide to the most common security errors in sites driven by PHP. The section on session fixation and hijacking was especially surprising to me, and I ended up rewriting some of my code to patch problems it alerted me to.

Browser Security Handbook – Google's guide is a comprehensive look at the flaws and pitfalls of the browser security model, and is essential reading for anyone building Javascript applications. It's sometimes a bit overwhelming though, because it covers the problems in so much depth.

Salesforce's security page – Salesforce would lose a lot of sales to traditional rivals if users were bitten by security problems related to their cloud-computing model, so ensuring their external developers write secure code is important to them. The result is an excellent set of documentation that's widely applicable, even if you're not working with Salesforce. Their self-assessment tool in particular was a great starting point for an internal security audit.