What the Sumerians can teach us about data

Sumerian1

I spent this afternoon wandering the British Museum's Mesopotamian collection, and I was struck by what the humanities graduates in charge of the displays missed. The way they told the story, the Sumerian's biggest contribution to the world was written language, but I think their greatest achievement was the invention of data.

Writing grew out of pictograms that were used to tally up objects or animals. Historians and other people who write for a living treat that as a primitive transitional use, a boring stepping-stone to the final goal of transcribing speech and transmitting stories. As a data guy, I'm fascinated by the power that being able to capture and transfer descriptions of the world must have given the Sumerians. Why did they invent data, and what can we learn from them?

First you get the data, then you get the power

Sumerian2

The Sumerians were a nasty lot. Their idea of a fun time was wheeling a bunch of caged lions into an arena so the king and his friends could shoot them from a chariot. One of the perks of working for a king was the opportunity to drink poison and join him in his grave. They created seals and cuneiform writing as tools of power. They kept track of who owed them what, in a way that left evidence that could be used to convince a third party of the obligation. I could swear blind that you'd verbally promised me three lambs in the spring, and it would be your word against mine. With a written record of the transaction, I could convince the rest of the community that it was true. If you don't hand over those lambs, some of them might help me stick that dagger between your ribs. Since these sort of obligations are the foundation of any state, the earliest writing was a potent source of power.

That's still true today. Gathering data is not a neutral act, it will alter the power balance, usually in favor of the people collecting the information.

Power corrupts data

Sumerian3

"The inscription on this stone is a statement of grants and privileges bestowed on the sun-god Shamash's temple by the Akkadian king Manishtushu (2269-2255 BC). It was actually written many centuries later. The object was clearly a forgery designed by the Sippar priesthood for their purposes."

As soon as records become vital in arguments about who gets what, people will figure out how to falsify them. The more important the outcome, the more temptation there is to fudge or fake them. Written records remove the problem of fallible memories, but replaces it with a second-degree question of provenance. How do you know the data accurately reflects what happened?

It's a good reminder that the map is not the territory. We still have a disturbing tendency to trust anything that's recorded, without understanding the subjective process that went into creating the record.

(Pre-)Digital Rights Management

Sumerian4

This stone was planted in the ground to mark a property boundary, and the top section records the details of the claim. The bottom third is covered with threats of supernatural retribution against anyone who moves or alters the marker. The main way Sumerians protected the integrity of their data was through curses. This may seem laughable to a modern audience, but I don't think we're so different. Does you expect the FBI to actually raid your house if you copy that VHS tape? The warnings are a way of forcefully expressing society's norms, rather than a credible threat of punishment.

As geeks we'll often roll our eyes at a technically-ineffective mechanism for preventing the copying or alteration of data, but the longevity of useless curses should make us think twice. Violating the rules is a decision taken by a person, so sometimes hacking the human element of the process is the most effective prevention.

Reading the future with data

Sumerian5

Many of the tablets archaeologists have recovered are elaborate instruction manuals on how to interpret omens. The idea was that you'd observe events that were happening now and use them to predict what was going to happen in the future. All the examples I saw at the Museum were obvious nonsense, using inputs like the shape of animal entrails, but what struck me was how respected they must have been despite their lack of results.

We've created science as a much more elaborate process for predicting the future from data, but in many ways that's lulled us into a false sense of security. The media prominently features 'scientific' studies showing that everything gives us cancer, thanks to our insatiable appetite for certainty and reassurance in the face of something terrifying and unpredictable. The lesson for me is that the results of any data-driven project will be accepted or ignored based on people's needs and fears. In the absence of real answers, we'll take bogus ones painted with a veneer of data, just like the Sumerians.

All data matters

Sumerian6

We actually know more about everyday life in Sumeria five millenia back than we do in Europe fifteen hundred years ago. The Sumerians recorded everything on stone or clay tablets, most of which were discarded after use with no thought for posterity. As it happened, the clay tablets proved remarkably resilient and so archaeologists and scholars have found and decoded hundreds of thousands of them. This data exhaust gives a rich view into trade, worship, life, death, medicine and almost every other aspect of the Sumerian's world.

This is a big reason why I'm so fanatical about opening up data sources. It's great to see Twitter taking steps to archive our public conversations in the Library of Congress, but it's taken a year and they're still not finished. Even when they're done, storing the records in a single location and on a single system is a terrible long-term plan, the only approach that's proven to last centuries is wide distribution of many copies on a range of mediums. Craiglist is another bad example, holding information that could be vital to understanding details of our social and commercial lives in the future, data that's been on view to the public, and yet refusing to discuss archiving any of it and actively blocking anyone who tries. If there's any way you can, please think about how to open up data you control, it's the best way to pass it on to posterity.

See for yourself

I had an amazing time in the British Museum's Mesopotamian galleries, I'd highly recommend it if you're ever in London, and it's completely free. Data was the aspect that fascinated me, but there's so much more held in the treasury of beautiful objects their scholars have collected, I guarantee you'll come away with a feeling of awe, and maybe a fresh view of the world around you too.

Street markets and change in England

Stives0Pigs' ears, anyone?

The longer I've lived in America, the more of a stranger I feel when I return to England. What struck me this visit was how exotic the local street markets feel. When I was a kid, market stalls were an absolute last resort when your parents couldn't afford to buy something in a proper shop. They were the homes of cheap batteries that never lasted, and brand-name clothes surrounded by a aura of suspicion. In San Francisco "farmer's markets" are at the opposite end of the scale. If the Ferry Building slid into the sea on a Saturday morning we'd lose half the Bay Area technology workforce.

Stives1
These same stripey-neon gloves were there in the 80's

I walked around Bury St. Edmund's and St Ives' markets this weekend, and I was struck by how much things had changed. There was still a fine selection of dubious clothing, but there were attractions for the gourmet too, with game pies and Spanish hams.

Bury0
Stives2
The rain in Spain probably isn't this grim

It was a good reminder that England isn't as backwards as I sometimes assume based on my memories. My brother has been involved in market stalls on-and-off over the years and he gave me some interesting background. In Bury there used to be a waiting list for space but now there are empty spots. That's opened up opportunities to new traders with less-traditional merchandise, but it's because the older stalls are closing.

Stives3

Stives4

I was sad when I saw the local butchers had just closed down too, but then my sister described a visit with her husband's family, where the meat looked extremely unappetizing. I was reminded of that when I saw a stall in Bury selling meat from cardboard boxes.

Bury2

It's easy for me to slip back into nostalgia, but supermarket meat counters are better than the average butcher's shop I grew up with. I'm hopeful that the best traders with something unique to offer will do well, and I shouldn't mourn the passing of the rest too much.

It was a good reminder of the limits of my knowledge of Britain these days. I left because I was frustrated at the resistance to change, but progress still happens, even if it's not at the pace I'd like. There's also usually a complex story behind the surface, and as an outsider I'll often miss it. The UK changes a lot more than you'd think, because the British do a great job of transforming things while maintaining the appearance of continuity, in street markets and everything else.

If you want to experience them for yourself, I highly recommend looking at the small towns near where you'll be in Britain. The network of cities was built up around fairs and markets, and you'll still hear medium-sized rural places described as "market towns". You'll often find a morning or two a week the center is closed off and stalls set up. It's Saturday and Wednesdays for Bury St. Edmunds and Monday for St. Ives.

How to easily optimize your landing page

TypesetPhoto by Ian Dolphin

As we're getting more traffic to the Jetpac home page (thanks AllThingsD!) optimizing our conversion rate has become a priority. In our case, the action we want people to take is connecting with Facebook, so we're having to work quite hard to figure out what messages work best to persuade people. In previous projects, it's often been quite a surprise exactly what sentences work, and so the only way is to test a lot of different approaches and see which strike a chord.

I love KissMetrics as a measuring tool for that sort of experimentation, but I couldn't find a good example of how to do the sort of tests we need. Ideally it should be all data-driven, so that the even less-technical members of the team can edit the copy options and try out new variations without my involvement. I built out a small framework that does everything we need, and have just open-sourced it as a github project:

https://github.com/petewarden/copyoptimizer

To use it:
– Edit index.html to add your own KissMetrics code
– Go to index.js
– Edit the g_copyChoices structure
– Use the class name of the element you want to alter as the key, and create an array of possible text values for it
– Once it's on the site, the class name and chosen text will show up in the KissMetrics reports as a drop-down option (though there's a lag in it showing up)
For example I have a headline with the class name 'action_header', and a link I want people to click with the class 'action_button', so I have an example data structure like this:
g_copyChoices = { 
  'action_header':['Please click me!', 'If you wouldn\'t mind, click here', 'I\'d really like you to<br>click below'],
  'action_button': ['Start Here', 'Next', 'Sign up']
};

When the page is first loaded, the inner HTML of the elements with those class names is replaced with a randomly-selected string from the array, and the choice is stored with KissMetrics so we can see which ones convert best in the reports. I also store the choices in cookies so that repeat visitors see the same text, and we don't pollute the metrics with varying choices.
Once the data has had a chance to percolate through Kiss's servers, you can choose the class name from the drop-down menu below 'Funnel Overview' on the report page and see which of the messages had the best conversion rate.
Kissshot0

Five short links

Fiveofhighs
Photos by Tang Yau Hoong

The tourists have left – Despite the early-stage hype, there's fewer VCs around than ever.

Converting addresses into lat/long coordinates in Excel using the Data Science Toolkit - I love seeing the creative way people use open-source projects once they're out in the wild.

Strata – A good overview of what's on offer at the Big Data conference, featuring your correspondent with "Embrace the Chaos", and with a 20% discount.

Google code prettify – A beautiful little Javascript hack for syntax-colored display of all sorts of computer languages in web pages.

12 Things Brad DeLong Got Wrong in his Career – A bit like a VC firm's anti-portfolio, acknowledging and even celebrating your mistakes is a fun way to keep yourself intellectually honest. I always loved the idea of the slave at a Roman Triumph whose job it was to whisper to the honored general "Remember you're mortal".

Five short links

Pentagonalevolution

Photo by Andrew Hudson

EntityTagger – A pleasantly practical natural-language processing paper, via Nat Torkington

How prostitution and alcohol make Uber better – A clever tabloid hook for an interesting data story. One thing I've heard that might explain part of the pattern is that police shifts vary regularly by day, which can impact arrest times.

Social Network Analysis for Telecoms – I've repeatedly heard this used as an anecdote, but it wasn't until I was sitting at an event next to Mike Driscoll this week where it was mentioned that he was able to point me to his original research. It's great to see the original research, I can understand why it's now a classic example of how useful data science can be.

Hue Histograms – A charming way of visualizing image color characteristics by another friend's company. I'm lookin at good ways of anonymizing image data in a way that still preserves enough signals to be useful for machine learning, and this has given me some ideas.

Break an image into tiles – On the topic of images, I was pleasantly surprised at how easy ImageMagick was to install on OS X through MacPorts, I used to dread the failed dependencies. I used the recipe in the article for a hack I'm quite proud of. I needed to generate 'percent of the world seen' thumbnails for Jetpac public profiles shared on Facebook, so I manually created the HTML for a page with a grid of one hundred of the elements, one for each number, took a screenshot and then ran it through the grid command to get the numbered images I needed. You can see it in action if you like this sneak peek of my public profile page - you can unlike it afterwards if you don't want my new pensive portrait in your stream.

How to run simple smoke tests in Ruby

Smoketest

Photo by Andrew Magill

One lesson I learned from Eric Ries is how powerful an incremental, reactive approach to testing can be. It's really hard to balance resources between development and testing, especially as a starving startup, but if you build tests to catch errors that have actually happened, you know you're focused on high-priority issues.

We started Jetpac with a very minimal deployment process with few automated checks, but two stages so we could eye-ball the test environment before pushing it to the final set of live servers. Yesterday that manual process finally failed after we accidentally pushed a completely broken build to the main site and took it down for a few minutes. That gave me a strong reason to add the first automatic checking to our deployment scripts to make sure we couldn't push to production if the test environment wasn't responsive.

To start with I just wanted something very basic that will catch glaring errors that stop our Ruby app from running entirely, since that was what actually happened and they're pretty simple to detect. To do that, I wrote a short Ruby script that can be invoked from the command line and will spot empty responses, 404's and other obvious problems with a URL. We invoke it like this in our deployment bash script, after calling Capistrano to do the actual push:

smoketest.rb "http://testingenvironment.example.com&quot;

if [ $? -gt 0 ]; then

    echo '*$!*$!*$!*$!*$!*$!*$!*$!*$!*$!*$!*$!*$!*$!*$!*$!*$!*$!'

    echo "Deployment not allowed, test server is not responding"

    echo '*$!*$!*$!*$!*$!*$!*$!*$!*$!*$!*$!*$!*$!*$!*$!*$!*$!*$!'

    exit 1

fi

It will print out information to stderr about any problems it encountered, and handles both http and https URLs. I'd imagine as our needs grow we'll turn to something more complex like Capybara, but for now this simple script is a very quick and easy way of catching a lot of common problems.

Death of a startup

HeadstonePhoto by Mugley

Mailana, Inc is dead. This week I've been going through the formalities, squaring away the legal paperwork and returning the tiny amount of money I'd raised, but truthfully it had been dead a long time, I just hadn't faced it. I'm over the moon about my new startup, but it feels like time to raise a glass to the last three years of my life, and the majority of my savings.

It began as a dream while I was still at Apple. I knew I wanted to strike out on my own, but by a strange kind of luck the painfully-slow green card process kept me living the corporate life for five years while Apple's shares kept rising, and my tiny windfall from technology I sold them when I joined became enough to live on for a few years. Within a couple of weeks of getting my permanent resident's status I handed in my notice, and set out to Build A Startup.

Technology risk

The hardest lesson to learn was how obsessed with technology I am. I had a problem in mind, one that I'd lived with at Apple – how to identify experts in large companies – but to be honest I chose that because I already had a solution that involved interesting engineering. I spent a year building a pipeline that could semantically hundreds of millions of email messages on a shoestring budget, seamlessly interface with Exchange, and present the results as beautiful visualizations. The only thing I failed to do was sell it to the enterprises I claimed were my customers. I wasn't a complete idiot, I spent time flying to boardrooms and talking to mid-level executives, creating demo videos, I even wrangled a few free pilot programs, but fundamentally I didn't care enough.

Shiny things

That meant when I'd proven my technical point, and faced a mountain of distribution problems instead, at some level I started to look for ways out. I'd already been using Twitter as a source of hundreds of millions of public messages for my demos, and then the public versions of the visualizations started to get some attention. I'd soured on the enterprise sales experience, so I started to explore what I could do on the consumer side. The trouble was I'd completely lost sight of what problem I was tackling. At least with the original version I'd set out to fix an issue I'd spent years living with. Now I was driven purely by curiosity, hoping I'd find neglected data that was so useful the problems I'd apply it to could be an afterthought.

Lonesome founder

I knew myself well enough to spot some of this at the time, and that the best prescription was a business and product-focused founder. I spent a lot of time dating potential partners, especially as I went through Techstars, but there was never a good enough fit. I needed somebody who was willing to bet on what we'd now call Big Data, who'd believe that there was a coming revolution that would bring data-processing problems that had previously required millions of dollars of investment within the reach of early-stage startups. Without external validation, nobody non-technical was convinced of either the concept in general, or my particular ability to execute on it.

Chasing the dragon

Muttering to myself in true mad scientist fashion, I set out to prove them all wrong by adding so many awesome features to my by-now-Gmail-addon Mailana that the world would have no choice but to sit up and take notice! One of these features was an email-contact-to-social-network-profile connector that involved me indexing public Facebook profiles, and then getting in a legal kerfuffle. Nightmarish as the situation was, the publicity and validation I received from that visualization work was addictive. I set out to explore that more, with the excuse that it provided distribution opportunities for my business, but if I was honest with myself it was because I found the whole area fascinating.

Startup neglect

I wandered farther and farther from my nominal business, first as I launched OpenHeatMap, and then as I delved into data arcana through books and journalism. It felt great, because I was actually making a difference to the world, I was having an impact! Sadly I wasn't building a company. I took one final stab at that with the Data Science Toolkit as the final product from Mailana, but in my heart I knew that it had a lot more potential as a long-term open-source project than as a revenue-generating business.

Closure, and a new start

It took a lot of soul-searching to accept, but I knew Mailana was over. I'd originally given myself an allowance of two years to spend with no revenue, and it had been over three. I was lucky enough to have a circle of trusted friends who were working on interesting projects, but Julian and Derek were particularly appealing. I'd been working with them for months as an advisor and fell in love with the idea behind Jetpac. It gave me the chance to keep exploring a lot of the technology ideas I was fascinated, but within the context of an actual business, with a revenue model, funding and more than one employee!

I don't regret any of the time or money I sank into Mailana, I've got so much to be thankful for over the last three years. The people I've met alone make it worth every minute, and I feel like I've now got a lifetime's worth of mistakes to learn from! I'm sorry to say goodbye to Mailana, but glad I had the chance to try something crazy and fail.

Five short links

FiveflowerPhoto by Jannis Andrija Schnitzer

On being wrong in Paris – A great general meditation on the slippery nature of facts, but the specific example is very resonant. We tend to think of places having clear boundaries, but depending on who I was talking to I'd describe my old house as either in "Los Angeles", "Near Thousand Oaks" or "Simi Valley". Technically I wasn't in LA, but the psychological boundaries aren't that neat.

The devil in the daguerrotype details – The detail you can see on this old photograph is amazing, and I love how they delve into the capture method. I was disappointed there was nothing on the role of lenses as a limiting factor on resolution though, I'd love to know more about that.

Katta – A truly distributed version of Lucene, designed for very large data sets. I haven't used it myself yet, but I'm now very curious.

Hbase vs Cassandra – An old but fair comparison of the two technologies. This mirrored the evaluation I went through when picking the backend database for Jetpac, and I ended up in the same place.

It's cheaper to keep 'em – Your strategy is sometimes pre-determined by what numbers you're paying attention to. If you start off with the assumption your job is to get new users as cheaply and fast as possible, you'll never realize how important retaining existing customers can be.

Sad Alliance

A friend inspired me to dig around in my digital attic, and resurrect a video of one of my live VJ performances. It's playing off the music of Richie Hawtins and Pete Namlook, and created on the fly using my home-brewed software, a MIDI controller, and a live camera feedback loop. There's no clips or pre-recorded footage, everything's my own response to the audio as it's happening.

Lessons from a Cassandra disaster

Disaster
Photo by Earthworm

Yesterday one of my nightmares came true; our backend went down for seven hours! I'd received an email from Amazon warning me that one of the instances in our Cassandra cluster was having reliability issues and would be shut down soon, so I had to replace it with a new node. I'm pretty new to Cassandra and I'd never done that in production, so I was nervous. Rightfully so at it turned out.

It began simply enough, I created a new server using a Datastax AMI, gave it the cluster name, pointed it at one of the original nodes as a seed, and set 'bootstrapping' to true. It seemed to do the right thing, connecting to the cluster, picking a new token and streaming down data from the existing servers. After about an hour it appeared to complete, but the state shown with nodetool ring was still Joining, so it never became part of the cluster. After researching this on the web without any clear results, I popped over to the #cassandra IRC channel and asked for advice. I was running 0.8.1 on the original nodes and 0.8.8 on the new one, since that was the only Datastax AMI available, so the only suggestion I had was to upgrade all the nodes to a recent version and try again.

This is where things started to get tough. There's no obvious way to upgrade a Datastax image and IRC gave me no suggestions, so I decided to try to figure how to do it myself from the official binary releases. I took 0.8.7 and looked at where the equivalent files to the ones in the archive lived on disk. Some of them were in /usr/share/cassandra, others in /usr/bin, so I made backup copies of those directories on the machine I was upgrading. I then copied over the new files, and tried restarting Cassandra. I hit an error, and then I made the fatal mistake of trying to restore the original /usr/bin by first moving out the updated one, thus bricking that server.

Up until now the Cassandra cluster had still been functional, but the loss of the node I had the code contact initially meant we lost access to the data. Luckily I'd set things up so that the frontend was mostly independent of the backend data store, so we were still able to accept new users, but we couldn't process them or show their profiles. I considered rejigging the code so that we could limp along with two of the three nodes working, but my top priority was safeguarding the data, so I decided to focus on getting the cluster back up as quickly as I could.

I girded my loins and took another try at upgrading a second node to 0.8.7, since that was the most likely cause of the failure-to-join issue according to IRC. I was painstaking about how I did it this time though, and after a little trial and error, it worked. Here's my steps:

There were a couple of gotchas. You shouldn't copy the bin/cassandra.in.sh file from the distribution, that contains settings like the location of the library files that you want to retain from the Datastax AMI, and if you see this error:

ERROR 22:15:44,518 Exception encountered during startup.

java.lang.NullPointerException

at org.apache.cassandra.db.ColumnFamilyStore.scrubDataDirectories(ColumnFamilyStore.java:606)

it means you've forgotten to run Cassandra as an su user!

Finally I was able to upgrade both remaining nodes to 0.8.7, and retry adding a new node. Maddeningly it still made it all the way through the streaming and indexing, only to pause on Joining forever! I turned back to IRC and explained what I'd been doing, and asked for suggestions. Nobody was quite sure what was going on, but a couple of people suggested turning off bootstrapping and retrying. To my great relief, it worked! It didn't even have to restream, the new node slotted nicely into the cluster within a couple of minutes. Things were finally up and running again, but the downtime definitely gave me a few grey hairs. Here's what I took away from the experience:

Practice makes perfect. I should have set up a dummy cluster and tried a dry run of the upgrade there. It's cheap and easy to fire up extra machines for a few hours, and would have saved a lot of pain.

Paranoia pays. I was thankful I'd been conservative in my data architecture. I'd specified three-way replication, so that even if I'd bricked the second machine, no data would have been lost. I also kept all the non-recoverable data either on a separate PostGres machine, or in a Cassandra table that was backed up nightly. The frontend was still able to limp along with reduced functionality when the backend data store was down. There's still lots of potential showstoppers of course, but the defence-in-depth approach worked during this crisis.

Communicate clearly. I was thankful that I'd asked around the team before making the upgrade, since I knew there was a chance of downtime whenever you had to upgrade a database server. We had no demos to give that afternoon, so the consequences were a lot less damaging than they could have been.

The Cassandra community rocks. I'm very grateful for all the help the folks on the #cassandra IRC channel gave me. I chose it for the backend because I knew there was an active community of developers who I could turn to when things went wrong, even when the documentation was sparse. There's no such thing as a mature distributed database, so having experienced gurus to turn to is essential, and Cassandra has a great bunch of folks willing to help.