Five short links

Pentagonallight
Photo by Phillip Chapman-Bell

Zipscribble map of Italy – Patterns leap out of post code data when you connect adjacent codes with lines, and color them according to the most significant digits. Shows how useful pictures can be when we need to make sense of complex data. 

The Gentleman Hacker's 1903 lulz – Innovators never enjoy pesky outsiders pointing out flaws with their technology. What's interesting is that I couldn't discover any practical exploits against early radio signals, despite how obvious the flaws were in the wake of these demonstrations. I did discover there's a world of Morse code software I never imagined existed though!

ARM instruction guide – Back in 1990, I learnt ARM assembler as my second language, after Basic, and though I haven't used it since then, I enjoyed this companion to a workshop on hacking the processor's security model because it's actually a concise, useful guide to its important features.

A revolution in mathematics? – An approachable look at the underappreciated changes in maths at the end of the 19th century. The short version is that the process of creating proofs became highly formalized, which sounds dry as dust but actually opened up a world of new possibilities. It's worth quoting at length:

"Well-optimized modern definitions have unexpected advantages. They give access to material that is not (as far as we know) reflected in the physical world. A really “good” definition often has logical consequences that are unanticipated or counterintuitive. A great deal of modern mathematics is built on these unexpected bonuses, but they would have been rejected in the old, more scientific approach. Finally, modern definitions are more accessible to new users. Intuitions can be developed by working directly with definitions, and this is faster and more reliable than trying to contrive a link to physical experience…rank and-file mathematicians can use the new methods confidently and effectively, while success with older methods was mostly limited to the elite"

Stepping away from "common sense" and relying on the logical outcomes of an abstract system that doesn't provide intuitive reasons sounds a lot like where we're headed in the data world. The question "Why does that work?" will often come up when you're making choices based on AB testing, and often the honest answer is "I don't know, but it does!".

Spark – A framework implementing a higher-level approach to writing distributed algorithms, with a more readable statement of the problem than standard MapReduce produces.

Five short links

Fiveaces
Photo by RHiNO NEAL

Your ideal performance/consistency tradeoff – It's unclear what the right number of nodes and level of redundancy for a Cassandra cluster are for any particular performance requirements, so most of us experiment until we have something that vaguely seems to work. Thanks to the folks at Berkeley, there's now a better way to figure it out via an interactive tool. Interestingly, they ended up using a Monte Carlo simulation rather than a formula, which shows how complex the problem is.

Why is finance so complex? – One of the most interesting articles I've read in a long time. It posits that finance is effectively a benign con trick, and relies on a lack of transparency to encourage people to take risks they wouldn't if they fully understood what they were getting into. The idea is that it's a collective action problem that only works if everyone jumps on board, and so the opacity helps persuade people to do that and achieve a better overall result than if they made an individually-rational choice. The model seems like it might explain other odd features of our social world.

Run a MapReduce job across five billion web pages for 25 cents – I have a massive data-crush on Common Crawl, and this is a fantastic practical demonstration of why I'm so excited. 

Clickjacking - The web's security model is more like Windows' than Unix's. It's been grafted onto an underlying system that was designed without any security foundations, and there's lots of gaps where different components interact in exploitable ways. This page explains how there's no reliable way to prevent malicious sites from hosting your site as an invisible frame and tricking users into taking actions by unknowingly clicking on it. Luckily we're in a world where software can be frequently updated, unlike 90's desktop software, so at least if this becomes widespread we might quickly see some fixes.

Muse – A noble experiment in mining useful data from your own email archives. It's still a bit too buggy to really get a feel for how interesting the results could be though.

Five short links

Fiveleaves
Photo by Let Ideas Compete

Rust – A trap to ensnare unwary web crawlers, by Tim McNamara. It creates pathological patterns of input data that will slow down naive robots by the sheer volume of processing required, whilst using minimal resources on the server thanks to elegant event-driven code. It's effectively a reversed denial-of-service attack, designed to overwhelm malicious or thoughtless crawlers of your site. Well-written and robust robot scripts will cope with malformed input of course, but the odds are that any crawler that's bringing your site to its knees with an unreasonable number of requests won't be a masterpiece of engineering!

Seeing like a database – Written by another fan of Seeing like a State, this has a great quote from Jay Owens at the end, noting "the asymmetry of personal data, open for the 99% & deep analytics for the 1%".

HttpBin – Echoes back information about HTTP requests you send it, including things like headers, data, and forced result codes. I'm just thankful it introduced me to the 418 (I'm a teapot) status code, I can't believe I've been writing web code for so long without checking for that possibility.

Drone landscapes, intelligent geotextiles, geographic countermeasures – I'd never realized how deeply adding processing to landscape structures could change our world. This is a compelling exploration of some of the possibilities, and I'm especially struck by the possibilties for a robot-readable world.

An end to bad heir days - The copyright on James Joyce's work finally expired! The enforcement process became a poster child for how the combination of insanely-long copyright terms and ornery heirs can derail the enjoyment and exploration of an artist's work. Thankfully scholars are now free to quote Joyce's work and letters, and I've just downloaded A Portrait of an Artist as a Young Man to re-read in celebration.

Five short links

Tally

Photo by Richard Paterson

The Ugliest Map in the World – Such an eyewatering color scheme, you'd think I'd designed it. The swimming-pool bottom caustics for the ocean areas really clinches it.

The Life of a Typeahead Query – An exploration of how hard it is to make an easy interface. Great to see a practical example how someone architected a real-world system with messy requirements.

Ending the Infographic Plague – Visualizations are an excellent hack for getting publicity, which inevitably leads to pollution by bad actors.

The Mess that is NPM – I really, really want to use Node.js, but the library ecosystem isn't quite mature enough for me to use in production. There's a lot of non-technical community hacking that you need to do to create a strong set of modules, and responsible maintainership isn't something I'm perfect at with all my projects, so I know how hard it is.

Brain Grain – Tasty little HTML5 visualization of world-wide migration. It's pretty simple, but has some innovations I've not seen elsewhere and uses animation effectively.

And last but not least, Jetpac is now rounding out a fundraising round, so if you're on Angelist any comments or recommendations would very welcome.

What the Sumerians can teach us about data

Sumerian1

I spent this afternoon wandering the British Museum's Mesopotamian collection, and I was struck by what the humanities graduates in charge of the displays missed. The way they told the story, the Sumerian's biggest contribution to the world was written language, but I think their greatest achievement was the invention of data.

Writing grew out of pictograms that were used to tally up objects or animals. Historians and other people who write for a living treat that as a primitive transitional use, a boring stepping-stone to the final goal of transcribing speech and transmitting stories. As a data guy, I'm fascinated by the power that being able to capture and transfer descriptions of the world must have given the Sumerians. Why did they invent data, and what can we learn from them?

First you get the data, then you get the power

Sumerian2

The Sumerians were a nasty lot. Their idea of a fun time was wheeling a bunch of caged lions into an arena so the king and his friends could shoot them from a chariot. One of the perks of working for a king was the opportunity to drink poison and join him in his grave. They created seals and cuneiform writing as tools of power. They kept track of who owed them what, in a way that left evidence that could be used to convince a third party of the obligation. I could swear blind that you'd verbally promised me three lambs in the spring, and it would be your word against mine. With a written record of the transaction, I could convince the rest of the community that it was true. If you don't hand over those lambs, some of them might help me stick that dagger between your ribs. Since these sort of obligations are the foundation of any state, the earliest writing was a potent source of power.

That's still true today. Gathering data is not a neutral act, it will alter the power balance, usually in favor of the people collecting the information.

Power corrupts data

Sumerian3

"The inscription on this stone is a statement of grants and privileges bestowed on the sun-god Shamash's temple by the Akkadian king Manishtushu (2269-2255 BC). It was actually written many centuries later. The object was clearly a forgery designed by the Sippar priesthood for their purposes."

As soon as records become vital in arguments about who gets what, people will figure out how to falsify them. The more important the outcome, the more temptation there is to fudge or fake them. Written records remove the problem of fallible memories, but replaces it with a second-degree question of provenance. How do you know the data accurately reflects what happened?

It's a good reminder that the map is not the territory. We still have a disturbing tendency to trust anything that's recorded, without understanding the subjective process that went into creating the record.

(Pre-)Digital Rights Management

Sumerian4

This stone was planted in the ground to mark a property boundary, and the top section records the details of the claim. The bottom third is covered with threats of supernatural retribution against anyone who moves or alters the marker. The main way Sumerians protected the integrity of their data was through curses. This may seem laughable to a modern audience, but I don't think we're so different. Does you expect the FBI to actually raid your house if you copy that VHS tape? The warnings are a way of forcefully expressing society's norms, rather than a credible threat of punishment.

As geeks we'll often roll our eyes at a technically-ineffective mechanism for preventing the copying or alteration of data, but the longevity of useless curses should make us think twice. Violating the rules is a decision taken by a person, so sometimes hacking the human element of the process is the most effective prevention.

Reading the future with data

Sumerian5

Many of the tablets archaeologists have recovered are elaborate instruction manuals on how to interpret omens. The idea was that you'd observe events that were happening now and use them to predict what was going to happen in the future. All the examples I saw at the Museum were obvious nonsense, using inputs like the shape of animal entrails, but what struck me was how respected they must have been despite their lack of results.

We've created science as a much more elaborate process for predicting the future from data, but in many ways that's lulled us into a false sense of security. The media prominently features 'scientific' studies showing that everything gives us cancer, thanks to our insatiable appetite for certainty and reassurance in the face of something terrifying and unpredictable. The lesson for me is that the results of any data-driven project will be accepted or ignored based on people's needs and fears. In the absence of real answers, we'll take bogus ones painted with a veneer of data, just like the Sumerians.

All data matters

Sumerian6

We actually know more about everyday life in Sumeria five millenia back than we do in Europe fifteen hundred years ago. The Sumerians recorded everything on stone or clay tablets, most of which were discarded after use with no thought for posterity. As it happened, the clay tablets proved remarkably resilient and so archaeologists and scholars have found and decoded hundreds of thousands of them. This data exhaust gives a rich view into trade, worship, life, death, medicine and almost every other aspect of the Sumerian's world.

This is a big reason why I'm so fanatical about opening up data sources. It's great to see Twitter taking steps to archive our public conversations in the Library of Congress, but it's taken a year and they're still not finished. Even when they're done, storing the records in a single location and on a single system is a terrible long-term plan, the only approach that's proven to last centuries is wide distribution of many copies on a range of mediums. Craiglist is another bad example, holding information that could be vital to understanding details of our social and commercial lives in the future, data that's been on view to the public, and yet refusing to discuss archiving any of it and actively blocking anyone who tries. If there's any way you can, please think about how to open up data you control, it's the best way to pass it on to posterity.

See for yourself

I had an amazing time in the British Museum's Mesopotamian galleries, I'd highly recommend it if you're ever in London, and it's completely free. Data was the aspect that fascinated me, but there's so much more held in the treasury of beautiful objects their scholars have collected, I guarantee you'll come away with a feeling of awe, and maybe a fresh view of the world around you too.

Street markets and change in England

Stives0Pigs' ears, anyone?

The longer I've lived in America, the more of a stranger I feel when I return to England. What struck me this visit was how exotic the local street markets feel. When I was a kid, market stalls were an absolute last resort when your parents couldn't afford to buy something in a proper shop. They were the homes of cheap batteries that never lasted, and brand-name clothes surrounded by a aura of suspicion. In San Francisco "farmer's markets" are at the opposite end of the scale. If the Ferry Building slid into the sea on a Saturday morning we'd lose half the Bay Area technology workforce.

Stives1
These same stripey-neon gloves were there in the 80's

I walked around Bury St. Edmund's and St Ives' markets this weekend, and I was struck by how much things had changed. There was still a fine selection of dubious clothing, but there were attractions for the gourmet too, with game pies and Spanish hams.

Bury0
Stives2
The rain in Spain probably isn't this grim

It was a good reminder that England isn't as backwards as I sometimes assume based on my memories. My brother has been involved in market stalls on-and-off over the years and he gave me some interesting background. In Bury there used to be a waiting list for space but now there are empty spots. That's opened up opportunities to new traders with less-traditional merchandise, but it's because the older stalls are closing.

Stives3

Stives4

I was sad when I saw the local butchers had just closed down too, but then my sister described a visit with her husband's family, where the meat looked extremely unappetizing. I was reminded of that when I saw a stall in Bury selling meat from cardboard boxes.

Bury2

It's easy for me to slip back into nostalgia, but supermarket meat counters are better than the average butcher's shop I grew up with. I'm hopeful that the best traders with something unique to offer will do well, and I shouldn't mourn the passing of the rest too much.

It was a good reminder of the limits of my knowledge of Britain these days. I left because I was frustrated at the resistance to change, but progress still happens, even if it's not at the pace I'd like. There's also usually a complex story behind the surface, and as an outsider I'll often miss it. The UK changes a lot more than you'd think, because the British do a great job of transforming things while maintaining the appearance of continuity, in street markets and everything else.

If you want to experience them for yourself, I highly recommend looking at the small towns near where you'll be in Britain. The network of cities was built up around fairs and markets, and you'll still hear medium-sized rural places described as "market towns". You'll often find a morning or two a week the center is closed off and stalls set up. It's Saturday and Wednesdays for Bury St. Edmunds and Monday for St. Ives.

How to easily optimize your landing page

TypesetPhoto by Ian Dolphin

As we're getting more traffic to the Jetpac home page (thanks AllThingsD!) optimizing our conversion rate has become a priority. In our case, the action we want people to take is connecting with Facebook, so we're having to work quite hard to figure out what messages work best to persuade people. In previous projects, it's often been quite a surprise exactly what sentences work, and so the only way is to test a lot of different approaches and see which strike a chord.

I love KissMetrics as a measuring tool for that sort of experimentation, but I couldn't find a good example of how to do the sort of tests we need. Ideally it should be all data-driven, so that the even less-technical members of the team can edit the copy options and try out new variations without my involvement. I built out a small framework that does everything we need, and have just open-sourced it as a github project:

https://github.com/petewarden/copyoptimizer

To use it:
– Edit index.html to add your own KissMetrics code
– Go to index.js
– Edit the g_copyChoices structure
– Use the class name of the element you want to alter as the key, and create an array of possible text values for it
– Once it's on the site, the class name and chosen text will show up in the KissMetrics reports as a drop-down option (though there's a lag in it showing up)
For example I have a headline with the class name 'action_header', and a link I want people to click with the class 'action_button', so I have an example data structure like this:
g_copyChoices = { 
  'action_header':['Please click me!', 'If you wouldn\'t mind, click here', 'I\'d really like you to<br>click below'],
  'action_button': ['Start Here', 'Next', 'Sign up']
};

When the page is first loaded, the inner HTML of the elements with those class names is replaced with a randomly-selected string from the array, and the choice is stored with KissMetrics so we can see which ones convert best in the reports. I also store the choices in cookies so that repeat visitors see the same text, and we don't pollute the metrics with varying choices.
Once the data has had a chance to percolate through Kiss's servers, you can choose the class name from the drop-down menu below 'Funnel Overview' on the report page and see which of the messages had the best conversion rate.
Kissshot0

Five short links

Fiveofhighs
Photos by Tang Yau Hoong

The tourists have left – Despite the early-stage hype, there's fewer VCs around than ever.

Converting addresses into lat/long coordinates in Excel using the Data Science Toolkit - I love seeing the creative way people use open-source projects once they're out in the wild.

Strata – A good overview of what's on offer at the Big Data conference, featuring your correspondent with "Embrace the Chaos", and with a 20% discount.

Google code prettify – A beautiful little Javascript hack for syntax-colored display of all sorts of computer languages in web pages.

12 Things Brad DeLong Got Wrong in his Career – A bit like a VC firm's anti-portfolio, acknowledging and even celebrating your mistakes is a fun way to keep yourself intellectually honest. I always loved the idea of the slave at a Roman Triumph whose job it was to whisper to the honored general "Remember you're mortal".

Five short links

Pentagonalevolution

Photo by Andrew Hudson

EntityTagger – A pleasantly practical natural-language processing paper, via Nat Torkington

How prostitution and alcohol make Uber better – A clever tabloid hook for an interesting data story. One thing I've heard that might explain part of the pattern is that police shifts vary regularly by day, which can impact arrest times.

Social Network Analysis for Telecoms – I've repeatedly heard this used as an anecdote, but it wasn't until I was sitting at an event next to Mike Driscoll this week where it was mentioned that he was able to point me to his original research. It's great to see the original research, I can understand why it's now a classic example of how useful data science can be.

Hue Histograms – A charming way of visualizing image color characteristics by another friend's company. I'm lookin at good ways of anonymizing image data in a way that still preserves enough signals to be useful for machine learning, and this has given me some ideas.

Break an image into tiles – On the topic of images, I was pleasantly surprised at how easy ImageMagick was to install on OS X through MacPorts, I used to dread the failed dependencies. I used the recipe in the article for a hack I'm quite proud of. I needed to generate 'percent of the world seen' thumbnails for Jetpac public profiles shared on Facebook, so I manually created the HTML for a page with a grid of one hundred of the elements, one for each number, took a screenshot and then ran it through the grid command to get the numbered images I needed. You can see it in action if you like this sneak peek of my public profile page - you can unlike it afterwards if you don't want my new pensive portrait in your stream.

How to run simple smoke tests in Ruby

Smoketest

Photo by Andrew Magill

One lesson I learned from Eric Ries is how powerful an incremental, reactive approach to testing can be. It's really hard to balance resources between development and testing, especially as a starving startup, but if you build tests to catch errors that have actually happened, you know you're focused on high-priority issues.

We started Jetpac with a very minimal deployment process with few automated checks, but two stages so we could eye-ball the test environment before pushing it to the final set of live servers. Yesterday that manual process finally failed after we accidentally pushed a completely broken build to the main site and took it down for a few minutes. That gave me a strong reason to add the first automatic checking to our deployment scripts to make sure we couldn't push to production if the test environment wasn't responsive.

To start with I just wanted something very basic that will catch glaring errors that stop our Ruby app from running entirely, since that was what actually happened and they're pretty simple to detect. To do that, I wrote a short Ruby script that can be invoked from the command line and will spot empty responses, 404's and other obvious problems with a URL. We invoke it like this in our deployment bash script, after calling Capistrano to do the actual push:

smoketest.rb "http://testingenvironment.example.com&quot;

if [ $? -gt 0 ]; then

    echo '*$!*$!*$!*$!*$!*$!*$!*$!*$!*$!*$!*$!*$!*$!*$!*$!*$!*$!'

    echo "Deployment not allowed, test server is not responding"

    echo '*$!*$!*$!*$!*$!*$!*$!*$!*$!*$!*$!*$!*$!*$!*$!*$!*$!*$!'

    exit 1

fi

It will print out information to stderr about any problems it encountered, and handles both http and https URLs. I'd imagine as our needs grow we'll turn to something more complex like Capybara, but for now this simple script is a very quick and easy way of catching a lot of common problems.