Five short links

Picture by Don O'Brien

DepthCam – An open-source Kinect hack that streams live depth information to a browser using WebSockets for transport and WebGL for display. If you pick the right time of day, you'll see the researcher sipping his tea and tapping at the keyboard, in depth form!

OpenGeocoder – Steve Coast is at it again, this time with a wiki-esque approach to geocoding. You type in a query string, and if it's not found you can define it yourself. I'm obsessed with the need for an open-source geocoder, and this is a fascinating take on the problem. By doing a simple string match, rather than trying to decompose and normalize the words, a lot of the complexity is removed. This is either madness or genius, but I'm hoping the latter. The tradeoff will be completely worthwhile if it makes it more likely that people will contribute.

A beautiful algorithm – I spent many hours as a larval programmer implementing different versions of Conway's Game of Life. As I read about new approaches, I was impressed by how much difference in speed there could be between my obvious brute force implementation, and those that used insights to avoid a lot of the unnecessary work. It's been two decades since I followed the area, so I was delighted to see how far it has come. In the old days, it would take a noticeable amount of time for a large grid to go through a single generation. Nowdays "it takes a second or so for Bill Gosper’s HashLife algorithm to leap one hundred and forty-three quadrillion generations into the future". There truly is something deeply inspiring about the effort that's gone into that progress, for a problem that's never had any commercial application.

BerkeleyDB's architecture – This long-form analysis of the evolution of a database's architecture rings very true. Read the embedded design lesson boxes even if you don't have time for the whole article, they're opinionated but thoughtful and backed up with evidence in the main text.

"View naming and style inconsistencies as some programmers investing time and effort to lie to the other programmers, and vice versa. Failing to follow house coding conventions is a firing offense".

"There is rarely such thing as an unimportant bug. Sure, there's a typo now and then, but usually a bug implies somebody didn't fully understand what they were doing and implemented the wrong thing. When you fix a bug, don't look for the symptom: look for the underlying cause, the misunderstanding"

Content Creep - There's a lot to think about in this exploration of media's response to a changing world. Using the abstract word "content" instead of talking concretely about stories, articles, or blog posts seems to go along with a distant relationship with the output your organization is creating. Thinking in terms of content simplifies problems too much, so that the value of one particular piece over another is forgotten.

Why Facebook’s data will change our world


When I told a friend about my work at Jetpac he nodded sagely and said "You just can't resist Facebook data can you? Like a dog returning to its own vomit". He's right, I'm completely entranced the information we're pouring into the service. All my privacy investigations were by-products of my obsessive quest for data. So with Facebook's IPO looming, why do I think research using its data will be so world-changing?


Everyone is on Facebook. I know, you're not, but most organizations can treat you like someone without a phone or TV twenty years ago. The medium is so prevalent, if you're not on it's commercially viable to ignore you. This broad coverage also makes it possible to answer questions with the data that are impossible with other sources.

It's intriguing to know which phrases are trending on Twitter, but with only a small proportion of the population on the service, it's hard to know how much that reflects the country as a whole. The small and biased sample immediately makes every conclusion you draw suspect. There's plenty of other ways to mess up your study of course, but if you have two-thirds of a three hundred million population in your data that makes a lot of hard problems solvable.


Love, friendship, family, cooking, travel, play, partying, sickness, entertainment, study, work: We leave traces of almost everything we care about on Facebook. We've never had records like this, outside of personal diaries. Blogs, government records, school transcripts, nothing captures such a rich slice of our lives.

The range of activities on Facebook not only lets us investigate poorly-understood areas of our behavior, it allows us to tie together many more factors than are available from any other source. How does travel affect our chances of getting sick? Are people who are close to their family different in how they date from those who are more distant?


The majority of my friends on Facebook update at least once a day, with quite a few doing multiple updates. We've found the average Jetpac user has had over 200,000 photos shared with them by their friends! This continuous and sustained instrumentation of our lives is unlike anything we've ever seen before, we generate dozens or hundreds of nuggets of information about what we're doing every week. This coverage means it's possible to follow changes over time in a way that few other sources can match.


It's at least theoretically possible for researchers to get their hands on Facebook's data in bulk. A large and increasing amount of activity on the site happens in communal spaces where people know casual friends will see it. Expectations of privacy are a fiercely fought-over issue, but the service is fundamentally about sharing in a much wider way than emails or phone calls allow.

This background means that it's technically feasible to access large amounts of data in a way that's not true for the fragmented and siloed world of email stores, and definitely isn't true for the old-school storage of phone records. The different privacy expectations also allow researchers to at least make a case for analyses like the Politico Facebook project. It's incredibly controversial, for good reason, but I expect to see some rough consensus emerge about how much we trade off privacy for the fruits of research.


I left this until last because I think it's the least distinctive part of Facebook's data. It's nice to have the explicit friendships, but every communication network can derive much better information on relationships based on the implicit signals of who talks to who. There are some advantages to recording the weak ties that most Facebook friendships represent, and it saves an extra analysis set, but even most social networks internally rely on implicit signals for recommendations and other applications that rely on identifying real relationships.

The Future

This is the first time in history that most people are creating a detailed record of their lives in a shared space. We've always relied on one-time, narrow surveys of a small number of people to understand ourselves. With Facebook's data we have an incredible source that's so different from existing data we can gather, it makes it possible to answer questions we've never been able to before.

We can already see glimmers of this as hackers machete their way through a jungle of technical and privacy problems, but once the working conditions improve we'll see a flood of established researchers enter the field. They've honed their skills on meagre traditional information sources, and I'll be excited when I see their results on far broader collections of data. The insights into ourselves that their research gives us will change our world radically.

Five short links

Photo by Vikas Rana

Dr Data's Blog – I love discovering new blogs, and this one's a gem. The State of Data posts are especially useful, with a lot of intriguing resources like csvkit.

TempoDB – Dealing with time series data at scale is a real pain, so I was pleased to run across this Techstars graduate. It's a database-as-a-service optimized for massive sets of time series data,behind a simple and modern REST/JSON API. We're generating so many streams of data from sensors and logs, the world needs something like this, as evidenced by the customers they're signing up, and I'm excited to follow their progress.

Foodborne Outbreaks – Unappetizing they may be, but this collection of food-poisoning cases is crying out to be visualized. (via Joe Adler)

Scalding – Another creation of the Cambrian Explosion of data tools, this Scala API for Cascading looks like it's informed by a lot of experience in the trenches at Twitter.

How to create a visualization – In a post on the O'Reilly blog I lay out how I tackle building a new visualization from scratch.

Five short links


Lunar Orbiter Image Recovery Project – The picture above is from an early unmanned scouting probe, sent to the moon as part of the preparation for the Apollo landings. The resolution and detail are amazing, but the story behind its salvage is even more astonishing. It's a tale of hacking at its best, as determined volunteers spent years working on technical archaeology to help save thousands of unique astronomical records. It's a fascinating intellectual adventure story, and I'm pleased to note that my namesake Dr Pete Worden played a key role.

Der Kritzler – A homebrew robot suspended from two ropes, that draws on a window by motoring itself along them.

Civil War Diary Code – Looks like a worthwhile puzzle, I just wish I was any good at ciphering. Despite my general geekiness, I've never been much good at code-breaking!

Apache Considered Harmful – For sanity's sake I normally try to avoid open-source disputes, but this article has a lot of good points about the changes in the free software world, and the implications of "Community > Code" now that github has made our coding practices much more social.

The Defeated – An examination of the effect of Sri Lanka's recently-ended civil war on the Tamil community. A long but moving piece of reporting, telling the detailed stories of a civilian and a combatant, intertwined with the causes and repercussions of the violence.