Keep the web weird

Robotsholdinghands
Photo by Jeremy Brooks

I'm doing a short talk at SXSW tomorrow, as part of a panel on Creating the Internet of Entities. Preparing is tough because don't I believe it's possible, and even if it was I wouldn't like it. Opposing better semantic tagging feels like hating on Girl Scout cookies, but I've realized that I like an internet full of messy, redundant, ambiguous data.

The stated goal of an Internet of Entities is a web where "real-world people, places, and things can be referenced unambiguously". We already have that. Most pages give enough context and attributes for a person to figure out which real world entity it's talking about. What the definition is trying to get at is a reference that a machine can understand.

The implicit goal of this and similar initiatives like Stephen Wolfram's .data proposal is to make a web that's more computable. Right now, the pages that make up the web are a soup of human-readable text, a long way from the structured numbers and canonical identifiers that programs need to calculate with. I often feel frustrated as I try to divine answers from chaotic, unstructured text, but I've also learned to appreciate the advantages of the current state of things.

Producers should focus on producing

The web is written for humans to read, and anything that requires the writers to stop and add extra tagging will reduce how much content they create. The original idea of the Semantic Web was that we'd somehow persuade the people who create websites to add invisible structure, but we had no way to motivate them to do it. I've given up on that idea. If we as developers want to create something, we should do the work of dealing with whatever form they've published their information in, and not expect them to jump through hoops for our benefit.

I also don't trust what creators tell me when they give me tags. Even if they're honest, there's no feedback for whether they've picked the right entity code or not. The only ways I've seen anything like this work are social bookmarking services like the late, lamented Delicio.us, or more modern approaches like Mendeley, where picking the right category tag gives the user something useful in return, so they have an incentive both to take the action and to do it right.

Ambiguity is preserved

The example I'm using in my talk is the location field on a Twitter profile. It's free-form text, and it's been my nemesis for years. I often want to plot users by location on a map, and that has meant taking those arbitrary strings and trying to figure out what they actually mean. By contrast, Facebook forces users to pick from a whitelist of city names, so there's only a small number of exact strings to deal with, and they even handily supply coordinates for each.

You'd think I'd be much happier with this approach, but actually it has made the data a lot less useful. Twitter users will often get creative, putting in neighborhood, region, or even names, and those let me answer a lot of questions that Facebook's more strait-laced places can't. Neighborhoods are a fascinating example. There's no standard for their boundaries or names, they're a true folksonomy. My San Francisco apartment has been described as being in the Lower Haight, Duboce Triangle, or Upper Castro, depending on who you ask, and the Twitter location field gives me insights into the natural voting process that drives this sort of naming.

There's many other examples I could use of how powerful free-form text is, like the prevalance of "bicoastal" and "flyover countries" as descriptions changes over time, but the key point is that they're only possible because ambiguous descriptions are allowed. A strict reference scheme like Facebook's makes those applications impossible.

Redundancy is powerful

When we're describing something to someone else, we'll always give a lot more information than is strictly needed. Most postal addresses could be expressed as just a long zip code and a house number, but when we're mailing letters we include street, city and state names. When we're talking about someone, we'll say something like "John Phillips, the lawyer friend of Val's, with the green hair, lives in the Tenderloin", when half that information would be enough to uniquely identify the person we mean.

We do this because we're communicating with unreliable receivers, we don't know what will get lost in transmission as the postie drops your envelope in a puddle, or exactly what information will ring a bell as you're describing someone. All that extra information is manna from heaven for someone doing information processing though. For example I've been experimenting with a completely free map of zip code boundaries, based on the fact that I can find latitude/longitude coordinates for most postal addresses using just the street number, name, and city, which gives me a cluster of points for each zip. The same approach works for extra terms used to in conjunction with people or places – there must be a high correlation between the phrases "dashingly handsome man about town" and "Pete Warden" on pages around the web. I'm practically certain. Probably.

Canonical schemes are very brittle in response to errors. If you pick the wrong code for a person or place, it's very hard to recover. Natural language descriptions are much harder for computers to deal with, but they not only are far more error-resistant, the redundant information they include often has powerful applications. The only reason Jetpac can pick good travel photos from your friends is that the 'junk' words used in the captions turned out to be strong predictors of picture quality.

Fighting the good fight

I'm looking forward to the panel tomorrow, because all of the participants are doing work I find fascinating and useful. Despite everything I've said, we do desperately need better standards for identifying entities, and I'm going to do what I can to help. I just see this as a problem we need to tackle more with engineering than evangelism. I think our energy is best spent on building smarter algorithms to handle a fallen world, and designing interchange formats for the data we do salvage from the chaos.

The web is literature; sprawling, ambiguous, contradictory, and weird. Let's preserve those as virtues, and write better code to cope with the resulting mess.

Five short links

Lichenstarfish
Photo by Phillip Hay

Kartograph – An open-source web component for rendering beautiful interactive maps using SVG. Fantastic work by Gregor Aisch.

Hard science, soft science, hardware, software – I have a blog crush on John D. Cook's site, it's full of thought-provoking articles like this. As someone who's learned a lot from the humanities, I think he gets the distinction between the sciences exactly right. Disciplines that don't have elegant theoretical frameworks and clear-cut analytical tools for answering questions do take a lot more work to arrive at usable truths.

Don't fear the web – A good overview of moral panics on the internet, and how we should react to the dangers of new technology.

Using regression isolation to decimate bug time-to-fix – Once you're dealing with massive, interdependent software systems, there's a whole different world of problems. This takes me back to my days of working with multi-million line code bases, automating testing and bug reporting becomes essential.

Humanitarian OpenStreetMap Team – I knew OSM did wonderful work around the world, but I wasn't aware of HOT until now, great to see it all collected in one place.

Five short links

Fivelonglinks
Photo by Jody Morgan

Open Data Handbook Launched – I love what the Open Knowledge Foundation are doing with their manuals. Documentation is hard and unglamorous work, but has an amazing impact. I'm looking forward to their upcoming title on data journalism.

My first poofer Workshop – This one's already gone, but I'm hoping there will be another soon. I can't think of a better way to spend an afternoon than learning to build your very own ornamental flamethrower.

Using photo networks to reveal your home town – Very few people understand how the sheer volume of data that we're producing makes it possible to produce scarily accurate guesses from seemingly sparse fragments of information. When you look at a single piece in isolation it looks harmless, but pull enough together and the result becomes very revealing.

Introducing SenseiDB – Another intriguing open-source data project from LinkedIn. There's a strong focus on the bulk loading process, which in my experience is the hardest part to engineer. Reading the documentation leaves me wanting more information on their internal DataBus protocol, I bet that includes some interesting tricks.

IPUMS and NHGIS – As someone who recently spent far too long trying to match the BLS's proprietary codes for counties with the US Census's FIPS standard, I know how painful the process of making statistics usable can be. There's a world of difference between a file dumps in obscure formats with incompatible time periods and units, and a clean set that you can perform calculations on. I was excited to discover the work being done at the University of Minnesota to create unified data sets that cover a long period of time, and much of the world.

Data scientists came out of the closet at Strata

Outofthecloset
Photo by Sarah Ackerman

Roger Magoulas asked me an interesting question during Strata – what was the biggest theme that emerged from this year's gathering? It took a bit of thought, but I realized that I was seeing a lot of people from all kinds of professions and organizations becoming conscious and open about their identity as data scientists.

The term itself has received a lot of criticism and there's always worries about 'big-data-washing', but what became clear from dozens of conversations was that it's describing something very real and innovative. The people I talked to came from professions as diverse as insurance actuaries, physicists, marketers, geologists, quants, biologists, web developers, and they were all excited about the same new tools and ways of thinking. Kaggle is concrete proof that the same machine-learning skills can be applied across a lot of different domains to produce better results than traditional approaches, and the same is being proved for all sorts of other techniques from NoSQL databases to Hadoop.

A year ago, your manager would probably roll her eyes if you were in a traditional sector and she caught you experimenting with the standard data science tools. These days, there's an awareness and acceptance that they have some true advantages over the old approaches, and so people have been able to make an official case for using them within their jobs. There's also been a massive amount of cross-fertilization, as it's become clear how transferrable across domains the best practices are.

This year thousands of people across the world have realized they have problems and skills in common with others they would never have imagine talking to. It's been a real pleasure seeing so much knowledge being shared across boundaries, as people realize that 'data scientist' is a useful label for helping them connect with other people and resources that can help with their problems. We're starting to develop a community, and a surprising amount of the growth is from those who are announcing their professional identity as data scientists for the first time.

Five short links

Fivedogs1
Picture by Don O'Brien

DepthCam – An open-source Kinect hack that streams live depth information to a browser using WebSockets for transport and WebGL for display. If you pick the right time of day, you'll see the researcher sipping his tea and tapping at the keyboard, in depth form!

OpenGeocoder – Steve Coast is at it again, this time with a wiki-esque approach to geocoding. You type in a query string, and if it's not found you can define it yourself. I'm obsessed with the need for an open-source geocoder, and this is a fascinating take on the problem. By doing a simple string match, rather than trying to decompose and normalize the words, a lot of the complexity is removed. This is either madness or genius, but I'm hoping the latter. The tradeoff will be completely worthwhile if it makes it more likely that people will contribute.

A beautiful algorithm – I spent many hours as a larval programmer implementing different versions of Conway's Game of Life. As I read about new approaches, I was impressed by how much difference in speed there could be between my obvious brute force implementation, and those that used insights to avoid a lot of the unnecessary work. It's been two decades since I followed the area, so I was delighted to see how far it has come. In the old days, it would take a noticeable amount of time for a large grid to go through a single generation. Nowdays "it takes a second or so for Bill Gosper’s HashLife algorithm to leap one hundred and forty-three quadrillion generations into the future". There truly is something deeply inspiring about the effort that's gone into that progress, for a problem that's never had any commercial application.

BerkeleyDB's architecture – This long-form analysis of the evolution of a database's architecture rings very true. Read the embedded design lesson boxes even if you don't have time for the whole article, they're opinionated but thoughtful and backed up with evidence in the main text.

"View naming and style inconsistencies as some programmers investing time and effort to lie to the other programmers, and vice versa. Failing to follow house coding conventions is a firing offense".

"There is rarely such thing as an unimportant bug. Sure, there's a typo now and then, but usually a bug implies somebody didn't fully understand what they were doing and implemented the wrong thing. When you fix a bug, don't look for the symptom: look for the underlying cause, the misunderstanding"

Content Creep - There's a lot to think about in this exploration of media's response to a changing world. Using the abstract word "content" instead of talking concretely about stories, articles, or blog posts seems to go along with a distant relationship with the output your organization is creating. Thinking in terms of content simplifies problems too much, so that the value of one particular piece over another is forgotten.

Why Facebook’s data will change our world

Topshot1

When I told a friend about my work at Jetpac he nodded sagely and said "You just can't resist Facebook data can you? Like a dog returning to its own vomit". He's right, I'm completely entranced the information we're pouring into the service. All my privacy investigations were by-products of my obsessive quest for data. So with Facebook's IPO looming, why do I think research using its data will be so world-changing?

Population

Everyone is on Facebook. I know, you're not, but most organizations can treat you like someone without a phone or TV twenty years ago. The medium is so prevalent, if you're not on it's commercially viable to ignore you. This broad coverage also makes it possible to answer questions with the data that are impossible with other sources.

It's intriguing to know which phrases are trending on Twitter, but with only a small proportion of the population on the service, it's hard to know how much that reflects the country as a whole. The small and biased sample immediately makes every conclusion you draw suspect. There's plenty of other ways to mess up your study of course, but if you have two-thirds of a three hundred million population in your data that makes a lot of hard problems solvable.

Coverage

Love, friendship, family, cooking, travel, play, partying, sickness, entertainment, study, work: We leave traces of almost everything we care about on Facebook. We've never had records like this, outside of personal diaries. Blogs, government records, school transcripts, nothing captures such a rich slice of our lives.

The range of activities on Facebook not only lets us investigate poorly-understood areas of our behavior, it allows us to tie together many more factors than are available from any other source. How does travel affect our chances of getting sick? Are people who are close to their family different in how they date from those who are more distant?

Frequency

The majority of my friends on Facebook update at least once a day, with quite a few doing multiple updates. We've found the average Jetpac user has had over 200,000 photos shared with them by their friends! This continuous and sustained instrumentation of our lives is unlike anything we've ever seen before, we generate dozens or hundreds of nuggets of information about what we're doing every week. This coverage means it's possible to follow changes over time in a way that few other sources can match.

Accessibility

It's at least theoretically possible for researchers to get their hands on Facebook's data in bulk. A large and increasing amount of activity on the site happens in communal spaces where people know casual friends will see it. Expectations of privacy are a fiercely fought-over issue, but the service is fundamentally about sharing in a much wider way than emails or phone calls allow.

This background means that it's technically feasible to access large amounts of data in a way that's not true for the fragmented and siloed world of email stores, and definitely isn't true for the old-school storage of phone records. The different privacy expectations also allow researchers to at least make a case for analyses like the Politico Facebook project. It's incredibly controversial, for good reason, but I expect to see some rough consensus emerge about how much we trade off privacy for the fruits of research.

Connections

I left this until last because I think it's the least distinctive part of Facebook's data. It's nice to have the explicit friendships, but every communication network can derive much better information on relationships based on the implicit signals of who talks to who. There are some advantages to recording the weak ties that most Facebook friendships represent, and it saves an extra analysis set, but even most social networks internally rely on implicit signals for recommendations and other applications that rely on identifying real relationships.

The Future

This is the first time in history that most people are creating a detailed record of their lives in a shared space. We've always relied on one-time, narrow surveys of a small number of people to understand ourselves. With Facebook's data we have an incredible source that's so different from existing data we can gather, it makes it possible to answer questions we've never been able to before.

We can already see glimmers of this as hackers machete their way through a jungle of technical and privacy problems, but once the working conditions improve we'll see a flood of established researchers enter the field. They've honed their skills on meagre traditional information sources, and I'll be excited when I see their results on far broader collections of data. The insights into ourselves that their research gives us will change our world radically.

Five short links

Goastarfish
Photo by Vikas Rana

Dr Data's Blog – I love discovering new blogs, and this one's a gem. The State of Data posts are especially useful, with a lot of intriguing resources like csvkit.

TempoDB – Dealing with time series data at scale is a real pain, so I was pleased to run across this Techstars graduate. It's a database-as-a-service optimized for massive sets of time series data,behind a simple and modern REST/JSON API. We're generating so many streams of data from sensors and logs, the world needs something like this, as evidenced by the customers they're signing up, and I'm excited to follow their progress.

Foodborne Outbreaks – Unappetizing they may be, but this collection of food-poisoning cases is crying out to be visualized. (via Joe Adler)

Scalding – Another creation of the Cambrian Explosion of data tools, this Scala API for Cascading looks like it's informed by a lot of experience in the trenches at Twitter.

How to create a visualization – In a post on the O'Reilly blog I lay out how I tackle building a new visualization from scratch.

Five short links

Earthfrommoon

Lunar Orbiter Image Recovery Project – The picture above is from an early unmanned scouting probe, sent to the moon as part of the preparation for the Apollo landings. The resolution and detail are amazing, but the story behind its salvage is even more astonishing. It's a tale of hacking at its best, as determined volunteers spent years working on technical archaeology to help save thousands of unique astronomical records. It's a fascinating intellectual adventure story, and I'm pleased to note that my namesake Dr Pete Worden played a key role.

Der Kritzler – A homebrew robot suspended from two ropes, that draws on a window by motoring itself along them.

Civil War Diary Code – Looks like a worthwhile puzzle, I just wish I was any good at ciphering. Despite my general geekiness, I've never been much good at code-breaking!

Apache Considered Harmful – For sanity's sake I normally try to avoid open-source disputes, but this article has a lot of good points about the changes in the free software world, and the implications of "Community > Code" now that github has made our coding practices much more social.

The Defeated – An examination of the effect of Sri Lanka's recently-ended civil war on the Tamil community. A long but moving piece of reporting, telling the detailed stories of a civilian and a combatant, intertwined with the causes and repercussions of the violence.

Five short links

Starfishpile
Photo by Nugun

Brainstorm – Psychedelic raytraced graphics packed onto a display that can only show a tiny set of characters and colors. A beautiful hack.

YASIV – An interactive visualization of the connections between books on Amazon. I never found a good way to expose these sort of force-directed network graphs within a usable product, but I remain fascinated by them, they're a powerful way of communicating relationships between large numbers of objects.

Mining of Massive Datasets – Rich, detailed, and practical, this is an invaluable overview of the techniques that you can apply to big collections of unstructured data to produce useful information, and is freely available as a PDF. I'm looking forward to learning a lot from this book, I just wish I could pay them for it without ordering a hardback copy.

LoremPixel – Simple but handy service that auto-generates placeholder images for your design prototypes, with easy control over the size and category.

Map of the Drug War – Chilling and information-rich, this visualization of Mexico's violence shows how bitter the drug war has become.

Big Data keeps getting bigger in Boulder

Bigdatameetup

A couple of years ago I started what became the Big Data meetup for the Boulder/Denver area, together with Jacob Rideout. The first few months were tough, despite having a tight-knit tech community in the area, not many people were using or interested in technologies like Hadoop and NoSQL, so we averaged around eight or nine people. After I left Colorado the event really started to pick up steam, as you can see in the graph above.

I like to think it wasn't my absence that fueled the growth, it's the ground-swell of interest in everything under the Big Data umbrella. Boulder is an exciting place to be working on technology, and I'm not at all surprised to see so much work being done with emerging data tools. There seem to be a lot of new (and old!) companies following in the footsteps of local pioneers like Gnip, Next Big Sound and Return Path, and they're looking for people to hire, so if you're an aspiring data geek who wants to work on interesting projects, I highly recommend popping along to the next one!