Five short links

Picture by Jeff Trexler

Brenda Zulu discusses the state of the Zambian blogosphere – A reminder of the basic challenges facing technology in the developing world, with critical bloggers being chased out of the country. It's promising to hear how popular Twitter and Facebook are for microblogging though.

Rickshaw – A D3-based Javascript framework for drawing sophisticated interactive time-series graphs.

All interesting problems are scalability problems – I don't agree with the headline, but there's some spot-on observations in this post. Almost all the costs of successful software are in maintenance, but there's a heavy survivor's bias in those figures, since many codebases never even get used. There are a lot of parallels (if you'll excuse the pun) between the constraints of tiny embedded systems, and those of massively distributed software. That's what I love about engineering, the border between what's needed and what's possible is a rich fractal, with enough repeating patterns to re-apply lessons you've learned, but with plenty of variety so you've no excuse to be bored.

ArcSpread for analyzing web archives – Stanford runs a fantastic project for capturing important web pages as they change over time, and then presenting the results in a form that future historians will be able to use. This paper talks about some of the techniques they use for removing boilerplate navigation and ad content, so that researchers can work with the meat of the page.

MemoirTree – A simple but effective application for capturing oral history from the people around you. One of the joys I discovered during my forays into journalism was how everybody has an interesting story to tell you if you just sit down and ask them about their life, so I'm hoping this catches on.

Is MySQL viable for data mining?

Photo by Aitor Escauriaza

I've been involved in an interesting Twitter conversation with Rafi Kam. I don't know anything about his background or plans, but he's obviously working on a data project. I was pleased to be able to point him to EC2 for Poets as a great introduction to Amazon's hosting, but this morning he asked "I'm concerned about nosql learning time and lack of simple querying. Can mysql be a viable back end for data mining?".

The quick answer is that MySQL and other traditional databases are absolutely viable for data mining. In most cases they're actually far superior to NoSQL solutions for anything that involves exploration and experimentation, simply because they have far more mature tools and documentation and a much more flexible interface.

My advice is to always start with a relational database when you're prototyping your product. NoSQL systems like Cassandra offer advantages once you're dealing with truly massive data sets, but relational databases will get you a long, long way. Once your queries start slowing down, that's the time to look at optimizing your database, whether it's by switching to a key/value solution, or more traditional approaches like heavier indexing or even vertically scaling by just buying a faster machine!

Now, NoSQL and the MapReduce approach to data processing are a lot of fun to play with, so I highly recommend learning more about them and using them in toy projects to get familiar with them, but unless the point of the project is to train yourself on the tools, start with something simpler.

Five short links

Picture by Matt Handler

Girls and coding: Female peer pressure scares them off – I wish there was more data to back this argument up, but the idea seems worth investigating. "..there are no great British young geek superstars for them to relate to, male or female" is sad but true too.

Using binary search for debugging – Binary search is useful in all sorts of circumstances beyond traditional programming, and it's great to see this list of some of the unexpected places it comes in handy. Figuring out item counts by binary-searching on URL parameters is particularly cunning.

Superfastmatch – A spectacularly-useful open source tool for quickly detecting identical sections in sets of millions of documents. Originally aimed at detecting lazy journalism using cut-and-paste from press releases or Wikipedia, it's also applicable to plagiarism more widely, or even detecting all the echoes of biblical phrases in Shakespeare's work. sued – A Canadian group spent years building up a crowd-sourced database of postal codes, an essential foundation for almost anyone doing open geographic work, and they're then sued by the Canadian Postal Service for violating their copyright! A very depressing case, but I'm hoping the support and publicity they're receiving convinces the government to back off.

Insight Data Fellows Program – Are you a PhD or post-doc who wants to apply the analytic skills you learned to the technology industry? This six week intensive course looks like a fantastic chance to be mentored by Silicon Valley data folks, and to meet lots of potential employers too!

Does Facebook’s purchase of Instagram make sense?

Picture by Oridusartic

I've spent the last year obsessed with social photo sharing as I've been building out Jetpac, so while I can't pretend I was expecting it, Facebook's acquisition of Instagram made sense to me. Here's why:

Facebook is a photo sharing site with a social network attached

The extent to which photos have always driven the growth of the network astonished me. Unlike games or even status updates, sharing pictures was an existing social behavior that the recipients understood and welcomed, giving friends and relatives of users a strong incentive to sign up themselves. Nothing else has this kind of pull, it's the bedrock of everything else they do. They currently host 140 billion photos, and are adding 10 billion a month, and that's a crucial engine of engagement.

Instagram has cracked the creative app problem

Instagram's real value is in their experience building a creative app that everybody can use. Nobody else has built an interface that's clear enough to be approachable and yet can produce results that people appreciate. It may sound simple, but it's deceptively hard to replicate from the outside. People like the filtered images because they're expressing a creative act by the taker, something they've put thought and time into, but for a wide audience of creators to use it, it actually has to be a lot easier and quicker than it appears. This balancing act is not only hard to reverse-engineer, it's also helped by an aura of exclusivity in the early days that's near-impossible for an established company to replicate. The flip-side of large companies being able to get easy press is that nobody gets credit for telling their friends about a cool new service they've launched.

Instagram was on the verge of going mainstream

The company clearly had proven their service had wide appeal, and showed all the signs of going into a rapid expansion. Even for behemoths like Facebook it gets very expensive to acquire a startup once they truly blow up, so with the cautionary tale of Yahoo's failure to buy Google in its early days in mind, it was a last chance of sorts for an acquisition. Instagram user's interaction with photos is very different from anything that Facebook offers, so if it did become widely popular there would be a real threat that they'd siphon off users.

Instagram is the first natively-mobile app

When Somini Sengupta asked me about this story for her New York Times post, I felt like I was repeating conventional wisdom, but I realized that's not something everybody's absorbed. It's profoundly shaped the way we approach Jetpac, with a laser-focus on our iPad app, because there's a deep shift in user behavior that established web companies are struggling to adapt to. Facebook is keenly aware of how important mobile is, but they're facing a classic innovator's dilemma where their core web business will suffer if they really prioritize phones and tablets. Bringing in the pioneers in mobile-only applications can't hurt as they wrestle with the changes they know they need to make.

Facebook's own valuation gives it a strong war chest for moves like this, so in their position I see why they made the purchase. The key is understanding how central photo sharing is for their business, and how much they believe in mobile.

Five short links

Picture by Pink Ponk

Why Open Science failed after the Gulf oil spill – The description of this researcher's interactions with the media rang very true. They took his reports and "eliminated a lot of the caveats and limits that Asper placed on his own results".

Sigma.js – An interactive network graph library, with support for both live force-directed layouts, and importing more complex structures from the desktop Gephi application. It has some very stylish visual defaults too.

Accumulo – I'd missed this Apache database project until now, but I'm interested in their take on the BigTable concept, especially their focus on security controls. Intriguing that it came out of the NSA too.

Visualizing live event broadcast delay – Working backwards from website traffic at different locations to figure out the broadcast delay for a TV commercial.

Online Hex Editor – Does exactly what it says on the tin. I don't know why I'm still amazed by how effective web apps can be, but it's striking how few barriers there are to replacing desktop programs.

Where am I, who am I?


"Queequeg was a native of Rokovoko, an island far away to the West and South. It is not down in any map; true places never are."

Where am I right now? Depending on who I'm talking to, I'm in SoMa, San Francisco, South Park, the City, or the Bay Area. What neighborhood is my apartment in? Craigslist had it down as Castro when it was listed. Long-time locals often describe it as Duboce Triangle, but people less concerned with fine differences lump it into the Lower Haight, since I'm only two blocks from Haight Street.

When I first started working with geographic data, I imagined this was a problem to be solved. There had to be a way to cut through the confusion and find a true definition, a clear answer to the question of "Where am I?".

What I've come to realize over the last few years is that geography is a folksonomy. Sure, there's political boundaries, but the only ones that people pay much attention to are states and countries. City limits don't have much effect on people's descriptions of where they live. Just take a look at this map of Los Angeles' official boundaries:


There's clearly little correlation between the legal city boundaries and how people describe the place that they live. You could argue that Los Angeles County is the correct region to use, but then people way out in the desert by Littlerock would be included!

The arbitrary and human nature of places is even more pronounced with neighborhoods. As I showed above, there's a surprising amount of consensus on the names of the neighborhoods, but almost none on their boundaries.

Why do I care about all this? It's crucial for data processing to recognize that if you force what the user puts in the 'Location' box into a standardized form, you're losing information. For example, knowing how somebody naturally describes where they are is going to be a lot more useful for grouping them together than a street address or latitude/longitude coordinates. If I choose the Lower Haight label, I'm more likely to be a hippy or a punk, for the Castro I want to identify with the gay image, or if I go for the Mission I'm associating myself with hipsters.

I'm glad Twitter has stuck with their free-form text fields, and I hope Facebook will become more flexible. Don't throw this data away, treasure it! It makes it a lot harder for machines to deal with the content that people produce, but unless you're shipping packages or targeting ICBMs, the payoff of richer knowledge of your users is worth it.

Find amazing travel photos from your friends with Jetpac on the iPad


I'm interrupting my usual stream of geek consciousness to bring you a message from our sponsors. I'm very pleased to announce that the Jetpac iPad app is now available! Some of your friends are taking astonishing travel pictures that you've never seen. Get the app and we'll give you the very best of the two hundred thousand photos your friends have shared on Facebook.

Ratings are very important to help other people discover the app, so if you do enjoy it, please consider taking a few seconds to rate us too.

There's a lot of data stories from this release, and I'll be writing about them over the next few weeks, in between new features and bug-fixes for the next update!