Save the world with data

Globefloat
Photo by Daniela Hartmann

I was talking to someone recently who at first I thought was an angel investor interested in my work, but it turned out he was keen on funding charitable projects. It shows I need to improve my messaging – while I am technically running a non-profit, that's not by choice. I do love supporting that world, but I'm always looking for revenue opportunities where I'm adding value for users who can afford it, even for my open-source projects.

It did get me thinking about some of my favorite non-profit teams though. They're all primarily focused on using data to do good, even if they're not purely non-profit in structure. Here's the people I'd trust to put my money to good use, if I was a social investor.

Ushahidi/SwiftRiver

With uptake worldwide, Ushahidi has been doing valuable work spreading information around disasters and political unrest, everywhere from Egypt to New Zealand and Japan. The SwiftRiver part is the brainchild of Jon Gosier, and while I'm a bit biased as I'm contributing some time and code, it's a really powerful use of technologies that are usually only focused on ad-serving. There's so many practical ways this kind of filtering and routing of information can help in an emergency.

Global Virus Forecasting Initiative

With feet on the ground in 23 countries, the GVFI has roots in the traditional non-profit world, but a clear understanding of how modern data processing can help its mission. Lucky Gunesekara, Lalith Polepeddi and the rest of their team are building a cutting-edge information infrastructure to make sense of the masses of information gathered by the teams, matching it against social media and traditional news organizations. The results are then used to track and predict human viral outbreaks and save lives. 

Media Cloud

Ethan Zuckerman is a legend in the non-profit world as a driving force behind projects like Global Voices and Geekcorps, but with Media Cloud he's applying his skills to data analysis on an ambitious scale. As the Arab revolts show, one of the biggest problems around the world is the way people are denied a voice. By tracking and quantifying media coverage in unprecedented detail, the project hopes to draw a picture of the problem, as a first step to offering solutions. It's an important project in a lot of ways – just ask yourself when you first heard about the Tunisian uprising? It took three weeks for coverage to enter most of the Western media (though France was a bit ahead of the curve, thanks to their colonial history). Those sort of gaps in our collective vision mean we're missing out on vital information.

Five Short Links

Digifive
Picture by Tiger Pixel

How US News abandoned print and learned to love its data – How a magazine started monetizing its rankings of colleges, cars, high schools and mutual funds as a side-business, and ended up closing the traditional publishing business to focus on it exclusively. People in the publishing business often think they’re selling books, magazines or shows, but those are just the delivery mechanisms for the advice and entertainment people actually crave.

eq.org.nz – The power of Ushahidi – How the local online community responded to the Christchurch earthquake. The details in this account are crucial, especially how important the ‘neutral ground’ aspect of the service was. With no corporate logos on the site, competitors felt it was safe to cooperate without worrying that it would backfire on them. That attitude may seem crazy in the face of a catastrophe, but efforts like this are way more effective when they don’t require people to behave like saints.

Article text extraction from HTML documents – In-depth bibliography on all the projects out that take a web page, and try to extract the important “body” text, without the ads, boilerplate or navigation links.

Big Data, Analytics and Storytellings – I spent an hour chatting with Lyle Wallis last week, he’s spent years fighting in the trenches applying data analysis to real-world problems. He had a lot of insights, but this post captures one of the most obvious but also most overlooked: People respond to stories. As engineering-types, it’s easy to miss out on the power of laying out your explanation as a narrative, but it’s a marvelous mental hack for connecting with your audience. See also Ira Glass.

Mobile internet usage in Japan – Comscore does a great job of teasing out the effects of the quake on cell data traffic. It actually surprised me how subtle the patterns of the disaster were in the data, even through something that apocalyptic.

Five Short Links

Ledfive
Photo by Bill Bradford

Websockets Pacman – My friend Tyler Gillies created this basic but functional networked version of the classic game, where visitors to the site each take control over one of the ghosts. Here’s the playable demo. I love this because it shows how far our tools have come – just a few years ago it would have been a major engineering challenge, now it’s just a screenful of code.

The Anti-Predictor – Part of a fascinating interview with “Mathematical Sociologist” Duncan Watts, where he lays out the evidence against the existence of an elite of influencers, at least as most online marketers understand the term. “It does matter, on average, how many followers you have and how successful you’ve been in spreading your messages in the past, but it’s a lot more random than intuition suggests.”

The ethics of live mapping in repressive regimes and hostile environments – A detailed practical guide for online revolutionaries. The mundane detail of the precautions brings to life what enormous risks they’re running.

Componentization and Open Data – The data world is following the path that the coding world took a decade ago, as common foundational building-blocks start to become freely available. Certain parts, like address lookup, can be packaged as free and open components, which then allows engineers to work on the unsolved problems that really add value (and that you can charge money for).

FBI wants public help solving encrypted notes from murder mystery – Are you a cryptography geek? Donate some spare brain cycles to helping out the FBI.

Walking up Twin Peaks from the Upper Castro

A friend was asking me about a dog-walk I've been doing a lot on the weekends, heading up to the large antenna on top of Twin Peaks. Since it took me a little experimentation to figure out the best route, I've put together this little map and guide. It's a five mile round-trip, with some serious San Francisco hill climbing, and it's almost entirely along streets. It rewards you with some amazing views all the way though, especially from the summit. It usually takes me and Thor a little under two hours, so it's a great way to squeeze a good workout into a busy day.

I start off from my apartment at Church and Duboce, and walk a couple of blocks along Hermann, into the Duboce dog park. This is a fantastic diversion for both of us, there's almost always a good bunch of dogs and owners. After Thor's done his socializing, we then take Noe down to Market. Following Market for a couple of blocks takes you to 17th Street. This goes straight up, and keeps teasing you with false summits. Eventually, after a tough climb, you'll turn onto Clayton.

This turn is the only slightly tricky part of the route, since you take Clayton for a block, and then cross over to Twin Peaks Boulevard as it forks off. Going up past the small Tank Hill park, you'll then continue along Twin Peaks as it cuts left, away from Clarendon. This is the final stretch, and only the first section has any sidewalk. After that, you're walking on the shoulder, and while it's not too tough, be careful of the corners with poor visibility if you have a dog who thinks he should be able to walk down the center line. Some of the downhill bikes and cars pick up quite a lot of speed. There are social trails cutting a lot of the corners, so I'd definitely consider those.

Finally you'll make it up to the viewpoint. Especially after a rain, you get a magnificent vista, stretching out over downtown, across the Bay, and into Marin. Take it all in, and try not to be too disturbed by the posters asking for information on a recent homicide in the parking lot. I'm guessing it's quite a different world at night, but during the day it's full of visitors enjoying the sights.

Create Beautiful Word Clouds with Wordlin.gs

Weddingcloud
Want to build striking word clouds around custom stencil images? Give wordlin.gs a try. Here's what it offers:

Custom Shapes. Choose from a selection of templates, or upload your own images, and the word cloud will be moulded to fit within the silhouette.

Custom Fonts. You can use any font on your machine for the cloud.

High-resolution. Large-format images mean you can order great-looking versions of your visualization as shirts, invitations, mugs or art prints.

HTML5. Using Canvas for rendering means it will even work on iPads or iPhones, as well as on the current versions of all major browsers.

I created the OpenWordCloud renderer as an open-source jQuery plugin, and then built a web service around it. I actually built it at the start of the year, and have used it myself for projects like the Gaddafi speech visualization and for analyzing my favorite novels, but the Data Science Toolkit preparation didn't leave me enough time to do a proper launch. There's still some parts I'm unhappy with, especially the menu responsiveness, but I want to get it out there to get feedback.

If you like this sort of thing, I highly recommend looking at Jonathan Feinberg's awesome Wordle too. It's a wonderful project, and offers much faster rendering thanks to its use of Java and non-raster algorithm for word placement. I just needed something a little different (custom shapes and fonts, iOS support) and wanted to get an open-source version out there for other hackers to build on.

Amphoracloud

Facebook isn’t so evil

I feel a little bad that my Facebook story has been getting so much attention recently, since the postscripts are a lot less negative to the company. Bret Taylor did a good job of addressing the problem once it came to his attention, I've had employees at all levels contact me offering support, and they actually are doing a fantastic job liaising with the academic community, though they tend to keep that pretty quiet.

While the couple of months when I feared being bankrupted by the lawsuit were decidedly not fun (their lawyer had recently won $711 *million* from a spammer) the aftermath has been incredibly positive for me. If nothing else, it's given me a wonderful story to catch people's attention with when I want to explain my work!

Launching the Data Science Toolkit

Dstkshot0

I'm very pleased to announce the launch of the Data Science Toolkit. It's a collection of the most useful open source tools and data sets I've found, wrapped in an easy-to-use REST/JSON interface, and available for download as a turnkey virtual machine image.

Over the past few years I've discovered some amazing open-source tools, and built a few I'm pretty proud of myself, but they've always required a lot of effort from developers to use. Take Boilerpipe for example. It's by far the best approach I've found for extracting the main text from a news story or blog post, a vital first step for many data processing operations. But, if it's only available as a Java library, only other Java developers will be able to benefit from it. By wrapping it in a web server interface, and shipping it pre-installed on a VM, I'm hoping to get it into the hands of more developers.

The same goes for other libraries like GeoIQ/Schuyler Erle's Geocoder, a wonderful way of locating any address in the US but previously required a multi-gigabyte download and many hours of data import, or my own Geodict with it's hour-long database setup. By shipping what is essentially a specialized Ubuntu distribution, those setup times are removed, at the cost of a large (5GB) download.

Another benefit of this approach is the ability to run scalably. When all the data you're querying is on the local machine, it's possible to add capacity just by throwing more servers at the problem, without the bandwidth, latency or other limits on calling an external API becoming the bottleneck.

Anyway, please try out the sandboxcheck out the documentation, grab the VM or just start up an EC2 instance from the public AMI image ami-9e7d8ff7. This is early days, there's already a pile of bugs along with features and APIs that didn't make it in this version, but I'm excited to see how people use it. I'd also love to see folks jump in and start hacking on it, it's all completely open-source so it's your project as much as mine!

Dstkshot1

A Radioactive Map of Japan

My friend Alasdair Allan has been busy analyzing the radiation figures released by the Japanese government, measuring levels around the country. He used OpenHeatMap to visualize his results, and here’s his reassuring conclusion.

The map embedded below shows the environmental radioactivity measurements with respect to the typical maximum values for that locale. From this visualisation it is very evident that the measured values throughout Japan are normal except in the immediate area surrounding the Fukushima reactors where levels are about double normal maximum levels.

Environmental Radioactivity Measurement,
Ratio with respect to typical Maximum Values

However it is also immediately evident that during the period of monitoring radiation levels surrounding the troubled facility in Fukushima did not change significantly over time.

Five Short Links

Fiver
Photo by Phat Controller

Nokia: Culture will out – The story of the ‘joyless’ NFC vending machine experience is an all-too-common outcome when engineers dominate designers. A fair number of people were upset by my harsh take on Rackspace’s signup experience. I know my own tendency to engineer pain into the user workflow without even thinking about it, so I do tend to compensate by being hyper-alert to rough edges. One of the hardest but most educational things about Apple was how the design process was mostly about stripping out features and hard-wiring choices, all in the service of the sort of sublime user experience Nokia ignores.

The Lost Art of Pickpocketing – It’s strange to think that pickpockets are now a dying breed, largely done in by changes in society.

I, Cyborg – I was lucky enough to meet Aaron and Amber at Strata, and their work at Cyborganthropology.com is incredibly imaginative. Just talking to them made me realize I’m now living in the future.

The Open Data Manual – A short but insightful guide to the nuts and bolts of opening up data sets, from legal issues to publicity and meetups.

Get the Data – A Stackoverflow clone for data questions, with a small but growing community of users. We need something like this for the community, I’m going to try to get more involved there too.