Create Beautiful Word Clouds with

Want to build striking word clouds around custom stencil images? Give a try. Here's what it offers:

Custom Shapes. Choose from a selection of templates, or upload your own images, and the word cloud will be moulded to fit within the silhouette.

Custom Fonts. You can use any font on your machine for the cloud.

High-resolution. Large-format images mean you can order great-looking versions of your visualization as shirts, invitations, mugs or art prints.

HTML5. Using Canvas for rendering means it will even work on iPads or iPhones, as well as on the current versions of all major browsers.

I created the OpenWordCloud renderer as an open-source jQuery plugin, and then built a web service around it. I actually built it at the start of the year, and have used it myself for projects like the Gaddafi speech visualization and for analyzing my favorite novels, but the Data Science Toolkit preparation didn't leave me enough time to do a proper launch. There's still some parts I'm unhappy with, especially the menu responsiveness, but I want to get it out there to get feedback.

If you like this sort of thing, I highly recommend looking at Jonathan Feinberg's awesome Wordle too. It's a wonderful project, and offers much faster rendering thanks to its use of Java and non-raster algorithm for word placement. I just needed something a little different (custom shapes and fonts, iOS support) and wanted to get an open-source version out there for other hackers to build on.


Facebook isn’t so evil

I feel a little bad that my Facebook story has been getting so much attention recently, since the postscripts are a lot less negative to the company. Bret Taylor did a good job of addressing the problem once it came to his attention, I've had employees at all levels contact me offering support, and they actually are doing a fantastic job liaising with the academic community, though they tend to keep that pretty quiet.

While the couple of months when I feared being bankrupted by the lawsuit were decidedly not fun (their lawyer had recently won $711 *million* from a spammer) the aftermath has been incredibly positive for me. If nothing else, it's given me a wonderful story to catch people's attention with when I want to explain my work!

Launching the Data Science Toolkit


I'm very pleased to announce the launch of the Data Science Toolkit. It's a collection of the most useful open source tools and data sets I've found, wrapped in an easy-to-use REST/JSON interface, and available for download as a turnkey virtual machine image.

Over the past few years I've discovered some amazing open-source tools, and built a few I'm pretty proud of myself, but they've always required a lot of effort from developers to use. Take Boilerpipe for example. It's by far the best approach I've found for extracting the main text from a news story or blog post, a vital first step for many data processing operations. But, if it's only available as a Java library, only other Java developers will be able to benefit from it. By wrapping it in a web server interface, and shipping it pre-installed on a VM, I'm hoping to get it into the hands of more developers.

The same goes for other libraries like GeoIQ/Schuyler Erle's Geocoder, a wonderful way of locating any address in the US but previously required a multi-gigabyte download and many hours of data import, or my own Geodict with it's hour-long database setup. By shipping what is essentially a specialized Ubuntu distribution, those setup times are removed, at the cost of a large (5GB) download.

Another benefit of this approach is the ability to run scalably. When all the data you're querying is on the local machine, it's possible to add capacity just by throwing more servers at the problem, without the bandwidth, latency or other limits on calling an external API becoming the bottleneck.

Anyway, please try out the sandboxcheck out the documentation, grab the VM or just start up an EC2 instance from the public AMI image ami-9e7d8ff7. This is early days, there's already a pile of bugs along with features and APIs that didn't make it in this version, but I'm excited to see how people use it. I'd also love to see folks jump in and start hacking on it, it's all completely open-source so it's your project as much as mine!


A Radioactive Map of Japan

My friend Alasdair Allan has been busy analyzing the radiation figures released by the Japanese government, measuring levels around the country. He used OpenHeatMap to visualize his results, and here’s his reassuring conclusion.

The map embedded below shows the environmental radioactivity measurements with respect to the typical maximum values for that locale. From this visualisation it is very evident that the measured values throughout Japan are normal except in the immediate area surrounding the Fukushima reactors where levels are about double normal maximum levels.

Environmental Radioactivity Measurement,
Ratio with respect to typical Maximum Values

However it is also immediately evident that during the period of monitoring radiation levels surrounding the troubled facility in Fukushima did not change significantly over time.

Five Short Links

Photo by Phat Controller

Nokia: Culture will out – The story of the ‘joyless’ NFC vending machine experience is an all-too-common outcome when engineers dominate designers. A fair number of people were upset by my harsh take on Rackspace’s signup experience. I know my own tendency to engineer pain into the user workflow without even thinking about it, so I do tend to compensate by being hyper-alert to rough edges. One of the hardest but most educational things about Apple was how the design process was mostly about stripping out features and hard-wiring choices, all in the service of the sort of sublime user experience Nokia ignores.

The Lost Art of Pickpocketing – It’s strange to think that pickpockets are now a dying breed, largely done in by changes in society.

I, Cyborg – I was lucky enough to meet Aaron and Amber at Strata, and their work at is incredibly imaginative. Just talking to them made me realize I’m now living in the future.

The Open Data Manual – A short but insightful guide to the nuts and bolts of opening up data sets, from legal issues to publicity and meetups.

Get the Data – A Stackoverflow clone for data questions, with a small but growing community of users. We need something like this for the community, I’m going to try to get more involved there too.

Why API Providers Hate You

Photo by Chris JD

Between the shutdown of Ubermedia and the company's announcement that it will not allow any new third-party clients, the once-passionate love affair between Twitter and third-party developers has finally descended into arguing about whose turn it is to do the washing up. I've seen this happen before.

In my past life at Apple I was responsible for some third-party developer support, and that was the worst part of my job. Not because of the external people I was dealing with, I loved them all and still stay in touch. That was actually the problem, because I often had to be a complete asshole to them. That bug-fix that would save you days, weeks or months of engineering and support time and would be a quarter-day job for us? No, not going to happen. That API you were relying on? We're removing that, with no equivalent available. Oh, and we liked your idea so much, we're bringing out our own version next month. I actually managed to smuggle in a few changes, which helped me keep a shred of my self-respect, but it was spirit-breaking work. Here's what it taught me.

Don't Believe the Hype. Everybody in the engineering team was a fan of what third-party developers were doing with our APIs, and wanted to encourage and help them however they could. Especially in the early days, we sit around with developers and get excited about what was possible, getting their hopes up. Unfortunately our enthusiasm was no match for relentless schedules and bug-fix prioritization, so we always had radically less resources to give them than we'd hoped for. When engineering hopes collides with business priorities, revenue wins. Looking back, I feel most guilty about the way I let my own enthusiasm lead third-party developers on, and I'd imagine the Twitter engineers feel something of that too.

It's Not Personal. When your dreams and livelihood are on the line, it's hard not to feel like it's the fault of the people you're dealing with. Why won't they listen, it's so obvious it makes sense? If you could just get somebody sensible on the phone, they'd understand. You can see this on the Twitter development list as people ask for Ryan to add another point of contact, presumably in the hope that the disliked changes are a misunderstanding. In fact, most external actions like this tend to be very deliberate expressions of corporate policy, even though the messages are often sugar-coated to minimize the damage. Whoever was in that position would be forced to do the same thing by their bosses, who in turn are under pressure from investors. External developers have almost no leverage in power struggles over corporate policies, and that doesn't change depending on who the front-man is.

Never Become a Sharecropper. Take a long, hard look at the power balance behind any long-term business relationship you enter into. Imagine you become wildly successful. How much of that success can you keep if you're tied to one platform, at the mercy of arbitrary changes in the terms of service? What's the track record of the provider?

Sometimes it makes sense – Microsoft were actually awesome to work with as a third-party developer, often-times to the detriment of the end consumer who was forced to turn to external tools for things that really should have been in the OS. For ISVs this was great though, you knew Microsoft would generally avoid competing directly with you, and would likely offer to buy you out at least if they were going to tread on your turf.

Open Wins (Eventually). We go through cycles of reincarnation, as a technological change sweeps the landscape and leaves a few companies with seemingly-unassailable strangle-holds on the market. The beauty of computing that we keep adding new layers of indirection, and sooner or later we route around the obstruction. Microsoft looked unbeatable, now I'd bet the majority of Windows installations are on virtual machines and apps are increasingly running in a browser. Nobody beat them at their game, we all just moved to a new playing field. Facebook is the current king, they've won the social network contest convincingly. But unless they anticipate everything that people want in the future, and deliver it all, there will be applications flourishing outside of their walled garden, and that's where the future will come from.

If you keep that in mind as a developer, you'll avoid tempting dead-ends that offer great initial distribution, but don't give you any real control over the data or platform. Open source projects can be a massive pain in the arse, they're messy, incomplete and often have trouble reaching users, but they're the only foundation you can build anything lasting on.