Create Beautiful Word Clouds with

Want to build striking word clouds around custom stencil images? Give a try. Here's what it offers:

Custom Shapes. Choose from a selection of templates, or upload your own images, and the word cloud will be moulded to fit within the silhouette.

Custom Fonts. You can use any font on your machine for the cloud.

High-resolution. Large-format images mean you can order great-looking versions of your visualization as shirts, invitations, mugs or art prints.

HTML5. Using Canvas for rendering means it will even work on iPads or iPhones, as well as on the current versions of all major browsers.

I created the OpenWordCloud renderer as an open-source jQuery plugin, and then built a web service around it. I actually built it at the start of the year, and have used it myself for projects like the Gaddafi speech visualization and for analyzing my favorite novels, but the Data Science Toolkit preparation didn't leave me enough time to do a proper launch. There's still some parts I'm unhappy with, especially the menu responsiveness, but I want to get it out there to get feedback.

If you like this sort of thing, I highly recommend looking at Jonathan Feinberg's awesome Wordle too. It's a wonderful project, and offers much faster rendering thanks to its use of Java and non-raster algorithm for word placement. I just needed something a little different (custom shapes and fonts, iOS support) and wanted to get an open-source version out there for other hackers to build on.


Facebook isn’t so evil

I feel a little bad that my Facebook story has been getting so much attention recently, since the postscripts are a lot less negative to the company. Bret Taylor did a good job of addressing the problem once it came to his attention, I've had employees at all levels contact me offering support, and they actually are doing a fantastic job liaising with the academic community, though they tend to keep that pretty quiet.

While the couple of months when I feared being bankrupted by the lawsuit were decidedly not fun (their lawyer had recently won $711 *million* from a spammer) the aftermath has been incredibly positive for me. If nothing else, it's given me a wonderful story to catch people's attention with when I want to explain my work!

Launching the Data Science Toolkit


I'm very pleased to announce the launch of the Data Science Toolkit. It's a collection of the most useful open source tools and data sets I've found, wrapped in an easy-to-use REST/JSON interface, and available for download as a turnkey virtual machine image.

Over the past few years I've discovered some amazing open-source tools, and built a few I'm pretty proud of myself, but they've always required a lot of effort from developers to use. Take Boilerpipe for example. It's by far the best approach I've found for extracting the main text from a news story or blog post, a vital first step for many data processing operations. But, if it's only available as a Java library, only other Java developers will be able to benefit from it. By wrapping it in a web server interface, and shipping it pre-installed on a VM, I'm hoping to get it into the hands of more developers.

The same goes for other libraries like GeoIQ/Schuyler Erle's Geocoder, a wonderful way of locating any address in the US but previously required a multi-gigabyte download and many hours of data import, or my own Geodict with it's hour-long database setup. By shipping what is essentially a specialized Ubuntu distribution, those setup times are removed, at the cost of a large (5GB) download.

Another benefit of this approach is the ability to run scalably. When all the data you're querying is on the local machine, it's possible to add capacity just by throwing more servers at the problem, without the bandwidth, latency or other limits on calling an external API becoming the bottleneck.

Anyway, please try out the sandboxcheck out the documentation, grab the VM or just start up an EC2 instance from the public AMI image ami-9e7d8ff7. This is early days, there's already a pile of bugs along with features and APIs that didn't make it in this version, but I'm excited to see how people use it. I'd also love to see folks jump in and start hacking on it, it's all completely open-source so it's your project as much as mine!


A Radioactive Map of Japan

My friend Alasdair Allan has been busy analyzing the radiation figures released by the Japanese government, measuring levels around the country. He used OpenHeatMap to visualize his results, and here’s his reassuring conclusion.

The map embedded below shows the environmental radioactivity measurements with respect to the typical maximum values for that locale. From this visualisation it is very evident that the measured values throughout Japan are normal except in the immediate area surrounding the Fukushima reactors where levels are about double normal maximum levels.

Environmental Radioactivity Measurement,
Ratio with respect to typical Maximum Values

However it is also immediately evident that during the period of monitoring radiation levels surrounding the troubled facility in Fukushima did not change significantly over time.

Five Short Links

Photo by Phat Controller

Nokia: Culture will out – The story of the ‘joyless’ NFC vending machine experience is an all-too-common outcome when engineers dominate designers. A fair number of people were upset by my harsh take on Rackspace’s signup experience. I know my own tendency to engineer pain into the user workflow without even thinking about it, so I do tend to compensate by being hyper-alert to rough edges. One of the hardest but most educational things about Apple was how the design process was mostly about stripping out features and hard-wiring choices, all in the service of the sort of sublime user experience Nokia ignores.

The Lost Art of Pickpocketing – It’s strange to think that pickpockets are now a dying breed, largely done in by changes in society.

I, Cyborg – I was lucky enough to meet Aaron and Amber at Strata, and their work at is incredibly imaginative. Just talking to them made me realize I’m now living in the future.

The Open Data Manual – A short but insightful guide to the nuts and bolts of opening up data sets, from legal issues to publicity and meetups.

Get the Data – A Stackoverflow clone for data questions, with a small but growing community of users. We need something like this for the community, I’m going to try to get more involved there too.

Why API Providers Hate You

Photo by Chris JD

Between the shutdown of Ubermedia and the company's announcement that it will not allow any new third-party clients, the once-passionate love affair between Twitter and third-party developers has finally descended into arguing about whose turn it is to do the washing up. I've seen this happen before.

In my past life at Apple I was responsible for some third-party developer support, and that was the worst part of my job. Not because of the external people I was dealing with, I loved them all and still stay in touch. That was actually the problem, because I often had to be a complete asshole to them. That bug-fix that would save you days, weeks or months of engineering and support time and would be a quarter-day job for us? No, not going to happen. That API you were relying on? We're removing that, with no equivalent available. Oh, and we liked your idea so much, we're bringing out our own version next month. I actually managed to smuggle in a few changes, which helped me keep a shred of my self-respect, but it was spirit-breaking work. Here's what it taught me.

Don't Believe the Hype. Everybody in the engineering team was a fan of what third-party developers were doing with our APIs, and wanted to encourage and help them however they could. Especially in the early days, we sit around with developers and get excited about what was possible, getting their hopes up. Unfortunately our enthusiasm was no match for relentless schedules and bug-fix prioritization, so we always had radically less resources to give them than we'd hoped for. When engineering hopes collides with business priorities, revenue wins. Looking back, I feel most guilty about the way I let my own enthusiasm lead third-party developers on, and I'd imagine the Twitter engineers feel something of that too.

It's Not Personal. When your dreams and livelihood are on the line, it's hard not to feel like it's the fault of the people you're dealing with. Why won't they listen, it's so obvious it makes sense? If you could just get somebody sensible on the phone, they'd understand. You can see this on the Twitter development list as people ask for Ryan to add another point of contact, presumably in the hope that the disliked changes are a misunderstanding. In fact, most external actions like this tend to be very deliberate expressions of corporate policy, even though the messages are often sugar-coated to minimize the damage. Whoever was in that position would be forced to do the same thing by their bosses, who in turn are under pressure from investors. External developers have almost no leverage in power struggles over corporate policies, and that doesn't change depending on who the front-man is.

Never Become a Sharecropper. Take a long, hard look at the power balance behind any long-term business relationship you enter into. Imagine you become wildly successful. How much of that success can you keep if you're tied to one platform, at the mercy of arbitrary changes in the terms of service? What's the track record of the provider?

Sometimes it makes sense – Microsoft were actually awesome to work with as a third-party developer, often-times to the detriment of the end consumer who was forced to turn to external tools for things that really should have been in the OS. For ISVs this was great though, you knew Microsoft would generally avoid competing directly with you, and would likely offer to buy you out at least if they were going to tread on your turf.

Open Wins (Eventually). We go through cycles of reincarnation, as a technological change sweeps the landscape and leaves a few companies with seemingly-unassailable strangle-holds on the market. The beauty of computing that we keep adding new layers of indirection, and sooner or later we route around the obstruction. Microsoft looked unbeatable, now I'd bet the majority of Windows installations are on virtual machines and apps are increasingly running in a browser. Nobody beat them at their game, we all just moved to a new playing field. Facebook is the current king, they've won the social network contest convincingly. But unless they anticipate everything that people want in the future, and deliver it all, there will be applications flourishing outside of their walled garden, and that's where the future will come from.

If you keep that in mind as a developer, you'll avoid tempting dead-ends that offer great initial distribution, but don't give you any real control over the data or platform. Open source projects can be a massive pain in the arse, they're messy, incomplete and often have trouble reaching users, but they're the only foundation you can build anything lasting on.

Get a job with Ushahidi

Photo by Brenda Gottssabend

I've been volunteering on the Ushahidi Swift pipeline, and it's been fascinating. The goal is to turn massive raw streams of tweets, text messages and news stories into useful information, something that's especially relevant looking at all of the recent turmoil around the world. Pulling data from unstructured text is one of the toughest and most interesting problems around, and Swift is a great platform to learn and test new approaches.

So I was pleased to hear from Jon Gosier that they're looking to hire some more engineers. The pay's not great compared to West Coast standards, but they're open to hiring globally, so it makes more sense for people in an area with a lower cost of living. They're also happy to accommodate side projects and offer a lot of flexibility as long as the work gets done. If you're looking a job on a great team working on important problems, this could be a good chance to gain experience in some very useful areas. Here's the full job posting.


Ushahidi is currently seeking to hire individuals in the following full-time and contract positions: Sr. Web Application Developer, Online Ethnographer/Behaviorist, Computational Linguistics Expert

Sr. Web Application Developer (Python/PHP)

Experience Requirements: At least 4 years professional experience in PHP/XHTML/MySQL/CSS building web applications. This position is minimum full-time for 12 months. Developers with a background in Design, experience with Ruby (Rails), Python (Django) and PHP Frameworks are definitely preferred but all candidates are welcome to apply.

Location: Anywhere, Global

Salary: $60k per year, U.S. dollars. 80% full-time commitment expected although candidates are welcome to maintain on side-projects so long as they don’t affect primary deliverables and deadlines.

Online Ethnographer/Behaviorist

Experience Requirements: PHD or PHD-Candidate level with a background in the qualitative study of network dynamics and ethnography of online communities. Position will require deep analysis of dynamics in online communities, and work alongside computer science teams to assist in the development of applications and algorithms based upon their research.

This position is minimum full-time for 12 months.

Location: Anywhere, Global

Salary: $60k per year, U.S. dollars. 80% full-time commitment expected although candidates are welcome to maintain on side-projects so long as they don’t affect primary deliverables and deadlines.

Computational Linguistics Expert (Python)

Experience Requirements: At least 5 years professional experience in the development of computational linguistic algorithms using Python. Applicant would supervise the development of open-source semantic technologies, with an emphasis on modularity and scalability. This position is contract.

Location: Anywhere, Global

Salary: Contract. Negotiable.

To apply email cover letter and CV's to


Why signing up with Rackspace was a disappointing experience


[Update – I wrote this at the end of a long and frustrating day, and got the tone completely wrong, I was way too grumpy. Normally I wouldn't revise, but since it ended up on HN, I'll add this note. I'm still living in awe that I can rent a hundred machine cluster for $10 an hour, for me that's better than a jet-pack]

Today I needed to create a Rackspace account, so I could help out a non-profit I'm supporting. I've long been intrigued by the idea of a cloud host with support as a key feature, especially after the poor experience I had when my Amazon load balancer died. That left me glad to have an excuse to try out Rackspace, but after going through signup I was left distinctly unimpressed. There wasn't anything major, but there were several aspects to the process that were jarring.


They require the password to have a mixture of upper and lower-case letters and numbers. This is a cop-out, they should be testing for general password strength rather than inflicting arbitrary rules like this. It's not a massive coding task, and it's also odd to see them passing the plain text passwords back to the server for checking, rather than doing it client side. The page is https (though with warnings on Chrome about untrusted content) so it's not a major flaw, but it's poor fit-and-finish. It did fill me with fear that they might be storing my password as plain text though, especially when I discovered they send root passwords for your boxes through email!


You also have to pick a product on the account creation page. This is odd, but what was worse was that it's at the top of the page and gets unchecked when your password fails their rules. This meant I still couldn't submit the form after I'd changed the password to comply, and there was no feedback as to why the submission was failing. Again, a minor but irritating missing detail.


After all that, one of their employees has to give me a phone call before I can start creating servers. It felt very old-fashioned, and meant that the self-serve aspect was gone. Amazon does something similar, but has it all automated, giving me a pin I can put in, a much smoother experience. I had to stop progress on exploring the service and wait for a call, and of course when it came I was in a meeting, so I couldn't answer it. Later in the day, I called back the number they emailed me, got the general customer service line, had to sit through two repetitions of options, none of which appeared to apply, until I finally got a real person. He took some details, and had to forward me to an on-boarding specialist. I waited for a few minutes on hold, and then he had me confirm some basic details about my credit card number and address.

I was waiting for the reason they did this as a call with a human. I wasn't even calling from the same number I'd given them, so caller ID wasn't giving them additional confirmation. I thought they'd try to wow me with a sales pitch at least, or sweep me off my feet with an insistence on answering any questions I might have, but it was just a polite data-entry conversation.

Shared Server Images

This one's more of a missing feature than a user experience problem, but it contributed to my grumpiness. My goal with the non-profit is to set them up with a server containing some useful software, a mixture of open source tools built by me and others. The installation process for all of these packages is a multi-page document, taking a fair amount of time. On Amazon, I'm able to do this once and build a shared AMI, making it public so anyone can use the system. It looks like on Rackspace I'll have to go through the installation process for every account that wants to use the system, since they only support server images within a single account. It's clear from comments on that announcement that I'm not the only one looking for this, but there's still no ETA on the feature.

I'm really hoping I'll see the advantages of Rackspace's approach now I'm signed up. If it was a phone company or bank, I'd just sigh and move on, but I had higher hopes for a Zappos-like experience. My first three complaints could all be solved with some fairly simple tweaks to their processes, a bit of Javascript and some Twilio. As it stands, though Amazon's not perfect, their sign-up process actually gave me a lot better user experience.

Five Short Links

Photo by Cat O

Egypt Influence Network – A network graph of Twitter users that actually tells a story. I do wish there was more discussion of the method and data used to generate it though. We all need reproducibility and peer review to strengthen our arguments.

CKAN – the Data Hub – A rich curated collection of data sets and APIs, with hopes to become the CPAN for information.

Weatherspark – Very effective interactive presentation of weather data. I especially like the faded background to the temperature graph subtly showing the mean, min and max values over time.

Spreadsheet Scraper – Simple but nice Chrome extension for pulling tabular data from web pages. Increasingly the client side is the only place to access third-party information, as robots.txt and API terms-of-service become more and more restrictive.

NoSQL @ NetFlix – It’s very helpful that Sid has done so much work explaining NetFlix’s experiences porting applications to NoSQL. It’s often hard to convince companies with existing traditional infrastructures that switching is an option worth considering, these sort of case studies really help.