Add humans to your data pipeline

Fargo-wood-chipper-scene

I was lucky enough to meet Chris Van Pelt of Crowdflower tonight, and it was fascinating to hear about some of the new developments bubbling away at the company. I'm a longtime fan, they add a lot of value beyond what you get from more basic crowd-sourcing services like Mechanical Turk, but I've always seen them as only an incremental improvement on their competitors. What Chris talked me through over beers felt like a true step forward though.

We started by chatting about their Real Time Foto Moderation tool. This is basically a penis removal tool for photo uploads; you feed in a stream of images and after a short delay you get back flagged results showing which were accepted according to the sort of criteria used by Apple's App Store for content. I was fascinated to hear about some of the rules – bare-chested guys are fine if they're outdoors, but not if they're inside!

This may not sound that revolutionary, but think about what this means. Your application code is calling an API, and getting results back, but behind the curtain is a workforce of humans! Chris likes to call this an RPC, a Remote Person Call. I'm not aware of any other service that allows this kind of unsupervised interaction, crowd-sourcing has always been much more of a batch process with manual transfers of inputs and outputs between the human and automated stages.

This is important because it turns human tasks into modules that can be flexibly inserted into your data pipeline just by signing up on the web site and installing a Ruby gem. This changes crowd-sourcing from a cumbersome custom process that you have to extensively plan up-front into something you can experiment with just like you would any other API. You can build prototypes in a few minutes, test ideas, benchmark against other solutions, and start shipping code much faster.

Chris is free to experiment on the other side of the abstraction layer too. He might partially or completely automate the process and applications would never need to know, as long as the quality of results is consistent. Human-driven versions are likely to be more expensive than computational ones, and the price people are willing to pay for particular services will be a strong signal of which ones are worth sinking developer time into.

There's a lot of hard problems that benefit from a human in the loop, from sentiment analysis to transcription, and I'd love to have a library of APIs for all those that I could drop into my data pipeline as I'm working on new features. Crowdflower is starting to make this possible, so I'll be excited to follow their progress as they roll out more services. If you have an AI-hard problem that's driving you crazy, they might have a solution that lets you pretend we've solved AI!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: