Five short links

Photo by Doug88888

Stanford’s Wrangler – A promising data-wrangling tool, with a lot of the interactive workflow that I think is crucial.

Open Knowledge Conference – They’ve gathered an astonishing selection of speakers. I’m really hoping I can make it out to Berlin to join them.

The Privacy Challenge in Online Prize Contests – It’s good to see my friend Arvind getting his voice heard in the debate around privacy.

The Profile Engine – A site that indexes Facebook profiles and pages, with their permission.

Acunu – I met up with this team in London, and they’re doing some amazing work at the kernel level to speed up distributed key/value stores, thanks to some innovative data structures.

Kindles Profiles are so close to being wonderful

"Propose to an Englishman any … instrument, however admirable, and you will observe that the whole effort of the English mind is directed to find a difficulty, defect, or an impossibility in it. If you speak to him of a machine for peeling a potato, he will pronounce it impossible; if you peel a potato with it before his eyes, he will declare it useless, because it will not slice a pineapple"

I'd completely forgotten about this deliciously bitter quote from Charles Babbage in The Philosophical Breakfast Club, but thanks to Amazon's Kindle profiles site, I re-discovered it listed in my highlights. I was very excited when I stumbled across this social feature, I've been looking for an automatic way to share my reading list with friends. I've even experimented with scripts to scrape the reading history from my account, but never got anything complete enough to use. My dream is a simple blog widget showing what I'm reading, but without the maintenance involved in updating GoodReads with my status. I'm often reading on a plane or in bed at night, so the only way I'll have something up to date is if it uses information directly from my Kindle. I looked at the highlights page, and it looked like exactly what I was after, a chronological list of notes and the books I'd been reading recently:


Now all I needed to do was figure out how to make that page public. First, I had to go through all 160 books, and manually mark two check boxes next to each of them, making the book public, and then making my notes on it available. That was a bit of a grind (and something I guess I'll need to do for every book as I read it), but worth it if I could easily publish my highlights. After that though, I realized there was nothing like a 'blog' page for my notes that was available to anyone else. The closest is this one for my public notes:–Warden/11996/public_notes

It just has covers for the five books I most recently altered the state of, whether or not they have any notes or highlights, and you have to click through to find any actual notes. The "Your Highlights" section that only I can access is perfect, exactly what I would like to share with people, its simplicity is beautiful. Short of posting my account name and password here, does anyone have any thoughts on how I could get it out there? Anybody at Amazon I can beg?

Facebook and Twitter logins aren’t enough

Photo by Karen Horton

A couple of months ago I claimed "These days it doesn't make much sense to build a consumer site with its own private account system" and released a Ruby template that showed how to rely on just Facebook and Twitter for logins. It turns out I was wrong! I always knew there would be some markets that didn't have enough adoption of those two services, but thought that the tide of history would make them less and less relevant. What I hadn't counted on was kids.

My Wordlings custom word cloud service has seen a lot of interest from teachers who want to use it with their students, but especially amongst pre-teens, there's little chance they're on either Facebook or Twitter. They may not even have an email address to use! Since that's not likely to change, I added a new "Sign in for Kids" option that just requires a name, skipping a password even. It has the disadvantage that once you log out, you can't edit any of your creations, but that seems a small price to pay to make the service more accessible.

Using Hadoop with external API calls

Photo by Joe Penniston

I've been helping a friend who has a startup which relies on processing large amounts of data. He's using Hadoop for the calculation portions of his pipeline, but has a home-brewed system of queues and servers for handling other parts like web crawling and calls to external API providers. I've been advising him to switch almost all of his pipeline to run as streaming jobs within Hadoop, but since there's not much out there on using it for those sort of problems, it's worth covering why I've found it makes sense and what you have to watch out for.

If you have a traditional "Run through a list of items and transform them" job, you can write that as a streaming job with a map step that calls the API or does another high-latency operation, and then use a pass-through reduce stage.

The key advantage is management. There's a rich ecosystem of tools like ZooKeeper, MRJob, ClusterChef and Cascading that let you define, run and debug complex Hadoop jobs. It's actually comparatively easy to build your own custom system to execute data-processing operations, but in reality you'll spend most of your engineering time maintaining and debugging your pipeline. Having tools available to make that side of it more efficient lets you build new features much faster, and spend much more time on the product and business logic instead of the plumbing. It will also help as you hire new engineers, as they may well be familiar with Hadoop already.

The stumbling block for many people when they think about running a web crawler or external API access as a MapReduce job is the picture of an army of servers hitting the external world far too frequently. In practice, you can mostly avoid this by using a single-machine cluster tuned to run a single job at a time, which serializes the access to the resource you're concerned about. If you need finer control, a pattern I've often seen is a gatekeeper server that all access to a particular API, etc has to go through. The MapReduce scripts then call that server instead of going directly to the third-party's end-point, so that the gatekeeper can throttle the frequency to stay within limits, back off when there's 50x errors, and so on.

So, if you are building a new data pipeline or trying to refactor an existing one, take a good look at Hadoop. It almost certainly won't be as snug a fit as your custom code, it's like using lego bricks instead of hand-carving, but I bet it will be faster and easier to build your product with. I'll be interested to hear from anyone who has other opinions or suggestions too of course!