Five short links

Photo by Linda Cronin

Probabilistic data structures for web analytics and data mining – A lot of the time we're processing massive amounts of data and producing very detailed intermediate results, only to throw almost all that detail away because all we want is a much smaller summary of the data's properties. I've got a lot of mileage out of approaches like this that cut out the middleman and produce much more manageable intermediates by throwing away parts of the data early when they are unlikely to be significant.

Twenty-four hours in a VC's life – Professional investors theoretical economic role is as discoverers of new information, finding opportunities for deploying other people's savings in useful ways. It's interesting to see some of that information transfer in action.

Haptic labs soft maps – I would have loved one of these quilt maps as a kid.

TopCoder – Using the competition model to drive software development.

American healthcare fraud and scalable investigative reporting – I'm always excited to see the new data techniques being used for more than ad targeting, and journalism is one of the most promising areas where they can make a difference.

Five short links

Photo by Pinké

Brain of Mat Kelcey – Mat's been doing some interesting work with Common Crawl, and his blog is a must-read for anyone interested in extracting data from unstructured text.

Google releases natural language dictionaries – Based around Wikipedia page titles as their list of concepts, Google Research have released a really interesting resource, a bit like a thesaurus for machines. Even better, it's available under a liberal CC BY license, so there should be no problem using it in any sort of project.

Dumb like me – A scary story for anyone who makes their living with their mind, from my friend Russ Jurney; "Smart people, like the very attractive, get special treatment they do not know they are getting". Despite all our techno-utopianism, we're still reliant on a fallible hardware platform made of meat.

NSA's security guide to iOS 5 – A wealth of detailed practical information on securing Apple mobile devices.

Things you might not know about jQuery – A good refresher on some of the less obvious cool features of the framework. I've been using .data() extensively on Jetpac.

Want a magic wand?

I've been collaborated on and off with Nicholas Napp for years, including on a National Science Foundation grant for computer vision on mobile devices. He's extremely experienced in the world of traditional toys, as well as video games, and he's had a compelling dream of tying together motion sensors, smart phones, and a rich real-world game system to produce something magic. The gameplay uses precise location tracking and gestural control through a wand to give you an interface to cast spells that hurt or help other combatants in fights, or help you progress through adventures in other ways.

I love this sort of overlay of virtual layers on top of the physical world, and I believe Nick and his very talented creative collaborator Kevin Mowrer can execute on their vision. They're now on Kickstarter raising money to get started, so if you're intrigued, go check it out. I've donated $250 myself, and I can't wait to get my hands on an early wand.

Five short links

Photo by Randy Robertson

Fancy ML techniques don't matter much – "The reason I don’t like Kaggle is that it’s all about squeezing more juice out of existing data." There's a lot of hard-earned wisdom in this post, but I think he's over-estimating the professional world's familiarity with machine learning techniques, and underestimating how hard they are to acquire. I love Kaggle because it allows me to outsource a whole lot of work that requires very specialized skills, so I don't have to support a full-time ML engineer, and I don't want to spend the time and resources I'd need to train an existing team-member to be good at it when we'll only use it ocassionally.

Are your cookies colluding? – The Mozilla folks have released a plugin showing how ad networks are connected, with a network graph visualization that actually seems useful, rather than just being pretty.

An interactive map of the Roman Empire – Calculates the travel time and cost for journeys in the ancient world. Tools like these bring back a perspective that anyone used to modern transport has lost, especially around the crucial power of the sea as much cheaper and faster than land for travel. I was first struck by their power when I ran across time-based maps like this for the medieval world, showing how much more connected coastal settlements were to fishing villages in other countries than to inland towns in their own, and helped me understand how England held on to Dunkirk for so long!

Image Vision Labs – Offers advanced image-processing algorithms as a service. We seem to be locked in an escalating arms race between users determined to upload pictures of their genitals, and platforms determined to stop them.

Pilot lights are evil – Data-driven detective work on where the actual energy usage is going, with a conclusion that's given away in the title, but remains surprising!

Add humans to your data pipeline


I was lucky enough to meet Chris Van Pelt of Crowdflower tonight, and it was fascinating to hear about some of the new developments bubbling away at the company. I'm a longtime fan, they add a lot of value beyond what you get from more basic crowd-sourcing services like Mechanical Turk, but I've always seen them as only an incremental improvement on their competitors. What Chris talked me through over beers felt like a true step forward though.

We started by chatting about their Real Time Foto Moderation tool. This is basically a penis removal tool for photo uploads; you feed in a stream of images and after a short delay you get back flagged results showing which were accepted according to the sort of criteria used by Apple's App Store for content. I was fascinated to hear about some of the rules – bare-chested guys are fine if they're outdoors, but not if they're inside!

This may not sound that revolutionary, but think about what this means. Your application code is calling an API, and getting results back, but behind the curtain is a workforce of humans! Chris likes to call this an RPC, a Remote Person Call. I'm not aware of any other service that allows this kind of unsupervised interaction, crowd-sourcing has always been much more of a batch process with manual transfers of inputs and outputs between the human and automated stages.

This is important because it turns human tasks into modules that can be flexibly inserted into your data pipeline just by signing up on the web site and installing a Ruby gem. This changes crowd-sourcing from a cumbersome custom process that you have to extensively plan up-front into something you can experiment with just like you would any other API. You can build prototypes in a few minutes, test ideas, benchmark against other solutions, and start shipping code much faster.

Chris is free to experiment on the other side of the abstraction layer too. He might partially or completely automate the process and applications would never need to know, as long as the quality of results is consistent. Human-driven versions are likely to be more expensive than computational ones, and the price people are willing to pay for particular services will be a strong signal of which ones are worth sinking developer time into.

There's a lot of hard problems that benefit from a human in the loop, from sentiment analysis to transcription, and I'd love to have a library of APIs for all those that I could drop into my data pipeline as I'm working on new features. Crowdflower is starting to make this possible, so I'll be excited to follow their progress as they roll out more services. If you have an AI-hard problem that's driving you crazy, they might have a solution that lets you pretend we've solved AI!

David Thomas, RIP


My grandfather David Thomas had a long life, and packed a lot in. He was one of the youngest lot to fight in World War II, but he didn't like to talk too much about the actual service he'd done. The easiest parts to get him talking about were the people, friends he'd lost, or who he'd stayed in touch with afterwards back into civilian life. He'd ended up in the navy, and on his way to a land base in Sierra Leone servicing torpedo bombers, he'd endured weeks below decks. He knew there wasn't much of a chance that far below if a u-boat struck, but what he remembered was the stink of so many men, without much access to a shower. He got on with it though.

That was his strength, getting on with it. At first when he came back from the war he worked on the buses, where his aircraft engine skills proved handy. When the buses went on strike, he needed to keep supporting his family and switched over to a job at the Post Office. That's one thing I remember, he always had wonderful access to catalogs showing special editions of stamps, and gave me discounted entry to the mail-order "Dinosaur Club" thanks to his connections. He was always keeping his eye out for things like that, little ways to help first his two daughters, then the grandkids like me, and finally the great-grandkids when they arrived.

He was devoted to his wife, my Nan, too, visiting her every day, all day in the hospital for months before she passed away a few years ago. He stayed active right until his end, despite an array of medical problems. It must have helped that he was surrounded by friends and family who loved him. I remember virtual traffic jams of people coming in to see him in his hospital bed, and within a few hours of a new ward the nurses would be new friends. One of the best presents I was ever able to give him was a calendar showing our pet photos, and the exact name, age, ownership, and character of all the animals in the latest one he received was a hot topic of conversation on my last visit to him two weeks ago. He devored a box of chocolates that were another gift, but just a few days later he had a peaceful end, surrounded by family.

He's somebody I admire very much, for many reasons, but his kindness and lifetime of hard work to support his family stand out most of all. I miss him, but the positive impact he had through the way he lived his life will be around for a long time to come.

Shell Apps and Silver Bullshit

Photo by ChristianUK

I don't normally go so aggressive with a title, but this statement made me see red:

"If you plan to write an app for iOS or Android, you will save time and create a better product if you stick to Objective-C and Java, respectively."

Go to the iTunes store, download Jetpac for your iPad, and tell me whether you think it's native or HTML5. Guess what, it's heavily reliant on web code! Facebook's iOS apps take the same route, and in fact I'd recommend anybody with an 'always-online' app seriously consider the same approach.

Funnily enough, I agree with a lot of Benjamin's points about the costs of an HTML5 approach to app development. There's a lot more to creating a real app than pointing a native web view at your site, and that gets lost in the hype. What he misses is how much of a development tax you're paying when you're writing native code.

Designing in Interface Builder

"Remember web development in 2004? When you had to create pixel-perfect comps because every element on screen was an image?" I spent five years at Apple writing desktop applications, and I'm still often baffled and confused by Interface Builder. The whole process of creating native screens is an order of magnitude harder for designers, and not much better for developers. Just being able to preview the design in a web page and tweak the CSS live through Firebug is a massive time-saver.

Forgiving languages

Objective C is finally starting to offer automatic memory management, but you'll still have to worry about buffer overflows and all sorts of other low-level details. Java is better, but not by much, and both require static typing. Modern languages like Ruby and Javascript are a lot more forgiving, and I've found that makes development faster and doesn't seem to introduce more bugs. Again, building Jetpac in a combination of server-side Ruby and client-side Javascript got us to market a lot quicker than we'd be able to manage with a native approach. Just as one example, I forgot a retain on a single string in the little native code we do have, and that caused an intermittent crash that put our launch back a week.


Being able to remotely inspect a web view makes a world of difference to debugging. If you keep your app runnable in a desktop browser too, the tools for stepping through code are superb, and in a lot of ways they're now superior to something like XCode. The amount of documentation and community answers from places like StackOverflow for common problems is much larger for the web world than native code, which also is a massive help for resolving bugs.


JQuery and Ruby's gem system are incredibly powerful. When we needed an autocomplete text box, grabbing a jquery plugin was simple. Need support for Amazon S3 interaction? We could just grab a gem. The development community behind these technologies is far larger than that for native development, so you're a lot more likely to find an off-the-shelf solution to your problems.

Slow Deployment

We built up a whole set of useful practices around long installation cycles. In a lot of ways I still miss the extended stages of QA that were required for apps that shipped on physical media. The catch is, while you could theoretically do the same thing with an app that's served directly off a website, the disadvantages are so much larger that nobody does! I love being able to fix bugs and solve user problems immediately, and the iteration cycles on new features are much, much faster than anything that requires waiting on Apple's approval process.

Don't believe the hype

We still had to work hard to support hardware-accelerated scrolling and native-feeling swiping, but going HTML5 was definitely the right decision for our app. Is it for yours? I don't know, and neither does Benjamin. As always in engineering it's all about tradeoffs, and no alternative to researching how the drawbacks and advantages of each approach fit your unique requirements. Just don't believe anyone who tells you they know the One True Way to develop for mobile, on either side.

Five short links

Picture from Pulp Covers

Ayasdi – A very seductive new visualization and analysis tool, it feels like they've learned a lot from Palantir's success.

Benford's Law: A revised analysis – I'd been using the original study that analyzed public company accounts for fraud over time using Benford's Law as a poster child for the application of numeric methods to journalism. I'm sorry to see that it turned out to be a bogus correlation (thanks to an increase of zeroes in revenue figures) but it's a good reminder of how important peer review and humility are as we're charging ahead with our new techniques. It's the sort of mistake that keeps me awake at nights, knowing how easy it would be to make.

Tiki – A lovely collection of open source code to handle all sorts of file conversions to text. I built some similar functionality into the Data Science Toolkit, but I'm excited to see an Apache-supported alternative.

Stanford Part-of-speech Tagger – A walk-through of a slick project for categorizing words within unstructured English-language text.

The Next Big Thing – How Amazon should be using their information on customers' book habits to drive a social network. I'm convinced that implicit signals will win out over the follow/friend model when it comes to building communities of people, but nobody's built an example that actually works yet.