When I first saw the results of the Kaggle Cats vs Dogs competition, I was amazed by how accurate it was. When I show consumers our Spotter iPhone app based on the same deep learning technology the contestants used, most people are distinctly underwhelmed thanks to all the mistakes it makes.
The problem is that while computer vision has got dramatically better in the last few years, it was so bad before that we’re still a long way behind what a human can achieve. Most of the obvious applications of computer vision, like the Fire Phone’s object recognition, implicitly assume a higher degree of accuracy than we can achieve, so users are left feeling disappointed and disillusioned by the technology. There’s a disconnect between researchers’ excitement about the improvements and future promise, and the general public’s expectations of what good computer vision should be able to do. I think we’re in a space much like the uncanny valley, where the technology is good enough to be built into applications, but bad enough that those apps will end up frustrating users.
I believe we need to stop trying to build applications that assume human levels of accuracy, and instead engineer around the strengths and weaknesses of the actual technology we have. Here’s some of the approaches that can help.
Imagine a user loads a video clip and the application suggests a template and music that fit the subject, whether it’s a wedding or a kids soccer match. The cost and annoyance of the algorithm getting it wrong are low because it’s just a smart suggestion the user can dismiss, so the recognition accuracy only needs to be decent, not infallible. This approach of using computer vision to assist human decisions rather than replacing them can be used in a lot of applications if the designers are willing to build an interface around the actual capabilities of the technology.
A lot of the ideas I see for vision applications are essentially taking a job humans currently do, and getting a computer to do it instead (eg identifying products on the Fire Phone). They almost always involve taking a single photo, extracting a rock-solid identification, and then fetching related data based on that. These kind of applications fall apart if the identification piece is inaccurate, which it currently for everything but the simplest cases of bar codes. Going in to build Jetpac’s City Guides, I knew that I wouldn’t be able to identify hipsters with 100% accuracy, but by analyzing a few thousand photos taken at the same place, I could get good data about the prevalence of hipsters at a venue even if there were some mistakes on individual images. As long as the errors are fairly random, throwing more samples at the problem will help. If you can, try to recast your application as something that will ingest a lot more photos than a human could ever deal with, and mine that bigger set for meaning.
Right now, looking at photos and making sense of them is an expensive business. Even if you give a security guard a bank of monitors, they probably can’t track more than a dozen or so in any meaningful way. With the current state of computer vision, you could have hundreds of cheap cameras in a facility, and have them trigger an alert when something unusual happens, saving the guard’s superior recognition skills to make sense of the anomaly rather than trying to spot them in the first place. More generally, intelligent cameras become more like sensors that can be deployed in large numbers all over an assembly line, road tunnel, or sewer to detect when things are out of the ordinary. You’ll still need a human’s skills to investigate more deeply, but cheap computing power means you can deploy an army of smart sensors for applications you never could justify paying people to manually monitor.
I’m sure there are other approaches that will help too, but my big hope is that we can be more imaginative about designing around the limitations of current vision technology, and actually start delivering some of the promise that researchers are so excited about!