How to get computer vision out of the unimpressive valley

Photo by Severin Sadjina

When I first saw the results of the Kaggle Cats vs Dogs competition, I was amazed by how accurate it was. When I show consumers our Spotter iPhone app based on the same deep learning technology the contestants used, most people are distinctly underwhelmed thanks to all the mistakes it makes.

The problem is that while computer vision has got dramatically better in the last few years, it was so bad before that we’re still a long way behind what a human can achieve. Most of the obvious applications of computer vision, like the Fire Phone’s object recognition, implicitly assume a higher degree of accuracy than we can achieve, so users are left feeling disappointed and disillusioned by the technology. There’s a disconnect between researchers’ excitement about the improvements and future promise, and the general public’s expectations of what good computer vision should be able to do. I think we’re in a space much like the uncanny valley, where the technology is good enough to be built into applications, but bad enough that those apps will end up frustrating users.

I believe we need to stop trying to build applications that assume human levels of accuracy, and instead engineer around the strengths and weaknesses of the actual technology we have. Here’s some of the approaches that can help.

Forgiving Interfaces

Imagine a user loads a video clip and the application suggests a template and music that fit the subject, whether it’s a wedding or a kids soccer match. The cost and annoyance of the algorithm getting it wrong are low because it’s just a smart suggestion the user can dismiss, so the recognition accuracy only needs to be decent, not infallible. This approach of using computer vision to assist human decisions rather than replacing them can be used in a lot of applications if the designers are willing to build an interface around the actual capabilities of the technology.

Big Data

A lot of the ideas I see for vision applications are essentially taking a job humans currently do, and getting a computer to do it instead (eg identifying products on the Fire Phone). They almost always involve taking a single photo, extracting a rock-solid identification, and then fetching related data based on that. These kind of applications fall apart if the identification piece is inaccurate, which it currently for everything but the simplest cases of bar codes. Going in to build Jetpac’s City Guides, I knew that I wouldn’t be able to identify hipsters with 100% accuracy, but by analyzing a few thousand photos taken at the same place, I could get good data about the prevalence of hipsters at a venue even if there were some mistakes on individual images. As long as the errors are fairly random, throwing more samples at the problem will help. If you can, try to recast your application as something that will ingest a lot more photos than a human could ever deal with, and mine that bigger set for meaning.

Grunt Work

Right now, looking at photos and making sense of them is an expensive business. Even if you give a security guard a bank of monitors, they probably can’t track more than a dozen or so in any meaningful way. With the current state of computer vision, you could have hundreds of cheap cameras in a facility, and have them trigger an alert when something unusual happens, saving the guard’s superior recognition skills to make sense of the anomaly rather than trying to spot them in the first place. More generally, intelligent cameras become more like sensors that can be deployed in large numbers all over an assembly line, road tunnel, or sewer to detect when things are out of the ordinary. You’ll still need a human’s skills to investigate more deeply, but cheap computing power means you can deploy an army of smart sensors for applications you never could justify paying people to manually monitor.

I’m sure there are other approaches that will help too, but my big hope is that we can be more imaginative about designing around the limitations of current vision technology, and actually start delivering some of the promise that researchers are so excited about!

2 responses

Randy Castleman (@rcastleman) says:

August 2, 2014 at 6:30 pm

Both Deep Belief and Spotter are really cool. Is the processing happening on board the phone?

- Pete Warden says:
  
  August 4, 2014 at 2:15 am
  
  Thanks, and yes it’s all local, you can verify it by turning on Airplane Mode and giving it a whirl! The code’s also available at https://github.com/jetpacapp/DeepBeliefSDK/ if you want to dig in a bit deeper.

	bouquetsweetly69036a… on Meet Fiona and Abby
	softlysuitcb91a8b8b1 on Meet Fiona and Abby
	Zero-Copy GPU Infere… on Why GEMM is at the heart of de…
	Moonshine Voice完全解説｜… on Announcing Moonshine Voice
	Moonshine KI-Sprache… on Introducing Moonshine, the new…

Pete Warden's blog

Ever tried. Ever failed. No matter. Try Again. Fail again. Fail better.

How to get computer vision out of the unimpressive valley

2 responses

Leave a comment Cancel reply

Share this:

Related

2 responses

Leave a comment Cancel reply