How to get computer vision out of the unimpressive valley

valley

Photo by Severin Sadjina

When I first saw the results of the Kaggle Cats vs Dogs competition, I was amazed by how accurate it was. When I show consumers our Spotter iPhone app based on the same deep learning technology the contestants used, most people are distinctly underwhelmed thanks to all the mistakes it makes.

The problem is that while computer vision has got dramatically better in the last few years, it was so bad before that we’re still a long way behind what a human can achieve. Most of the obvious applications of computer vision, like the Fire Phone’s object recognition, implicitly assume a higher degree of accuracy than we can achieve, so users are left feeling disappointed and disillusioned by the technology. There’s a disconnect between researchers’ excitement about the improvements and future promise, and the general public’s expectations of what good computer vision should be able to do. I think we’re in a space much like the uncanny valley, where the technology is good enough to be built into applications, but bad enough that those apps will end up frustrating users.

I believe we need to stop trying to build applications that assume human levels of accuracy, and instead engineer around the strengths and weaknesses of the actual technology we have. Here’s some of the approaches that can help.

Forgiving Interfaces

Imagine a user loads a video clip and the application suggests a template and music that fit the subject, whether it’s a wedding or a kids soccer match. The cost and annoyance of the algorithm getting it wrong are low because it’s just a smart suggestion the user can dismiss, so the recognition accuracy only needs to be decent, not infallible. This approach of using computer vision to assist human decisions rather than replacing them can be used in a lot of applications if the designers are willing to build an interface around the actual capabilities of the technology.

Big Data

A lot of the ideas I see for vision applications are essentially taking a job humans currently do, and getting a computer to do it instead (eg identifying products on the Fire Phone). They almost always involve taking a single photo, extracting a rock-solid identification, and then fetching related data based on that. These kind of applications fall apart if the identification piece is inaccurate, which it currently for everything but the simplest cases of bar codes. Going in to build Jetpac’s City Guides, I knew that I wouldn’t be able to identify hipsters with 100% accuracy, but by analyzing a few thousand photos taken at the same place, I could get good data about the prevalence of hipsters at a venue even if there were some mistakes on individual images. As long as the errors are fairly random, throwing more samples at the problem will help. If you can, try to recast your application as something that will ingest a lot more photos than a human could ever deal with, and mine that bigger set for meaning. 

Grunt Work

Right now, looking at photos and making sense of them is an expensive business. Even if you give a security guard a bank of monitors, they probably can’t track more than a dozen or so in any meaningful way. With the current state of computer vision, you could have hundreds of cheap cameras in a facility, and have them trigger an alert when something unusual happens, saving the guard’s superior recognition skills to make sense of the anomaly rather than trying to spot them in the first place. More generally, intelligent cameras become more like sensors that can be deployed in large numbers all over an assembly line, road tunnel, or sewer to detect when things are out of the ordinary. You’ll still need a human’s skills to investigate more deeply, but cheap computing power means you can deploy an army of smart sensors for applications you never could justify paying people to manually monitor.

I’m sure there are other approaches that will help too, but my big hope is that we can be more imaginative about designing around the limitations of current vision technology, and actually start delivering some of the promise that researchers are so excited about!

Setting up Caffe on Ubuntu 14.04

A lot of people on this morning’s webcast asked if I had an Amazon EC2 image of pre-installed Caffe. I didn’t then, but I’ve just put one together! It’s available as ami-2faaa96a in the Northern California zone. There’s also a Vagrant VM at https://d2rlgkokhpr1uq.cloudfront.net/dl_webcast.box, and I’ve got full instructions for setting up your own machine on the Caffe wiki. I’m shaving yaks, so you don’t have to!

Join me for a Deep Learning webcast tomorrow

radios

Photo by RoadsidePictures

I’ve been having a lot of fun working with the O’Reilly team on ways to open up Deep Learning to a wider audience of developers. I’m working on a book tentatively titled “Deep Learning for Hackers” (since I’m a fan of Drew and John’s work), I’ve put up some introductory chapters as articles on the O’Reilly website, and tomorrow I’ll be doing an hour-long webcast walking through the basics of training and running deep networks with Caffe.

I hope you can join me, it’s a big help whenever I can collaborate with an audience to understand more about what developers need! If you do, I recommend downloading the Vagrant box ahead of time to make it easy to follow along. I look forward to seeing you there.

Five short links

cinque

Photo by Gianni

Databending using Audacity effects – What happens when you apply audio effects to images? Occasionally-wonderful glitch art! It always lifts my heart to see creative tinkering like this, especially when it’s well-documented.

Scan Processor Studies – On the topic of glitch art, here are some mind-bending analog video effects from 1972. They look amazingly fresh, and I discovered them while following the influences behind the phenomenal Kentucky Route Zero, which features a lot of tributes to the Bronze Age of computing.

The crisis of reproducibility in omen reading – In which I cheer as the incomparable Cosma Shalizi takes on fuzzy thinking, again.

What’s next for OpenStreetMap? – Has an essential overview of the Open Database License virality problem that’s stopped me from working with OSM data for geocoding. David Blackman from Foursquare gave a heart-breaking talk describing all the work he’s put into cleaning up and organizing OSM boundaries for his open-source TwoFishes project, only to find himself unable to use it in production, or even recommend that other users adopt it!

Where is my C++ replacement? – As a former graphics engine programmer, I recognize the requirements and constraints in this lament. I want to avoid writing any more C or C++, after being scared straight by all the goto-fail-like problems recently, and Go’s my personal leading contender, but there’s still no clear winner.

Why engineers shouldn’t disdain frontend development

disdain

Picture by Sofi

When I saw Cate Huston’s latest post, this part leapt out at me:

‘There’s often a general disdain for front-end work amongst “engineers”. In a cynical mood, I might say this is because they don’t have the patience to do it, so they denigrate it as unimportant.’

It’s something that’s been on my mind since I became responsible for managing younger engineers, and helping them think about their careers. It’s been depressing to see how they perceive frontend development as low status, as opposed to ‘real programming’ on crazy algorithms deep in the back end. In reality, becoming a good frontend developer is a vital step to becoming a senior engineer, even if you end up in a backend-focused role. Here’s why.

You learn to deal with real requirements

The number one reason I got when I dug into juniors’ resistance to frontend work was that the requirements were messy and constantly changing. All the heroic sagas of the programming world are about elegant systems built from a components with minimal individual requirements, like classic Unix tools. The reality is that any system that actually gets used by humans, even Unix, has its elegance corrupted to deal with our crazy and contradictory needs. The trick is to fight that losing battle as gracefully as you can. Frontend work is the boss level in coping with those pressures and will teach you how to engineer around them. Then, when you’re inevitably faced with similar problems in other areas, you’ll be able to handle them with ease.

You learn to work with people

In most other programming roles you get to sit behind a curtain like the Great and Powerful Wizard of Oz whilst supplicants come to you begging for help. They don’t understand what you’re doing back there, so they have to accept whatever you tell them about the constraints and results you can produce. Quite frankly it’s an open invitation to be a jerk, and a lot of engineers RSVP!

Frontend work is all about the visible results, and you’re accountable at a detailed level to a whole bunch of different people, from designers to marketing, even the business folks are going to be making requests and suggestions. You have nothing to hide behind, it’s hard to wiggle out of work by throwing up a smokescreen of jargon when it’s just changing the appearance or basic functionality of a page. You’re suddenly just another member of a team working on a problem, not a gatekeeper, and the power relationship is very different. This can be a nasty shock at first, but it’s good for the soul, and will give you vital skills that will stand you in good stead.

A lot of programmers who’ve only worked on backend problems find their careers limited because nobody wants to work with them. Sure, you’ll be well paid if you have technical skills that are valuable, but you’ll be treated like a troll that’s kept firmly under a bridge, for fear you’ll scare other employees. Being successful in frontend work means that you’ve learned to play well with others, to listen to them, and communicate your own needs effectively, which opens the door to a lot of interesting work you’d never get otherwise. As a bonus, you’re also going to become a better human being and have more fun!

You’ll be able to build a complete product

There are a lot of reasons why being full-stack is useful, but one of my favorites is that you can prototype a fully-working side-project on your own. Maybe that algorithm you’ve been working on really is groundbreaking, but unless you can build it into a demo that other people can easily see and understand, the odds are high it will just languish in obscurity. Being able to quickly pull together an app that doesn’t make the viewer’s eyes bleed is a superpower that will make everything else you do easier. Plus, it’s so satisfying to take an idea all the way from a notepad to a screen, all by yourself.

You’ll understand how to integrate with different systems

One of the classic illusions of engineers early in their career is that they’ll spend most of their time coding. In reality, writing new code is only a fraction of our job, most of the time will go into debugging, or getting different code libraries to work together. The frontend is the point at which you have to pull together all of the other modules that make up your application. That requires a wide range of skills, not the least of which is investigating problems and assigning blame! It’s the best bootcamp I can imagine in working with other people’s code, which is another superpower for any developer. Even if you only end up working as a solo developer on embedded systems, there’s always going to be an OS kernel and drivers you rely on.

Frontend is harder than backend

The Donald Knuth world of algorithms looks a lot like physics, or maths, and those are the fields most engineers think of as the hardest and hence the most prestigious. Just like we’ve discovered in AI though, the hard problems are easy, and the easy problems are hard. If you haven’t already, find a way to get some frontend experience, it will pay off handsomely. You’ll also have a lot more sympathy for all the folks on your team who are working on the user experience!

What does the future hold for deep learning?

clockwork

Photo by Pierre J.

When I chat to people about deep learning, they often ask me what I think its future is. Is it a fad, or something more enduring? What new opportunities are going to appear? I don’t have a crystal ball, but I have now spent a lot of time implementing deep neural networks for vision, and I’m also old enough to have worked through a couple of technology cycles. I’m going to make some ‘sophisticated wild-ass guesses’ about where I think things will head.

Deep learning eats the world

I strongly believe that neural networks have finally grown up enough to fulfil their decades-old promise. All applications that process natural data (speech-recognition, natural language processing, computer vision) will rely on it. This is already happening on the research side, but it will take a while to percolate fully through to the commercial sector.

Training and running a model will require different tools

Right now experimenting with new network architectures and train models is done with the same tools we use to run the models to generate predictions. To me, trained neural networks look a lot like compiled programs in a very limited assembler language. They’re essentially just massive lists of weights with a description of the order to execute them in. I don’t see any reason why the tools we use to develop them, which we use to change, iterate, debug, and train networks, should be used to execute them in production, with its very different requirements around interoperability with existing systems and performance constraints. I also think we’ll end up with small numbers of research-oriented folks who develop models, and a wider group of developers who apply them with less understanding of what’s going on inside the black box.

Traditional approaches will fight back

Deep learning isn’t the end of computer science, it’s just the current top dog. Millions of man-years have gone into researching other approaches to computer vision for example, and my bet is that once researchers have absorbed some of the lessons behind deep learning’s success (eg use massive numbers of training images, letting the algorithm pick the features), we’ll see better versions of the old algorithms emerge. We might even see hybrids, for example I’m using an SVM as the final layer of my network to enable fast retraining on embedded devices.

There will be a gold-rush around production-ready tools

Deep learning eating the world means rapidly growing demand for new solutions, as it spreads from research into production. The tools will need to fit into legacy ecosystems, so things like integration with OpenCV and Hadoop will become very important. As they get used at large scale, the power and performance costs of running the networks will become a lot more important, as opposed to the raw speed of training that current researchers are focused on. Developers will want to be able to port their networks between frameworks, so they can use the one that has the right tradeoffs for their requirements, rather than being bound to whatever system they trained the model on as they are right now.

What does it all mean?

With all these assumptions, here’s where I think we’re headed. Researchers will focus on expanding and improving the current crop of training focused libraries and IDEs (Caffe, Theanos). Other developers will start producing solutions that can be used more widely. They’ll be able to compete on ease-of-use, performance (not just raw speed, but also power consumption and hardware costs), and which environments they run in (language integration, distributed systems support via Hadoop or Spark, embedded devices).

One of the ways to improve performance is with specialized hardware, but there are some serious obstacles to overcome first. One of them is that the algorithms themselves are in flux, I think there will be a lot of changes over the next few years, which makes solidifying them into chips hard. Another is that almost all the time in production systems is spent doing massive matrix multiplies, which existing GPUs happen to be great at parallelizing. Even SIMD instructions on ordinary CPUs are highly effective at giving good performance. If deep networks need to be run as part of larger systems, the latency involved in transferring between the CPU and specialized hardware will kill speed, just as it does with a lot of attempts to use GPUs in real-world programs. Finally, a lot of the interest seems to be around encoding the training process into a chip, but in my future, only a small part of the technology world trains new models, everyone else is just executing off-the-shelf networks. With that all said, I’m still fascinated by the idea of a new hardware approach to the problem. Since I see neural networks as programs, building chips to run them is very tempting. I’m just wary that it may be five or ten years before they make commercial sense, not two or three.

Anyway, I hope this gives you some idea of how things look from my vantage point. I’m excited to be involved with a technology that’s got so much room to grow, and I can’t wait to see where it goes from here!

Five short links

arena5

Photo by <rs> snaps

Crossing the great UX/Agile divide – Methodologies are driven by human needs, and I know from personal experience that agile development takes power from designers and gives it to engineers. That doesn’t mean it’s wrong, but admitting there’s a power dynamic there makes it at least possible to talk about it candidly. “Although many software developers today enjoy the high salaries and excellent working conditions associated with white-collar work, it may not stay that way and UX design could be a contributing factor.

Eigenmorality – A philosophical take on how PageRank-like algorithms could be used to tackle ethical dilemmas, featuring Eigenmoses and Eigenjesus.

The elephant was a Trojan horse – I almost always use Hadoop as a simple distributed job system, and rarely need MapReduce. I think this eulogy for the original approach captures a lot of why MapReduce was so important as an agent of change, even if it ended up not being used as much as you’d expect.

Neural networks, manifolds, and topology – There are a lot of important insights here. One is the Manifold Hypothesis, which essentially says there are simpler underlying structures buried beneath the noise and chaos of natural data. Without this, machine learning would be impossible, since you’d never be able to generalize from a set of examples to cope with novel inputs, there would be no pattern to find. Another is that visual representations of the problems we’re tackling can help make sense of what’s actually happening under the hood.

The polygon construction kit – Turns 3D models into instructions for building them in the real world. It’s early days still, but I want this!