Yesterday I was suddenly struck by a thought – I used to be a coder, now I teach computers to write their own programs. With the deep belief systems I’m using for computer vision, I spend most of my time creating an environment that allows the machines to decide how they want to solve problems, rather than dictating the solution myself. I’m starting to feel a lot more like a teacher than a programmer, so here’s what it’s like to teach a classroom of graphics cards.
I have to spend a lot of time figuring out how to collect a large training set of images, which have to represent the kind of pictures that the algorithm will be likely to encounter. That means you can’t just re-use photos from cell phones if you’re targeting a robotics application. The lighting, viewing angles, and even the ‘fisheye’ geometry of the lens all have to be consistent with what the algorithm will encounter in the real world or you’ll end up with poor results. I also have to make sure the backgrounds of the images are as random as possible, because if the objects I’m looking for always occur in a similar setting in the training, I’ll end up detecting that rather than the thing I actually care about.
Another crucial step is deciding what the actual categories I’m going to recognize are. They have to be the kind of thing that’s quite different between images, so separating cats from dogs is more likely to work than distinguishing American from British short-hair cat breeds. There are often edge cases too, so to get consistent categorization I’ll spend some time figuring out rules. If I’m looking for hipsters with mustaches, how much stubble does somebody need on their upper lip before they count? What if they have a mustache as part of a beard?
Once I’ve done all that, I have to label at least a thousand images for each category, with often up to a million images in total. This means designing a system to capture likely images from the web or other sources, with a UI that lets me view them rapidly and apply labels to any that fall into a category I care about. I always start by categorizing the first ten thousand or so images myself so I can get a feel for how well the categorization rules work, and what the source images are like overall. Once I’m happy the labeling process works, I’ll get help from the rest of our team, and then eventually bring in Mechanical Turks to speed up the process.
One advantage I have over my conventional teacher friends is that I get to design my own students! This is one of the least-understood parts of the deep learning process though, with most vision solutions sticking pretty close to the setup described in the original Krizhevsky paper. There are several basic components that I have to arrange in a pipeline, repeating some of them several times with various somewhat-arbitrary transformations in between. There are a lot of obscure choices to make about ordering and other parameters, and you won’t know if something’s an improvement until after you’ve done a full training run, which can easily take weeks. This means that, as one of my friends put it, we have an entire generation of graduate students trying to find improvements by trying random combinations in parallel. It’s a particularly painful emulation of a genetic algorithm since it’s powered by consuming a chunk of people’s careers, but until we have more theory behind deep learning, the only way to make progress is by using architectures that have been found to work in the past.
The training process itself involves repeatedly looping through all of the labeled images, and rewarding or punishing the neural connections in your network depending on how correctly they respond to each photo. This process is similar to natural learning, as more examples are seen the system starts to understand more about the patterns they have in common and the success rate increases. In practice deep neural networks are extremely fussy learners though, and I spend most of my time trying to understand why they’re bone-headedly not improving when they should be. There can be all sorts of problems; poorly chosen categories, bad source images, incorrectly classified objects, a network layout that doesn’t work, or bugs in the underlying code. I can’t ask the network why it’s not learning, we just don’t have good debugging tools, so I’ll usually end up simplifying the system to eliminate possible causes and try solutions more quickly than I could with a full run.
Training can take a long time for problems like recognizing the 1,000 Imagenet categories, on the order of a couple of weeks. At any point the process can spiral out of control or hit a bug, so I have to check the output logs several times a day to see how they’re doing. My girlfriend has became resigned to me tending ‘the brain’ in the corner of our living room in breaks between evening TV. Even if nothing’s gone dramatically wrong, several of the parameters need to be changed as the training process progresses to keep the learning rate up, and knowing when to make those changes is much more of an art than a science.
Once I’ve got a model fully trained, I have to figure out how well it works in practice. You might think it would be easy to evaluate a computer, they don’t have all the human problems of performance anxiety or distraction, but this part can actually be quite tough. As part of the training process I’m continually running numerical tests on how many right and wrong answers the system is giving, but, like standardized tests for kids, these only tell part of the story. One of the big advantages I’ve found with deep learning systems is that they make more understandable mistakes than other approaches. For example, users are a lot more forgiving if a picture of a cat is mis-labeled as a racoon, than if it’s categorized as a coffee cup! That kind of information is lost when we boil down performance into a single number, so I have to dive deeper.
The real test is building the model into an application and getting it in front of users. Often I’ll end up tweaking the results of the algorithm based on how I observe people reacting to it, for example suppressing the nematode label because it’s the default when the image is completely black. I’ll often spot more involved problems that require changes at the training set level, which will require another cycle through the whole process once they’re important enough to tackle.
As you can see being a computer trainer is a strange job, but as we get better at building systems that can learn, I bet it’s going to be increasingly common. The future may well belong to humble humans who work well with intelligent machines.