I’m very proud and excited to be taking my oath of allegiance this morning, the final step to becoming a US citizen after thirteen years of calling this country my home. To mark the occasion, my girlfriend Joanne wanted to interview me to answer some pressing questions about exactly why I still can’t pronounce “Water” correctly!
Yesterday a friend emailed, asking “What’s going on with deep learning? I keep hearing about more and more companies offering it, is it something real or just a fad?“. A couple of years ago I was very skeptical of the hype that had emerged around the whole approach, but then I tried it, and was impressed by the results I got. I still try to emphasize that they’re not magic, but here’s why I think they’re worth getting excited about.
They work really, really well
Neural networks have been the technology-of-the-future since the 1950’s, with massive theoretical potential but lacklustre results in practice. The big turning point in public perception came when a deep learning approach won the equivalent of the World Cup for computer vision in 2012. Just look at the results table, the Super Vision team, Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, absolutely trounced their closest competitors. It wasn’t a fluke, here’s a good overview of a whole bunch of other tasks where the approach is either beating more traditional approaches or providing comparable results. I can back this up with my own experience, and they’ve consistently won highly-competitive Kaggle competitions too.
I’m focused on computer vision, but deep neural networks have already become the dominant approach in speech recognition, and they’re showing a lot of promise for making sense of text too. There’s no other technique that applies to so many different areas, and that means that any improvements in one field have a good chance of applying to other problems too. People who learn how to work with deep neural nets can keep re-using that skill across a lot of different domains, so it’s starting to look like a valuable foundational skill for practical coders rather than a niche one for specialized academics. From a research perspective it makes the approach worth investing in too, because they show a lot of promise for tackling a wide range of topics.
With neural networks you’re not telling a computer what to do, you’re telling it what problem to solve. I try to describe what this means in practice in my post about becoming a computer trainer, but the key point is that the development process is a lot more efficient once you hand over implementation decisions to the machine. Instead of a human with a notebook trying to decide whether to look for corners or edges to help spot objects in images, the algorithm looks at a massive number of examples and decides for itself which features are going to be useful. This is the kind of radical change that artificial intelligence has been promising for decades, but has seldom managed to deliver until now.
There’s lots of room for improvement
Even though the Krizhevsky approach won the 2012 Imagenet competition, nobody can claim to fully understand why it works so well, which design decisions and parameters are most important. It’s a fantastic trial-and-error solution that works in practice, but we’re a long way from understanding how it works in theory. That means that we can expect to see speed and result improvements as researchers gain a better understanding of why it’s effective, and how it can be optimized. As one of my friends put it, a whole generation of graduate students are being sacrificed to this effort, but they’re doing it because the potential payoff is so big.
I don’t want you to just jump on the bandwagon, but deep learning is a genuine advance, and people are right to be excited about it. I don’t doubt that we’re going to see plenty of other approaches trying to improve on its results, it’s not going to be the last word in machine learning, but it has been a big leap forward for the field, and promises a lot more in years to come.
I’m very pleased to announce that I’ve managed to port the Deep Belief image recognition SDK to the Raspberry Pi! I’m excited about this because it shows that even tiny, cheap devices are capable of performing sophisticated computer vision tasks. I’ve talked a lot about how object detection is going to be commoditized and ubiquitous, but this is a tangible example of what I mean, and I’ve already had folks digging into some interesting applications; detecting endangered wildlife, traffic analysis, satellites, even intelligent toys.
I can process a frame in around three seconds, largely thanks to heavy use of the embedded GPU for heavy lifting on the math side. I had to spend quite a lot of time writing custom assembler programs for the Pi’s 12 parallel ‘QPU’ processors, but I’m grateful I could get access at that low a level. Broadcom only released the technical specs for their graphics chip in the last few months, and it’s taken a community effort to turn that into a usable set of examples and compilers. I ended up heavily patching one of the available assemblers to support more instructions, and created a set of helper macros for programming the DMA controller, so I’ve released those all as open source. I wish more manufacturers would follow Broadcom’s lead and give us access to their GPUs at the assembler level, there’s a lot of power in those chips but it’s so hard to tune algorithms to make use of them without being able to see how they work.
Download the library, give it a try, and let me know about projects you use it on. I’m looking forward to hearing about what you come up with!
Yesterday I was suddenly struck by a thought – I used to be a coder, now I teach computers to write their own programs. With the deep belief systems I’m using for computer vision, I spend most of my time creating an environment that allows the machines to decide how they want to solve problems, rather than dictating the solution myself. I’m starting to feel a lot more like a teacher than a programmer, so here’s what it’s like to teach a classroom of graphics cards.
I have to spend a lot of time figuring out how to collect a large training set of images, which have to represent the kind of pictures that the algorithm will be likely to encounter. That means you can’t just re-use photos from cell phones if you’re targeting a robotics application. The lighting, viewing angles, and even the ‘fisheye’ geometry of the lens all have to be consistent with what the algorithm will encounter in the real world or you’ll end up with poor results. I also have to make sure the backgrounds of the images are as random as possible, because if the objects I’m looking for always occur in a similar setting in the training, I’ll end up detecting that rather than the thing I actually care about.
Another crucial step is deciding what the actual categories I’m going to recognize are. They have to be the kind of thing that’s quite different between images, so separating cats from dogs is more likely to work than distinguishing American from British short-hair cat breeds. There are often edge cases too, so to get consistent categorization I’ll spend some time figuring out rules. If I’m looking for hipsters with mustaches, how much stubble does somebody need on their upper lip before they count? What if they have a mustache as part of a beard?
Once I’ve done all that, I have to label at least a thousand images for each category, with often up to a million images in total. This means designing a system to capture likely images from the web or other sources, with a UI that lets me view them rapidly and apply labels to any that fall into a category I care about. I always start by categorizing the first ten thousand or so images myself so I can get a feel for how well the categorization rules work, and what the source images are like overall. Once I’m happy the labeling process works, I’ll get help from the rest of our team, and then eventually bring in Mechanical Turks to speed up the process.
One advantage I have over my conventional teacher friends is that I get to design my own students! This is one of the least-understood parts of the deep learning process though, with most vision solutions sticking pretty close to the setup described in the original Krizhevsky paper. There are several basic components that I have to arrange in a pipeline, repeating some of them several times with various somewhat-arbitrary transformations in between. There are a lot of obscure choices to make about ordering and other parameters, and you won’t know if something’s an improvement until after you’ve done a full training run, which can easily take weeks. This means that, as one of my friends put it, we have an entire generation of graduate students trying to find improvements by trying random combinations in parallel. It’s a particularly painful emulation of a genetic algorithm since it’s powered by consuming a chunk of people’s careers, but until we have more theory behind deep learning, the only way to make progress is by using architectures that have been found to work in the past.
The training process itself involves repeatedly looping through all of the labeled images, and rewarding or punishing the neural connections in your network depending on how correctly they respond to each photo. This process is similar to natural learning, as more examples are seen the system starts to understand more about the patterns they have in common and the success rate increases. In practice deep neural networks are extremely fussy learners though, and I spend most of my time trying to understand why they’re bone-headedly not improving when they should be. There can be all sorts of problems; poorly chosen categories, bad source images, incorrectly classified objects, a network layout that doesn’t work, or bugs in the underlying code. I can’t ask the network why it’s not learning, we just don’t have good debugging tools, so I’ll usually end up simplifying the system to eliminate possible causes and try solutions more quickly than I could with a full run.
Training can take a long time for problems like recognizing the 1,000 Imagenet categories, on the order of a couple of weeks. At any point the process can spiral out of control or hit a bug, so I have to check the output logs several times a day to see how they’re doing. My girlfriend has became resigned to me tending ‘the brain’ in the corner of our living room in breaks between evening TV. Even if nothing’s gone dramatically wrong, several of the parameters need to be changed as the training process progresses to keep the learning rate up, and knowing when to make those changes is much more of an art than a science.
Once I’ve got a model fully trained, I have to figure out how well it works in practice. You might think it would be easy to evaluate a computer, they don’t have all the human problems of performance anxiety or distraction, but this part can actually be quite tough. As part of the training process I’m continually running numerical tests on how many right and wrong answers the system is giving, but, like standardized tests for kids, these only tell part of the story. One of the big advantages I’ve found with deep learning systems is that they make more understandable mistakes than other approaches. For example, users are a lot more forgiving if a picture of a cat is mis-labeled as a racoon, than if it’s categorized as a coffee cup! That kind of information is lost when we boil down performance into a single number, so I have to dive deeper.
The real test is building the model into an application and getting it in front of users. Often I’ll end up tweaking the results of the algorithm based on how I observe people reacting to it, for example suppressing the nematode label because it’s the default when the image is completely black. I’ll often spot more involved problems that require changes at the training set level, which will require another cycle through the whole process once they’re important enough to tackle.
As you can see being a computer trainer is a strange job, but as we get better at building systems that can learn, I bet it’s going to be increasingly common. The future may well belong to humble humans who work well with intelligent machines.