Why are Eight Bits Enough for Deep Neural Networks?

turbo_esprit_alternative_loading_screen_by_retronator-d72wl6o

Picture by Retronator

Deep learning is a very weird technology. It evolved over decades on a very different track than the mainstream of AI, kept alive by the efforts of a handful of believers. When I started using it a few years ago, it reminded me of the first time I played with an iPhone – it felt like I’d been handed something that had been sent back to us from the future, or alien technology.

One of the consequences of that is that my engineering intuitions about it are often wrong. When I came across im2col, the memory redundancy seemed crazy, based on my experience with image processing, but it turns out it’s an efficient way to tackle the problem. While there are more complex approaches that can yield better results, they’re not the ones my graphics background would have predicted.

Another key area that seems to throw a lot of people off is how much precision you need for the calculations inside neural networks. For most of my career, precision loss has been a fairly easy thing to estimate. I almost never needed more than 32-bit floats, and if I did it was because I’d screwed up my numerical design and I had a fragile algorithm that would go wrong pretty soon even with 64 bits. 16-bit floats were good for a lot of graphics operations, as long as they weren’t chained together too deeply. I could use 8-bit values for a final output for display, or at the end of an algorithm, but they weren’t useful for much else.

It turns out that neural networks are different. You can run them with eight-bit parameters and intermediate buffers, and suffer no noticeable loss in the final results. This was astonishing to me, but it’s something that’s been re-discovered over and over again. My colleague Vincent Vanhoucke has the only paper I’ve found covering this result for deep networks, but I’ve seen with my own eyes how it holds true across every application I’ve tried it on. I’ve also had to convince almost every other engineer who I tell that I’m not crazy, and watch them prove it to themselves by running a lot of their own tests, so this post is an attempt to short-circuit some of that!

How does it work?

You can see an example of a low-precision approach in the Jetpac mobile framework, though to keep things simple I keep the intermediate calculations in float and just use eight bits to compress the weights. Nervana’s NEON library also supports fp16, though not eight-bit yet. As long as you accumulate to 32 bits when you’re doing the long dot products that are the heart of the fully-connected and convolution operations (and that take up the vast majority of the time) you don’t need float though, you can keep all your inputs and output as eight bit. I’ve even seen evidence that you can drop a bit or two below eight without too much loss! The pooling layers are fine at eight bits too, I’ve generally seen the bias addition and activation functions (other than the trivial relu) done at higher precision, but 16 bits seems fine even for those.

I’ve generally taken networks that have been trained in full float and down-converted them afterwards, since I’m focused on inference, but training can also be done at low precision. Knowing that you’re aiming at a lower-precision deployment can make life easier too, even if you train in float, since you can do things like place limits on the ranges of the activation layers.

Why does it work?

I can’t see any fundamental mathematical reason why the results should hold up so well with low precision, so I’ve come to believe that it emerges as a side-effect of a successful training process. When we are trying to teach a network, the aim is to have it understand the patterns that are useful evidence and discard the meaningless variations and irrelevant details. That means we expect the network to be able to produce good results despite a lot of noise. Dropout is a good example of synthetic grit being thrown into the machinery, so that the final network can function even with very adverse data.

The networks that emerge from this process have to be very robust numerically, with a lot of redundancy in their calculations so that small differences in input samples don’t affect the results. Compared to differences in pose, position, and orientation, the noise in images is actually a comparatively small problem to deal with. All of the layers are affected by those small input changes to some extent, so they all develop a tolerance to minor variations. That means that the differences introduced by low-precision calculations are well within the tolerances a network has learned to deal with. Intuitively, they feel like weebles that won’t fall down no matter how much you push them, thanks to an inherently stable structure.

At heart I’m an engineer, so I’ve been happy to see it works in practice without worrying too much about why, I don’t want to look a gift horse in the mouth! What I’ve laid out here is my best guess at the cause of this property, but I would love to see a more principled explanation if any researchers want to investigate more thoroughly? [Update – here’s a related paper from Matthieu Courbariaux, thanks Scott!]

What does this mean?

This is very good news for anyone trying to optimize deep neural networks. On the general CPU side, modern SIMD instruction sets are often geared towards float, and so eight bit calculations don’t offer a massive computational advantage on recent x86 or ARM chips. DRAM access takes a lot of electrical power though, and is slow too, so just reducing the bandwidth by 75% can be a very big help. Being able to squeeze more values into fast, low-power SRAM cache and registers is a win too.

GPUs were originally designed to take eight bit texture values, perform calculations on them at higher precisions, and then write them back out at eight bits again, so they’re a perfect fit for our needs. They generally have very wide pipes to DRAM, so the gains aren’t quite as straightforward to achieve, but can be exploited with a bit of work. I’ve learned to appreciate DSPs as great low-power solutions too, and their instruction sets are geared towards the sort of fixed-point operations we need. Custom vision chips like Movidius’ Myriad are good fits too.

Deep networks’ robustness means that they can be implemented efficiently across a very wide range of hardware. Combine this flexibility with their almost-magical effectiveness at a lot of AI tasks that have eluded us for decades, and you can see why I’m so excited about how they will alter our world over the next few years!

Jetpac’s deep learning framework on the Beaglebone Black

beagle

Photo by Michael Nika

I’ve been having a lot of fun porting the Jetpac image recognition library to new and tinier devices, and the latest addition to the family is the Beaglebone Black. As I mentioned in my Raspberry Pi 2 port, the Eigen math library has had a lot of effort put into ARM optimizations by Benoit Jacob and Benoit Steiner recently, and I was able to use it to good effect on the Beagle. The overall time for the 1,000 category Imagenet task was 5.5 seconds, not enough for real time but still promising for a lot of applications with smaller networks or slower response needs. The default OS was a bit long in the tooth though, so I had to patch Eigen to get NEON support working.

I also updated the general project documentation to describe how to build the library from source on a new device, since I’ve been seeing it pop up as a “hello world” for deep networks on new platforms. Thanks to everyone who’s reached out, it’s been great hearing about all the cool projects out there. I just can’t wait until I get my hands on a CHIP to see how much performance we can squeeze out of a $9 computer!

Image Recognition on the Raspberry Pi 2

raspberrypie

Photo by Shashinjutsu

I loved the original Raspberry Pi, it was a great platform to run deep neural networks on, especially with a fully-programmable GPU. I was excited when the new Pi 2 was released, because it was even more powerful for the same low price. Unfortunately I heard back from early users that the GPU code I had been using no longer worked, the device just crashed when the example program was run.

I ordered a Pi 2, and this weekend I was finally able to devote a few hours to debugging the problems. The bad news is that I wasn’t able to figure out why the GPU code is being problematic. The good news is that the CPU’s so improved on the Pi 2 that I’m able to run even faster without it, in 3.2 seconds!

I’ve checked in my changes, and you can see full directions in the README, but the summary is that by using Eigen and gcc 4.8, NEON code on the CPU is able to run the matrix calculations very fast. One of my favorite parts of joining Google has been all the open-source heroes I’ve been able to hang out with, and I’ve got to know Benoit Jacob , the founder, and Benoit Steiner, a top contributor to the Eigen project. I knew they’ve been doing amazing work improving ARM performance, so I was hopeful that the latest version would be a big step forward. I was pleased to discover that the top of tree is almost 25% faster than the last stable release in January!

Let me know how you get on if you do dive in. I’ve had a lot of fun with this, and I hope you do too!

Why you should visit the Living Computer Museum in Seattle

IMG_2631

I’d never heard of the Living Computer Museum, and even when Joanne suggested it as a great geeky excursion during a visit to Seattle I wasn’t sure what to expect. We’d planned to spend thirty minutes there on the way to somewhere else, but we ended up staying for three hours to explore everything they had. I was expecting a smaller copy of Silicon Valley’s Computer History Museum, but what I found was a lot more interesting. As soon as I walked in, the first exhibit I saw was a PDP-7 with a teletype attached that I could actually play with!

After some failed tries, I typed in ‘help’ and saw a list of the commands scroll by. After that, I was able to generate a directory listing of the file system, even though it felt painfully slow at what seemed like just one character per second. Ken Thompson built the first version of Unix for a PDP-7, so I felt a kind of awe at being able to experience what our technical ancestors had to go through to build the foundations of our world.

IMG_2636

At least they never had to face dysentery through. It’s not just ancient mainframes that the museum keep alive and playable, they also have a wide selection of seventies and eighties personal computers, with programs ready to run. I saw a lot of nostalgia as people rediscovered games and machines from their childhoods, whether it was Oregon Trail on an Apple IIe, or Ms Pacman on a Vic 20.

IMG_2645

I had fun playing through Oregon Trail, until one of my party contracted typhoid and I decided that was a good point to take a break. What really pulled me in was their collection of mainframe and mini-computers though. When I was 15, I did work experience at a chemical factory that ran on a DEC VAX. That week left me with a burning curiosity to know more, especially after I saw the wall of manuals.

IMG_2642

I had leafed through a few of them while I was there, but I couldn’t imagine anything more fun than having access to all of the secrets they could teach me about these amazing devices. Those orange and grey binders might not look like much now, but to the teenage me they represented a ticket to somewhere magical.

IMG_2644

That meant I was excited to learn that the museum offers remote login access to the Vax, and a couple of other old systems. I have no idea when I’d find the free time to start learning an ancient OS, but I love knowing that I have the chance.

The exhibits themselves are definitely very focused on the seventies and eighties, which made a lot more sense once I learned the nucleus of the collection was Paul Allen’s personal memorabilia. It does lead to some awkward dancing around the Apple/Microsoft relationship, and the rise of the internet isn’t really covered, but I think being selective helps overall. If you spent time with machines in that era, or you’re interested in getting a taste of what those of us who did had to put up with, I highly recommend paying a visit, I bet you’ll have fun!

IMG_2634

Why GEMM is at the heart of deep learning

seventiessuits

Photo by Anthony Catalano

I spend most of my time worrying about how to make deep learning with neural networks faster and more power efficient. In practice that means focusing on a function called GEMM. It’s part of the BLAS (Basic Linear Algebra Subprograms) library that was first created in 1979, and until I started trying to optimize neural networks I’d never heard of it. To explain why it’s so important, here’s a diagram from my friend Yangqing Jia’s thesis:

profile

This is breaking down where the time’s going for a typical deep convolutional neural network doing image recognition using Alex Krizhevsky’s Imagenet architecture. All of the layers that start with fc (for fully-connected) or conv (for convolution) are implemented using GEMM, and almost all the time (95% of the GPU version, and 89% on CPU) is spent on those layers.

So what is GEMM?  It stands for GEneral Matrix to Matrix Multiplication, and it essentially does exactly what it says on the tin, multiplies two input matrices together to get an output one. The difference between it and the kind of matrix operations I was used to in the 3D graphics world is that the matrices it works on are often very big. For example, a single layer in a typical network may require the multiplication of a 256 row, 1,152 column matrix by an 1,152 row, 192 column matrix to produce a 256 row, 192 column result. Naively, that requires 57 million (256 x 1,152, x 192) floating point operations and there can be dozens of these layers in a modern architecture, so I often see networks that need several billion FLOPs to calculate a single frame. Here’s a diagram that I sketched to help me visualize how it works:

gemm_corrected

Fully-Connected Layers

Fully-connected layers are the classic neural networks that have been around for decades, and it’s probably easiest to start with how GEMM is used for those. Each output value of an FC layer looks at every value in the input layer, multiplies them all by the corresponding weight it has for that input index, and sums the results to get its output. In terms of the diagram above, it looks like this:

fcgemm_corrected

There are ‘k’ input values, and there are ‘n’ neurons, each one of which has its own set of learned weights for every input value. There are ‘n’ output values, one for each neuron, calculated by doing a dot product of its weights and the input values.

Convolutional Layers

Using GEMM for the convolutional layers is a lot less of an obvious choice. A conv layer treats its input as a two dimensional image, with a number of channels for each pixel, much like a classical image with width, height, and depth. Unlike the images I was used to dealing with though, the number of channels can be in the hundreds, rather than just RGB or RGBA!

The convolution operation produces its output by taking a number of ‘kernels’ of weights. and applying them across the image. Here’s what an input image and a single kernel look like:

kernelview

Each kernel is another three-dimensional array of numbers, with the depth the same as the input image, but with a much smaller width and height, typically something like 7×7. To produce a result, a kernel is applied to a grid of points across the input image. At each point where it’s applied, all of the corresponding input values and weights are multiplied together, and then summed to produce a single output value at that point. Here’s what that looks like visually:

patches

You can think of this operation as something like an edge detector. The kernel contains a pattern of weights, and when the part of the input image it’s looking at has a similar pattern it outputs a high value. When the input doesn’t match the pattern, the result is a low number in that position. Here are some typical patterns that are learned by the first layer of a network, courtesy of the awesome Caffe and featured on the NVIDIA blog:

kernels

Because the input to the first layer is an RGB image, all of these kernels can be visualized as RGB too, and they show the primitive patterns that the network is looking for. Each one of these 96 kernels is applied in a grid pattern across the input, and the result is a series of 96 two-dimensional arrays, which are treated as an output image with a depth of 96 channels. If you’re used to image processing operations like the Sobel operator, you can probably picture how each one of these is a bit like an edge detector optimized for different important patterns in the image, and so each channel is a map of where those patterns occur across the input.

You may have noticed that I’ve been vague about what kind of grid the kernels are applied in. The key controlling factor for this is a parameter called ‘stride’, which defines the spacing between the kernel applications. For example, with a stride of 1, a 256×256 input image would have a kernel applied at every pixel, and the output would be the same width and height as the input. With a stride of 4, that same input image would only have kernels applied every four pixels, so the output would only be 64×64. Typical stride values are less than the size of a kernel, which means that in the diagram visualizing the kernel application, a lot of them would actually overlap at the edges.

How GEMM works for Convolutions

This seems like quite a specialized operation. It involves a lot of multiplications and summing at the end, like the fully-connected layer, but it’s not clear how or why we should turn this into a matrix multiplication for the GEMM. I’ll talk about the motivation at the end, but here’s how the operation is expressed in terms of a matrix multiplication.

The first step is to turn the input from an image, which is effectively a 3D array, into a 2D array that we can treat like a matrix. Where each kernel is applied is a little three-dimensional cube within the image, and so we take each one of those cubes of input values and copy them out as a single column into a matrix. This is known as im2col, for image-to-column, I believe from an original Matlab function, and here’s how I visualize it:

im2col_corrected

Now if you’re an image-processing geek like me, you’ll probably be appalled at the expansion in memory size that happens when we do this conversion if the stride is less than the kernel size. This means that pixels that are included in overlapping kernel sites will be duplicated in the matrix, which seems inefficient. You’ll have to trust me that this wastage is outweighed by the advantages though.

Now you have the input image in matrix form, you do the same for each kernel’s weights, serializing the 3D cubes into rows as the second matrix for the multiplication. Here’s what the final GEMM looks like:

im2colmult_corrected

Here ‘k’ is the number of values in each patch and kernel, so it’s kernel width * kernel height * depth. The resulting matrix is ‘Number of patches’ columns high, by ‘Number of kernel’ rows wide. This matrix is actually treated as a 3D array by subsequent operations, by taking the number of kernels dimension as the depth, and then splitting the patches back into rows and columns based on their original position in the input image.

Why GEMM works for Convolutions

Hopefully you can now see how you can express a convolutional layer as a matrix multiplication, but it’s still not obvious why you would do it. The short answer is that it turns out that the Fortran world of scientific programmers has spent decades optimizing code to perform large matrix to matrix multiplications, and the benefits from the very regular patterns of memory access outweigh the wasteful storage costs. This paper from Nvidia is a good introduction to some of the different approaches you can use, but they also describe why they ended up with a modified version of GEMM as their favored approach. There are also a lot of advantages to being able to batch up a lot of input images against the same kernels at once, and this paper on Caffe con troll uses those to very good effect. The main competitor to the GEMM approach is using Fourier transforms to do the operation in frequency space, but the use of strides in our convolutions makes it hard to be as efficient.

The good news is that having a single, well-understood function taking up most of our time gives a very clear path to optimizing for speed and power usage, both with better software implementations and by tailoring the hardware to run the operation well. Because deep networks have proven to be useful for a massive range of applications across speech, NLP, and computer vision, I’m looking forward to seeing massive improvements over the next few years, much like the widespread demand for 3D games drove a revolution in GPUs by forcing a revolution in vertex and pixel processing operations.

(Updated to fix my incorrect matrix ordering in the diagrams, apologies to anyone who was confused!)

Give Bay Area girls a head-start in tech

Screen Shot 2015-03-12 at 8.03.07 AM

This summer the Stanford AI Lab has a two week program called SAILOR aimed at local 9th grade girls, and I think it’s a wonderful chance to give promising students a strong start in a very important field. They’re a scrappy grass-roots initiative within the organization though, so they do need financial support to help pay for attendees expenses! There’s no online way to sponsor the program unfortunately, but if you email them at sailors-finance@cs.stanford.edu, they’ll be able to help you donate. In their own words, here’s what the program is trying to accomplish:

  • To simultaneously educate and excite students about the field of AI by providing exposure to a variety of AI topics, discussing in-depth some of the cutting-edge AI research, and exploring the societal impacts of AI.
  • To foster personal growth through career development workshops, mentoring, and social events.
  • To provide students a hands-on experience with real research projects in the AI Lab.
  • The program also aims to build a close-knit community and encourage interest among underrepresented minorities in the field.

I think this is important because it’s a practical and immediate way to do something at the grassroots to address the inequalities that plague our industry, and the local area. It’s just one step, but I think it can make a real difference to the lives of the attendees.

I do have a personal reason for supporting this effort. I grew up as a “Townie” in Cambridge, England and I was fascinated by the university but never had a chance to experience what it had to offer as a child. I think it’s sad that the college was so cut off from its local community, for both sides’ sake. One of the things I love about America is that universities are far more open to the outside world, with a lot of people I know taking Stanford’s continuing studies program for example, or their online courses. There are still incredible contrasts of course, like between East Palo Alto and the main city, but at least the college is actively trying to do something about the problems.

If you can help, or if you know students who might benefit from this program, do reach out to sailors@cs.stanford.edu for more details. I’m excited to see what this initiative can accomplish!

Five short links

numberfivecat

Photo by Brian Schoonover

Understanding genre in a collection of a million volumes – This project achieved 97% precision in identifying whether a book was poetry, prose, or non-fiction. Machines are never going to replace human scholars, but I know they can help them answer questions that would have been impossible to tackle in the past.

OpenAddresses – A wonderful resource for building geocoding tools, and one we’ve needed for a long time, I’m excited to see this collection growing.

Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-Scale Convolutional Architecture – I know I’m a broken record on deep learning, but almost everywhere it’s being applied it’s doing better than techniques that people have been developing for decades. This example is particularly exciting because the results can be fed in to other image processing algorithms, it’s a big improvement in the foundations of our understanding of natural scenes.

Book Review: On the Road – Looking back on my reading growing up, I realize that the underlying appeal of a lot of books was a world where life would be easy, at least for the heroes and by extension me. I’ll always remember the review of Phillip K. Dick’s work that pointed out his protagonist always had jobs, and they were often pretty unglamorous, and how unusual that was in sci-fi.

It’s not an asshole problem – it’s a bystander problem – More food for thought from Cate Huston, talking about some practical ways for men to help our industry’s awful gender ratio without making a big song and dance of it.

Follow

Get every new post delivered to your Inbox.

Join 1,156 other followers