Deep learning is a very weird technology. It evolved over decades on a very different track than the mainstream of AI, kept alive by the efforts of a handful of believers. When I started using it a few years ago, it reminded me of the first time I played with an iPhone – it felt like I’d been handed something that had been sent back to us from the future, or alien technology.
One of the consequences of that is that my engineering intuitions about it are often wrong. When I came across im2col, the memory redundancy seemed crazy, based on my experience with image processing, but it turns out it’s an efficient way to tackle the problem. While there are more complex approaches that can yield better results, they’re not the ones my graphics background would have predicted.
Another key area that seems to throw a lot of people off is how much precision you need for the calculations inside neural networks. For most of my career, precision loss has been a fairly easy thing to estimate. I almost never needed more than 32-bit floats, and if I did it was because I’d screwed up my numerical design and I had a fragile algorithm that would go wrong pretty soon even with 64 bits. 16-bit floats were good for a lot of graphics operations, as long as they weren’t chained together too deeply. I could use 8-bit values for a final output for display, or at the end of an algorithm, but they weren’t useful for much else.
It turns out that neural networks are different. You can run them with eight-bit parameters and intermediate buffers, and suffer no noticeable loss in the final results. This was astonishing to me, but it’s something that’s been re-discovered over and over again. My colleague Vincent Vanhoucke has the only paper I’ve found covering this result for deep networks, but I’ve seen with my own eyes how it holds true across every application I’ve tried it on. I’ve also had to convince almost every other engineer who I tell that I’m not crazy, and watch them prove it to themselves by running a lot of their own tests, so this post is an attempt to short-circuit some of that!
How does it work?
You can see an example of a low-precision approach in the Jetpac mobile framework, though to keep things simple I keep the intermediate calculations in float and just use eight bits to compress the weights. Nervana’s NEON library also supports fp16, though not eight-bit yet. As long as you accumulate to 32 bits when you’re doing the long dot products that are the heart of the fully-connected and convolution operations (and that take up the vast majority of the time) you don’t need float though, you can keep all your inputs and output as eight bit. I’ve even seen evidence that you can drop a bit or two below eight without too much loss! The pooling layers are fine at eight bits too, I’ve generally seen the bias addition and activation functions (other than the trivial relu) done at higher precision, but 16 bits seems fine even for those.
I’ve generally taken networks that have been trained in full float and down-converted them afterwards, since I’m focused on inference, but training can also be done at low precision. Knowing that you’re aiming at a lower-precision deployment can make life easier too, even if you train in float, since you can do things like place limits on the ranges of the activation layers.
Why does it work?
I can’t see any fundamental mathematical reason why the results should hold up so well with low precision, so I’ve come to believe that it emerges as a side-effect of a successful training process. When we are trying to teach a network, the aim is to have it understand the patterns that are useful evidence and discard the meaningless variations and irrelevant details. That means we expect the network to be able to produce good results despite a lot of noise. Dropout is a good example of synthetic grit being thrown into the machinery, so that the final network can function even with very adverse data.
The networks that emerge from this process have to be very robust numerically, with a lot of redundancy in their calculations so that small differences in input samples don’t affect the results. Compared to differences in pose, position, and orientation, the noise in images is actually a comparatively small problem to deal with. All of the layers are affected by those small input changes to some extent, so they all develop a tolerance to minor variations. That means that the differences introduced by low-precision calculations are well within the tolerances a network has learned to deal with. Intuitively, they feel like weebles that won’t fall down no matter how much you push them, thanks to an inherently stable structure.
At heart I’m an engineer, so I’ve been happy to see it works in practice without worrying too much about why, I don’t want to look a gift horse in the mouth! What I’ve laid out here is my best guess at the cause of this property, but I would love to see a more principled explanation if any researchers want to investigate more thoroughly? [Update – here’s a related paper from Matthieu Courbariaux, thanks Scott!]
What does this mean?
This is very good news for anyone trying to optimize deep neural networks. On the general CPU side, modern SIMD instruction sets are often geared towards float, and so eight bit calculations don’t offer a massive computational advantage on recent x86 or ARM chips. DRAM access takes a lot of electrical power though, and is slow too, so just reducing the bandwidth by 75% can be a very big help. Being able to squeeze more values into fast, low-power SRAM cache and registers is a win too.
GPUs were originally designed to take eight bit texture values, perform calculations on them at higher precisions, and then write them back out at eight bits again, so they’re a perfect fit for our needs. They generally have very wide pipes to DRAM, so the gains aren’t quite as straightforward to achieve, but can be exploited with a bit of work. I’ve learned to appreciate DSPs as great low-power solutions too, and their instruction sets are geared towards the sort of fixed-point operations we need. Custom vision chips like Movidius’ Myriad are good fits too.
Deep networks’ robustness means that they can be implemented efficiently across a very wide range of hardware. Combine this flexibility with their almost-magical effectiveness at a lot of AI tasks that have eluded us for decades, and you can see why I’m so excited about how they will alter our world over the next few years!
This reminds me of Geoff Hinton’s talk about how neurons in the brain prefer to communicate single bits rather than real values. He touches on it about 6 minutes in and then gets back to it 45 minutes in – https://www.youtube.com/watch?v=DleXA5ADG78.
Some sort of “8 bit proof” could be taken just from notebook. Open any video recording tool and look at yourself. I hope you recognize yourself pretty fast. 🙂 The explanation of the proof starts from the fact your notebook camera very likely uses no more than 8 bits per BGR channel. And it appears enough for the task. Another consideration: 1% value is not generally significant, so 1 bit value of 8 bit range counts just as 0.4%, and so fits quite well. Of course, math guys will put a lot of critics on these “proofs”, but I think as coarse assessment it does work.
About memory requirements, you’re right. After evaluation of some my test, I discovered machine needs about 3 terabytes of RAM just to mimic neocortex of HS with 8 bit precision. Having 64 bit computing cuts a range by 8. Sounds bad.
I think you’re getting confused between integer and floating point representation in your comment…
Binary ANNs use just one bit for weights. I have done some testing on binary ANNs, using the output of many of them to provide a consensus decision. This has usefulness in minimizing te memory of te ANN, but allowing them to run in parallel. Kanerva’s Sparse Distributed memory uses 8-nit weights, althought I suspect it will work with fewer bits.
I suspect that as te number of connections increases, the smaller the individual bit-size needs to be. At the limit, it may be just 1. Perhaps a better approach is to consider what is the aggregate number of bits needed across a given network to reach some arbitray accuracy. Are 10 connections at 32 bits about as accurate as 40 at 8-bits or 320 at 1 bit each?
Pingback: nl comments on “Intel Buys Altera for $16.7B” | Exploding Ads
As Xueseng Qian mentioned in “Engineering Control Theory”, Deviation of the system may be under control even that of components are large, since deviation of composite components may be counteracted.
I like this comment quoted from Qian Xuesen.
Pingback: Five Deep Links « Pete Warden's blog
Pingback: LessThunk.com « Why are Eight Bits Enough for Deep Neural Networks? | Pete Warden’s blog
You can reduce the numerical precision of a neuron model, from floating point to discrete value/ integer networks, simplify the sigmoid function, use power-of-2 weights, etc. and still get good results on many tasks. Techniques like this were used to successfully run neural networks on non-desktop CPUs that didn’t have a floating point unit e.g. mobile robotics platforms, FPGAs etc. There is a list of citations on this topic dating back to 1988 at https://goo.gl/6nltlB page 68.
Pingback: One Weird Trick for Faster Android Multithreading « Pete Warden's blog
what about this work on using fixed point numbers + stochastic rounding:
Click to access 1502.02551.pdf
Here’s a paper describing neural network compression (used to cram neural networks into RAM), where one of the techniques is limiting the precision to 5 bits.
Click to access 1510.00149v3.pdf
“On the general CPU side, modern SIMD instruction sets are often geared towards float, and so eight bit calculations don’t offer a massive computational advantage on recent x86 or ARM chips.”
Um.. no. Integer computation, when it can be limited to 16-bits or less are typically 2+ times faster on neon or SSE4.1 or later. GPUs on the other hand are heavily geared towards float and may in fact be faster with float than integer arithmetic.
On a modern x86 chip with AVX2 you have instructions to operations to deal with 32 8bit numbers. There is even an 8bit macc instruction, which conveniently gives a 16bit result which would be a great fit for this kind of algorithm.
From what I see on Wikipedia, Cannonlake is expected to have byte and word instructions and “some Skylake Xeons” have them. But they are listed as AVX-512 instructions, not AVX2. What are the AVX2 instructions you are referring to?
Pingback: How to Quantize Neural Networks with TensorFlow « Pete Warden's blog
Pingback: Deep Learning & Pieces of Eight - OrionX.net A New Model for Strategy, Marketing, PR
Pingback: Bookmarks for October 17th from 15:10 to 20:12 : Extenuating Circumstances
You say “their almost-magical effectiveness at a lot of AI tasks” (‘their’ being Deep NNs).
I find that perplexing. What tasks have these hyped models been (relatively) successful at, besides basic pattern recognition tasks (sound and image recognition) the kind of which even cats and mice to much better – where’ the huge AI success, it beats me!
I saw you needed this :
This is not paywalled in researchgate.net, you can even download it.
Hope you enjoy it.
Hi Pete, I found your post because of a project I’m working on and I’d like to thank you. As a hardware engineer I’ve had the same feeling about redundant memory — more so because chip manufacturers would never recommend using less RAM. I’m glad to see it confirmed. From now on my network will be 8-bit fixed point. (RGBA!)
I am interested in your project
Can you sell your chips ?
Pingback: What Machine Learning needs from Hardware « Pete Warden's blog
Pingback: Quantization Screencast « Pete Warden's blog
Pingback: 使用深度神经网络为什幺8位足够？ – 闪念基因 – 个人技术分享