It’s been a while since I last wrote about using eight bit for inference with deep learning, and the good news is that there has been a lot of progress, and we know a lot more than we did even a year ago. There are still a lot of unanswered questions too, which is why I’m waiting for a plane to take me to MobiSys, where I’ll be helping Nic Lane from UCL run a workshop for the research community to investigate some of them.
As a foundation for that, I’ll be giving a talk on what I know now, and what my hunches are. A lot of it is empirical, and we don’t have nearly enough rigorous experiments, let alone published papers, but if you take all this as provisional I hope it might still be useful. I’m also very happy to acknowledge my deep debt to my Google colleagues and others like Song Han who are the driving forces behind much of this work! Here are my notes on the areas I’ll be covering tomorrow.
Hardware implementations
Since the original TPU paper has been published, we can now use that as a successful example of using eight bit for inference across a wide variety of models within Google. There’s also the collaboration between the Qualcomm and TensorFlow teams that enables models to run up to seven times faster on the HVX DSP than on the CPU, thanks to the use of eight bit. This means we now have more evidence that this is a good approach to use on the hardware side.
Training with forward passes
I don’t have any published papers to hand, and we haven’t documented it well within TensorFlow, but we do have support for “fake quantization” operators. If you include these in your graphs at the points where quantization is expected to occur (for example after convolutions), then in the forward pass the float values will be rounded to the specified number of levels (typically 256) to simulate the effects of quantization. In the backward pass, this rounding won’t be performed, so gradients will be calculated using full float values. This has the effect of forcing the graph to adapt to the lower precision it will encounter during inference, and in practice we’ve seen this improve the accuracy of the quantized graph dramatically, sometimes to a level indistinguishable from float. It also gives precalculated min/max ranges for the 32-bit to 8-bit downscaling that needs to happen after many operations. This saves a step on the CPU, but for hardware implementations it’s even more important, since a dynamically-calculated range may be impossible to efficiently implement.
By the way, if you do want fixed ranges but can’t retrain, there are some options for running example data through a pretrained network to bake them in instead.
Exact zeroes are important
The current TensorFlow way of figuring out ranges just looks at the min/max of the float values and assigns those to 0 and 255. This means that real zero is almost always not exactly representable, and the closest encoded value may represent something like 0.046464, or some other arbitrary distance from exact zero. For most numbers this doesn’t matter, because the float values are assumed to occur in a ‘random’ enough way that the error on the representation of any individual value is also uniformly random. The idea is that as long as the errors generally cancel each other out, they’ll just appear as the kind of random noise that the network is trained to cope with and so not destroy the overall accuracy by introducing a bias.
The problem is that the real value of zero shows up a lot more often you’d expect in neural network calculations. Convolutions are padded with zeros at the edges when filters overlap, and the Relu activation function gates any negative numbers at zero. This means that any error in the zero representation contributes disproportionately to overall results.
The solution to this is to ensure that real values of zero are represented as exactly as possible in the quantized encoding. The way to do this is to nudge the overall min/max values so that zero is exact. We’re not (yet) doing this in TensorFlow, but hope to have it in soon. For much more information, Benoit Jacob has some excellent documentation in gemmlowp, and is the source of most of the information above.
Asymmetric ranges are inconvenient, but may be necessary
Constraining the min/max ranges so that the minimum is always the negative of the maximum is very convenient for a lot of purposes because it avoids having to apply an offset to the operands to the matrix multiply. Unfortunately the evidence for whether this allows for enough precision is mixed, with some models showing unacceptable loss of overall accuracy. This is still an open question, and an area where we need more experiments.
Excluding -128 can be useful
One practical issue that has come up in various contexts is that signed eight bit values run from -128 to +127. This is inconvenient because there’s one more value on the negative side than the positive, and so requires careful handling if we want to use symmetric ranges and ensure zero is exactly representable as encoded zero. Unrelatedly, it’s also been helpful with the ARM NEON CPU implementation to avoid -128 for the weights to allow a faster code path. There’s not all that much principle behind it yet, but there’s thus some evidence that avoiding -128 in general may be helpful.
Lower bit depths are promising, but unproven
There have been some fantastic papers around four bit, two bit, or even one bit precision for neural networks. Unfortunately so far they’ve all had some practical drawbacks that have prevented us from taking advantage of them. Song Han’s four-bit weights require a lookup table, which makes them hard to implement efficiently at runtime, though I’m intrigued to know if a simple function to handle the nonlinear distribution might work as well and be easier to optimize. We haven’t been able to achieve the accuracy we need on models we care about using lower bit depths, or even four-bit linear. The number of one-bit ops required also seems to scale in a way that negates the advantage of their lower precision. Unfortunately I don’t have any papers or documented experiments to share on this though, and I’m also hopeful that these issues can be overcome in the future, so I’ll be keeping a close eye on the literature.
Models are important
A lot of what I’m discussing above are fairly low-level optimizations, but as we know from software engineering, the biggest gains are often to be found higher up the stack. Switching to a more efficient sorting algorithm will probably do more for traditional code than rewriting a less-suited one in assembler. In the same spirit, altering the model architectures so that there’s less work to do is usually a much bigger win than tweaking the bit depth. That’s why I was very pleased that we could release the Mobilenet family of models. These substantially reduce the amount of computation needed, and also work well with quantization, thanks to hard work by Andrew Howard, Benoit Jacob, Dmitry Kalenichenko, and the rest of the Mobile Vision team.
As we keep pushing on quantization, this sort of co-design between researchers and implementers is crucial to get the best results. I think there’s a whole new field beginning to emerge, which I’m not sure whether to call ML Engineering or ML Systems, looking at the whole lifecycle of a deep learning solution, all the way from initial research through to deployment in production. It’s only with that sort of integrated view that we’re going to be able to solve some of the outstanding problems we’re still facing.