It’s been a while since I last wrote about using eight bit for inference with deep learning, and the good news is that there has been a lot of progress, and we know a lot more than we did even a year ago. There are still a lot of unanswered questions too, which is why I’m waiting for a plane to take me to MobiSys, where I’ll be helping Nic Lane from UCL run a workshop for the research community to investigate some of them.

As a foundation for that, I’ll be giving a talk on what I know now, and what my hunches are. A lot of it is empirical, and we don’t have nearly enough rigorous experiments, let alone published papers, but if you take all this as provisional I hope it might still be useful. I’m also very happy to acknowledge my deep debt to my Google colleagues and others like Song Han who are the driving forces behind much of this work! Here are my notes on the areas I’ll be covering tomorrow.

### Hardware implementations

Since the original TPU paper has been published, we can now use that as a successful example of using eight bit for inference across a wide variety of models within Google. There’s also the collaboration between the Qualcomm and TensorFlow teams that enables models to run up to seven times faster on the HVX DSP than on the CPU, thanks to the use of eight bit. This means we now have more evidence that this is a good approach to use on the hardware side.

### Training with forward passes

I don’t have any published papers to hand, and we haven’t documented it well within TensorFlow, but we do have support for “fake quantization” operators. If you include these in your graphs at the points where quantization is expected to occur (for example after convolutions), then in the forward pass the float values will be rounded to the specified number of levels (typically 256) to simulate the effects of quantization. In the backward pass, this rounding won’t be performed, so gradients will be calculated using full float values. This has the effect of forcing the graph to adapt to the lower precision it will encounter during inference, and in practice we’ve seen this improve the accuracy of the quantized graph dramatically, sometimes to a level indistinguishable from float. It also gives precalculated min/max ranges for the 32-bit to 8-bit downscaling that needs to happen after many operations. This saves a step on the CPU, but for hardware implementations it’s even more important, since a dynamically-calculated range may be impossible to efficiently implement.

By the way, if you do want fixed ranges but can’t retrain, there are some options for running example data through a pretrained network to bake them in instead.

### Exact zeroes are important

The current TensorFlow way of figuring out ranges just looks at the min/max of the float values and assigns those to 0 and 255. This means that real zero is almost always not exactly representable, and the closest encoded value may represent something like 0.046464, or some other arbitrary distance from exact zero. For most numbers this doesn’t matter, because the float values are assumed to occur in a ‘random’ enough way that the error on the representation of any individual value is also uniformly random. The idea is that as long as the errors generally cancel each other out, they’ll just appear as the kind of random noise that the network is trained to cope with and so not destroy the overall accuracy by introducing a bias.

The problem is that the real value of zero shows up a lot more often you’d expect in neural network calculations. Convolutions are padded with zeros at the edges when filters overlap, and the Relu activation function gates any negative numbers at zero. This means that any error in the zero representation contributes disproportionately to overall results.

The solution to this is to ensure that real values of zero are represented as exactly as possible in the quantized encoding. The way to do this is to nudge the overall min/max values so that zero is exact. We’re not (yet) doing this in TensorFlow, but hope to have it in soon. For much more information, Benoit Jacob has some excellent documentation in gemmlowp, and is the source of most of the information above.

### Asymmetric ranges are inconvenient, but may be necessary

Constraining the min/max ranges so that the minimum is always the negative of the maximum is very convenient for a lot of purposes because it avoids having to apply an offset to the operands to the matrix multiply. Unfortunately the evidence for whether this allows for enough precision is mixed, with some models showing unacceptable loss of overall accuracy. This is still an open question, and an area where we need more experiments.

### Excluding -128 can be useful

One practical issue that has come up in various contexts is that signed eight bit values run from -128 to +127. This is inconvenient because there’s one more value on the negative side than the positive, and so requires careful handling if we want to use symmetric ranges and ensure zero is exactly representable as encoded zero. Unrelatedly, it’s also been helpful with the ARM NEON CPU implementation to avoid -128 for the weights to allow a faster code path. There’s not all that much principle behind it yet, but there’s thus some evidence that avoiding -128 in general may be helpful.

### Lower bit depths are promising, but unproven

There have been some fantastic papers around four bit, two bit, or even one bit precision for neural networks. Unfortunately so far they’ve all had some practical drawbacks that have prevented us from taking advantage of them. Song Han’s four-bit weights require a lookup table, which makes them hard to implement efficiently at runtime, though I’m intrigued to know if a simple function to handle the nonlinear distribution might work as well and be easier to optimize. We haven’t been able to achieve the accuracy we need on models we care about using lower bit depths, or even four-bit linear. The number of one-bit ops required also seems to scale in a way that negates the advantage of their lower precision. Unfortunately I don’t have any papers or documented experiments to share on this though, and I’m also hopeful that these issues can be overcome in the future, so I’ll be keeping a close eye on the literature.

### Models are important

A lot of what I’m discussing above are fairly low-level optimizations, but as we know from software engineering, the biggest gains are often to be found higher up the stack. Switching to a more efficient sorting algorithm will probably do more for traditional code than rewriting a less-suited one in assembler. In the same spirit, altering the model architectures so that there’s less work to do is usually a much bigger win than tweaking the bit depth. That’s why I was very pleased that we could release the Mobilenet family of models. These substantially reduce the amount of computation needed, and also work well with quantization, thanks to hard work by Andrew Howard, Benoit Jacob, Dmitry Kalenichenko, and the rest of the Mobile Vision team.

As we keep pushing on quantization, this sort of co-design between researchers and implementers is crucial to get the best results. I think there’s a whole new field beginning to emerge, which I’m not sure whether to call ML Engineering or ML Systems, looking at the whole lifecycle of a deep learning solution, all the way from initial research through to deployment in production. It’s only with that sort of integrated view that we’re going to be able to solve some of the outstanding problems we’re still facing.

Hi Pete,

Thanks! This post is of great interest to us as we’re just currently analysing the effect of quantization. I have one question: why to you quantize everything to 8 bits?

We are currently studying a flexible quantization where we quantize the upper layer on 16 bits for example and the more we go down the graph, the more we quantize, to even maybe 4 bits or less for the last layers.

Best Regards,

Corine

Pingback: Four short links: 23 June 2017 | Vedalgo

Pingback: Four short links: 23 June 2017 | A bunch of data

On the topic of not using -128 & encoding zeros exactly… instead of nudging the range to guarantee exact zero representation, could one encode zero as -128? It may be tricky to make that work efficiently with today’s hardware, but something to consider when designing new hardware…

Well, at least its representation isn’t as odd in binary. The more general issue with the implementation is introducing branching at all. But this might be one of the cases where you can use clever arithmetic to make it [quote] “linear” [unquote]

Pingback: 1 – What I’ve learned about neural network quantization

> Asymmetric ranges are inconvenient, but may be necessary

Pete, asymmetric ranges are useless for conv layers since their PDFs are symmetric due to regularization. In my opinion, minor gains on scale or bn layers are not worth it and offset should be an optional parameter.

> Excluding -128 can be useful

If use conventional 2’s complement encoding, excluding min value before ops is kind of useless because e.g. conv layers multiplications are followed by summation of product terms with bitwidth grow, rounding and saturation.

> Pete, asymmetric ranges are useless for conv layers since their PDFs are symmetric due to regularization.

> In my opinion, minor gains on scale or bn layers are not worth it and offset should be an optional parameter.

We see about a 3% drop in top-1 accuracy on small models like Mobilenet when using symmetric versus asymmetric, so the loss is significant for some applications. I’m keen to see more experiments to understand these losses better though.

hi Pete, about the hardware, on what kind of devices(with which instruction set) can get the expected acceleration.

Hi Pete,

Miyashita et al showed that it’s possible to use lower (around 3 to 5) bits representation without so much loss of accuracy. They even quantized gradient in some case.

The very interesting points for me are:

1. they used logarithmic representation and showed that inner product can be approximiated with simple bit operations.

2. base-2 logarithmic representation didn’t work well, but when they changed base number from 2 to the square root of 2, then logarithmic representation performed well than linear representation.

https://arxiv.org/abs/1603.01025

Hi Pete,

Thanks for sharing all this with us!

Recently, we’ve published some new research on finding an optimal network topology at lower bit depths. In this work, we investigate the trade-off you talk about: networks at lower precision require more ops to achieve the same level of accuracy than networks at high precision. Hence, although the ops are cheaper at low precision, the network might still consume more energy due to its increased size.

In the paper [1] and presentation [2], we hence investigate the influence of both network size (width, depth) and precision (clustered, binary, weight quantized, weights and activations quantized) on benchmark accuracy. We then link this information to an energy model based on our chips to find an optimal (minimum energy) solution. It turns out the-optimal network at iso-accuracy varies between 1-4b. 8b operators are never optimal.

[1] https://arxiv.org/abs/1711.00215

[2] https://www.researchgate.net/publication/320775013_Presentation_on_Minimum_Energy_Quantized_Neural_Networks_Asilomar_2017

Code for quantized in tensorflow can be found here: github.com/BertMoons/

are the results accumulated in a signed or unsigned integer

Hi Pete, thanks for sharing!

This is a piece of nice information on the quantization of neural models. Neatly explained in your blog.

Also, I read the https://gyrus.ai/blog/quantization-of-neural-network-model-for-ai-hardware/ blog talks about common methods for doing quantization, challenges, and methods to overcome them.

thanks again, cheers!

Pingback: Quantization Screencast « Pete Warden's blog