How to shrink ONNX files

I’ve been using the ONNX Runtime a lot recently, and while it has been a lot of fun, there are a few things I’ve missed from the TensorFlow Lite world. The biggest (no pun intended) is the lack of tools to shrink the model file size, something that’s always been essential in the mobile app world. You can quantize using the standard ONNX tools, but in my experience you’ll often run into accuracy problems because all of the calculations are done at lower precision. These are usually fixable, but require some time and effort.

Instead, I like to perform “weights-only quantization”, where the calculations are still done in 32-bit floating point, but the large arrays of weight values are stored as 8-bit codes. This usually has no impact on accuracy, and the effect on latency should be pretty negligible, since the compute involved in unpacking those values every time is a tiny fraction of the rest of the network calculations. I couldn’t find a tool to do that for me though, so I’ve just released ONNX Shrink Ray on GitHub and pypi. This tool processes ONNX files, finds large arrays of float32 values, and replaces them with an equivalent array of 8-bit codes followed by a DequantizeLinear operation. This typically reduces large float models to around 30% of their original size, usually with no measurable impact on accuracy.

This is especially important for models that are hosted on the web or using the ONNX web runtime, since big downloads cost money. I’ve put together a quick pricing calculator using Claude to demonstrate the potential savings, using Google Cloud Storage download costs as the default. You can enter in your own values to see what the impact would be in your situation.

Other frameworks like GGML do offer similar kinds of weight-only quantization, but this is the only solution I know of for ONNX. I’ve also included a variation on this kind of quantization, where the values are still stored as floats, but quantized to an arbitrary number of values. This is very effective when your content is compressed for delivery (which if you’re concerned about download costs, you’re probably already doing) and has no impact on latency.

We have some other tricks up our sleeve for shrinking large models, so if you are running into this issue yourself, please do get in touch, I’ll be happy to geek out.

Leave a comment