Debugging Disposable ML Frameworks

Guest post by Nat Jeffries, Founding Engineer at Useful Sensors.

At Useful Sensors we love using disposable frameworks to deploy on-device transformers. Having built several such frameworks, I realized that, while there are great resources for understanding and training transformer models, there are few guides for deploying them on-device. The following are some lessons I wish I knew when I started building disposable frameworks, and some tricks I’ve learned along the way.

First, I’ve learned to make sure to test parts of the model rather than the whole thing. When you run a transcription model on some sample audio clip and get back wingdings, curse words or nothing at all, it’s hard to know what went wrong. I like to compare intermediate tensor values from a known-good model against the same tensors in my custom framework, working from the input through each major block until these tensors differ. One trick I’ve found is to log the sum and shape of each tensor rather than all or some of the tensor values. 

Here’s an example in C++:

void print_tensor(const Tensor* tensor, std::string msg) {
  float sum = 0;
  for (auto elem : tensor->data) {
    sum += elem;
  }
  printf("%s: sum: %.4f shape (", msg.c_str(), sum);
  for (auto elem : tensor->shape()) {
    printf("%d ", elem);
  } printf(")\n");
}

Tensor* generate(Tensor* input, Tensor* mask, Tensor* seq) {
  print_tensor(input, "input");
  print_tensor(mask, "mask");
  auto* preprocessed = preprocess(input);
  print_tensor(preprocessed, "preprocessed");
  auto* embedding = encoder(input, mask);
  print_tensor(embedding, "embedding");
  auto* output = decoder(seq, embedding, mask);
  print_tensor(output, "output");
  return output;
}

And here’s the Python version:

def print_tensor(tensor, name):
    print(f'{name} sum {torch.sum(tensor)} shape {tensor.shape}')

def generate(src, mask, seq):
    print_tensor(src, "input")
    print_tensor(mask, "input mask")

    preprocessed = preprocessor(src)
    print_tensor(preprocessed, "preprocessed")

    enc = encoder(src=preprocessed, input_mask=mask)
    print_tensor(enc, "embedding")

    output = decoder(prompt=seq, embedding=enc, input_mask=mask)
    print_tensor(output, "output")

It’s rare that two tensors with the same sum and shape contain different values, and even if they do then the error will almost always appear one block later. Remember that this includes checking the input of the two models. I’ve lost count of the number of times I used an incorrectly quantized input, the wrong input mask, or fed inputs into the model in the wrong order.

When dealing with quantized tensors, always refer back to the floating point values represented by the quantized tensors. Remember that regardless of the quantization scheme, each quantized value is an approximation of an equivalent floating point value in the known-good (usually floating point) model. Recording sums and shapes of quantized tensors converted back to float can be a good way to ensure that the models match, and to quickly identify integer overflow, incorrect logic, or excessive quantization error.

Finally, make sure to periodically take a step back and honestly evaluate how clear your mental picture of what you’re trying to implement is. I recently experienced this while adding batch decoding to our Moonshine model. I spent many days debugging subtle differences between batch and non-batch versions of our model before realizing that I had forgotten to mask cross attention in the decoder. A simple gap in my knowledge, quickly solved by reading a guide on masking in encoder-decoder models, resulted in days of wasted effort.
Hopefully these tricks can save somebody from the pitfalls I’ve fallen into. If you’re interested in deploying speech models on-device or have tips I missed here, please reach out!

How to shrink ONNX files

I’ve been using the ONNX Runtime a lot recently, and while it has been a lot of fun, there are a few things I’ve missed from the TensorFlow Lite world. The biggest (no pun intended) is the lack of tools to shrink the model file size, something that’s always been essential in the mobile app world. You can quantize using the standard ONNX tools, but in my experience you’ll often run into accuracy problems because all of the calculations are done at lower precision. These are usually fixable, but require some time and effort.

Instead, I like to perform “weights-only quantization”, where the calculations are still done in 32-bit floating point, but the large arrays of weight values are stored as 8-bit codes. This usually has no impact on accuracy, and the effect on latency should be pretty negligible, since the compute involved in unpacking those values every time is a tiny fraction of the rest of the network calculations. I couldn’t find a tool to do that for me though, so I’ve just released ONNX Shrink Ray on GitHub and pypi. This tool processes ONNX files, finds large arrays of float32 values, and replaces them with an equivalent array of 8-bit codes followed by a DequantizeLinear operation. This typically reduces large float models to around 30% of their original size, usually with no measurable impact on accuracy.

This is especially important for models that are hosted on the web or using the ONNX web runtime, since big downloads cost money. I’ve put together a quick pricing calculator using Claude to demonstrate the potential savings, using Google Cloud Storage download costs as the default. You can enter in your own values to see what the impact would be in your situation.

Other frameworks like GGML do offer similar kinds of weight-only quantization, but this is the only solution I know of for ONNX. I’ve also included a variation on this kind of quantization, where the values are still stored as floats, but quantized to an arbitrary number of values. This is very effective when your content is compressed for delivery (which if you’re concerned about download costs, you’re probably already doing) and has no impact on latency.

We have some other tricks up our sleeve for shrinking large models, so if you are running into this issue yourself, please do get in touch, I’ll be happy to geek out.

Introducing Moonshine, the new state of the art for speech to text

Can you imagine using a keyboard where it took a key press two seconds to show up on screen? That’s the typical latency for most voice interfaces, so it’s no wonder they’ve failed to catch on for most people. Today we’re open sourcing Moonshine, a new speech to text model that returns results faster and more efficiently than the current state of the art, OpenAI’s Whisper, while matching or exceeding its accuracy. The paper has the full details, but the key improvements are an architecture that offers an overall 1.7x speed boost compared to Whisper, and a flexibly-sized input window. This variable length input is very important, since Whisper always works with 30 second chunks of audio, so even if you only have a few seconds of speech you have to zero-pad the input and process much more data than you need. These two improvements mean we’re five times faster than Whisper on ten second audio clips!

To understand what that means in practice, you can check out our Torre translator. The speed of Moonshine means we can offer almost instant translations as people are talking, making for a conversation that’s much more natural than existing solutions.

Even better, the low resource demands of Moonshine allow us to run everything locally on the device, without any network connection, safeguarding privacy and letting us run anywhere in the world, instantly.

We founded Useful to help machines understand us better, and we’re proud to share this new step forward in speech to text, since voice interfaces are a vital part of that mission. Moonshine doesn’t just help us with products like Torre, its unique design makes it possible to fit full automatic speech recognition on true embedded hardware. We’ve found the biggest obstacle to running ASR on microcontrollers and DSPs hasn’t been the processing power, since accelerators help with that, but RAM limits. Even the smallest Whisper model requires at least 30MB of RAM, since modern transformers create large dynamic activation layers which can’t be stored in flash or other read-only memory. Because Moonshine’s requirements scale with the size of the input window, we are on target to transcribe full sentences a few seconds long in 8MB of RAM or less.

I can’t wait to see what people are able to build with these new models, especially on resource-constrained platforms like the Raspberry Pi, where running full speech to text has been challenging. Please do get in touch if you’ve built something neat, we’d love to hear from you!

Update – I talk a bit more about Moonshine on YouTube at youtu.be/sZVTisKqJtA.