Guest post by Nat Jeffries, Founding Engineer at Useful Sensors.
At Useful Sensors we love using disposable frameworks to deploy on-device transformers. Having built several such frameworks, I realized that, while there are great resources for understanding and training transformer models, there are few guides for deploying them on-device. The following are some lessons I wish I knew when I started building disposable frameworks, and some tricks I’ve learned along the way.
First, I’ve learned to make sure to test parts of the model rather than the whole thing. When you run a transcription model on some sample audio clip and get back wingdings, curse words or nothing at all, it’s hard to know what went wrong. I like to compare intermediate tensor values from a known-good model against the same tensors in my custom framework, working from the input through each major block until these tensors differ. One trick I’ve found is to log the sum and shape of each tensor rather than all or some of the tensor values.
Here’s an example in C++:
void print_tensor(const Tensor* tensor, std::string msg) {
float sum = 0;
for (auto elem : tensor->data) {
sum += elem;
}
printf("%s: sum: %.4f shape (", msg.c_str(), sum);
for (auto elem : tensor->shape()) {
printf("%d ", elem);
} printf(")\n");
}
Tensor* generate(Tensor* input, Tensor* mask, Tensor* seq) {
print_tensor(input, "input");
print_tensor(mask, "mask");
auto* preprocessed = preprocess(input);
print_tensor(preprocessed, "preprocessed");
auto* embedding = encoder(input, mask);
print_tensor(embedding, "embedding");
auto* output = decoder(seq, embedding, mask);
print_tensor(output, "output");
return output;
}
And here’s the Python version:
def print_tensor(tensor, name):
print(f'{name} sum {torch.sum(tensor)} shape {tensor.shape}')
def generate(src, mask, seq):
print_tensor(src, "input")
print_tensor(mask, "input mask")
preprocessed = preprocessor(src)
print_tensor(preprocessed, "preprocessed")
enc = encoder(src=preprocessed, input_mask=mask)
print_tensor(enc, "embedding")
output = decoder(prompt=seq, embedding=enc, input_mask=mask)
print_tensor(output, "output")
It’s rare that two tensors with the same sum and shape contain different values, and even if they do then the error will almost always appear one block later. Remember that this includes checking the input of the two models. I’ve lost count of the number of times I used an incorrectly quantized input, the wrong input mask, or fed inputs into the model in the wrong order.
When dealing with quantized tensors, always refer back to the floating point values represented by the quantized tensors. Remember that regardless of the quantization scheme, each quantized value is an approximation of an equivalent floating point value in the known-good (usually floating point) model. Recording sums and shapes of quantized tensors converted back to float can be a good way to ensure that the models match, and to quickly identify integer overflow, incorrect logic, or excessive quantization error.
Finally, make sure to periodically take a step back and honestly evaluate how clear your mental picture of what you’re trying to implement is. I recently experienced this while adding batch decoding to our Moonshine model. I spent many days debugging subtle differences between batch and non-batch versions of our model before realizing that I had forgotten to mask cross attention in the decoder. A simple gap in my knowledge, quickly solved by reading a guide on masking in encoder-decoder models, resulted in days of wasted effort.
Hopefully these tricks can save somebody from the pitfalls I’ve fallen into. If you’re interested in deploying speech models on-device or have tips I missed here, please reach out!

