Running TensorFlow Graphs on Microcontrollers

crown-gecko-967335_960_720

Photo by PixelMixer

I gave a talk last week at the Embedded Vision Summit, and one question that came up was how to run neural networks trained in TensorFlow on tiny, power-efficient CPUs like the EFM32. This microcontroller is based on an ARM M4 design, and uses about 2.5 milliwatts when running at 40 MHz. It’s not much to work with, but it does at least have 256KB of RAM and 1024KB of flash program memory. This is more than enough to run some simple neural networks, especially for tasks like audio hotword detection that don’t need ultra-high accuracy to be useful. It is a tricky task though, since it requires a combination of ML and embedded hardware expertise. Here’s how I would tackle it:

Make sure you can train a model on the desktop that achieves the accuracy you need for your product, ignoring embedded constraints. Usually the hardest part here is getting enough relevant training data, since for good results you’ll need something that reflects the actual data you’ll be seeing on the device.

Once you have a proof of concept model working, try to shrink it down to fit the constraints of your device. This will involve:

  • Making sure the number of weights is small enough to fit in RAM. You can compress them down to eight bit in most cases to help. The tensorflow/tools/graph_transform/summarize_graph tool will give you estimates on the number of weights.
  • Reduce the size of the fully-connected and convolutional layers so that the number of compute operations is small enough to run within your device’s budget. The tensorflow/tools/benchmark utility with –show_flops will give you an estimate of this number. For example, I might guess that an M4 could do 2 MFLOPs/second, and so aim for a model that fits in that limit.

When you have a model that looks like it should fit, write a custom runtime to execute it. Right now I wouldn’t recommend using the standard TensorFlow runtime on an embedded device because the binary size overhead of a general interpreter is too large. Instead, manually write a piece of code that loads your weights and does the matrix multiplies. Usually your model will be comparatively simple, just a couple of fully-connected layers for example, so hopefully the amount of coding won’t be overwhelming. I am hoping we have a better solution and some examples for this use case in the future though, since I think it’s an area where neural network solutions are going to make a massive impact!

One response

  1. Hi Pete – exactly the problem and solution we’re thinking of. We’re more concerned about edge devices as opposed to teeny microcontrollers – but same principle. In addition, we’ll try to use other hardware-specific optimized libraries (probably for Intel graphics) for some of the compute. Do you know of any examples of this approach having been done (or parts of it, e.g. extracting all weights) or an easy way of figuring it out (e.g. if there are just standard numpy implementation of all ops in the code base, which we can pull out)? I’ve found deeplearn.js which e.g. supports converting a trained tensorflow model into something that can run in JS (optionally with webgl) e.g. here https://deeplearnjs.org/demos/mnist/mnist.html (and e.g. matmul here: https://github.com/PAIR-code/deeplearnjs/blob/68219e2c6d7f8d5343a55b416d73e1a92564e97f/src/math/backends/backend_cpu.ts#L261)

Leave a comment