Scaling machine learning models to embedded devices

ScaledML Talk - Pete Warden.png

I gave a talk at ScaledML today, and I’m publishing the slides and my speaking notes as a quick blog post. It will hopefully mostly be familiar to regular readers of my blog already, but I wanted to gather it all in one place!

Hi, I’m here to talk about running machine learning models on tiny computers!

As you may have guessed from the logos splashed across these slides, I’m an engineer on the TensorFlow team at Google, working on our open source machine learning library. My main goal with this talk is to get you excited about a new project we just launched! I’m hoping to get help and contributions from some of you here today, so here are my contact details for any questions or discussions, and I’ll be including these at the end too, when I hope we might have some time for live questions.

So, why am I here, at the ScaledML conference?

I want to start with a number. There are 150 billion embedded processors out there in the world, that’s more than twenty each for every man, woman, and child on earth! Not only did that number amaze me when I first came across it, but the growth rate is an astonishing 20% annually, with no signs of slowing down. That’s much faster than smartphone usage, which is almost flat, or the growth in the number of internet users, which is in the low single digits these days.

One of the things I like about this conference is how the title gives it focus, but still covers a lot of interesting areas. We often think of scale as being about massive data centers but it’s a flexible enough concept for a computing platform with this much reach to make sense to talk about here.

Incidentally, you may be wondering why I’m not talking about the edge, or the internet of things. The term “edge” is actually one of my pet peeves, and to help explain why, take a look at this diagram.

Can you see what’s missing? There are no people here! Even if we were included, we would be literally at the edge, while a datacenter sits in the center. I think it makes a lot more sense to think about our infrastructure as having people at the center, so that they have priority in the way we think about and design our technology.

My objection to the internet of things is that the majority of embedded devices are not connected to any network, and as I’ll discuss in a bit, it’s unlikely that they ever will be, at least more than intermittently.

So, now I’ve got those rants out of the way, what about the ML part of ScaledML? Why is that important in this world?

The hardest constraint embedded devices face is energy. Wiring them into mains power is hard or impossible in most environments. The maintenance burden of replacing batteries quickly becomes unmanageable as the number of devices increases (just think about your smoke alarm chirping when its battery is low, and then multiply that many-fold). The only sane way for the number of devices to keep increasing is if they have batteries that last a very long time, or if they can use energy harvesting (like solar cells from indoor lighting).

That constraint means that it’s essential to keep energy usage at one milliwatt or even lower, since that can give you a year of continuous use on a reasonably small and cheap battery, or alternatively is within the range of a decent energy harvesting system like ambient solar.

Unfortunately, anything involving radio takes a lot of energy, almost always far more than one milliwatt. Transmitting bits of information, even with approaches like bluetooth low energy, is in the tens to hundreds of milliwatts in the best of circumstances at comparatively short range. The efficiency of radio transmission doesn’t seem to be improving dramatically over time either, there seem to be some tough hurdles imposed by physics that make improvements hard.

On a happier note, capturing data through sensors doesn’t suffer from the same problem. There are microphones, accelerometers, and even image sensors that operate well below a milliwatt, even down to tens of microwatts. The same is true for arithmetic. Microprocessors and DSPs are able to process tens or hundreds of millions of calculations for under a milliwatt, even with existing technologies, and much more efficient low-energy accelerators are on the horizon.

What this means is that most data that’s being captured by sensors in the embedded world is just being discarded, without being analyzed at all. As a practical example, I was talking to a satellite company a few years ago, and was astonished to discover that they threw away most of the imagery their satellites gathered. They only had limited bandwidth to download images to base stations periodically, so they stored images at much lower resolutions than their smart-phone derived cameras were capable of, and at a much lower framerate. After going to all the trouble of getting a device into space, this seemed like a terrible waste!

What machine learning, and here I’m primarily talking about deep learning, offers is the ability to take large amounts of noisy sensor data, and spot patterns that lead to actionable information. In the satellite example, rather than storing uniform images that are mostly blank ocean, why not use image detection models to capture ships or other features in much higher detail?

What deep learning enables is decision-making without continuous connectivity to a datacenter in the cloud. This makes it possible to build entirely new kinds of products.

To give you a concrete example of what I mean, here’s a demo. This is a microcontroller development board that you can buy from SparkFun for $15, and it includes microphones and a low-power processor from Ambiq. Unfortunately nobody but the front row will be able to see this tiny LED, but when I say the word “Yes”, it should flash yellow. As you can see, it’s far from perfect, but it is doing this very basic speech recognition on-device with no connectivity, and can run on this coin battery for days. If you’re unconvinced, come and see me after the talk and I’ll show you in person!

So how did we build this demo? One of the toughest challenges we faced was that the computing environment for embedded systems is extremely harsh. We had to design for a system with less than 100 KB of RAM and read-only storage, since that’s common on many microprocessors. We also don’t have much processing power, only tens of millions of arithmetic operations, and we couldn’t rely on having floating point hardware, so we needed to cope with integer-only math. There’s also no common operating system to rely on, and many devices don’t have anything beyond bare metal interfaces that require you to access control registers directly.

To begin, we started with figuring out how we could even fit a model into these constraints. Luckily, the “Hey Google” speech team has many years of experience with these kind of applications, so we knew from discussions with them that it was possible to get useful results with models that were just tens of kilobytes in size. Part of the magic of getting a network to fit into that little memory was quantizing all the weights down to eight bits, but the actual architecture itself was a very standard single-layer convolutional network, applied to frequency spectrograms generated from the raw audio sample data.

This simple model helped us fit within the tight computational budget of the platform, since we needed to run inference multiple times a second, so even the millions of ops per second we had available had to be sliced very thinly. Once we had a model designed, I was able to use the standard TensorFlow speech command tutorial to train it.

With the model side planned out, the next challenge was writing the software to run our network. I work on TensorFlow Lite, so I wanted that as my starting point, but there were some problems. The smallest size we could achieve for a binary using the library was still hundreds of kilobytes, and it has a lot of dependencies on standard libraries like C, C++ or Posix functions. It also assumes that dynamic memory allocation is available. None of these assumptions could be relied on in the embedded world, so it wasn’t usable as it was.

There was a lot to recommend TensorFlow Lite though. It has implementations and tests for many operations, has a great set of APIs, a good ecosystem, and handles conversion from the training side of TensorFlow. I didn’t want to lose all those advantages.

In short, and apologies for a Brexit reference, but I wanted to have my cake and eat it too!

The way out of this dilemma proved to be some targeted refactoring of the existing codebase. The hardest part was separating out the fundamental properties of the framework like file formats and APIs from less-portable implementation details. We did this by focusing on converting the key components of the system into modules, with shared interfaces at the header level, but the possibility of different implementations to cope with the requirements of different environments.

We also made the choice not to boil the ocean and try to get the entirety of a complex framework running on these new platforms at once. Instead, we picked one particular application, the speech recognition demo I just showed, and attempted to get it completely working from training to deployment on a device, before we tried to expand into full op support. This gave us the ability to quickly try out approaches on a small scale and learn by doing, iteratively making progress rather than having to plan the whole thing out ahead of time with imperfect knowledge of what we’d encounter.

We also deliberately avoided focusing on optimizations. We’re not experts on every microcontroller out there, and there are a lot of them, so we wanted to empower the real experts at hardware vendors to contribute. To help with that, we tried to create clear reference implementations along with unit tests, benchmarks, and documentation, to make collaboration as easy as possible even for external engineers with no background in machine learning.

So, what does that all mean in practice? I’m not going to linger on the code, but here’s the heart of a reference implementation of depthwise convolution. The equivalent optimized versions are many screens of code, but at its center, the actual algorithm is quite simple.

One thing I want to get across is that there is a lot of artificial complexity in current implementations of machine learning operations, across most frameworks. They’re written to be fast on particular platforms, not to be understood or learned from. Reference code implementations can act as teaching tools, helping developers unfamiliar with machine learning understand what’s involved by presenting the algorithms in a form they’re familiar with. They also make a great starting point for optimizing for platforms that the original creators weren’t planning for, and together with unit tests, form living specifications to guide future work.

So, what am I hoping you’ll take away from this talk? Number one is the idea that it’s possible and useful to run machine learning on embedded platforms.

As we look to the future, it’s not clear what the “killer app”, if any, will be for this kind of “Tiny ML”. We know that voice interfaces are popular, and if we can get them running locally on cheap, low-power devices (the recent Pixel server-quality transcription release is a proof of concept that this is starting to become possible), then there will be an explosion of products using them. Beyond that though, we need to connect these solutions with the right problems. It is a case of having a hammer and looking for nails to hit with it, but it is a pretty amazing hammer!

What I’d ask of you is that you think about your own problems, the issues you face in the worlds you work in, and imagine how on-device machine learning could help. If your models could run on a 50 cent chip that could be peeled and stick anywhere, what could you do to help people?

With that in mind, the preview version of the TensorFlow Lite for Microcontrollers library is available at this link, together with documentation and a codelab tutorial. I hope you’ll grab it, get ideas on how you might use it, and give us feedback.

Here are my contact details for questions or comments, please do get in touch! I’ll look forward to hopefully taking a few questions now too, if we have time?


Launching TensorFlow Lite for Microcontrollers

SparkFun Edge Development Board - Apollo3 Blue

I’ve been spending a lot of my time over the last year working on getting machine learning running on microcontrollers, and so it was great to finally start talking about it in public for the first time today at the TensorFlow Developer Summit. Even better, I was able to demonstrate TensorFlow Lite running on a Cortex M4 developer board, handling simple speech keyword recognition. I was nervous, especially with the noise of the auditorium to contend with, but I managed to get the little yellow LED to blink in response to my command! If you’re interested in trying it for yourself, the board is available for $15 from SparkFun with the sample code preloaded. For anyone who didn’t catch it, here are the notes from my talk.

Hi, I’m Pete Warden on the TensorFlow Lite team, and I’m here to talk about a new project we’re pretty excited about. When I first joined Google back in 2014, I learned about a lot of exciting internal work that wasn’t yet public, but one of the most impressive moments was when I was introduced to Raziel, who was on the speech team at that point, and he told me that they used network models that were only thirteen kilobytes in size! I only had experience with image models, and in those days even the smallest model like Inception still took up megabytes.

I was even more amazed when he told me why these models had to be so small. They needed to run them on DSPs and other embedded chips in smartphones so Android could listen out for wake words like “Hey Google” while the main CPU was powered off to save the battery. These microcontrollers often only had tens of kilobytes of RAM and Flash memory, so they simply couldn’t fit anything larger. They also couldn’t rely on cloud connectivity because keeping any radio connection alive continuously would drain the battery in no time at all.

What struck me was that the speech team had a massive amount of experience, and had spent a lot of time experimenting, and even within the tough constraints of these devices, neural networks produced better results than any of the more traditional methods they tried. This left me wondering if they would be useful for other embedded sensor applications, and I wanted to see if we could build support for these platforms into TensorFlow. At the time few people knew about the ground-breaking work that was being done in the speech community, so I was excited to help share it more widely.

Today I’m pleased to announce that we are releasing the first, experimental support for embedded platforms in TensorFlow Lite. To show you what I mean, here’s a demonstration I have in my pocket!

This is a prototype of a development board built by SparkFun, and it has a Cortex M4 processor with 384KB of RAM and 1MB of Flash storage. The processor was built by Ambiq to be extremely low power, drawing less than one milliwatt in many cases so it’s able to run for many days on a small coin battery.

I’m going to take my life in my hands now by trying a live demo, so wish me luck! The goal is that I’m going to say the word “Yes”, and the little yellow LED here will light up. Hopefully we can use this camera contraption to show this to everyone on the screen and in the livestream.

“Yes”. “Yes”. “Yes”.

As you can see, it’s still far from perfect, but it’s managing to do a decent job of recognizing when I say the word, and not lighting up when there’s unrelated conversations.

So why is this useful? First, this is running entirely locally on the embedded chip, with no need to have any internet connectivity, so it’s good to have as part of a voice interface system. The model itself takes up less than 20KB of Flash storage space, the footprint of the TensorFlow Lite code is only another 25KB of Flash, and it only needs 30KB of RAM to operate.

Secondly, the software for this demo is entirely open source. You can grab the code for it and build it yourself. It’s also already been ported to a lot of different embedded chips, and we hope to see it appear on many more over the next few months. You can check out the code yourself at

There’s more documentation here:

If you want to customize the example, you can try this code lab:

Third, you can train your own model using this tutorial that we provide. It comes with an open dataset of over 100,000 utterances submitted by volunteers, which we’d love your help expanding through the link here:
The helpful thing about this is that if you have your own words or noises you want to recognize, you should be able to adapt this training approach to your own problem just by supplying new training data.

Fourth, the code is part of TensorFlow Lite, it uses the same APIs, file formats, and conversion tools, so it’s well integrated into the whole TensorFlow ecosystem, making it easier to use.

So, how can you try this out yourself? If you’re in the audience, I’m pleased to say that when you pick up your box this afternoon you’ll find your very own prototype SparkFun Edge board! Just remove the tab to switch the battery on, and you should find it preloaded with the TensorFlow “yes” example. Just try saying “Yes” to TensorFlow, and you should hopefully get a yellow light! We also include all the cables you need to program it with your own code through the serial port. These are the first 700 boards ever built, so there is a wiring issue that drains the battery more quickly than on the final devices, but you should be able to develop with them in exactly the same way as the production boards.

If you’re watching at home, you can order one of these for $15 from SparkFun. You’ll also find instructions for many other platforms in the documentation, so we’re happy to work with whatever devices you want to build your projects on. We welcome collaboration with developers across the community to unlock all the creativity that I know is out there, and I’m hoping to be spending a lot of my time in the future reviewing pull requests!

Finally, a big thanks to everyone who helped bring this prototype together, including the TensorFlow Lite team, especially Raziel, Rocky, Dan, Tim, and Andy; Alasdair, Nathan, Owen and Jim at SparkFun; Scott, Steve, Arpit, and Andre at Ambiq, and many people at Arm including Rod, Neil and Zach! This is still a very early experiment but I can’t wait to see what people build with this.