Scaling machine learning models to embedded devices

I gave a talk at ScaledML today, and I’m publishing the slides and my speaking notes as a quick blog post. It will hopefully mostly be familiar to regular readers of my blog already, but I wanted to gather it all in one place!

Hi, I’m here to talk about running machine learning models on tiny computers!

As you may have guessed from the logos splashed across these slides, I’m an engineer on the TensorFlow team at Google, working on our open source machine learning library. My main goal with this talk is to get you excited about a new project we just launched! I’m hoping to get help and contributions from some of you here today, so here are my contact details for any questions or discussions, and I’ll be including these at the end too, when I hope we might have some time for live questions.

So, why am I here, at the ScaledML conference?

I want to start with a number. There are 150 billion embedded processors out there in the world, that’s more than twenty each for every man, woman, and child on earth! Not only did that number amaze me when I first came across it, but the growth rate is an astonishing 20% annually, with no signs of slowing down. That’s much faster than smartphone usage, which is almost flat, or the growth in the number of internet users, which is in the low single digits these days.

One of the things I like about this conference is how the title gives it focus, but still covers a lot of interesting areas. We often think of scale as being about massive data centers but it’s a flexible enough concept for a computing platform with this much reach to make sense to talk about here.

Incidentally, you may be wondering why I’m not talking about the edge, or the internet of things. The term “edge” is actually one of my pet peeves, and to help explain why, take a look at this diagram.

Can you see what’s missing? There are no people here! Even if we were included, we would be literally at the edge, while a datacenter sits in the center. I think it makes a lot more sense to think about our infrastructure as having people at the center, so that they have priority in the way we think about and design our technology.

My objection to the internet of things is that the majority of embedded devices are not connected to any network, and as I’ll discuss in a bit, it’s unlikely that they ever will be, at least more than intermittently.

So, now I’ve got those rants out of the way, what about the ML part of ScaledML? Why is that important in this world?

The hardest constraint embedded devices face is energy. Wiring them into mains power is hard or impossible in most environments. The maintenance burden of replacing batteries quickly becomes unmanageable as the number of devices increases (just think about your smoke alarm chirping when its battery is low, and then multiply that many-fold). The only sane way for the number of devices to keep increasing is if they have batteries that last a very long time, or if they can use energy harvesting (like solar cells from indoor lighting).

That constraint means that it’s essential to keep energy usage at one milliwatt or even lower, since that can give you a year of continuous use on a reasonably small and cheap battery, or alternatively is within the range of a decent energy harvesting system like ambient solar.

Unfortunately, anything involving radio takes a lot of energy, almost always far more than one milliwatt. Transmitting bits of information, even with approaches like bluetooth low energy, is in the tens to hundreds of milliwatts in the best of circumstances at comparatively short range. The efficiency of radio transmission doesn’t seem to be improving dramatically over time either, there seem to be some tough hurdles imposed by physics that make improvements hard.

On a happier note, capturing data through sensors doesn’t suffer from the same problem. There are microphones, accelerometers, and even image sensors that operate well below a milliwatt, even down to tens of microwatts. The same is true for arithmetic. Microprocessors and DSPs are able to process tens or hundreds of millions of calculations for under a milliwatt, even with existing technologies, and much more efficient low-energy accelerators are on the horizon.

What this means is that most data that’s being captured by sensors in the embedded world is just being discarded, without being analyzed at all. As a practical example, I was talking to a satellite company a few years ago, and was astonished to discover that they threw away most of the imagery their satellites gathered. They only had limited bandwidth to download images to base stations periodically, so they stored images at much lower resolutions than their smart-phone derived cameras were capable of, and at a much lower framerate. After going to all the trouble of getting a device into space, this seemed like a terrible waste!

What machine learning, and here I’m primarily talking about deep learning, offers is the ability to take large amounts of noisy sensor data, and spot patterns that lead to actionable information. In the satellite example, rather than storing uniform images that are mostly blank ocean, why not use image detection models to capture ships or other features in much higher detail?

What deep learning enables is decision-making without continuous connectivity to a datacenter in the cloud. This makes it possible to build entirely new kinds of products.

To give you a concrete example of what I mean, here’s a demo. This is a microcontroller development board that you can buy from SparkFun for $15, and it includes microphones and a low-power processor from Ambiq. Unfortunately nobody but the front row will be able to see this tiny LED, but when I say the word “Yes”, it should flash yellow. As you can see, it’s far from perfect, but it is doing this very basic speech recognition on-device with no connectivity, and can run on this coin battery for days. If you’re unconvinced, come and see me after the talk and I’ll show you in person!

So how did we build this demo? One of the toughest challenges we faced was that the computing environment for embedded systems is extremely harsh. We had to design for a system with less than 100 KB of RAM and read-only storage, since that’s common on many microprocessors. We also don’t have much processing power, only tens of millions of arithmetic operations, and we couldn’t rely on having floating point hardware, so we needed to cope with integer-only math. There’s also no common operating system to rely on, and many devices don’t have anything beyond bare metal interfaces that require you to access control registers directly.

To begin, we started with figuring out how we could even fit a model into these constraints. Luckily, the “Hey Google” speech team has many years of experience with these kind of applications, so we knew from discussions with them that it was possible to get useful results with models that were just tens of kilobytes in size. Part of the magic of getting a network to fit into that little memory was quantizing all the weights down to eight bits, but the actual architecture itself was a very standard single-layer convolutional network, applied to frequency spectrograms generated from the raw audio sample data.

This simple model helped us fit within the tight computational budget of the platform, since we needed to run inference multiple times a second, so even the millions of ops per second we had available had to be sliced very thinly. Once we had a model designed, I was able to use the standard TensorFlow speech command tutorial to train it.

With the model side planned out, the next challenge was writing the software to run our network. I work on TensorFlow Lite, so I wanted that as my starting point, but there were some problems. The smallest size we could achieve for a binary using the library was still hundreds of kilobytes, and it has a lot of dependencies on standard libraries like C, C++ or Posix functions. It also assumes that dynamic memory allocation is available. None of these assumptions could be relied on in the embedded world, so it wasn’t usable as it was.

There was a lot to recommend TensorFlow Lite though. It has implementations and tests for many operations, has a great set of APIs, a good ecosystem, and handles conversion from the training side of TensorFlow. I didn’t want to lose all those advantages.

In short, and apologies for a Brexit reference, but I wanted to have my cake and eat it too!

The way out of this dilemma proved to be some targeted refactoring of the existing codebase. The hardest part was separating out the fundamental properties of the framework like file formats and APIs from less-portable implementation details. We did this by focusing on converting the key components of the system into modules, with shared interfaces at the header level, but the possibility of different implementations to cope with the requirements of different environments.

We also made the choice not to boil the ocean and try to get the entirety of a complex framework running on these new platforms at once. Instead, we picked one particular application, the speech recognition demo I just showed, and attempted to get it completely working from training to deployment on a device, before we tried to expand into full op support. This gave us the ability to quickly try out approaches on a small scale and learn by doing, iteratively making progress rather than having to plan the whole thing out ahead of time with imperfect knowledge of what we’d encounter.

We also deliberately avoided focusing on optimizations. We’re not experts on every microcontroller out there, and there are a lot of them, so we wanted to empower the real experts at hardware vendors to contribute. To help with that, we tried to create clear reference implementations along with unit tests, benchmarks, and documentation, to make collaboration as easy as possible even for external engineers with no background in machine learning.

So, what does that all mean in practice? I’m not going to linger on the code, but here’s the heart of a reference implementation of depthwise convolution. The equivalent optimized versions are many screens of code, but at its center, the actual algorithm is quite simple.

One thing I want to get across is that there is a lot of artificial complexity in current implementations of machine learning operations, across most frameworks. They’re written to be fast on particular platforms, not to be understood or learned from. Reference code implementations can act as teaching tools, helping developers unfamiliar with machine learning understand what’s involved by presenting the algorithms in a form they’re familiar with. They also make a great starting point for optimizing for platforms that the original creators weren’t planning for, and together with unit tests, form living specifications to guide future work.

So, what am I hoping you’ll take away from this talk? Number one is the idea that it’s possible and useful to run machine learning on embedded platforms.

As we look to the future, it’s not clear what the “killer app”, if any, will be for this kind of “Tiny ML”. We know that voice interfaces are popular, and if we can get them running locally on cheap, low-power devices (the recent Pixel server-quality transcription release is a proof of concept that this is starting to become possible), then there will be an explosion of products using them. Beyond that though, we need to connect these solutions with the right problems. It is a case of having a hammer and looking for nails to hit with it, but it is a pretty amazing hammer!

What I’d ask of you is that you think about your own problems, the issues you face in the worlds you work in, and imagine how on-device machine learning could help. If your models could run on a 50 cent chip that could be peeled and stick anywhere, what could you do to help people?

With that in mind, the preview version of the TensorFlow Lite for Microcontrollers library is available at this link, together with documentation and a codelab tutorial. I hope you’ll grab it, get ideas on how you might use it, and give us feedback.

Here are my contact details for questions or comments, please do get in touch! I’ll look forward to hopefully taking a few questions now too, if we have time?

	bouquetsweetly69036a… on Meet Fiona and Abby
	softlysuitcb91a8b8b1 on Meet Fiona and Abby
	Zero-Copy GPU Infere… on Why GEMM is at the heart of de…
	Moonshine Voice完全解説｜… on Announcing Moonshine Voice
	Moonshine KI-Sprache… on Introducing Moonshine, the new…

Pete Warden's blog

Ever tried. Ever failed. No matter. Try Again. Fail again. Fail better.

Scaling machine learning models to embedded devices

One response

Leave a comment Cancel reply

Share this:

Related

One response

Leave a comment Cancel reply