What Machine Learning needs from Hardware

robot_love.png

Photo by Kimberly D

On Monday I’ll be giving a keynote at the IEEE Custom Integrated Circuits Conference, which is quite surprising even to me, considering I’m a software engineer who can barely solder! Despite that, I knew exactly what I wanted to talk about when I was offered the invitation. If I have a room full of hardware designers listening to me to twenty minutes, I want them to understand what people building machine learning applications need out of their chips. After thirteen years(!) of blogging, I find writing a post the most natural way of organizing my thoughts, so I hope any attendees don’t mind some spoilers on what I’ll be asking for.

At TinyML last month, I think it was Simon Craske from Arm who said that a few years ago hardware design was getting a little bit boring, since the requirements seemed well understood and it was mostly just an exercise in iterating on existing ideas. The rise of machine learning (and to be more specific deep learning since that’s been the almost exclusive focus so far) has changed all that. The good news is that chip design is no longer boring, but it can be hard to understand what the new requirements are, so in this post I’ll try to cover my perspective from the software side. I won’t be proposing hardware solutions, since I don’t know what the answers are, but I will try to distill what the hundreds of teams I’ve worked with building products using machine learning are asking for.

More Arithmetic

The single most important message I want to get across is that there are a lot of new applications that are blocked from launching because we don’t have enough computing power. Many other existing products would be improved if we could run models that require more computing power than is available. It doesn’t matter whether you’re on the cloud, a mobile device, or even embedded, teams have models they’d like to run that they can’t.

To give a practical example, take a look at what might seem like a mature area, speech recognition. After heroic efforts, a team at Google was recently able to squeeze server-quality transcription onto local compute on Pixel phones. The model itself is comparatively small too, at just 80 MB. This network is pushing the limits of what a modern application processor on a mobile device can manage, but it’s almost entirely arithmetic-bound. That means if we could offer the same level of raw compute in a chip for lower energy and a cheaper price, this sort of general speech recognition could be added to almost any product. Even if you aren’t looking to expand beyond current phones, there are still problems like the cocktail party effect that can benefit from running additional neural networks to improve the overall accuracy. You can get even more context from visual sensors by looking at things like gaze direction, which, again, require more deep learning models to calculate.

This is just one application area. Every product domain I’ve worked with has similar stories, where improvements in latency and energy usage when running networks would translate directly into new or enhanced user experiences. If you look at a time profile of these networks you’ll see almost all the time is going into multiply-adds, so improving the hardware does mean improving its ability to run the arithmetic primitives efficiently.

Inference

I may be biased because my work almost entirely focuses on running already-trained models (‘inference’ in ML terms) but I believe the biggest need over the next few years is for inference hardware, not training. Someone once said to me “training scales with the number of researchers, inference scales with the number of applications times the number of users“, and that idea has always stuck with me. An individual model author’s training needs are immense, but there’s comparatively few of them and the growth is limited by human educational processes. By comparison, popular applications have hundreds of millions of users, and each application can require many models, so the scale of inference calculations can easily grow much faster.

On a less philosophical note, as I talked about earlier I see a lot of teams who are able to train more complex models than they can deploy on their production platforms. Even cloud applications have compute budgets driven by the economic costs of running servers, and other devices usually have hard limits on the resources available. For those reasons, I’d love to see a lot of attention paid to speeding up ML inference from the hardware community.

Low Precision

It’s now widely accepted that eight bits are enough for running inference on convolutional neural networks. The picture is a bit more complicated for training, and for recurrent networks, because both processes require the addition of many small increments to stored values to achieve their results, but at the very least full thirty-two bit floating point values are overkill in every case I’ve seen. If you design inference hardware for eight-bit precision you’ll cover a large number of practical use cases. Of course, exactly what eight-bit means isn’t necessarily obvious, so I’m hoping the TensorFlow team will be able to produce some guidance based on our experience soon, detailing exactly what we believe the best practices are for executing eight-bit calculations.

There’s also a lot of evidence that it’s possible to go lower than eight bits, but that is a lot less settled. As some background it’s worth reading this survey by my colleague Raghu, which has a lot of experiments investigating different possibilities.

Compatibility

I’ve saved what I expect may be my most controversial request until last. The typical design process I’ve seen from hardware teams is that they will look at some existing ML workloads, note that almost all of the time goes into just a few operations, and so design an accelerator that speeds up those critical-path ops.

This sounds fine in principle, but when an accelerator like that is integrated into a full system it often fails to live up to its potential. The problem is that even though most of the compute for almost all models does go into a handful of common operations, there are hundreds of others that often appear. Almost every model I see has some of these, and they’re almost always different from network to network. A good example is ‘non-max suppression’ in MobileSSD and similar object detection models, where we need some very specific and custom operations to merge the many bounding boxes that are output by the model into just a few coherent final results. This doesn’t require very much raw compute, but it does take a lot of logic, and is hard to express except as general C++ code. In a similar way, many audio networks have a feature generation preprocessing step that converts raw audio data into tensors to feed into the neural networks. Even more tricky are custom steps (like modified activation functions) that show up in the middle of networks. Almost none of these operations are compute intensive, but they aren’t supported by specialized accelerators.

There are two common answers to this from hardware teams. The first is to fall back to a main application processor to implement these custom operations. If the accelerator is across a system bus from the main CPU this can involve a lot of latency as the two processors have to communicate and synchronize with each other. This latency can easily cancel out any speed advantages from using the accelerator in the first place. Alternatively, the team may direct users towards using ‘blessed’ models that will run entirely on the accelerator, avoiding any of the tricky custom operations. This can work for some cases, but the majority of the product teams I work with are struggling to train their models to the accuracy they require for their application, so they’re usually using custom approaches to achieve the results they need. This makes asking them to switch to a new model and figure out how to achieve similar results within tighter constraints a big ask.

This is a big problem for accelerator adoption in practice. What I’m hoping is that future accelerators will offer some kind of general compute capability so that arbitrary C++ custom operations can be easily ported and run on them. The work we’re doing on dependency-free reference implementations in TensorFlow Lite is initially aimed at microcontrollers and embedded systems, but I’m hoping it will eventually be useful for porting ops to devices like accelerators too. The nice thing is that these custom operations almost always involve much less compute than the core accelerated ops, so you don’t need a fast way of running general purpose code, just an escape valve that avoids the latency hit of delegating to the main application processor.

To illustrate the issue, I tried to estimate the number of operations in core TensorFlow using the following command from inside the source folder:

grep -Ir 'REGISTER_OP("' tensorflow/core | grep -vE '(test)|(contrib)' | wc -l

This gives me a rough estimate of 1,202 operations in TensorFlow. Some of these are internal details, or only used for debugging, but in my experience you need to be prepared to deal with many hundreds of different ops if you’re accepting models from authors. I don’t expect this problem to get any easier, since researchers seem to be creating new and improved operations faster than accelerators can support the ones that are already there!

Codesign

The exciting news is that we’re all in a great position to see improvements we make in our systems translate very quickly into better experiences for our users. The work we’re doing has the potential to have a lot of impact on people’s lives, and I think we have the right tools to make fast progress. What it will require is a lot of cooperation between the hardware and software communities, with rapid iteration and sharing of requirements because this is such a new area, so I’m looking forward to continuing to share my experiences, and hearing from hardware experts about ways we can move forward together.

Scaling machine learning models to embedded devices

ScaledML Talk - Pete Warden.png

I gave a talk at ScaledML today, and I’m publishing the slides and my speaking notes as a quick blog post. It will hopefully mostly be familiar to regular readers of my blog already, but I wanted to gather it all in one place!

Hi, I’m here to talk about running machine learning models on tiny computers!

As you may have guessed from the logos splashed across these slides, I’m an engineer on the TensorFlow team at Google, working on our open source machine learning library. My main goal with this talk is to get you excited about a new project we just launched! I’m hoping to get help and contributions from some of you here today, so here are my contact details for any questions or discussions, and I’ll be including these at the end too, when I hope we might have some time for live questions.

So, why am I here, at the ScaledML conference?

I want to start with a number. There are 150 billion embedded processors out there in the world, that’s more than twenty each for every man, woman, and child on earth! Not only did that number amaze me when I first came across it, but the growth rate is an astonishing 20% annually, with no signs of slowing down. That’s much faster than smartphone usage, which is almost flat, or the growth in the number of internet users, which is in the low single digits these days.

One of the things I like about this conference is how the title gives it focus, but still covers a lot of interesting areas. We often think of scale as being about massive data centers but it’s a flexible enough concept for a computing platform with this much reach to make sense to talk about here.

Incidentally, you may be wondering why I’m not talking about the edge, or the internet of things. The term “edge” is actually one of my pet peeves, and to help explain why, take a look at this diagram.

Can you see what’s missing? There are no people here! Even if we were included, we would be literally at the edge, while a datacenter sits in the center. I think it makes a lot more sense to think about our infrastructure as having people at the center, so that they have priority in the way we think about and design our technology.

My objection to the internet of things is that the majority of embedded devices are not connected to any network, and as I’ll discuss in a bit, it’s unlikely that they ever will be, at least more than intermittently.

So, now I’ve got those rants out of the way, what about the ML part of ScaledML? Why is that important in this world?

The hardest constraint embedded devices face is energy. Wiring them into mains power is hard or impossible in most environments. The maintenance burden of replacing batteries quickly becomes unmanageable as the number of devices increases (just think about your smoke alarm chirping when its battery is low, and then multiply that many-fold). The only sane way for the number of devices to keep increasing is if they have batteries that last a very long time, or if they can use energy harvesting (like solar cells from indoor lighting).

That constraint means that it’s essential to keep energy usage at one milliwatt or even lower, since that can give you a year of continuous use on a reasonably small and cheap battery, or alternatively is within the range of a decent energy harvesting system like ambient solar.

Unfortunately, anything involving radio takes a lot of energy, almost always far more than one milliwatt. Transmitting bits of information, even with approaches like bluetooth low energy, is in the tens to hundreds of milliwatts in the best of circumstances at comparatively short range. The efficiency of radio transmission doesn’t seem to be improving dramatically over time either, there seem to be some tough hurdles imposed by physics that make improvements hard.

On a happier note, capturing data through sensors doesn’t suffer from the same problem. There are microphones, accelerometers, and even image sensors that operate well below a milliwatt, even down to tens of microwatts. The same is true for arithmetic. Microprocessors and DSPs are able to process tens or hundreds of millions of calculations for under a milliwatt, even with existing technologies, and much more efficient low-energy accelerators are on the horizon.

What this means is that most data that’s being captured by sensors in the embedded world is just being discarded, without being analyzed at all. As a practical example, I was talking to a satellite company a few years ago, and was astonished to discover that they threw away most of the imagery their satellites gathered. They only had limited bandwidth to download images to base stations periodically, so they stored images at much lower resolutions than their smart-phone derived cameras were capable of, and at a much lower framerate. After going to all the trouble of getting a device into space, this seemed like a terrible waste!

What machine learning, and here I’m primarily talking about deep learning, offers is the ability to take large amounts of noisy sensor data, and spot patterns that lead to actionable information. In the satellite example, rather than storing uniform images that are mostly blank ocean, why not use image detection models to capture ships or other features in much higher detail?

What deep learning enables is decision-making without continuous connectivity to a datacenter in the cloud. This makes it possible to build entirely new kinds of products.

To give you a concrete example of what I mean, here’s a demo. This is a microcontroller development board that you can buy from SparkFun for $15, and it includes microphones and a low-power processor from Ambiq. Unfortunately nobody but the front row will be able to see this tiny LED, but when I say the word “Yes”, it should flash yellow. As you can see, it’s far from perfect, but it is doing this very basic speech recognition on-device with no connectivity, and can run on this coin battery for days. If you’re unconvinced, come and see me after the talk and I’ll show you in person!

So how did we build this demo? One of the toughest challenges we faced was that the computing environment for embedded systems is extremely harsh. We had to design for a system with less than 100 KB of RAM and read-only storage, since that’s common on many microprocessors. We also don’t have much processing power, only tens of millions of arithmetic operations, and we couldn’t rely on having floating point hardware, so we needed to cope with integer-only math. There’s also no common operating system to rely on, and many devices don’t have anything beyond bare metal interfaces that require you to access control registers directly.

To begin, we started with figuring out how we could even fit a model into these constraints. Luckily, the “Hey Google” speech team has many years of experience with these kind of applications, so we knew from discussions with them that it was possible to get useful results with models that were just tens of kilobytes in size. Part of the magic of getting a network to fit into that little memory was quantizing all the weights down to eight bits, but the actual architecture itself was a very standard single-layer convolutional network, applied to frequency spectrograms generated from the raw audio sample data.

This simple model helped us fit within the tight computational budget of the platform, since we needed to run inference multiple times a second, so even the millions of ops per second we had available had to be sliced very thinly. Once we had a model designed, I was able to use the standard TensorFlow speech command tutorial to train it.

With the model side planned out, the next challenge was writing the software to run our network. I work on TensorFlow Lite, so I wanted that as my starting point, but there were some problems. The smallest size we could achieve for a binary using the library was still hundreds of kilobytes, and it has a lot of dependencies on standard libraries like C, C++ or Posix functions. It also assumes that dynamic memory allocation is available. None of these assumptions could be relied on in the embedded world, so it wasn’t usable as it was.

There was a lot to recommend TensorFlow Lite though. It has implementations and tests for many operations, has a great set of APIs, a good ecosystem, and handles conversion from the training side of TensorFlow. I didn’t want to lose all those advantages.

In short, and apologies for a Brexit reference, but I wanted to have my cake and eat it too!

The way out of this dilemma proved to be some targeted refactoring of the existing codebase. The hardest part was separating out the fundamental properties of the framework like file formats and APIs from less-portable implementation details. We did this by focusing on converting the key components of the system into modules, with shared interfaces at the header level, but the possibility of different implementations to cope with the requirements of different environments.

We also made the choice not to boil the ocean and try to get the entirety of a complex framework running on these new platforms at once. Instead, we picked one particular application, the speech recognition demo I just showed, and attempted to get it completely working from training to deployment on a device, before we tried to expand into full op support. This gave us the ability to quickly try out approaches on a small scale and learn by doing, iteratively making progress rather than having to plan the whole thing out ahead of time with imperfect knowledge of what we’d encounter.

We also deliberately avoided focusing on optimizations. We’re not experts on every microcontroller out there, and there are a lot of them, so we wanted to empower the real experts at hardware vendors to contribute. To help with that, we tried to create clear reference implementations along with unit tests, benchmarks, and documentation, to make collaboration as easy as possible even for external engineers with no background in machine learning.

So, what does that all mean in practice? I’m not going to linger on the code, but here’s the heart of a reference implementation of depthwise convolution. The equivalent optimized versions are many screens of code, but at its center, the actual algorithm is quite simple.

One thing I want to get across is that there is a lot of artificial complexity in current implementations of machine learning operations, across most frameworks. They’re written to be fast on particular platforms, not to be understood or learned from. Reference code implementations can act as teaching tools, helping developers unfamiliar with machine learning understand what’s involved by presenting the algorithms in a form they’re familiar with. They also make a great starting point for optimizing for platforms that the original creators weren’t planning for, and together with unit tests, form living specifications to guide future work.

So, what am I hoping you’ll take away from this talk? Number one is the idea that it’s possible and useful to run machine learning on embedded platforms.

As we look to the future, it’s not clear what the “killer app”, if any, will be for this kind of “Tiny ML”. We know that voice interfaces are popular, and if we can get them running locally on cheap, low-power devices (the recent Pixel server-quality transcription release is a proof of concept that this is starting to become possible), then there will be an explosion of products using them. Beyond that though, we need to connect these solutions with the right problems. It is a case of having a hammer and looking for nails to hit with it, but it is a pretty amazing hammer!

What I’d ask of you is that you think about your own problems, the issues you face in the worlds you work in, and imagine how on-device machine learning could help. If your models could run on a 50 cent chip that could be peeled and stick anywhere, what could you do to help people?

With that in mind, the preview version of the TensorFlow Lite for Microcontrollers library is available at this link, together with documentation and a codelab tutorial. I hope you’ll grab it, get ideas on how you might use it, and give us feedback.

Here are my contact details for questions or comments, please do get in touch! I’ll look forward to hopefully taking a few questions now too, if we have time?

 

Launching TensorFlow Lite for Microcontrollers

SparkFun Edge Development Board - Apollo3 Blue

I’ve been spending a lot of my time over the last year working on getting machine learning running on microcontrollers, and so it was great to finally start talking about it in public for the first time today at the TensorFlow Developer Summit. Even better, I was able to demonstrate TensorFlow Lite running on a Cortex M4 developer board, handling simple speech keyword recognition. I was nervous, especially with the noise of the auditorium to contend with, but I managed to get the little yellow LED to blink in response to my command! If you’re interested in trying it for yourself, the board is available for $15 from SparkFun with the sample code preloaded. For anyone who didn’t catch it, here are the notes from my talk.

Hi, I’m Pete Warden on the TensorFlow Lite team, and I’m here to talk about a new project we’re pretty excited about. When I first joined Google back in 2014, I learned about a lot of exciting internal work that wasn’t yet public, but one of the most impressive moments was when I was introduced to Raziel, who was on the speech team at that point, and he told me that they used network models that were only thirteen kilobytes in size! I only had experience with image models, and in those days even the smallest model like Inception still took up megabytes.

I was even more amazed when he told me why these models had to be so small. They needed to run them on DSPs and other embedded chips in smartphones so Android could listen out for wake words like “Hey Google” while the main CPU was powered off to save the battery. These microcontrollers often only had tens of kilobytes of RAM and Flash memory, so they simply couldn’t fit anything larger. They also couldn’t rely on cloud connectivity because keeping any radio connection alive continuously would drain the battery in no time at all.

What struck me was that the speech team had a massive amount of experience, and had spent a lot of time experimenting, and even within the tough constraints of these devices, neural networks produced better results than any of the more traditional methods they tried. This left me wondering if they would be useful for other embedded sensor applications, and I wanted to see if we could build support for these platforms into TensorFlow. At the time few people knew about the ground-breaking work that was being done in the speech community, so I was excited to help share it more widely.

Today I’m pleased to announce that we are releasing the first, experimental support for embedded platforms in TensorFlow Lite. To show you what I mean, here’s a demonstration I have in my pocket!

This is a prototype of a development board built by SparkFun, and it has a Cortex M4 processor with 384KB of RAM and 1MB of Flash storage. The processor was built by Ambiq to be extremely low power, drawing less than one milliwatt in many cases so it’s able to run for many days on a small coin battery.

I’m going to take my life in my hands now by trying a live demo, so wish me luck! The goal is that I’m going to say the word “Yes”, and the little yellow LED here will light up. Hopefully we can use this camera contraption to show this to everyone on the screen and in the livestream.

“Yes”. “Yes”. “Yes”.

As you can see, it’s still far from perfect, but it’s managing to do a decent job of recognizing when I say the word, and not lighting up when there’s unrelated conversations.

So why is this useful? First, this is running entirely locally on the embedded chip, with no need to have any internet connectivity, so it’s good to have as part of a voice interface system. The model itself takes up less than 20KB of Flash storage space, the footprint of the TensorFlow Lite code is only another 25KB of Flash, and it only needs 30KB of RAM to operate.

Secondly, the software for this demo is entirely open source. You can grab the code for it and build it yourself. It’s also already been ported to a lot of different embedded chips, and we hope to see it appear on many more over the next few months. You can check out the code yourself at

https://github.com/tensorflow/tensorflow/tree/master/tensorflow/lite/experimental/micro

There’s more documentation here:

https://www.tensorflow.org/lite/guide/microcontroller

If you want to customize the example, you can try this code lab:

https://g.co/codelabs/sparkfunTF

Third, you can train your own model using this tutorial that we provide. It comes with an open dataset of over 100,000 utterances submitted by volunteers, which we’d love your help expanding through the link here:
https://aiyprojects.withgoogle.com/open_speech_recording
The helpful thing about this is that if you have your own words or noises you want to recognize, you should be able to adapt this training approach to your own problem just by supplying new training data.

Fourth, the code is part of TensorFlow Lite, it uses the same APIs, file formats, and conversion tools, so it’s well integrated into the whole TensorFlow ecosystem, making it easier to use.

So, how can you try this out yourself? If you’re in the audience, I’m pleased to say that when you pick up your box this afternoon you’ll find your very own prototype SparkFun Edge board! Just remove the tab to switch the battery on, and you should find it preloaded with the TensorFlow “yes” example. Just try saying “Yes” to TensorFlow, and you should hopefully get a yellow light! We also include all the cables you need to program it with your own code through the serial port. These are the first 700 boards ever built, so there is a wiring issue that drains the battery more quickly than on the final devices, but you should be able to develop with them in exactly the same way as the production boards.

If you’re watching at home, you can order one of these for $15 from SparkFun. You’ll also find instructions for many other platforms in the documentation, so we’re happy to work with whatever devices you want to build your projects on. We welcome collaboration with developers across the community to unlock all the creativity that I know is out there, and I’m hoping to be spending a lot of my time in the future reviewing pull requests!

Finally, a big thanks to everyone who helped bring this prototype together, including the TensorFlow Lite team, especially Raziel, Rocky, Dan, Tim, and Andy; Alasdair, Nathan, Owen and Jim at SparkFun; Scott, Steve, Arpit, and Andre at Ambiq, and many people at Arm including Rod, Neil and Zach! This is still a very early experiment but I can’t wait to see what people build with this.

Will Compression Be Machine Learning’s Killer App?

vice.pngPhoto by Greg Simenoff

When I talk to people about machine learning on phones and devices I often get asked “What’s the killer application?“. I have a lot of different answers, everything from voice interfaces to entirely new ways of using sensor data, but the one I’m most excited about in the near-team is compression. Despite being fairly well-known in the research community, this seems to surprise a lot of people, so I wanted to share some of my personal thoughts on why I see compression as so promising.

I was reminded of this whole area when I came across an OSDI paper on “Neural Adaptive Content-aware Internet Video Delivery“. The summary is that by using neural networks they’re able to improve a quality-of-experience metric by 43% if they keep the bandwidth the same, or alternatively reduce the bandwidth by 17% while preserving the perceived quality. There have also been other papers in a similar vein, such as this one on generative compression, or adaptive image compression. They all show impressive results, so why don’t we hear more about compression as a machine learning application?

We don’t (yet) have the compute

All of these approaches require comparatively large neural networks, and the amount of arithmetic needed scales with the number of pixels. This means large images or video with high frames-per-second can require more computing power than current phones and similar devices have available. Most CPUs can only practically handle tens of billions of arithmetic operations per second, and running ML compression on HD video could easily require ten times that.

The good news is that there are hardware solutions, like the Edge TPU amongst others, that offer the promise of much more compute being available in the future. I’m hopeful that we’ll be able to apply these resources to all sorts of compression problems, from video and image, to audio, and even more imaginative approaches.

Natural language is the ultimate compression

One of the other reasons I think ML is such a good fit for compression is how many interesting results we’ve had recently with natural language. If you squint, you can see captioning as a way of radically compressing an image. One of the projects I’ve long wanted to create is a camera that runs captioning at one frame per second, and then writes each one out as a series of lines in a log file. That would create a very simplistic story of what the camera sees over time, I think of it as a narrative sensor.

The reason I think of this as compression is that you can then apply a generative neural network to each caption to recreate images. The images won’t be literal matches to the inputs, but they should carry the same meaning. If you want results that are closer to the originals, you can also look at stylization, for example to create a line drawing of each scene. What these techniques have in common is that they identify parts of the input that are most important to us as people, and ignore the rest.

It’s not just images

There’s a similar trend in the speech world. Voice recognition is improving rapidly, and so is the ability to synthesize speech. Recognition can be seen as the process of compressing audio into natural language text, and synthesis as the reverse. You could imagine being able to highly compress conversations down to transmitting written representations rather than audio. I can’t imagine a need to go that far, but it does seem likely that we’ll be able to achieve much better quality and lower bandwidth by exploiting our new understanding of the patterns in speech.

I even see interesting possibilities for applying ML compression to text itself. Andrej Karpathy’s char-rnn shows how well neural networks can mimic styles given some examples, and that prediction is a similar problem to compression. If you think about how much redundancy is in a typical HTML page, it seems likely that there would be some decent opportunities for ML to improve on gzip. This is getting into speculation though, since I don’t have any ML text compression papers handy.

Compression is already a budget item

One of the things I learned while unsuccessfully trying to sell to businesses during my startup career was that it was much easier to make a sale if there was already a chunk of money allocated to what you were selling. The existence of a budget line item meant that the hard battle over whether the company should spend money on a solution had already been won, now the only questions was which solution to buy. That’s one of the reasons why I think that ML could make dramatic inroads in this area, because manufacturers already have engineers, money, and silicon area earmarked for video and audio compression. If we can show that adding machine learning to existing solutions improves them in measurable ways (for example quality, speed, or power consumption) then they will be adopted quickly.

Bandwidth costs users and carriers money, and quality and battery life are selling points for products, so the motivation behind adopting ML for compression is much more direct than many other use cases. Existing research shows that it can be very effective, and I’m optimistic that there’s a lot more to be discovered, so I’m hopeful that it will develop into a key use of the technology.

 

What Does it Take to Train Deep Learning Models On-Device?

graduationPhoto by Fort George G. Meade Public Affairs Office

Over the past few weeks,a few different people have asked me about the state of model training on phones and embedded devices. The good news is that it’s definitely possible, I know of multiple examples of teams doing this successfully. The bad news is that our tools don’t yet make it easy.

Back in 2014 I released some code and a video showing how you could do simple transfer learning on a phone. At that point I was using a support vector machine to handle the on-device training of the final layer, but there was no fundamental reason I couldn’t have used back-propagation instead, it was just easier to use a technique I knew. I’m sure there are other examples even earlier than that too.

One area where on-device learning with neural networks has long been common is speech recognition. Even today, your phone will often ask you to say the wake word (for example “OK Google” or “Hey Siri”) a few times, so it can learn your pronunciation. Typically this won’t involve a complete retraining of the network using back propagation, but something that’s more like transfer learning on either the feature extraction process, or the final layer. Apple’s implementation uses a supplemental network to compare utterances to your references in what they term speaker space, so it’s pretty different than how we normally think of training. I still consider these cases valid examples of on-device learning, but often use the term “personalization” to capture the limited extent of the network changes.

There are situations where the kind of full-network back propagation familiar to deep learning practitioners building models in Python happens on devices too. A good published example is for the Google Keyboard application. The paper focuses on the novel aspects of sending anonymized model updates over the network, but those updates come from a traditional process of running forward and backward passes on the phone.

So, corporate teams are building applications that use learning on-device, but how can you do the same thing? The key thing to realize is that there are two different stages to training a model. The first is building the back-propagation machinery to update the gradients. The second is actually feeding examples through the forward pass, and then passing the gradient changes through the backward pass.

Most libraries, including TensorFlow, handle the automatic differentiation you need to build the back propagation operations in the Python layer. This code looks at the model you’ve defined in terms of inference operations (convolution layers feeding into activation functions, etc) and builds a complementary set of operations for the back propagation, including a loss function. The important part is that once the graph is built, at least in TensorFlow, you don’t need the Python code any more. Running the backward pass just means running the sub-graph of operations that was created by the differentiation stage.

In TensorFlow terms, this means after you’ve created the backward pass you can save off the GraphDef containing both the forward and backward operations. If you then load that in C++, either on-device, or on a server, theoretically you can also run gradient updates from labeled examples using Session::Run(). I say theoretically because we don’t have documentation on this process, and it involves a lot of unfamiliar steps that can be hard to get right in C++, like initializing variables, loading and saving checkpoints, and handling losses. This is a gap I’m hoping we can fill, but there are internal teams who’ve been able to past these issues, so at least there’s proof that it’s possible. I’m sure there must be examples of external teams doing similar work, so feel free to add comments or reply on Twitter if you do have code to share, in TensorFlow or any other framework, I’d love to share them!

What Image Classifiers Can Do About Unknown Objects

question_markPhoto by Brandon Giesbrecht

A few days ago I received a question from Plant Village, a team I’m collaborating with about a problem that’s emerged with a mobile app they’re developing. It detects plant diseases, and is delivering good results when it’s pointed at leaves, but if you point it at a computer keyboard it thinks it’s a damaged crop.

Photos from David Hughes and Amanda Ramcharan

This isn’t a surprising result to computer vision researchers, but it is a shock to most other people, so I want to explain why it’s happening, and what we can do about it.

As people, we’re used to being able to classify anything we see in the world around us, and we naturally expect machines to have the same ability. Most models are only trained to recognize a very limited set of objects though, such as the 1,000 categories of the original ImageNet competition. Crucially, the training process makes the assumption that every example the model sees is one of those objects, and the prediction must be within that set. There’s no option for the model to say “I don’t know”, and there’s no training data to help it learn that response. This is a simplification that makes sense within a research setting, but causes problems when we try to use the resulting models in the real world.

Back when I was at Jetpac, we had a lot of trouble convincing people that the ground-breaking AlexNet model was a big leap forward because every time we handed over a demo phone running the network, they would point it at their faces and it would predict something like “Oxygen mask” or “Seat belt”. This was because the ImageNet competition categories didn’t include any labels for people, but most of the photos with mask and seatbelt labels included faces along with the objects. Another embarrassing mistake came when they would point it at a plate and it would predict “Toilet seat”! This was because there were no plates in the original categories, and the closest white circular object in appearance was a toilet.

I came to think of this as the “open world” versus “closed world” problem. Models were trained and evaluated assuming that there was only ever going to be a limited universe of objects presented to them, but as soon as they make it outside the lab that assumption breaks down and they are judged by users on their performance for any arbitrary object that’s put in front of them, whether or not it was in the training set.

So, What’s The Solution?

Unfortunately I don’t know of a simple fix for this problem, but there are some strategies that I’ve seen help. The most obvious start is to add an “Unknown” class to your training data. The bad news is that just opens up a whole different set of issues

  • What examples should go into that class? There’s an almost limitless number of possible natural images, so how do you choose which to include?
  • How many of each different type of object do you need in the unknown class?
  • What should you do about unknown objects that look very similar to the classes you care about? For example adding a dog breed that’s not in the ImageNet 1,000, but looks nearly identical, will likely force a lot of what would have been correct matches into the unknown bucket.
  • What proportion of your training data should be made up of examples of the unknown class?

This last point actually touches on a much larger issue. The prediction values you get from image classification networks are not probabilities. They assume that the odds of seeing any particular class are equal to how often that class shows up in the training data. If you try to use an animal classifier that includes penguins in the Amazon jungle you’ll experience this problem, since (presumably) all of the penguin sightings will be false positives. Even with dog breeds in a US city, the rarer breeds show up a lot more often in the ImageNet training data than they will in a dog park, so they’ll be over-represented as false positives. The usual solution is to figure out what the prior probabilities in the situation you’ll be facing in production are, and then use those to apply calibration values to the network’s output to get something that’s closer to real probabilities.

The main strategy that helps tackle the overall problem in real applications is constraining the model’s usage to situations where the assumptions about what objects will be present matches the training data. A straightforward way of doing this is through product design. You can create a user interface that directs people to focus their device on an object of interest before running the classifier, much like applications that ask you to take photographs of checks or other documents often do.

Getting a little more sophisticated, you can write a separate image classifier that tries to identify conditions that the main image classifier is not designed for. This is different than adding a single “Unknown” class, because it acts more like a cascade, or a filter before the detailed model. In the crop disease case, the operating environment is visually distinct enough that it might be fine to just train a model to distinguish between leaves and a random selection of other photos. There’s enough similarity that the gating model should at least be able to tell if the image is being taken in a type of scene that’s not supported. This gating model would be run before the full image classifier, and if it doesn’t detect something that looks like it could be a plant, it will bail out early with an error message indicating no crops were found.

Applications that ask you to capture images of credit cards or perform other kinds of OCR will often use a combination of on-screen directions and a model to detect blurriness or lack of alignment to guide users to take photos that can be successfully processed, and having a “are there leaves?” model is a simple version of this interface pattern.

This probably isn’t a very satisfying set of answers, but they’re a reflection of the messiness of user expectations once you take machine learning beyond constrained research problems. There’s a lot of common sense and external knowledge that goes into a person’s recognition of an object, and we don’t capture any of that in the classic image classification task. To get results that meet user expectations, we have to design a full system around our models that understands the world that they will be deployed in, and makes smart decisions based on more than just the model outputs.

 

Why the Future of Machine Learning is Tiny

ruler.pngPhoto by Kevin Steinhardt

When Azeem asked me to give a talk at CogX, he asked me to focus on just a single point that I wanted the audience to take away. A few years ago my priority would have been convincing people that deep learning was a real revolution, not a fad, but there have been enough examples of shipping products that that question seems answered. I knew this was true before most people not because I’m any kind of prophet with deep insights, but because I’d had a chance to spend a lot of time running hands-on experiments with the technology myself. I could be confident of the value of deep learning because I had seen with my own eyes how effective it was across a whole range of applications, and knew that the only barrier to seeing it deployed more widely was how long it takes to get from research to deployment.

Instead I chose to speak about another trend that I am just as certain about, and will have just as much impact, but which isn’t nearly as well known. I’m convinced that machine learning can run on tiny, low-power chips, and that this combination will solve a massive number of problems we have no solutions for right now. That’s what I’ll be talking about at CogX, and in this post I’ll explain more about why I’m so sure.

Tiny Computers are Already Cheap and Everywhere

jg07_The_Shape_of_the_MCU_Market_c

Chart from embedded.com

The market is so fragmented that it’s hard to get precise numbers, but the best estimates are that over 40 billion microcontrollers will be sold this year, and given the persistence of the products they’re in, there’s likely to be hundreds of billions of them in service. Microcontrollers (or MCUs) are packages containing a small CPU with possibly just a few kilobytes of RAM, and are embedded in consumer, medical, automotive and industrial devices. They are designed to use very small amounts of energy, and to be cheap enough to include in almost any object that’s sold, with average prices expected to dip below 50 cents this year.

They don’t get much attention because they’re often used to replace functionality that older electro-mechanical systems could do, in cars, washing machines, or remote controls. The logic for controlling the devices is almost unchanged from the analog circuits and relays that used to be used, except possibly with a few tweaks like programmable remote control buttons or windshield wipers that vary their speed with rain intensity. The biggest benefit for the manufacturer is that standard controllers can be programmed with software rather than requiring custom electronics for each task, so they make the manufacturing process cheaper and easier.

Energy is the Limiting Factor

Any device that requires mains electricity faces a lot of barriers. It’s restricted to places with wiring, and even where it’s available, it may be tough for practical reasons to plug something new in, for example on a factory floor or in an operating theatre. Putting something high up in the corner of a room means running a cord or figuring out alternatives like power-over-ethernet. The electronics required to convert mains voltage to a range circuitry can use is expensive and wastes energy. Even portable devices like phones or laptops require frequent docking.

The holy grail for almost any smart product is for it to be deployable anywhere, and require no maintenance like docking or battery replacement. The biggest barrier to achieving this is how much energy most electronic systems use. Here are some rough numbers for common components based on figures from Smartphone Energy Consumption (see my old post here for more detail):

  • A display might use 400 milliwatts.
  • Active cell radio might use 800 milliwatts.
  • Bluetooth might use 100 milliwatts.
  • Accelerometer is 21 milliwatts.
  • Gyroscope is 130 milliwatts.
  • GPS is 176 milliwatts.

A microcontroller itself might only use a milliwatt or even less, but you can see that peripherals can easily require much more. A coin battery might have 2,500 Joules of energy to offer, so even something drawing at one milliwatt will only last about a month. Of course most current products use duty cycling and sleeping to avoid being constantly on, but you can see what a tight budget there is even then.

CPUs and Sensors Use Almost No Power, Radios and Displays Use Lots

The overall thing to take away from these figures is that processors and sensors can scale their power usage down to microwatt ranges (for example Qualcomm’s Glance vision chip, even energy-harvesting CCDs, or microphones that consume just hundreds of microwatts) but displays and especially radios are constrained to much higher consumption, with even low-power wifi and bluetooth using tens of milliwatts when active. The physics of moving data around just seems to require a lot of energy. There seems to be a rule that the energy an operation takes is proportional to how far you have to send the bits. CPUs and sensors send bits a few millimeters, and is cheap, radio sends them meters or more and is expensive. I don’t see this relationship fundamentally changing, even as technology improves overall. In fact, I expect the relative gap between the cost of compute and radio to get even wider, because I see more opportunities to reduce computing power usage.

We Capture Much More Sensor Data Than We Use

A few years ago I talked to some engineers working on micro-satellites capturing imagery. Their problem was that they were essentially using phone cameras, which are capable of capturing HD video, but they only had a small amount of memory on the satellite to store the results, and only a limited amount of bandwidth every few hours to download to the base stations on Earth. I realized that we face the same problem almost everywhere we deploy sensors. Even in-home cameras are limited by the bandwidth of wifi and broadband connections. My favorite example of this was a friend whose December ISP usage was dramatically higher than the rest of the year, and when he drilled down it was because his blinking Christmas lights caused the video stream compression ratio to drop dramatically, since so many more frames had differences!

There are many more examples of this, all the accelerometers on our wearables and phones are only used to detect events that might wake up the device or for basic step counting, with all the possibilities of more sophisticated activity detection untouched.

What This All Means For Machine Learning

If you accept all of the points above, then it’s obvious there’s a massive untapped market waiting to be unlocked with the right technology. We need something that works on cheap microcontrollers, that uses very little energy, that relies on compute not radio, and that can turn all our wasted sensor data into something useful. This is the gap that machine learning, and specifically deep learning, fills.

Deep Learning is Compute-Bound and Runs Well on Existing MCUs

One of my favorite parts of working on deep learning implementations is that they’re almost always compute-bound. This is important because almost all of the other applications I’ve worked on have been limited by how fast large amounts of memory can be accessed, usually in unpredictable patterns. By contrast, most of the time for neural networks is spent multiplying large matrices together, where the same numbers are used repeatedly in different combinations. This means that the CPU spends most of its time doing the arithmetic to multiply two cached numbers together, and much less time fetching new values from memory.

This is important because fetching values from DRAM can easily use a thousand times more energy than doing an arithmetic operation. This seems to be another example of the distance/energy relationship, since DRAM is physically further away than registers. The comparatively low memory requirements (just tens or hundreds of kilobytes) also mean that lower-power SRAM or flash can be used for storage. This makes deep learning applications well-suited for microcontrollers, especially when eight-bit calculations are used instead of float, since MCUs often already have DSP-like instructions that are a good fit. This idea isn’t particularly new, both Apple and Google run always-on networks for voice recognition on these kind of chips, but not many people in either the ML or embedded world seem to realize how well deep learning and MCUs match.

Deep Learning Can Be Very Energy-Efficient

I spend a lot of time thinking about picojoules per op. This is a metric for how much energy a single arithmetic operation on a CPU consumes, and it’s useful because if I know how many operations a given neural network takes to run once, I can get a rough estimate for how much power it will consume. For example, the MobileNetV2 image classification network takes 22 million ops (each multiply-add is two ops) in its smallest configuration. If I know that a particular system takes 5 picojoules to execute a single op, then it will take (5 picojoules * 22,000,000) = 110 microjoules of energy to execute. If we’re analyzing one frame per second, then that’s only 110 microwatts, which a coin battery could sustain continuously for nearly a year. These numbers are well within what’s possible with DSPs available now, and I’m hopeful we’ll see the efficiency continue to increase. That means that the energy cost of running existing neural networks on current hardware is already well within the budget of an always-on battery-powered device, and it’s likely to improve even more as both neural network model architectures and hardware improve.

Deep Learning Makes Sense of Sensor Data

In the last few years its suddenly become possible to take noisy signals like images, audio, or accelerometers and extract meaning from them, by using neural networks. Because we can run these networks on microcontrollers, and sensors themselves use little power, it becomes possible to interpret much more of the sensor data we’re currently ignoring. For example, I want to see almost every device have a simple voice interface. By understanding a small vocabulary, and maybe using an image sensor to do gaze detection, we should be able to control almost anything in our environment without needing to reach it to press a button or use a phone app. I want to see a voice interface component that’s less than fifty cents that runs on a coin battery for a year, and I believe it’s very possible with the technology we have right now.

As another example, I’d love to have a tiny battery-powered image sensor that I could program to look out for things like particular crop pests or weeds, and send an alert when one was spotted. These could be scattered around fields and guide interventions like weeding or pesticides in a much more environmentally friendly way.

One of the industrial examples that stuck with me was a factory operator’s description of “Hans”. He’s a long-time engineer that every morning walks along the row of machines, places a hand on each of them, listens, and then tells the foreman which will have to be taken offline for servicing, all based on experience and intuition. Every factory has one, but many are starting to face retirement. If you could stick a battery-powered accelerometer and microphone to every machine (a “Cyber-Hans”) that would learn usual operation and signal if there was an anomaly, you might be able to catch issues before they became real problems.

I probably have a hundred other products I could dream up, but if I’m honest what I’m most excited about is that I don’t know how these new devices will be used, just that the technological imperative behind them is so compelling that they’ll be built and whole new applications I can’t imagine will emerge. For me it feels a lot like being a kid in the Eighties when the first home computers emerged. I had no idea what they would become, and most people at the time used them for games or storing address books, but there were so many possibilities I knew whole new worlds would emerge.

The Takeaway

The only reason to have an in-person meeting instead sending around a document is to communicate the emotion behind the information. What I want to share with the CogX audience is my excitement and conviction about the future of ML on tiny devices, and while a blog post is a poor substitute for real presence, I hope I’ve got across some of that here. I don’t know the details of what the future will bring, but I know ML on tiny, cheap battery powered chips is coming and will open the door for some amazing new applications!