One weird trick to shrink convolutional networks for TinyML

A colleague recently asked for more details on an approach I recommended, but which she hadn’t seen any documentation for. I realized that it was something I’d learned from talking to model builders at Google, and I wasn’t sure there was anything written up, so in the spirit of leaving a trail of breadcrumbs for anyone coming after, I thought I should put it into a quick blog post.

The summary is that if you have MaxPool or AveragePool after a convolutional layer in a network, and you’re targeting a resource-constrained system like a microcontroller, you should try removing them entirely and replacing them with a stride in the convolution instead. This has two main benefits, but to explain it’s easiest to diagram out the network before and after.

In the typical setup, shown on the left, is a convolutional layer is followed by a pooling operation. This has been common since at least AlexNet, and is still found in many modern networks. The setup I often find useful is shown on the right. I’m using an example input size of 224 wide by 224 high for this diagram, but the discussion holds true for any dimensions.

The first thing to notice is that in the standard configuration, there’s a 224x224x8 activation buffer written out to memory after the convolution layer. This is by far the biggest chunk of memory required in this part of the graph, taking over 400KB, even with eight-bit values. All ML frameworks I’m aware of will require this buffer to be instantiated and filled before the next operation can be invoked. In theory it might be possible to do tiled execution, in the way that’s common for image processing frameworks, but the added complexity hasn’t made it a priority so far. If you’re running on an embedded system, 400KB is a lot of RAM, especially since it’s only being used for temporary values. That makes it a tempting target for size optimization.

My second observation is that we’re only using 25% of those values, assuming MaxPool is doing a typical 2x reduction, taking the largest value out of 4 in a 2×2 window. From experience, these values are often very similar, so while doing the pooling does help overall accuracy a bit, taking any of those four values at random isn’t much worse. In essence, this is what removing the pooling and increasing the stride for convolution does.

Stride is an argument that controls the step size as a convolution filter is slid across the input. By default, many networks have windows that are offset from each other by one pixel horizontally, and one pixel vertically. This means (ignoring padding, which is a whole different discussion) the output is the same size as the input, but typically with more channels (eight in the diagram above). Instead of setting the stride to this default of 1 horizontally, 1 vertically, you can set it to 2,2. This means that each window is offset by two pixels vertically and horizontally from its neighbor. This results in an output array that is half the width and height of the input, and so has a quarter of the number of elements. In essence, we’re picking one of the four values that would have been chosen by the pooling operation, but without the comparison or averaging that is used in the standard configuration.

This means that the output of the convolution layer uses much less memory, resulting in a smaller arena for TFL Micro, but also reduces the computation by 75%, since only a quarter of the convolution windows are being calculated. It does result in some accuracy loss, which you can verify during training, but since it reduces the resource usage so dramatically you may even be able to increase some other parameters like the input size or number of channels and gain some back. If you do find yourself struggling for arena size, I highly recommend giving this approach a try, it’s been very helpful for a lot of our models. If you’re not sure if your model has the convolution/pooling pattern, or want to better understand the sizes of your activation buffers and how they influence the arena you’ll need, I recommend the Netron visualizer, which can take TensorFlow Lite model files.

How to write to flash on an Arduino Nano BLE

Photo by Brecht Bug

I’ve been enjoying using the Arduino Nano Sense BLE 33 board as an all-round microcontroller for my machine learning work, but I had trouble figuring out how to programmatically write to flash memory from a sketch. I need to do this because I want to be able to download ML models over Bluetooth and then have them persist even if the user unplugs the board or resets it. After some research and experimentation I finally have a solution I’m happy with, so I’ve put an example sketch and documentation up at github.com/petewarden/arduino_nano_ble_write_flash.

The main hurdle I had to overcome was how to initialize an area of memory that would be loaded into flash when the program was first uploaded, but not touched on subsequent resets. Since modifying linker scripts isn’t recommended in the Arduino IDE, I had to come up with a home-brewed solution using const arrays and C++’s alignas() command. Thankfully it seems to work in my testing.

There’s a lot more documentation in the README and inline in the sketch, but I would warn anyone interested in this that flash has a limited number of erase/write cycles it can handle reliably, so don’t go too crazy with high-frequency changes!

How to transfer files over BLE

Image from Wikipedia

I’ve now taught a lot of workshops on TinyML using the Arduino Nano Sense BLE 33 board, including the new EdX course, and while it’s a fantastic piece of technology I often have to spend a lot of time helping students figure out how to get the boards communicating with their computer. Flashing programs to the Arduino relies on having a USB connection that can use the UART serial protocol to communicate, and it turns out that there are a lot of things that can go wrong in this process. Even worse, it’s very hard to debug what’s going wrong, since the UART drivers are deep in the operating system, and vary across Windows, MacOS, and Linux computers. Students can end up getting very frustrated, even after referring to the great troubleshooting FAQ that Brian on the EdX course put together.

I’ve been trying to figure out if there’s an alternative to this approach that will make life easier. To help with that, I’ve been experimenting with how I might be able to transfer files wirelessly over the Bluetooth Low Energy protocol that the Arduino board supports, and I now have a prototype available at github.com/petewarden/ble_file_transfer. There are lots of disclaimers; it’s only a few kilobytes per second, I haven’t tested it very heavily, and it’s just a proof of concept, but I’m hoping to be able to use this to try out some approaches that will help students get started without the UART road bumps.

I also wanted to share a complete example of how to do this kind of file transfer more generally, since when I went looking for similar solutions I saw a lot of questions about how to do this but not many solutions. It’s definitely not an application that BLE is designed for, but it does seem possible to do at least. Hopefully having a version using a well-known board and WebBLE will help someone else out in the future!

How screen scraping and TinyML can turn any dial into an API

https://github.com/jomjol/AI-on-the-edge-device

This image shows a traditional water meter that’s been converted into a web API, using a cheap ESP32 camera and machine learning to understand the dials and numbers. I expect there are going to be billions of devices like this deployed over the next decade, not only for water meters but for any older device that has a dial, counter, or display. I’ve already heard from multiple teams who have legacy hardware that they need to monitor, in environments as varied as oil refineries, crop fields, office buildings, cars, and homes. Some of the devices are decades old, so until now the only option to enable remote monitoring and data gathering was to replace the system entirely with a more modern version. This is often too expensive, time-consuming, or disruptive to contemplate. Pointing a small, battery-powered camera instead offers a lot of advantages. Since there’s an air gap between the camera and the dial it’s monitoring, it’s guaranteed to not affect the rest of the system, and it’s easy to deploy as an experiment, iterating to improve it.

If you’ve ever worked with legacy software systems, this may all seem a bit familiar. Screen scraping is a common technique to use when you have a system you can’t easily change that you need to extract information from, when there’s no real API available. You take the user interface results for a query as text, HTML, or even an image, ignore the labels, buttons, and other elements you don’t care about, and try to extract the values you want. It’s always preferable to have a proper API, since the code to pull out just the information you need can be hard to write and is usually very brittle to minor changes in the interface, but it’s an incredibly common technique all the same.

The biggest reason we haven’t seen more adoption of this equivalent approach for IoT is that training and deploying machine learning models on embedded systems has been very hard. If you’ve done any deep learning tutorials at all, you’ll know that recognizing digits with MNIST is one of the easiest models to train. With the spread of frameworks like TensorFlow Lite Micro (which the example above apparently uses, though I can’t find the on-device code in that repo) and others, it’s starting to get easier to deploy on cheap, battery-powered devices, so I expect we’ll see more of these applications emerging. What I’d love to see is some middleware that understands common displays types like dials, physical or LED digits, or status lights. Then someone with a device they want to monitor could build it out of those building blocks, rather than having to train an entirely new model from scratch.

I know I’d enjoy being able to use something like this myself. I’d use a cell-connected device to watch my cable modem’s status, so I’d know when my connection was going flaky, I’d keep track of my mileage and efficiency with something stuck on my car’s dash board looking at the speedometer, odometer and gas gauge, it would be great to have my own way to monitor my electricity, gas, and water meters, I’d have my washing machine text me when it was done. I don’t know how I’d set it up physically, but I’m always paranoid about leaving the stove on, so something that looked at the gas dials would put my mind at ease.

There’s a massive amount of information out in the real world that’s can’t be remotely monitored or analyzed over time, and a lot of it is displayed through dials and displays. Waiting for all of the systems involved to be replaced with connected versions could take decades, which is why I’m so excited about this incremental approach. Just like search engines have been able to take unstructured web pages designed for people to read, and index them so we can find and use them, this physical version of screen-scraping takes displays aimed at humans and converts them into information usable from anywhere. A lot of different trends are coming together to make this possible, from cheap, capable hardware, widespread IoT data networks, software improvements, and the democratization of all these technologies. I’m excited to do my bit to hopefully help make this happen, and I can’t wait to see all the applications that you all come up with, do let me know your ideas!

Why Do I Think There Will be Hundreds of Billions of TinyML Devices Within a Few Years?

Rising Graph Icons - Download Free Vector Icons | Noun Project
Image by The Noun Project

A few weeks ago I was lucky enough to have the chance to present at the Linley Processor Conference. I gave a talk on “What TinyML Needs from Hardware“, and afterwards one of the attendees emailed to ask where some of my numbers came from. In particular, he was intrigued by my note on slide 6 that “Expectations are for tens or hundreds of billions of devices over the next few years“.

I thought that was a great question, since those numbers definitely don’t come from any analyst reports, and they imply at least a doubling of the whole embedded system market from its current level of 40 billion devices a year. Clearly that statement deserves at least a few citations, and I’m an engineer so I try to avoid throwing around predictions without a bit of evidence behind them.

I don’t think I have any particular gift for prophecy, but I do believe I’m in a position that very few other people have, giving me a unique view into machine learning, product teams, and the embedded hardware industry. Since TensorFlow Lite Micro is involved in the integration process for many embedded ML products, we get to hear the requirements from all sides, and see the new capabilities that are emerging from research into production. This also means I get to hear a lot about the unmet needs of product teams. What I see is that there is a lot of latent demand for technology that I believe will become feasible over the next few years, and the scale of that demand is so large that it will lead to a massive increase in the number of embedded devices shipped.

I’m basically assuming that one or more of the killer applications for embedded ML become technically possible. For example, every consumer electronics company I’ve talked to would integrate a voice interface chip into almost everything they make if it was 50 cents and used almost no power (e.g. a coin battery for a year). There’s similar interest in sensor applications for logistics, agriculture, and health, given the assumption that we can scale down the cost and energy usage. A real success in any one of these markets adds tens of billions of devices. Of course, the technical assumptions behind this aren’t certain to be achieved in the time frame of the next few years, but that’s where I stick my neck out based on what I see happening in the research world.

From my perspective, I see models and software already available for things like on-device server-quality voice recognition already, such as Pixel’s system. Of course this example currently requires 80 MB of storage and a Cortex A CPU, but from what I see happening in the MCU and DSP world, the next generation of ML accelerators will provide the needed compute capability, and I’m confident some combination of shrinking the model sizes and increased storage capacity will enable an embedded solution. Then we just need to figure out how to bring the power and price down! It’s similar for other areas like agriculture and health, there are working ML models out there just looking for the right hardware to run on, and then they’ll be able to solve real, pressing problems in the world.

I may be an incorrigible optimist, and as you can see I don’t have any hard proof that we’ll get to hundreds of billions of devices over the next few years, but I hope you can at least understand the trends I’m extrapolating from now.

How to Organize a Zoom Wedding

Photo by Chantal

Joanne and I got engaged two years ago in Paris, and were planning on getting married in the summer, before the pandemic intervened. Once it became clear that it might be years until everybody could meet up in person, especially older members of our families who were overseas, we started looking into how we could have our ceremony online, with no physical contact at all. It was unknown territory for almost everybody involved, including us, but it turned out to be a wonderful day that we’ll remember for the rest of our lives.

In the hope that we might help other couples who are navigating this new world, Joanne has written up an informal how-to guide on Zoom weddings. It covers the legal side of licenses in California, organizing the video conferencing (we used the fantastic startup Wedfuly), cakes, dresses, flowers, and even the first dance! We’re so happy that we were still able to share our love with over a hundred terrific guests, despite the adverse circumstances, so we hope this guide helps others in the same position.

The Five: Putting Jack the Ripper’s Victims at the Center of the Story

Years ago I went on a “Jack the Ripper” walking tour when I visited London without giving it much thought. The murders felt like fiction, in the same realm as Sherlock Holmes or Dr Jekyll and Mr Hyde. Even visiting the spots where the bodies were found didn’t make an impact. I’ve never been a “Ripperologist” but as someone interested in Victorian history it’s been hard to avoid the story and endless theories about the case.

Over time I’ve found myself actively avoiding the subject though. The way the story is told seems to center the perpetrator as the star of the show, endlessly fascinating and worthy of study. At its worst the literature seems little more than a way to vicariously replay the victimization of women over and over again. I’ve run into the same problem with the whole “True Crime” genre – I want the catharsis of a solved case, and the glimpses into people’s hidden lives, but I don’t think murderers are nearly as interesting as they’d like us to believe.

That’s why I was so excited when I saw “The Five” coming up for publication. The book tells the life stories of the canonical victims of the Ripper, up to but not including their deaths. I was already a fan of the author from the wonderful Harlots TV show, and here she has done another amazing job bringing strong, complex, ignored people to life. While she’s at it, she also makes a strong case that the prejudices about the women we’ve inherited from the original investigators have distorted our understanding of the case and blinded us to likely solutions. If most of the women weren’t prostitutes, as she argues convincingly, and if they were attacked while asleep, then many of the old narratives fall apart. It’s a great example of how highlighting the stories of people who have traditionally been ignored isn’t just a worthy pursuit, it also adds to our overall understanding of the past.

Even ignoring the wider lessons, this is a beautifully-written set of stories that I found hard to put down, and I had to ration myself to one a night to avoid burning through it all too fast. With deep research, Rubenhold draws sketches of girls growing into women, living full lives as daughters, mothers, friends, and workers, despite our knowledge of the shadow of Whitechapel looming in their future. They all suffer blows that end up pushing them into poverty, but what she drives home is that they were far more than just their ending, and what a loss their deaths were to themselves and many others. Alcoholism, poverty, and rough sleeping are common factors, but the path to them is wildly different in each story. It even made me rethink some of my expectations of unbending Victorian morality, with most of the women at least temporarily benefiting from pragmatic attempts to help them from well-wishers, even after their notional “fall”.

What does shine through most strongly though is how imperfect the safety net was, especially for poor women. There were very few second chances. One of the reasons I’ve ended up reading so much Victorian history is in an attempt to understand my adopted America, as the only comparable era I can find with such fabulous energy and grinding poverty so close together. It has made me wonder about all the stories I don’t know of people just a few hundred meters from me right now who are living on the same kind of knife edge, and might end up dying in poverty. I hope it’s given me a little more moral imagination to understand the struggles that those around me are facing, and motivation to find ways to help. I do know I won’t be able to visit London again without thinking of Polly, Annie, Elizabeth, Kate, and Mary Jane.

Quantization Screencast

TinyML Book Screencast #4 – Quantization

For the past few months I’ve been working with Zain Asgar and Keyi Zhang on EE292D, Machine Learning on Embedded Systems, at Stanford. We’re hoping to open source all the materials after the course is done, but I’ve been including some of the lectures I’m leading as part of my TinyML YouTube series. Since I’ve talked a lot about quantization on this blog over the years, I thought it would be worth including the latest episode here too.

It’s over an hour long, mostly because quantization is still evolving so fast, and I’ve included a couple of Colabs you can play with to back up the lesson. The first lets you load a pretrained Inception v3 model and inspect the weights, and the second shows how you can load a TensorFlow Lite model file, modify the weights, save it out again and check the accuracy, so you can see for yourself how quantization affects the overall results.

The slides themselves are available too, and this is one area where I go into more depth on the screencast than I do in the TinyML book, since that has more of a focus on concrete exercises. I’m working with some other academics to see if we can come up with a shared syllabus around embedded ML, so if you’re trying to put together something similar for undergraduates or graduates at your college, please do get in touch. The TensorFlow team even has grants available to help with the expenses of machine learning courses, especially for traditionally overlooked students.

Converting a TensorFlow Lite .cc data array into a file on disk

Most of the devices TensorFlow Lite for Microcontrollers runs on don’t have file systems, so the model data is typically included by compiling a source file containing an array of bytes into the executable. I recently added a utility to help convert files into nicely-formatted source files and headers, as the convert_bytes_to_c_source() function, but I’ve also had requests to go the other way. If you have a .cc file (like one from the examples) how do you get back to a TensorFlow Lite file that you can feed into other tools (such as the Netron visualizer)?

My hacky answer to this is a Python script that does a rough job of parsing an input file, looks for large chunks of numerical data, converts the values into bytes, and writes them into an output file. The live Gist of this is at https://gist.github.com/petewarden/493294425ac522f00ff45342c71939d7 and will contain the most up to date version, but here’s the code inline:

import re
output_data = bytearray()
with open('tensorflow/tensorflow/lite/micro/examples/magic_wand/magic_wand_model_data.cc', 'r') as file:
  for line in file:
    values_match = re.match(r"\W*(0x[0-9a-fA-F,x ]+).*", line)
    if values_match:
      list_text = values_match.group(1)
      values_text = filter(None, list_text.split(","))
      values = [int(x, base=16) for x in values_text]
      output_data.extend(values)
with open('converted.tfl', 'wb') as output_file:
  output_file.write(output_data)

You’ll need to replace the input file name with the path of the one you want to convert, but otherwise this should work on most of these embedded model data source files.

TinyML Book Released!

book_cover.jpg

I’ve been excited about running machine learning on microcontrollers ever since I joined Google and discovered the amazing work that the speech wakeword team were doing with 13 kilobyte models, and so I’m very pleased to finally be able to share a new O’Reilly book that shows you how to build your own TinyML applications. It took me and my colleague Dan most of 2019 to write it, and it builds on work from literally hundreds of contributors inside and outside Google. The most nerve-wracking part was the acknowledgements, since I knew that we’d miss important people, just because there were so many of them over such a long time. Thanks to everyone who helped, and we’re hoping this will just be the first of many guides to this area. There’s so much fascinating work that can now be done using ML with battery-powered or energy-harvesting devices, I can’t wait to see what the community comes up with!

To give you a taste, there’s a hundred-plus page preview of the first six chapters available as a free PDF, and I’m recording a series of screencasts to accompany the tutorials.