The Five: Putting Jack the Ripper’s Victims at the Center of the Story

Years ago I went on a “Jack the Ripper” walking tour when I visited London without giving it much thought. The murders felt like fiction, in the same realm as Sherlock Holmes or Dr Jekyll and Mr Hyde. Even visiting the spots where the bodies were found didn’t make an impact. I’ve never been a “Ripperologist” but as someone interested in Victorian history it’s been hard to avoid the story and endless theories about the case.

Over time I’ve found myself actively avoiding the subject though. The way the story is told seems to center the perpetrator as the star of the show, endlessly fascinating and worthy of study. At its worst the literature seems little more than a way to vicariously replay the victimization of women over and over again. I’ve run into the same problem with the whole “True Crime” genre – I want the catharsis of a solved case, and the glimpses into people’s hidden lives, but I don’t think murderers are nearly as interesting as they’d like us to believe.

That’s why I was so excited when I saw “The Five” coming up for publication. The book tells the life stories of the canonical victims of the Ripper, up to but not including their deaths. I was already a fan of the author from the wonderful Harlots TV show, and here she has done another amazing job bringing strong, complex, ignored people to life. While she’s at it, she also makes a strong case that the prejudices about the women we’ve inherited from the original investigators have distorted our understanding of the case and blinded us to likely solutions. If most of the women weren’t prostitutes, as she argues convincingly, and if they were attacked while asleep, then many of the old narratives fall apart. It’s a great example of how highlighting the stories of people who have traditionally been ignored isn’t just a worthy pursuit, it also adds to our overall understanding of the past.

Even ignoring the wider lessons, this is a beautifully-written set of stories that I found hard to put down, and I had to ration myself to one a night to avoid burning through it all too fast. With deep research, Rubenhold draws sketches of girls growing into women, living full lives as daughters, mothers, friends, and workers, despite our knowledge of the shadow of Whitechapel looming in their future. They all suffer blows that end up pushing them into poverty, but what she drives home is that they were far more than just their ending, and what a loss their deaths were to themselves and many others. Alcoholism, poverty, and rough sleeping are common factors, but the path to them is wildly different in each story. It even made me rethink some of my expectations of unbending Victorian morality, with most of the women at least temporarily benefiting from pragmatic attempts to help them from well-wishers, even after their notional “fall”.

What does shine through most strongly though is how imperfect the safety net was, especially for poor women. There were very few second chances. One of the reasons I’ve ended up reading so much Victorian history is in an attempt to understand my adopted America, as the only comparable era I can find with such fabulous energy and grinding poverty so close together. It has made me wonder about all the stories I don’t know of people just a few hundred meters from me right now who are living on the same kind of knife edge, and might end up dying in poverty. I hope it’s given me a little more moral imagination to understand the struggles that those around me are facing, and motivation to find ways to help. I do know I won’t be able to visit London again without thinking of Polly, Annie, Elizabeth, Kate, and Mary Jane.

Quantization Screencast

TinyML Book Screencast #4 – Quantization

For the past few months I’ve been working with Zain Asgar and Keyi Zhang on EE292D, Machine Learning on Embedded Systems, at Stanford. We’re hoping to open source all the materials after the course is done, but I’ve been including some of the lectures I’m leading as part of my TinyML YouTube series. Since I’ve talked a lot about quantization on this blog over the years, I thought it would be worth including the latest episode here too.

It’s over an hour long, mostly because quantization is still evolving so fast, and I’ve included a couple of Colabs you can play with to back up the lesson. The first lets you load a pretrained Inception v3 model and inspect the weights, and the second shows how you can load a TensorFlow Lite model file, modify the weights, save it out again and check the accuracy, so you can see for yourself how quantization affects the overall results.

The slides themselves are available too, and this is one area where I go into more depth on the screencast than I do in the TinyML book, since that has more of a focus on concrete exercises. I’m working with some other academics to see if we can come up with a shared syllabus around embedded ML, so if you’re trying to put together something similar for undergraduates or graduates at your college, please do get in touch. The TensorFlow team even has grants available to help with the expenses of machine learning courses, especially for traditionally overlooked students.

Converting a TensorFlow Lite .cc data array into a file on disk

Most of the devices TensorFlow Lite for Microcontrollers runs on don’t have file systems, so the model data is typically included by compiling a source file containing an array of bytes into the executable. I recently added a utility to help convert files into nicely-formatted source files and headers, as the convert_bytes_to_c_source() function, but I’ve also had requests to go the other way. If you have a .cc file (like one from the examples) how do you get back to a TensorFlow Lite file that you can feed into other tools (such as the Netron visualizer)?

My hacky answer to this is a Python script that does a rough job of parsing an input file, looks for large chunks of numerical data, converts the values into bytes, and writes them into an output file. The live Gist of this is at https://gist.github.com/petewarden/493294425ac522f00ff45342c71939d7 and will contain the most up to date version, but here’s the code inline:

import re
output_data = bytearray()
with open('tensorflow/tensorflow/lite/micro/examples/magic_wand/magic_wand_model_data.cc', 'r') as file:
  for line in file:
    values_match = re.match(r"\W*(0x[0-9a-fA-F,x ]+).*", line)
    if values_match:
      list_text = values_match.group(1)
      values_text = filter(None, list_text.split(","))
      values = [int(x, base=16) for x in values_text]
      output_data.extend(values)
with open('converted.tfl', 'wb') as output_file:
  output_file.write(output_data)

You’ll need to replace the input file name with the path of the one you want to convert, but otherwise this should work on most of these embedded model data source files.

TinyML Book Released!

book_cover.jpg

I’ve been excited about running machine learning on microcontrollers ever since I joined Google and discovered the amazing work that the speech wakeword team were doing with 13 kilobyte models, and so I’m very pleased to finally be able to share a new O’Reilly book that shows you how to build your own TinyML applications. It took me and my colleague Dan most of 2019 to write it, and it builds on work from literally hundreds of contributors inside and outside Google. The most nerve-wracking part was the acknowledgements, since I knew that we’d miss important people, just because there were so many of them over such a long time. Thanks to everyone who helped, and we’re hoping this will just be the first of many guides to this area. There’s so much fascinating work that can now be done using ML with battery-powered or energy-harvesting devices, I can’t wait to see what the community comes up with!

To give you a taste, there’s a hundred-plus page preview of the first six chapters available as a free PDF, and I’m recording a series of screencasts to accompany the tutorials.

Why you must see Richard Mosse’s Incoming

mosse1.pngImage by Richard Mosse

When I went to SF MoMA over the holiday break I didn’t have a plan for what I wanted to see, but as soon as I saw the first photos of Richard Mosse’s Incoming exhibition I knew where I’d be spending most of my time. His work documents the European refugee crisis in a way I’ve never seen before. The innovation is that he’s using infrared and nightsight cameras designed for surveillance to document the lives and deaths of the people involved, on all sides. The work on show is a set of large stills, and a fifty minute film without narration. I highly encourage you to check out the exhibition yourself, and I’ll be going back to the museum for an artist talk tomorrow (Saturday January 11th at 2pm), but I wanted to share a few thoughts on why I found it so powerful, partly just to help myself understand why it had such an emotional impact on me.

The representation of the scenes is recognizable, but the nightsight makes them different enough that they broke through my familiarity with all the similar images I’ve seen in the news. It forced me to study the pictures in a new way to understand them, they live in an uncanny valley between abstract renders and timeworn visual cliches, and that made them seem much closer to my own life in a strange way. I couldn’t fit what I was seeing into the easy categories I guess I’ve developed of “bad things happening somewhere far away”.

mosse4.pngI realized that I rely on a lot of visual cues to understand the status and importance of people in images like these, and most of those were washed away by the camera rendering. I couldn’t tell what race people were, or even if they were wearing uniforms or civilian clothes, and I realized that I reach for those markers to make sense of images like these. Without those, I was forced to look at the subjects as people rather than just members of a category. I was a bit shocked at how jarring it was to lose that shorthand, and found myself searching for other clues to make up for my loss. Where does the woman in the cover image above come from? I can guess she’s a refugee from the shawl, but I can’t tell much else. My first guess when I saw the photo above is that it was migrants, but the caption told me these were actually crew members of a navy vessel.

It forced me to confront why I have such a need to categorize people to understand the stories. Part of it feels like an attempt to convince myself their predicament couldn’t happen to me or those I love, which is a lot easier if I can slot them into an “other” category

mosse3.png

A lot of the images are taken from a long distance away using an extreme zoom, or stitched together from panoramas. On a practical level this gives Mosse access to sites that he couldn’t reach, by shooting from nearby hills or shorelines, but it also affects the way I found myself looking at the scenes. The wide shots (like the refugee camp used for the cover of richardmosse.com) give you a god-like view, it’s almost isometric like old-school simulation games. I was able to study the layout in a way that felt voyeuristic, but brought the reality home in a way that normal imagery couldn’t have done. There are also closeups that feel intrusive, like paparazzi, but capture situations which normal filmmakers would never be able to, like refugees being loaded onto a navy boat. Everybody seems focused on their actions, working as if they’re unobserved, which adds an edge of reality to the scenes I’m not used to.

mosse2.png

Capturing the heat signatures of the objects and people in the scene brings home aspects of the experience that normal photography can’t. One of the scenes that affected me most in the film was seeing a man care for an injured person wrapped in a blanket. The detail that was heartbreaking was that you can see his handprints as marks of warmth as he pats and hugs the victim, and watch as they slowly fade. Handprints show up on the sides of inflatables too, where people tried to climb in. You can also see the dark splashes of frigid seawater as the refugees and crew struggle in a harsh environment, the sharpness of the contrast made me want to shiver as it brought home how cold and uncomfortable everyone had to be.

I can’t properly convey how powerful the exhibition is in words, I just hope this post might encourage some more people to check out Mosse’s work for themselves. The technical innovations alone make this a remarkable project (and I expect to see many other filmmakers influenced by his techniques in the future) but he uses his tools to bring a unfolding tragedy back in the spotlight in a way that only art can do. I can’t say it any better than he does himself in the show’s statement:

Light is visible heat. Light fades. Heat grows cold. People’s attention drifts. Media attention dwindles. Compassion is eventually exhausted. How do we find a way, as photographers and as storytellers, to continue to shed light on the refugee crisis, and to keep the heat on these urgent narratives of human displacement?

Chesterton’s shell script

wooden_cratesPhoto by Linnaea Mallette

I have a few days off over the holidays, so I’ve had a chance to think about what I’ve learned over the last year. Funnily enough, the engineering task that taught me the most was writing and maintaining a shell script of less than two hundred lines. One of my favorite things about programming is how seemingly simple problems turn out to have endless complexity the longer you work with them. It’s also one of the most frustrating things too, of course!

The problem I faced was that TensorFlow Lite for Microcontrollers depends on external frameworks like KissFft or the Gnu Compiler Toolchain to build. For TFL Micro to be successful it has to be easy to compile, even for engineers from the embedded world who are may not be familiar with the wonderful world of installing Linux or MacOS dependencies. That meant downloading these packages had to be handled automatically by the build process.

Because of the degree of customization we needed to handle cross-compilation across a large number of very different platforms, early on I made the decision to use ‘make’ as our build tool, rather than a more modern system like cmake or bazel. This is still a controversial choice within the team, since it involves extensive use of make’s scripting language to do everything we need, which is a maintenance nightmare. The bus factor in this area is effectively one, since I’m the only engineer familiar with that code, and even I can’t easily debug complex issues. I believe the tradeoffs were worth it though, and I’m hoping to mitigate the problems by improving the build infrastructure in the future.

A downside of choosing make is that doesn’t have any built-in support for downloading external packages. The joy and terror of make is that it’s a very low-level language for expressing build dependency rules which allow you to build almost anything on top of them, so the obvious next step was to implement the functionality I needed within that system.

The requirements at the start were that I needed a component that I could pass in a URL to a compressed archive, and it would handle downloading and unpacking it into a folder of source files. The most common place these archives come from is GitHub, where every commit to an open source project can be accessed as a ZIP or Gzipped Tar archive, like https://github.com/google/stm32_bare_lib/archive/c07d611fb0af58450c5a3e0ab4d52b47f99bc82d.zip.

The first choice I needed to make was what language to implement this component in. One option was make’s own scripting language, but as I mention above it’s quite obscure and hard to maintain, so I preferred an external script I could call from a makefile rule. The usual Google style recommendation is to use a Python script for anything that would be more than a few lines of shell script, but I’ve found that doesn’t make as much sense when most of the actual work will be done through shell tools, which I expected to be the case with the archive extraction.

I settled on writing the code as a bash script. You can see the first version I checked in at https://github.com/tensorflow/tensorflow/blob/e98d1810c607e704609ffeef14881f87c087394c/tensorflow/lite/experimental/micro/tools/make/download_and_extract.sh. This is a long way from the very first script I wrote, because of course I encountered a lot of requirements I hadn’t thought of at the start. The primordial version was something like this:

curl -Ls "${1}" > ${tempfile}
if [[ "${url}" == *gz ]]; then
  tar -C "${dir}" -xzf ${tempfile}
elif [[ "${url}" == *zip ]]; then
    unzip ${tempfile}

Here’s a selection of the issues I had to fix with this approach before I was able to even check in that first version:

  • A few of the dependencies were bzip files, so I had to handle them too.
  • Sometimes the destination root folder didn’t exist, so I had to add an ‘mkdir -p’.
  • If the download failed halfway through or the file at the URL changed, there wouldn’t always be an error, so I had to add MD5 checksum verification.
  • If there were any BUILD files in the extracted folders, bazel builds (which are supported for x86 host compilation) would find them through a recursive search and fail.
  • We needed the extracted files to be in a first-level folder with a constant name, but by default many of the archives had versioned names (like the project name followed by the checksum). To handle this for tars I could use strip-components=1, but for zipping I had to cobble together a more manual approach.
  • Some of the packages required small amounts of patching after download to work successfully.

All of those requirements meant that this initial version was already 124 lines long! If you look at the history of this file (continued here after the experimental move), you can see that I was far from done making changes even after all those discoveries. Here are some of the additional problems I tackled over the last six months:

  • Added (and then removed) some debug logging so I could tell what archives were being downloaded.
  • Switched the MD5 command line tool the script used to something that was present by default on MacOS, and that had a nicer interface as a bonus, so I no longer had to write to a temporary file.
  • Tried one approach to dealing with persistent ‘56′ errors from curl by adding retries.
  • When that failed to solve the problem, tried manually looping and retrying curl.
  • Pete Blacker fixed a logic problem where we didn’t raise an error if an unsupported archive file suffix was passed in.
  • During an Arm AIoT workshop I found that some users were hitting problems when curl was not installed on their machine on the first run of the build script, and then they re-ran the build script after installation but the downloaded folders had already been created but were empty, leading to weird errors later in the process. I fixed this by erroring out early if curl isn’t present.

What I found most interesting about dealing with these issues is how few of them were easily predictable. Pete Blacker’s logic fix was one that could have been found by inspection (not having a final else or default on a switch statement is a classic lintable error), but most of the other problems were only visible once the script became heavily used across a variety of systems. For example, I put in a lot of time to mitigating occasional ’56’ errors from curl because they seemed to show up intermittently (maybe 10% of the time or less) for one particular archive URL. They bypassed the curl retry logic (presumably because they weren’t at the http error code level?), and since they were hard to reproduce consistently I had to make several attempts at different approaches and test them in production to see which techniques worked.

A less vexing issue was the md5sum tool not being present by default on MacOS, but this was also one that I wouldn’t have caught without external testing, because my Mac environment did have it installed.

This script came to mind as I was thinking back over the year for a few reasons. One of them was that I spent a non-trivial amount of time writing and debugging it, despite its small size and the apparent simplicity of the problem it tackled. Even in apparently glamorous fields like machine learning, 90% of the work is nuts and bolts integration like this. If anything you should be doing more of it as you become more senior, since it requires a subtle understanding of the whole system and its requirements, but doesn’t look impressive from the outside. Save the easier-to-explain projects for more junior engineers, they need them for their promotion packets.

The reason this kind of work is so hard is precisely because of all the funky requirements and edge cases that only become apparent when code is used in production. As a young engineer my first instinct when looking at a snarl of complex code for something that looked simple on the surface was to imagine the original authors were idiots. I still remember scoffing at the Diablo PC programmers as I was helping port the codebase to the Playstation because they used inline assembler to do a simple signed to unsigned cast. My lead, Gary Liddon, very gently reminded me that they had managed to ship a chart-topping game and I hadn’t, so maybe I had something to learn from their approach?

Over the last two decades of engineering I’ve come to appreciate the wisdom of Chesterton’s Fence, the idea that it’s worth researching the history of a solution I don’t understand before I replace it with something new. I’m not claiming that the shell script I’m highlighting is a coding masterpiece, in fact I have a strong urge to rewrite it because I know so many of its flaws. What holds me back is that I know how valuable the months of testing in production have been, and how any rewrite would probably be a step backwards because it would miss the requirements that aren’t obvious. There’s a lot I can capture in unit tests (and one of its flaws is that the script doesn’t have any) but it’s not always possible or feasible to specify everything a component needs to do at that level.

One of the hardest things to learn in engineering is good judgment, because it’s always subjective and usually relies on pattern matching to previous situations. Making a choice about how to approach maintaining legacy software is no different. In its worst form Chesterton’s Fence is an argument for never changing anything and I’d hate to see progress on any project get stymied by excessive caution. I do try to encourage the engineers I work with (and remind myself) that we have a natural bias towards writing new code rather than reusing existing components though. If even a minor shell script for downloading packages has requirements that are only discoverable through seasoning in production, what issues might you unintentionally introduce if you change a more complex system?

Can we avoid the Internet of Trash?

computer_trash.pngPhoto by Branden

When embedded systems get cheap and capable enough to run machine learning, I’m convinced we’re going to end up with trillions of devices doing useful jobs in the world around us. Those benefits are only half the story though, so I’ve been doing my best to think through what the unintended consequences of a change this big might be. Unlike Lehrer’s Werner von Braun, I do think it is our job to care where the rockets come down. One challenge will be preserving privacy when almost everything is capable of recording us, and I discussed some ideas on tackling that in a previous post, but another will be the sheer amount of litter having that many devices in the environment will generate.

Right now my best guess is that there are 250 billion embedded systems active in the world, with 40 billion more being shipped every year. Most of them are truly embedded in larger systems like washing machines, toasters, or cars. If smart sensors using machine learning become ubiquitous, we could easily end up with ten times as many chips, and they’re more likely to be standalone packages that are “peeled and stuck” in our built and natural environments. They will either use batteries or energy harvesting, and have lifetimes measured in years, but inevitably they will reach their end of life at some point. What happens then?

The printed circuit boards, batteries, chips, and other physical components that make up these devices are going to become litter. They’ll end up scattered around the places they were installed, when buildings are demolished or remodeled they’ll become part of the rubble, and in natural environments they’ll join the leaf litter or ocean sediments. They will contain a lot of substances that are harmful to the environment, so this kind of disposal is not a good default if we care about the planet. Even ignoring those concerns, we’re already marking the Anthropocene with a layer of plastic, I’d hate to be part of adding another ugly indicator of our presence.

What can we do as engineers? I think it’s worth designing in some technological solutions, as well as suggesting broader fixes and trying to start a wider discussion about the problem as early as possible. I definitely don’t have many answers myself, but here are a few thoughts.

Findability

Can we implement a standard for locating embedded devices, even if their power supply has died? I’m wondering about something like an RFID tag, that’s passive but automatically responds to a querying tool. That would at least make it possible to sweep through an area and maybe even automatically retrieve dead sensors using drones or robots.

Deposits

Even if devices are findable, we need to incentivize finding them. You pay a small deposit for many bottles and cans, maybe we could do the same for embedded devices? If there’s a 10 cent surcharge that can be redeemed when a dead device is returned, would that be enough to encourage scavenging? Of course, we’d need to ensure that devices are actually past their lifetime, maybe using a registry and a unique ID, to avoid stripping working sensors.

Construction

As a naive software engineer I’d love to believe that we’ll end up with biodegradable computer hardware one day, but I know enough to know that won’t be happening soon. I do hope that we can decrease the toxicity of our designs though. It looks like the EU has banned mercury batteries at least, but are there better approaches than lithium coin cells? Traditional printed circuit boards contain toxic metals, but can we use environmentally-friendly alternatives?

If these or other techniques do seem like the right approaches, we’ll have to think about how we organize as a profession to make sure that they’re in place before the problem becomes pressing. I don’t know whether self-regulation or a more formal approach would be best, but I’m hoping we can at least start discussing this now so we’re prepared to limit our negative impact on the world around us.

 

How can we stop smart sensors from spying on us?

privacy.png

Photo by Ade Oshineye

I was at the Marine Robotics Forum at Woods Hole Oceanographic Institute last week, and it set a lot of wheels turning in my head. As a newcomer to the field, I was blown away by some of the demonstrations like the Swarm Diver underwater drones, the ChemYak data-gathering robot, Drone Tugs, and WaveGliders. Operating underwater means that the constraints are very different from terrestrial robots, for instance it’s almost impossible to send appreciable amounts of data wirelessly more than a few hundred meters. I was struck by one constraint that was missing though – nobody had to worry about privacy in the deep sea.

One of the reasons I’m most excited about TinyML is that adding machine learning to microphones and cameras will let us create smart sensors. I think about these as small, cheap, self-contained components that do a single job like providing a voice interface, detecting when a person is nearby (like an infra-red motion sensor that won’t be triggered accidentally by animals or trees moving in the wind), or spotting different hand gestures. It’s possible to build these using audio and image sensors combined with machine learning models, but only expose a “black box” interface to system builders that raises a pin high when there’s a person in view for example, or has an array of pins corresponding to different commands or gestures. This standardization into easy to use modules is the only way we’ll be able to get mass adoption. There’s no reason they can’t be as simple to wire into your design as a button or temperature sensor, and just as cheap.

I see this future as inevitable in a technical sense, probably sooner rather than later. I’m convinced that engineers will end up building these devices because they’re starting to become possible to create. They’ll also be very useful, so I expect they’ll see widespread adoption across all sorts of consumer, automotive, and industrial devices. The reason privacy has been on my mind is that if this does happen, our world will suddenly contain many more microphones and cameras. We won’t actually be using them to record sound or video for playback, in the traditional sense, they’ll just be implementation details of higher-level sensors. This isn’t an easy story to explain to the general public though, and even for engineers on a technical level it may be hard to guarantee that there’s no data leakage.

I believe we have an opportunity right now to engineer-in privacy at a hardware level, and set the technical expectation that we should design our systems to be resistant to abuse from the very start. The nice thing about bundling the audio and image sensors together with the machine learning logic into a single component is that we have the ability to constrain the interface. If we truly do just have a single pin output that indicates if a person is present, and there’s no other way (like Bluetooth or WiFi) to smuggle information out, then we should be able to offer strong promises that it’s just not possible to leak pictures. The same for speech interfaces, if it’s only able to recognize certain commands then we should be able to guarantee that it can’t be used to record conversations.

I’m a software guy, so I can’t pretend to know all the challenges that might face us in trying to implement something like this in hardware, but I’m given some hope that it’s possible from the recent emergence of cryptography devices for microcontrollers. There will always be ways determined attackers may be able to defeat particular approaches, but what’s important is making sure there are significant barriers. Here are some of the properties I’d expect we’d need:

– No ability to rewrite the internal program or ML model on the device once it has been created.

– No radio or other output interfaces exposed.

– As simple an interface as possible for the application. For example for a person detector we might specify a single output pin that updates no faster than once per second, to make it harder to sneak out information even if the internal program is somehow compromised.

– As little memory on the device as possible. If it can’t store more than one frame of 320×320 video, or a few seconds of audio, then you won’t be able to effectively use it as a recording device even if other protections fail.

I would also expect we’d need some kind of independent testing and certification to ensure that these standards are being met. If we could agree as an industry on the best approach, then maybe we can have a conversation with the rest of the world about what we see coming, and how these guarantees may help. Establishing trust early on and promoting a certification that consumers can look for would be a lot better than scrambling to reassure people after they’ve been unpleasantly surprised by a privacy hack.

These are just my personal thoughts, not any kind of official position, and I’m posting this in the hope I’ll hear from others with deep experience in hardware, privacy, and machine learning about how to improve how we approach this new wave of challenges. How do you think we can best safeguard our users?

What Machine Learning needs from Hardware

robot_love.png

Photo by Kimberly D

On Monday I’ll be giving a keynote at the IEEE Custom Integrated Circuits Conference, which is quite surprising even to me, considering I’m a software engineer who can barely solder! Despite that, I knew exactly what I wanted to talk about when I was offered the invitation. If I have a room full of hardware designers listening to me to twenty minutes, I want them to understand what people building machine learning applications need out of their chips. After thirteen years(!) of blogging, I find writing a post the most natural way of organizing my thoughts, so I hope any attendees don’t mind some spoilers on what I’ll be asking for.

At TinyML last month, I think it was Simon Craske from Arm who said that a few years ago hardware design was getting a little bit boring, since the requirements seemed well understood and it was mostly just an exercise in iterating on existing ideas. The rise of machine learning (and to be more specific deep learning since that’s been the almost exclusive focus so far) has changed all that. The good news is that chip design is no longer boring, but it can be hard to understand what the new requirements are, so in this post I’ll try to cover my perspective from the software side. I won’t be proposing hardware solutions, since I don’t know what the answers are, but I will try to distill what the hundreds of teams I’ve worked with building products using machine learning are asking for.

More Arithmetic

The single most important message I want to get across is that there are a lot of new applications that are blocked from launching because we don’t have enough computing power. Many other existing products would be improved if we could run models that require more computing power than is available. It doesn’t matter whether you’re on the cloud, a mobile device, or even embedded, teams have models they’d like to run that they can’t.

To give a practical example, take a look at what might seem like a mature area, speech recognition. After heroic efforts, a team at Google was recently able to squeeze server-quality transcription onto local compute on Pixel phones. The model itself is comparatively small too, at just 80 MB. This network is pushing the limits of what a modern application processor on a mobile device can manage, but it’s almost entirely arithmetic-bound. That means if we could offer the same level of raw compute in a chip for lower energy and a cheaper price, this sort of general speech recognition could be added to almost any product. Even if you aren’t looking to expand beyond current phones, there are still problems like the cocktail party effect that can benefit from running additional neural networks to improve the overall accuracy. You can get even more context from visual sensors by looking at things like gaze direction, which, again, require more deep learning models to calculate.

This is just one application area. Every product domain I’ve worked with has similar stories, where improvements in latency and energy usage when running networks would translate directly into new or enhanced user experiences. If you look at a time profile of these networks you’ll see almost all the time is going into multiply-adds, so improving the hardware does mean improving its ability to run the arithmetic primitives efficiently.

Inference

I may be biased because my work almost entirely focuses on running already-trained models (‘inference’ in ML terms) but I believe the biggest need over the next few years is for inference hardware, not training. Someone once said to me “training scales with the number of researchers, inference scales with the number of applications times the number of users“, and that idea has always stuck with me. An individual model author’s training needs are immense, but there’s comparatively few of them and the growth is limited by human educational processes. By comparison, popular applications have hundreds of millions of users, and each application can require many models, so the scale of inference calculations can easily grow much faster.

On a less philosophical note, as I talked about earlier I see a lot of teams who are able to train more complex models than they can deploy on their production platforms. Even cloud applications have compute budgets driven by the economic costs of running servers, and other devices usually have hard limits on the resources available. For those reasons, I’d love to see a lot of attention paid to speeding up ML inference from the hardware community.

Low Precision

It’s now widely accepted that eight bits are enough for running inference on convolutional neural networks. The picture is a bit more complicated for training, and for recurrent networks, because both processes require the addition of many small increments to stored values to achieve their results, but at the very least full thirty-two bit floating point values are overkill in every case I’ve seen. If you design inference hardware for eight-bit precision you’ll cover a large number of practical use cases. Of course, exactly what eight-bit means isn’t necessarily obvious, so I’m hoping the TensorFlow team will be able to produce some guidance based on our experience soon, detailing exactly what we believe the best practices are for executing eight-bit calculations.

There’s also a lot of evidence that it’s possible to go lower than eight bits, but that is a lot less settled. As some background it’s worth reading this survey by my colleague Raghu, which has a lot of experiments investigating different possibilities.

Compatibility

I’ve saved what I expect may be my most controversial request until last. The typical design process I’ve seen from hardware teams is that they will look at some existing ML workloads, note that almost all of the time goes into just a few operations, and so design an accelerator that speeds up those critical-path ops.

This sounds fine in principle, but when an accelerator like that is integrated into a full system it often fails to live up to its potential. The problem is that even though most of the compute for almost all models does go into a handful of common operations, there are hundreds of others that often appear. Almost every model I see has some of these, and they’re almost always different from network to network. A good example is ‘non-max suppression’ in MobileSSD and similar object detection models, where we need some very specific and custom operations to merge the many bounding boxes that are output by the model into just a few coherent final results. This doesn’t require very much raw compute, but it does take a lot of logic, and is hard to express except as general C++ code. In a similar way, many audio networks have a feature generation preprocessing step that converts raw audio data into tensors to feed into the neural networks. Even more tricky are custom steps (like modified activation functions) that show up in the middle of networks. Almost none of these operations are compute intensive, but they aren’t supported by specialized accelerators.

There are two common answers to this from hardware teams. The first is to fall back to a main application processor to implement these custom operations. If the accelerator is across a system bus from the main CPU this can involve a lot of latency as the two processors have to communicate and synchronize with each other. This latency can easily cancel out any speed advantages from using the accelerator in the first place. Alternatively, the team may direct users towards using ‘blessed’ models that will run entirely on the accelerator, avoiding any of the tricky custom operations. This can work for some cases, but the majority of the product teams I work with are struggling to train their models to the accuracy they require for their application, so they’re usually using custom approaches to achieve the results they need. This makes asking them to switch to a new model and figure out how to achieve similar results within tighter constraints a big ask.

This is a big problem for accelerator adoption in practice. What I’m hoping is that future accelerators will offer some kind of general compute capability so that arbitrary C++ custom operations can be easily ported and run on them. The work we’re doing on dependency-free reference implementations in TensorFlow Lite is initially aimed at microcontrollers and embedded systems, but I’m hoping it will eventually be useful for porting ops to devices like accelerators too. The nice thing is that these custom operations almost always involve much less compute than the core accelerated ops, so you don’t need a fast way of running general purpose code, just an escape valve that avoids the latency hit of delegating to the main application processor.

To illustrate the issue, I tried to estimate the number of operations in core TensorFlow using the following command from inside the source folder:

grep -Ir 'REGISTER_OP("' tensorflow/core | grep -vE '(test)|(contrib)' | wc -l

This gives me a rough estimate of 1,202 operations in TensorFlow. Some of these are internal details, or only used for debugging, but in my experience you need to be prepared to deal with many hundreds of different ops if you’re accepting models from authors. I don’t expect this problem to get any easier, since researchers seem to be creating new and improved operations faster than accelerators can support the ones that are already there!

Codesign

The exciting news is that we’re all in a great position to see improvements we make in our systems translate very quickly into better experiences for our users. The work we’re doing has the potential to have a lot of impact on people’s lives, and I think we have the right tools to make fast progress. What it will require is a lot of cooperation between the hardware and software communities, with rapid iteration and sharing of requirements because this is such a new area, so I’m looking forward to continuing to share my experiences, and hearing from hardware experts about ways we can move forward together.

Scaling machine learning models to embedded devices

ScaledML Talk - Pete Warden.png

I gave a talk at ScaledML today, and I’m publishing the slides and my speaking notes as a quick blog post. It will hopefully mostly be familiar to regular readers of my blog already, but I wanted to gather it all in one place!

Hi, I’m here to talk about running machine learning models on tiny computers!

As you may have guessed from the logos splashed across these slides, I’m an engineer on the TensorFlow team at Google, working on our open source machine learning library. My main goal with this talk is to get you excited about a new project we just launched! I’m hoping to get help and contributions from some of you here today, so here are my contact details for any questions or discussions, and I’ll be including these at the end too, when I hope we might have some time for live questions.

So, why am I here, at the ScaledML conference?

I want to start with a number. There are 150 billion embedded processors out there in the world, that’s more than twenty each for every man, woman, and child on earth! Not only did that number amaze me when I first came across it, but the growth rate is an astonishing 20% annually, with no signs of slowing down. That’s much faster than smartphone usage, which is almost flat, or the growth in the number of internet users, which is in the low single digits these days.

One of the things I like about this conference is how the title gives it focus, but still covers a lot of interesting areas. We often think of scale as being about massive data centers but it’s a flexible enough concept for a computing platform with this much reach to make sense to talk about here.

Incidentally, you may be wondering why I’m not talking about the edge, or the internet of things. The term “edge” is actually one of my pet peeves, and to help explain why, take a look at this diagram.

Can you see what’s missing? There are no people here! Even if we were included, we would be literally at the edge, while a datacenter sits in the center. I think it makes a lot more sense to think about our infrastructure as having people at the center, so that they have priority in the way we think about and design our technology.

My objection to the internet of things is that the majority of embedded devices are not connected to any network, and as I’ll discuss in a bit, it’s unlikely that they ever will be, at least more than intermittently.

So, now I’ve got those rants out of the way, what about the ML part of ScaledML? Why is that important in this world?

The hardest constraint embedded devices face is energy. Wiring them into mains power is hard or impossible in most environments. The maintenance burden of replacing batteries quickly becomes unmanageable as the number of devices increases (just think about your smoke alarm chirping when its battery is low, and then multiply that many-fold). The only sane way for the number of devices to keep increasing is if they have batteries that last a very long time, or if they can use energy harvesting (like solar cells from indoor lighting).

That constraint means that it’s essential to keep energy usage at one milliwatt or even lower, since that can give you a year of continuous use on a reasonably small and cheap battery, or alternatively is within the range of a decent energy harvesting system like ambient solar.

Unfortunately, anything involving radio takes a lot of energy, almost always far more than one milliwatt. Transmitting bits of information, even with approaches like bluetooth low energy, is in the tens to hundreds of milliwatts in the best of circumstances at comparatively short range. The efficiency of radio transmission doesn’t seem to be improving dramatically over time either, there seem to be some tough hurdles imposed by physics that make improvements hard.

On a happier note, capturing data through sensors doesn’t suffer from the same problem. There are microphones, accelerometers, and even image sensors that operate well below a milliwatt, even down to tens of microwatts. The same is true for arithmetic. Microprocessors and DSPs are able to process tens or hundreds of millions of calculations for under a milliwatt, even with existing technologies, and much more efficient low-energy accelerators are on the horizon.

What this means is that most data that’s being captured by sensors in the embedded world is just being discarded, without being analyzed at all. As a practical example, I was talking to a satellite company a few years ago, and was astonished to discover that they threw away most of the imagery their satellites gathered. They only had limited bandwidth to download images to base stations periodically, so they stored images at much lower resolutions than their smart-phone derived cameras were capable of, and at a much lower framerate. After going to all the trouble of getting a device into space, this seemed like a terrible waste!

What machine learning, and here I’m primarily talking about deep learning, offers is the ability to take large amounts of noisy sensor data, and spot patterns that lead to actionable information. In the satellite example, rather than storing uniform images that are mostly blank ocean, why not use image detection models to capture ships or other features in much higher detail?

What deep learning enables is decision-making without continuous connectivity to a datacenter in the cloud. This makes it possible to build entirely new kinds of products.

To give you a concrete example of what I mean, here’s a demo. This is a microcontroller development board that you can buy from SparkFun for $15, and it includes microphones and a low-power processor from Ambiq. Unfortunately nobody but the front row will be able to see this tiny LED, but when I say the word “Yes”, it should flash yellow. As you can see, it’s far from perfect, but it is doing this very basic speech recognition on-device with no connectivity, and can run on this coin battery for days. If you’re unconvinced, come and see me after the talk and I’ll show you in person!

So how did we build this demo? One of the toughest challenges we faced was that the computing environment for embedded systems is extremely harsh. We had to design for a system with less than 100 KB of RAM and read-only storage, since that’s common on many microprocessors. We also don’t have much processing power, only tens of millions of arithmetic operations, and we couldn’t rely on having floating point hardware, so we needed to cope with integer-only math. There’s also no common operating system to rely on, and many devices don’t have anything beyond bare metal interfaces that require you to access control registers directly.

To begin, we started with figuring out how we could even fit a model into these constraints. Luckily, the “Hey Google” speech team has many years of experience with these kind of applications, so we knew from discussions with them that it was possible to get useful results with models that were just tens of kilobytes in size. Part of the magic of getting a network to fit into that little memory was quantizing all the weights down to eight bits, but the actual architecture itself was a very standard single-layer convolutional network, applied to frequency spectrograms generated from the raw audio sample data.

This simple model helped us fit within the tight computational budget of the platform, since we needed to run inference multiple times a second, so even the millions of ops per second we had available had to be sliced very thinly. Once we had a model designed, I was able to use the standard TensorFlow speech command tutorial to train it.

With the model side planned out, the next challenge was writing the software to run our network. I work on TensorFlow Lite, so I wanted that as my starting point, but there were some problems. The smallest size we could achieve for a binary using the library was still hundreds of kilobytes, and it has a lot of dependencies on standard libraries like C, C++ or Posix functions. It also assumes that dynamic memory allocation is available. None of these assumptions could be relied on in the embedded world, so it wasn’t usable as it was.

There was a lot to recommend TensorFlow Lite though. It has implementations and tests for many operations, has a great set of APIs, a good ecosystem, and handles conversion from the training side of TensorFlow. I didn’t want to lose all those advantages.

In short, and apologies for a Brexit reference, but I wanted to have my cake and eat it too!

The way out of this dilemma proved to be some targeted refactoring of the existing codebase. The hardest part was separating out the fundamental properties of the framework like file formats and APIs from less-portable implementation details. We did this by focusing on converting the key components of the system into modules, with shared interfaces at the header level, but the possibility of different implementations to cope with the requirements of different environments.

We also made the choice not to boil the ocean and try to get the entirety of a complex framework running on these new platforms at once. Instead, we picked one particular application, the speech recognition demo I just showed, and attempted to get it completely working from training to deployment on a device, before we tried to expand into full op support. This gave us the ability to quickly try out approaches on a small scale and learn by doing, iteratively making progress rather than having to plan the whole thing out ahead of time with imperfect knowledge of what we’d encounter.

We also deliberately avoided focusing on optimizations. We’re not experts on every microcontroller out there, and there are a lot of them, so we wanted to empower the real experts at hardware vendors to contribute. To help with that, we tried to create clear reference implementations along with unit tests, benchmarks, and documentation, to make collaboration as easy as possible even for external engineers with no background in machine learning.

So, what does that all mean in practice? I’m not going to linger on the code, but here’s the heart of a reference implementation of depthwise convolution. The equivalent optimized versions are many screens of code, but at its center, the actual algorithm is quite simple.

One thing I want to get across is that there is a lot of artificial complexity in current implementations of machine learning operations, across most frameworks. They’re written to be fast on particular platforms, not to be understood or learned from. Reference code implementations can act as teaching tools, helping developers unfamiliar with machine learning understand what’s involved by presenting the algorithms in a form they’re familiar with. They also make a great starting point for optimizing for platforms that the original creators weren’t planning for, and together with unit tests, form living specifications to guide future work.

So, what am I hoping you’ll take away from this talk? Number one is the idea that it’s possible and useful to run machine learning on embedded platforms.

As we look to the future, it’s not clear what the “killer app”, if any, will be for this kind of “Tiny ML”. We know that voice interfaces are popular, and if we can get them running locally on cheap, low-power devices (the recent Pixel server-quality transcription release is a proof of concept that this is starting to become possible), then there will be an explosion of products using them. Beyond that though, we need to connect these solutions with the right problems. It is a case of having a hammer and looking for nails to hit with it, but it is a pretty amazing hammer!

What I’d ask of you is that you think about your own problems, the issues you face in the worlds you work in, and imagine how on-device machine learning could help. If your models could run on a 50 cent chip that could be peeled and stick anywhere, what could you do to help people?

With that in mind, the preview version of the TensorFlow Lite for Microcontrollers library is available at this link, together with documentation and a codelab tutorial. I hope you’ll grab it, get ideas on how you might use it, and give us feedback.

Here are my contact details for questions or comments, please do get in touch! I’ll look forward to hopefully taking a few questions now too, if we have time?