How Should you Protect your Machine Learning Models and IP?

Over the last decade I’ve helped hundreds of product teams ship ML-based products, inside and outside of Google, and one of the most frequent questions I got was “How do I protect my models?”. This usually came from executives, and digging deeper it became clear they were most worried about competitors gaining an advantage from what we released. This worry is completely understandable, because modern machine learning has become essential for many applications so quickly that best practices haven’t had time to settle and spread. The answers are complex and depend to some extent on your exact threat models, but if you want a summary of the advice I usually give it boils down to:

  • Treat your training data like you do your traditional source code.
  • Treat your model files like compiled executables.

To explain why I ended up with these conclusions, I’ll need to dive into some of the ways that malicious actors could potentially harm a company based on how ML materials are released. I’ve spent a lot of my time focused on edge deployments, but many of the points are applicable to cloud applications too.

The most concerning threat is frequently “Will releasing this make it easy for my main competitor to copy this new feature and hurt our differentiation in the market?”. If you haven’t spent time personally engineering ML features, you might think that releasing a model file, for example as part of a phone app, would make this easy, especially if it’s in a common format like a TensorFlow Lite flatbuffer. In practice, I recommend thinking about these model files like the binary executables that contain your application code. By releasing it you are making it possible to inspect the final result of your product engineering process, but trying to do anything useful with it is usually like trying to turn a hamburger back into a cow. Just as with executables you can disassemble them to get the overall structure, by loading them into a tool like Netron. You may be able to learn something about the model architecture, but just like disassembling machine code it won’t actually give you a lot of help reproducing the results. Knowing the model architecture is mildly useful, but most architectures are well known in the field anyway, and only differ from each other incrementally.

What about just copying the model file itself and using it in an application? That’s not as useful as you might think, for a lot of reasons. First off, it’s a clear copyright violation, just like copying an executable, so it’s easy to spot and challenge legally. If you are still worried about this, you can take some simple steps like encrypting the model file in the app bundle and only unpacking it into memory when the app is running. This won’t stop a determined attacker, but it makes it harder. To help catch copycats, you can also add text strings into your files that say something like “Copyright Foo, Inc.”, or get more elaborate and modify your training data to add canaries, also more poetically called Mountweazels, by modifying your training data so that the model produces distinct and unlikely results in rare circumstances. For example, an image model could be trained so that a Starbucks logo always returns “Duck” as the prediction. Your application could ignore this result, but even if the attacker got clever and added small perturbations to the model weights to prevent obvious binary comparisons, the behavior would be likely to persist and prove that it was directly derived from the original.

Even if you don’t detect the copying, having a static model is not actually that useful. The world keeps changing, you’ll want to keep improving the model and adapting to new needs, and that’s very hard to do if all you have is the end result of training. It’s also unlikely that a competitor will have exactly the same requirements as you, whether it’s because of using different hardware or a user population that differs from yours. You might be able to hack a bit of transfer learning to modify a model file, but at that point you’re probably better off starting with a publicly-released model, since you’ll have a very limited ability to make changes on a model that’s already been optimized (for example using quantization).

A lot of these properties are very analogous to a compiled executable, hence my advice at the start. You’ve got an artifact that’s the end result of a complex process, and any attacker is almost certain to want modifications that aren’t feasible without access to the intermediate steps that were required to produce it in the first place. From my experience, by far the most crucial, and so most valuable, part of the recipe for a machine learning feature is the training data. It would be much quicker for me to copy most features if I was given nothing but the dataset used to train the model, than if I had access to the training script, feature generation, optimization and deployment code without that data. The training data is what actually sets out the detailed requirements for what the model needs to do, and usually goes through a long process of refinement as the engineers involved learn more about what’s actually needed in the product.

This is why I recommend treating the dataset in the same way that you treat source code for your application. It’s a machine-readable specification of exactly how to tackle your problem, and as such requires a lot of time, resources, and expertise to reproduce. People in other industries often ask me why big tech companies give away so much ML software as open source, because they’re used to thinking about code as the crown jewels that need to be protected at all costs. This is true for your application code, but in machine learning having access to libraries like TensorFlow or PyTorch doesn’t get you that much closer to achieving what Google or Meta can do with machine learning. It’s actually the training data that’s the biggest barrier, so if you have built something using ML that’s a competitive advantage, make sure you keep your dataset secure.

Personally, I’m a big fan of opening up datasets for research purposes, but if you look around you’ll see that most releases are for comparatively generic problems within speech or vision, rather than more specific predictions that are useful for features in commercial products. Public datasets can be useful as starting points for training a more targeted model, but the process usually involves adding data that’s specific to your deployment environment, relabeling to highlight the things you really want to recognize, and removing irrelevant or poorly-tagged data. All these steps take time and resources, and form a barrier to any competitor who wants to do the same thing.

My experience has largely been with on-device ML, so these recommendations are focused on the cases I’m most familiar with. Machine learning models deployed behind a cloud API have different challenges, but are easier in a lot of ways because the model file itself isn’t accessible. You may still want to put in terms-of-use clauses to bar people from using the services to train their own models, like all commercial speech recognition APIs I know of do, but this approach to copying isn’t as effective as you might expect. It suffers the Multiplicity problem, where copies inevitably seem to lose quality compared to their originals.

Anyway, I am very definitely Not A Lawyer, so don’t take any of this as legal advice, but I hope it will be useful to help understand some useful responses to some typical threat models, and at least give you my perspective on the best practices I’ve seen emerge. I’ll be interested to hear if there are any papers or other publications around these questions too, so please do get in touch if you know of anything I should check out!

Is Google Spying on your Conversations?


Ok, I thought about leaving this as a one-word blog post, but even though I can categorically state that it isn’t happening, the fact that this question comes up regularly in my everyday life, and that I worked on always-on audio when I was at Google, makes me want to expand on this a bit.

A good starting point is this BBC article from 2016 asking “Is your smartphone listening to you?“, which includes the common anecdote of an ad that seems like it was triggered by a recent conversation, an investigation into the technical possibility that it could be happening, and denials from Google, Facebook, and Amazon that what users suspect is actually occurring. I worked for years on the infrastructure Google uses for the machine learning models to recognize speech triggers like “Hey Google”, so if you trust me you can take my word that we didn’t have the capability to do what people are concerned about. Even if you don’t trust me, there are public papers from Google and Apple that go into detail about how the always-on system in Android and iOS phones works. The summary is that in order to run even when most of the phone (including the CPU) is powered down, the microphone data has to be processed by a subsystem that is extremely constrained, because to avoid draining the battery it can only consume something like ten milliwatts. For comparison, a Cortex A processor used for the main CPU (or application processor) can easily burn a watt or more. To run at such low power, this subsystem has a lot less memory and compute than the application processor, often only a few hundred kilobytes of RAM and runs at a frequency in the low hundreds of megahertz. This makes running full speech recognition, or even listening for more than a few keywords, impractical from an engineering perspective. The Google research teams have managed some minor miracles like squeezing “Now Playing” onto the Pixel’s always-on subsystem, listening out for when music is playing and waking up the application processor to identify it, but it took incredible ingenuity to fit that into the memory budget available. Even though the article states the security researchers built a proof of concept app that didn’t use much power, they don’t link to any code or power measurements. Since regular Android developers can’t run apps on the always-on subsystem (it’s restricted to phone manufacturers) their app must have been running on the application processor, and I’m willing to bet a lot of money you’d notice your battery draining fast if the main CPU was awake for long periods.

So, I would have been directly involved in any code that did the kind of conversational spying that many people incorrectly suspect is happening, and I’m in a good position to categorically say it isn’t. Why should you trust me though? Or to put it another way, how can an everyday user verify my statement? The BBC article is a bit unsatisfying, because they have security researchers create a proof of concept for an app that listens to conversations, and then state that the companies involved deny that they are doing this. Even if you have faith in the big tech firms involved, I know from my own experience that their engineers can make mistakes and leak information accidentally. My knowledge is also aging, technology keeps improving and running full speech recognition on an always-on chip won’t always be out of reach.

That gap, the fact that we have to trust the word of phone manufacturers that they aren’t spying on us and that there’s no good way for a third party to verify that promise, is what I’ll be focusing on in my research. I believe it should be possible to build voice interfaces and other devices with microphones and cameras in such a way that someone like Underwriters’ Laboratories or Consumer Reports can test their privacy guarantees. I’ve already explored some technical solutions in the past, but I think it’s important to gather a coalition of people interested in the broader questions. With that in mind, if you are a researcher or engineer either in academia or industry who’s interested in this area, drop me an email at I’m hoping we can organize some kind of symposium and discussion groups to figure out the best practices. I believe that we as computer scientists can do better than just asking the public to blindly trust corporations to do the right thing, so let’s figure out how!

Non-Max Suppressions, How do they Work?

(En espaƱol:

I’ve been working with neural networks to do image recognition for almost a decade now, but I have to admit I never really understood the details of how they output things like bounding boxes. I didn’t have a good mental model for how it all worked, and the reference functions always seemed pretty intimidating. In a lot of cases this doesn’t matter, the conversion process is handled by internal layers inside a model, and the application developer doesn’t need to worry about what’s happening under the hood. Recently though, I’ve begun working with some networks that expect the conversion to be handled externally, and so I’ve been writing code from scratch to perform the translation.

That has forced me to understand the details, and to make sure I have a good grasp, and have something to refer to in the future, I’ve put together this blog post and a Python Colab demonstrating it all, step by step. I’m using an example model from the awesome MediaPipe framework (which handles all the conversion itself, if you’re on a platform that supports it), and I’ve written reference code to explain the workflow to get from raw tensors to a cleaned-up set of bounding boxes. In particular, I feel like I’ve finally got a handle on “non-max suppression”, which turned out to be less intimidating than I’d feared.

How do they Work?

I recommend working through the Colab to get the best understanding, but the summary is that most neural networks designed to produce bounding boxes use a grid of anchor points across the image as a base. Each anchor point is associated with a score value, along with x and y offsets, width, height, and any other feature coordinates (like nose or eye positions). All of these coordinates are output relative to the anchor points, normalized between 0.0 and 1.0, where 1.0 is the image size. There is one score, and one set of coordinates for each anchor point, so in the case of the face model I’m using in the notebook, there are 48 columns and 48 rows of anchors, spread 4 pixels apart on a 192×192 image, which means 2,304 entries.

There are two outputs to the model, the first with a shape of (1, 2304, 16), holding 8 pairs of (x, y) coordinates for each anchor. The second is (1, 2304, 1) and holds the score for each anchor. For this model, the first two pairs of coordinates are the origin of the bounding box and its width and height. The other six are the positions of facial landmarks like the mouth or nose. The first stage of decoding is to turn these from relative positions into absolute coordinates by adding the corresponding anchor origins. This gives you a soup of overlapping bounding boxes, each associated with a score.

The next challenge is reducing this set of overlapping boxes into a single one for each real object detection. That’s where the non-max suppression algorithm comes in.

The initial step is to sort the boxes with the highest scores first. After that, we find all the boxes that overlap significantly and merge them together. The exact methods we use to determine if the overlap is significant can be seen in the `overlap_similarity()` function. The merging process either involves just taking the top-scoring box from an overlapping set (`unweighted_non_max_suppression()`) or averaging all the boxes and features in a set, weighted by their score (`weighted_non_max_suppression()`). And that’s how non-max suppression works!

My favorite debugging hack

Image from Wikimedia

The first machine I programmed commercially was the original PlayStation, and I didn’t appreciate it at the time but it had the best debugger I’ve ever had the pleasure to use. One of the best features was a continuously updating variable view, so you could see how values you cared about were changing in real time, without pausing the code. I’ve not been able to find anything quite that good since, but one of my leads did teach me an approach that’s almost as useful that works on any system. I’ve been surprised that I haven’t seen the technique used more widely (it wasn’t part of Google’s standard toolkit for example) so I wanted to share it here.

The short story is that you can easily output the values of variables at any point in your C or C++ code by inserting a line like:


Every time that line of code is hit, it will output the location, name, and value to stderr:

bar.c:101 foo=10

This may seem blindingly obvious – why not just write an fprintf() statement that does the same thing? What I find most useful is that it turns a minute of thinking about format strings and typing the variable name twice into a few seconds of adding a simple macro call. Lowering the effort involved means I’m a lot more likely to actually add the instrumentation and learn more about what’s going on, versus stubbornly trying to debug the problem by staring at the code in frustration and willing it to work. I often find scattering a bunch of these statements throughout the area I’m struggling with will help me align my mental model of what I think the code should be doing with what’s actually happening.

The implementation is just a few lines of code, included below or available as a Gist here. It’s so simple I often write it out again from memory when I’m starting a new codebase. The biggest piece of magic is the way it automatically pulls the variable name from the input argument, so you can easily see the source of what’s being logged. The do/while construct is just there so that a semicolon is required at the end of the macro call, like a normal function invocation. I’m making the code available here under a CC0 license, which is equivalent to public domain in the US, so if you think it might be useful feel free to grab it for yourself.


#include <stdio.h>
#include <stdint.h>

#define TRACE_STR(variable) do { fprintf(stderr, __FILE__":%d "#variable"=%s\n", __LINE__, variable); } while (0)
#define TRACE_INT(variable) do { fprintf(stderr, __FILE__":%d "#variable"=%d\n", __LINE__, variable); } while (0)
#define TRACE_PTR(variable) do { fprintf(stderr, __FILE__":%d "#variable"=0x%016lx\n", __LINE__, (uint64_t)(variable)); } while (0)
#define TRACE_SIZ(variable) do { fprintf(stderr, __FILE__":%d "#variable"=%zu\n", __LINE__, variable); } while (0)

#endif  // INCLUDE_TRACE_H

Leaving Google, Starting Stanford

I’ve been at Google for seven years, and I’ve been lucky enough to work with some amazing people on projects like TensorFlow that I’m very proud of. I’ve been talking about all the wonderful TinyML things you can build using TensorFlow Lite Micro a lot over the last few years, and the time has finally come to start trying to build some of them myself! Much as I’d like to, it’s very costly and time-consuming to launch new hardware devices at Google, because the downsides of a failed or buggy launch to any large company’s reputation are so high. Instead, I’ve decided to go back to college after more than twenty years away, and work on a Computer Science PhD at Stanford.

I’ve enjoyed teaching EE292D there for the last couple of years, it’s been wonderful being able to draft off the students’ enthusiasm about the possibilities with these emerging technologies, and I’ve learned a lot from faculty like Zain Asgar, Sachin Katti, and Boris Murmann. I’m very pleased I’ll have a chance to spend more time on campus.

TensorFlow Lite Micro is in very good hands with Advait Jain and the rest of the team, usage and headcount has continued to grow over the last couple of years, so I’m very optimistic about its future. I’ll be publishing more details about my plans soon, along with some demos, but I’ll be using the framework myself to create some of the devices I’ve been dreaming about since the project started.

It’s going to be an interesting new adventure, I’m definitely going to be feeling a bit like Rodney Dangerfield in the classes I’m taking, but I want to thank everyone who’s supported me getting this far. If you want to get in touch, my Stanford home page has more details on how to reach me, I’m looking forward to learning, teaching, and researching in a whole new environment.

Launching spchcat, an open-source speech recognition tool for Linux and Raspberry Pi

During the pandemic travel lockdown I’ve ended up accumulating a lot of vacation time, so I decided to take a lot of December off. I did spend some time relaxing, especially walking our adorable new dogs, but there were some coding itches I wanted to scratch. One of the biggest was building a simple system for prototyping voice interfaces on an embedded device like a Raspberry Pi, all running locally. I’ve been following the team’s work since they launched, and was very impressed by the quality of the open source speech models and code they have produced. I didn’t have an easy way to run them myself though, especially on live microphone input. With that in mind, I decided my holiday project would be writing a command line tool using Coqui’s speech to text library. To keep it as straightforward as possible I modeled it on the classic Unix cat command, where the default would be to read audio from a microphone and output text (though it ended up expanding to system audio and files too) so I called it spchcat. You can now download it yourself for Pi’s and x86 Linux from!

As usual, the scope kept expanding beyond my original idea. Coqui have collaborated with groups like ITML to collect models for over 40 languages, including some that are endangered, so I couldn’t resist supporting those, even though it makes the installer over a gigabyte in size. I also found it straightforward to support x86 Linux, since Coqui supply prebuilt libraries for those platforms too.

I’ve now scratched my own itch, but I’m hoping that this code will help introduce more people to the amazing advances in open source voice technology that have been happening over the last few years, and also help increase the number of people donating their voices to Common Voice, since none of this could have happened without Mozilla’s groundbreaking efforts. There’s still a lot of room for improvement with the accuracy and language coverage, but I’m confident that this is a project the open source community can make rapid progress on.

Thanks to the Coqui team for their great contributions, and to everyone who helped me test this initial release, especially Keyi for his detailed bug reports. I’m hoping to see some fun projects emerge out of this, so please drop me a line at or leave a comment if you do have something you’d like to share!

How to get started with Coqui’s open source on-device speech to text tool

Image from Wikimedia

I think the transformative power of on-device speech to text is criminally under-rated (and I’m not alone), so I’m a massive fan of the work Coqui are doing to make the technology more widely accessible. Coqui is a startup working on a complete open source solution to speech recognition, as well as text to speech, and I’ve been lucky enough to collaborate with their team on datasets like Multilingual Spoken Words.

They have have great documentation already, but over the holidays I’ve been playing around with the code and I always like to leave a trail of breadcrumbs if I can, so in this post I’ll try to show you how to get speech recognition running locally yourself in just a few minutes. I’ve tried it on my PopOS 21.04 laptop, but it will hopefully work on most modern Linux distributions, and should be trivial to modify for other platforms that Coqui provide binaries for. To accompany this post, I’ve also published a Colab notebook, which you can use from your browser on almost any system, and demonstrates all these steps.

You’ll need to be comfortable using a terminal, but because they do offer pre-built binaries you won’t need to worry about touching code or compilation. I’ll show you how to use their tools to recognize English language text from a WAV file. The code sections below (in a monospace font) should all be run from a shell terminal window.

First we download the example executable, stt, and the shared library,, that contains the framework code, all parts of the native_client archive.

wget --quiet
unxz native_client.tflite.Linux.tar.xz
tar -xf native_client.tflite.Linux.tar

Next, we need to fetch a model. For this example I’ve chosen the English large vocabulary model, but there are over 80 different versions available for many languages at Note that this is the recognition model, not the language model. Language models are used to post-process the results of the neural network, and are optional. To keep things simple, in this example we’re just using the raw recognition model output, but there are lots of options to improve the quality for a particular application if you investigate things like language models and hotwords.

wget --quiet

To demonstrate how the speech to text tool works, we need some WAV files to try it out on. Luckily Coqui provide some examples, together with transcripts of the expected output.

wget --quiet
!tar -xzf audio-1.1.0.tar.gz

The stt file is a command line tool that lets you run speech to text translation using Coqui’s framework. It has a lot of options you can explore, but the simplest way to use it is to provide a recognition model and then point it at a WAV file. After some version logging you should see the predicted transcript of the speech in the audio file as the final line.

./stt --model ./model.tflite --audio ./audio/4507-16021-0012.wav

You should see output that looks something like this:

TensorFlow: v2.3.0-14-g4bdd3955115
 Coqui STT: v1.1.0-0-gf3605e23
why should one halt on the way

If you’ve made it this far, congratulations, you’ve just run your own speech to text engine locally on your machine! Coqui have put a lot of work into their open source speech framework, so if you want to dive in deeper I highly recommend browsing their documentation and code. Everything’s open source, even the training, so if you need something special for your own application, like a different language or specialized vocabulary, you have the chance to do it yourself.

Update – I’ve also just added a new Colab notebook showing how to build a program using STT with just a makefile and the binary releases, without requiring Bazel.

Why are ML Compilers so Hard?

File:Modern Loose Reed Power Loom-marsden.png
Image from Wikimedia

Even before the first version of TensorFlow was released, the XLA project was integrated as a “domain-specific compiler” for its machine learning graphs. Since then there have been a lot of other compilers aimed at ML problems, like TVM, MLIR, EON, and GLOW. They have all been very successful in different areas, but they’re still not the primary way for most users to run machine learning models. In this post I want to talk about some of the challenges that face ML compiler writers, and some approaches I think may help in the future.

I’m not a compiler expert at all, but I have been working on infrastructure to run deep learning models across different platforms for the last ten years, so most of my observations come from being a user rather than an implementer of compiler technology. I’m also writing from my own personal experience, these are all just my own opinions rather than anything endorsed by TensorFlow or Google, so take them for what they’re worth. I’m outside my area of expertise, and I’d love to hear what other people think about the areas I’m highlighting, I bet I’ll learn something interesting and new from any responses!

What is an ML compiler?

So far I’ve been talking about ML compilers like they’re a well-defined class of technology, but the term actually covers a very wide range of tools. Intuitively there’s an analogy with procedural programming, where interpreted languages tend to be used for experimentation, prototyping and research because of their flexibility, but compilation is deployed when performance is a higher priority. The computation graphs used in deep learning look a lot like programs, and the major frameworks like PyTorch and TensorFlow use interpretation to execute them, so using compilation to improve performance seems like a logical next step. All of the ML compilers take a model defined in Python in one of the major training frameworks and attempt to convert them into a different form that produces the same results. The output form is usually chosen to have some advantages in performance or portability over the original version.

For example, XLA takes the layers defined at the TensorFlow Graph level, and converts them initially into what’s known as an HLO (high-level operation) representation. This term slightly confused me initially, since from my perspective as a TensorFlow engineer the HLO operations were *lower* level than Graph operations, as individual TF ops are often broken into multiple HLOs, but it comes from the fact that these are at the highest level of XLA’s interface. These HLOs are designed to be implementable efficiently on GPUs, CPUs, and TPUs, with the hope that supporting a smaller number of mathematical operations will allow many more TF ops to be implemented by composition, increasing portability.

The definition I’ve given above may seem overly broad, and it probably is, but that’s also one of the challenges in this area. When someone offers an ML compiler as a potential solution, most engineers’ experience with procedural compilers makes them receptive to the idea, because traditional compilers have become such a vital tool for all of us. Using the term compiler is popular because of this halo effect, but it doesn’t say much about the scope of the tool. As another example, TensorFlow Lite has a different representation for its computation graph from TensorFlow, and a tool is required to convert TF models to TFLite. The current version of this tool uses MLIR, a compiler technology, to perform the conversion, but the resulting graph representation is interpreted, so it seems strange to call it a compiler. One of the common assumptions when the term compiler is used is that it will generate code, but many of them actually generate intermediate representations which are then handed over to other tools to perform further steps. This makes it necessary to dig a bit deeper into the actual capabilities of anything labeled as an ML compiler to better understand what problems it can solve.

Why is ML compilation not like procedural compilation?

I’ve mentioned that the analogy between procedural and ML compilers is imperfect, but why? The biggest reason is that deep learning computation graphs are made up of a large, arbitrary, and ever-growing set of layers. Last time I checked, stock TensorFlow has over 2,000 different operations. PyTorch is not as expansive, but it also relies more on Python within its models to implement functionality, as does JAX. This is in contrast to modern procedural languages which tend to have a comparatively small number of core primitives (keywords, built-in types and operations) and a lot of libraries implemented using those primitives to provide most of their useful functionality. It’s also a Big Deal to add primitives to a procedural language, with a long period of debate, prototyping, and consensus building before new ones are accepted.

The reason for this difference is that deep learning model authors are tightly constrained by latency and memory performance. There seems to be a practical time limit of around a week to train a model to completion, if it takes any longer then the author’s unlikely to be able to iterate with enough prototypes to produce a successful result in the end. Because training a model means running it many millions of times, it makes sense for authors trying new techniques to invest time optimizing the code they’re using. Because many people use Nvidia GPUs this often means writing a function in CUDA to implement any new computation that’s needed, rather than leaving it in the Python that might be more natural for experimenting. The consequence of this is that even operators like activation functions that involve trivial math that could easily be represented as a simple NumPy operation get implemented as separate layers, and so show up as such in the compute graph. Even worse from the framework implementer’s perspective is that authors may actually fuse together multiple conceptually-separate operations into a single layer purely for performance reasons. You can see this in the plethora of different LSTM layers available in TensorFlow, they exist because manual fusing helped speed up training for particular models.

What this means in practice is that compute graphs are made up of layers chosen for model authors’ convenience, only defined by their already-optimized implementations in C++/CUDA, and any unit tests, which are often written to test against Python libraries like NumPy, adding another layer of indirection when trying to understand what they do. They are also likely to be comparatively large in their scope, rather than being constrained to a more primitive operation. All this makes the job of anyone trying to convert them into another representation very hard. Even worse, new layers are constantly being invented by researchers!

Most ML compilers “solve” this problem by only supporting a subset of the layers. Even TensorFlow Lite and XLA only support some operations. What I’ve found from my experience is that this is an unpleasant surprise to many users hoping to convert their model from the training environment to run on another platform. Most authors aren’t even particularly aware of which ops they’re using, since they’re likely to be using a higher-level interface like Keras, so figuring out how to change a model definition to fit with any constraints can be a frustrating and confusing process.

I believe this to be the single biggest problem facing ML compilers. The only way we can hope to provide a good experience to everyday users is by changing the training environment so that models are automagically expressed in a manageable representation from the start. The current situation asks compiler authors to turn a hamburger back into a cow, it’s simply not feasible. The challenge is that adding such constraints in makes it harder to experiment with new approaches, since any additional layers would need to be represented in a form other than Python, C++, or CUDA, which are the preferred languages of researchers. Either compiler writers will have to keep chasing all the new layer implementations constantly being produced by researchers, or we’ll have to persuade researchers to write them in a more portable form.

Why are there so many layers?

So far I’ve focused on “classical” deep learning operations, but one of the reasons that there are so many layers is that compute graphs also include a lot of computation that can’t easily be expressed in a mathematical form. Layers like convolution, fully-connected, or activations can be written using a math notation and implemented using a comparatively small number of primitives, and they take up the majority of the compute time, so they’re often chosen as the first targets by compiler writers. Unfortunately there are many other layers that don’t fit as easily into something as mathematical, and where the only practical definition can be written in a procedural function using something like C++. A favorite example of mine is the non-max suppression layer used to prune the soup of bounding boxes produced by networks doing image localization. This algorithm is hard to describe except as a series of sorting, loops, and conditional statements, and it’s difficult to see how it could be represented in anything less general than LLVM’s IR.

There are a lot of operations that generate features or perform other pre-processing, or do post-processing like scoring or beam search. These are tempting to exclude from any compiler solution because they often occur before or after the body of the model where the bulk of the computation happens, and so aren’t a priority for optimization, but these do sometimes occur at performance-critical points and so I think we need a solution.

What about fallbacks?

One answer I’ve heard to this problem is that it’s always possible to fall back to the original CPU C++ implementation for the long tail of layers that are not easy to use a specialized representation for. In my opinion this removes a lot of the advantages of using a compiler. It’s no longer possible to perform fusing or other IR optimizations across the barrier formed by a non-IR layer, and the model itself is not portable across different platforms. You might think that you still have portability across platforms that support C++, but as I mentioned earlier, most layer implementations were created by research scientists and are only present in an optimized form. This means that the code is likely to rely on libraries like Eigen and use functions from the training framework itself. Consequently, porting a single layer often means porting most of the training framework and its dependencies too. This is possible, we use it for the Flex delegate in TensorFlow Lite, and PyTorch Mobile takes a similar approach, but it is a lot of work even for comparatively mainstream platforms like Android and iOS, and doesn’t work at all for anything non-Posix-like such as most embedded devices. It also takes up a lot of binary space, since server code is written to very different constraints than other platforms. Another problem is that even if the libraries relied upon for optimization are portable to other platforms, it’s not likely that they’ll offer the same performance that they do in the original environment. Performance doesn’t tend to be portable.

What about compiling Python?

A lot of frameworks rely on Python glue code to help implement models. This is great for model authors because they can use a familiar and flexible language, but it makes porting to other environments very tough. Mobile platforms don’t support Python for example, and neither do GPUs. The need for GPU support tends to push authors to re-implement any Python code that becomes a performance bottleneck, but that still leaves a lot of places where it can be useful for training. The problem from a compiler’s perspective is that parts of the definition of the computation that needs to be performed to run the model is now held in Python code, not in the regular compute graph of layers, making those parts inaccessible.

A solution to this has been to compile regular Python code, as shown by TensorFlow’s tf.function. This is very helpful in some cases, but there are plenty of times when the Python code is actually relying on other libraries, often only available as C or C++ implementations. For example, a lot of audio models will do things like create spectrograms using a specialized library, which ML compilers don’t have visibility into, or the ability to translate into another representation.

How can we make progress?

I hope this post doesn’t sound too much like an Airing of Grievances (though it is December 23rd as I write this), I’m honestly a big fan of all the compiler work that’s been happening in the ML world and I want it to continue growing. With that in mind, how can we move forward? In my mind, there are two possible futures, one where the ML ecosystem looks like Matlab, and another where it looks like LLVM.

If you’re not familiar with it, Matlab is a tool used by researchers for a lot of mathematical prototyping and exploration. There are tools to compile the resulting Matlab projects into standalone executable code, but because it’s so hard to do so completely and in an optimal way, a common workflow is for researchers to hand over their projects to engineers to write a C or C++ implementation by hand. In this flavor of the future, we’d do something similar for ML, where researchers would use frameworks focused on easy experimentation and flexibility, and the conversion process for production deployments would involve manual engineering into a more portable and optimized representation once it was finalized. As an engineer who would be likely to be responsible for this conversion process, I’m hoping we can do better for my own sake. It will also remove a lot of scope for collaboration between model authors and deploying engineers, which is a shame because iterative feedback loops can make both models and software implementations better. Unfortunately I think the Matlab model is the most likely to happen unless we can change direction.

The key LLVM innovation was the invention of an intermediate representation that was rich enough for a large set of languages, but small enough to be supported by a lot of different platforms without requiring exorbitant engineering resources for each. An IR like this for machine learning is the dream of most of the hardware vendors I know, since it would allow them to support a lot of different models and frameworks with comparatively little effort, at least compared to the current status quo. There are existing attempts that have had some success, such as ONNX or MLIR’s TOSA dialect, but they’ve all struggled either with coverage or have increased the number of layers they support to a level that makes them tougher for hardware teams to implement. This is why I come back to the need to change the training environment itself. We somehow need to come up with tools that permit researchers the flexibility to experiment, give them the performance they need to complete training in a reasonable time, but also result in a representation that can be understood separate from the training environment. Researchers are in high demand in the ML world, so it would have to be something they want to use if it’s going to get adoption. These three requirements might end up being impossible to meet, but I’m hopeful that the ML compiler community can keep innovating and come up with something that does!

The Death of Feature Engineering is Greatly Exaggerated

Image by OfSmallThings

One of the most exciting aspects of deep learning’s emergence in computer vision a few years ago was that it didn’t appear to require any feature engineering, unlike previous techniques like histograms-of-gradients or Haar cascades. As neural networks ate up other fields like NLP and speech, the hope was that feature engineering would become unnecessary for those domains too. At first I fully bought into this idea, and saw any remaining manually-engineered feature pipelines as legacy code that would soon be subsumed by more advanced models.

Over the last few years of working with product teams to deploy models in production I’ve realized I was wrong. I’m not the first person to raise this idea, but I have some thoughts I haven’t seen widely discussed on exactly why feature engineering isn’t going away anytime soon. One of them is that even the original vision case actually does rely on a *lot* of feature engineering, we just haven’t been paying attention. Here’s a quote from a typical blog post discussing image models:

“a deep learning system is a fully trainable system beginning from raw input, for example image pixels

(Emphasis added by me)

I spent over a decade working on graphics and image processing, so the implicit assumption that the kinds of images we train networks on are at all “raw” always bothered me a bit. I was used to starting with truly RAW image files to preserve as much information from the original scene as possible. These formats reflect the output of the camera’s CCD hardware pretty closely. This means that the values for each pixel correspond roughly linearly to the number of photons hitting the detector at that point, and the position of each measured value is actually in a Bayer pattern, rather than a simple grid of pixels.

Image from Wikipedia

So, even to get to the kind of two-dimensional array of evenly spaced pixels with RGB values that ML practitioners expect an image to contain, we have to execute some kind of algorithm to resample the original values. There are deep learning approaches to this problem, but it’s clear that this is an important preprocessing step, and one that I’d argue should count as feature engineering. There’s a whole world of other transformations like this that have to be performed before we get what we’d normally recognize as an image. These include some very complex and somewhat arbitrary transformations like white balancing, which everyday camera users might only become aware of during an apocalypse. There are also steps like gamma correction, which take the high dynamic ranges possible for the CCD output values (which reflect photon counts) and scale them into numbers which more closely resemble the human eye’s response curve. Put very simplistically, we can see small differences in dark areas with much more sensitivity than differences in bright parts, so to represent images in an eight-bit byte it’s convenient to apply a gamma curve so that more of the codes are used for darker values.

I don’t want this to turn into an image processing tutorial, but I hope that these examples illustrate that there’s a lot of engineering happening before ML models get an image. I’ve come to think of these steps as feature engineering for the human visual system, and see deep learning as piggy-backing on all this work without realizing it. It makes intuitive sense to me that models benefit from the kinds of transformations that help us recognize objects in the world too. My instinct is that gamma correction makes it a lot easier to spot things in natural scenes, because you’d hope that the differences between two materials would remain roughly constant regardless of lighting conditions, and scaling the values keeps the offsets between the colors from varying as widely as they would with the raw measurements. I can easily believe that neural networks benefit from this property just like we do.

If you accept that there is a lot of hidden feature engineering happening behind the scenes even for the classic vision models, what does this mean for other applications of deep networks? My experience has been that it’s important to think explicitly about feature engineering when designing models, and if you believe your inputs are raw, it’s worth doing a deep dive to understand what’s really happening before you get your data. For example, I’ve been working with a team that’s using accelerometer and gyroscope data to interpret gestures. They were getting good results in their application, but thanks for supply-chain problems they had to change the IMU they were using. It turned out that the original part included sensor fusion to produce estimates of the device’s absolute orientation and that’s what they were feeding into the network. Other parts had different fusion algorithms which didn’t work as well, and even trying software fusion wasn’t effective. Some problems included significant lag responding to movement and biases that sent the orientation way off over time. We switched the model to using the unfused accelerometer and gyroscope values, and were able to get back a lot of the accuracy we’d lost.

In this case, deep learning did manage to eat that part of the feature engineering pipeline, but because we didn’t have a good understanding of what was happening to our input data before we started we ended up spending extra time having to deal with problems that could have been more easily handled in the design and prototyping phase. Also, I don’t have the knowledge of accelerometer hardware but I wouldn’t be at all surprised if the “raw” values we’re now using have actually been through some significant processing.

Another area that feature engineering has surprised me with its usefulness is around labeling and debugging data problems. When I was working on building a more reliable magic wand gesture model, I was getting very frustrated with my inability to tell if the training data I was capturing from people was good enough. Just staring at six curves of the acceleration and gyroscope X, Y, Z values over time wasn’t enough for me to tell if somebody had actually performed the expected gesture or not. I thought about trying to record video of the contributors, but that seemed a lot to ask. Instead, I put some work into reconstructing the absolute position and movement from the “raw” values. This effectively became an extremely poor man’s version of sensor fusion, but focused on the needs of this particular application. I was not only able to visualize the data to check its quality, I started feeding the rendered results into the model itself, improving the accuracy. It also had the side-benefit that I could display an intuitive visualization of the gesture as seen by the model back to the user, so that they could gain an understanding of why it failed to recognize some attempts and learn to adapt their movements to be clearer from the model’s perspective!

From Colab notebook

I don’t want to minimize deep learning’s achievements in reducing the toil involved in building feature pipelines, I’m still constantly amazed at how effective they are. I would like to see more emphasis put on feature engineering in research and teaching though, since it’s still an important issue that practitioners have to wrestle with to successfully deploy ML applications. I’m hoping this post will at least spark some curiosity about where your data has really been before you get it!

One weird trick to shrink convolutional networks for TinyML

A colleague recently asked for more details on an approach I recommended, but which she hadn’t seen any documentation for. I realized that it was something I’d learned from talking to model builders at Google, and I wasn’t sure there was anything written up, so in the spirit of leaving a trail of breadcrumbs for anyone coming after, I thought I should put it into a quick blog post.

The summary is that if you have MaxPool or AveragePool after a convolutional layer in a network, and you’re targeting a resource-constrained system like a microcontroller, you should try removing them entirely and replacing them with a stride in the convolution instead. This has two main benefits, but to explain it’s easiest to diagram out the network before and after.

In the typical setup, shown on the left, is a convolutional layer is followed by a pooling operation. This has been common since at least AlexNet, and is still found in many modern networks. The setup I often find useful is shown on the right. I’m using an example input size of 224 wide by 224 high for this diagram, but the discussion holds true for any dimensions.

The first thing to notice is that in the standard configuration, there’s a 224x224x8 activation buffer written out to memory after the convolution layer. This is by far the biggest chunk of memory required in this part of the graph, taking over 400KB, even with eight-bit values. All ML frameworks I’m aware of will require this buffer to be instantiated and filled before the next operation can be invoked. In theory it might be possible to do tiled execution, in the way that’s common for image processing frameworks, but the added complexity hasn’t made it a priority so far. If you’re running on an embedded system, 400KB is a lot of RAM, especially since it’s only being used for temporary values. That makes it a tempting target for size optimization.

My second observation is that we’re only using 25% of those values, assuming MaxPool is doing a typical 2x reduction, taking the largest value out of 4 in a 2×2 window. From experience, these values are often very similar, so while doing the pooling does help overall accuracy a bit, taking any of those four values at random isn’t much worse. In essence, this is what removing the pooling and increasing the stride for convolution does.

Stride is an argument that controls the step size as a convolution filter is slid across the input. By default, many networks have windows that are offset from each other by one pixel horizontally, and one pixel vertically. This means (ignoring padding, which is a whole different discussion) the output is the same size as the input, but typically with more channels (eight in the diagram above). Instead of setting the stride to this default of 1 horizontally, 1 vertically, you can set it to 2,2. This means that each window is offset by two pixels vertically and horizontally from its neighbor. This results in an output array that is half the width and height of the input, and so has a quarter of the number of elements. In essence, we’re picking one of the four values that would have been chosen by the pooling operation, but without the comparison or averaging that is used in the standard configuration.

This means that the output of the convolution layer uses much less memory, resulting in a smaller arena for TFL Micro, but also reduces the computation by 75%, since only a quarter of the convolution windows are being calculated. It does result in some accuracy loss, which you can verify during training, but since it reduces the resource usage so dramatically you may even be able to increase some other parameters like the input size or number of channels and gain some back. If you do find yourself struggling for arena size, I highly recommend giving this approach a try, it’s been very helpful for a lot of our models. If you’re not sure if your model has the convolution/pooling pattern, or want to better understand the sizes of your activation buffers and how they influence the arena you’ll need, I recommend the Netron visualizer, which can take TensorFlow Lite model files.