Why is it so difficult to retrain neural networks and get the same results?

Photo by Ian Sane

Last week I had a question from a colleague about reproducibility in TensorFlow, specifically in the 1.14 era. He wanted to be able to run the same training code multiple times and get exactly the same results, which on the surface doesn’t seem like an unreasonable expectation. Machine learning training is fundamentally a series of arithmetic operations applied repeatedly, so what makes getting the same results every time so hard? I had the same question when we first started TensorFlow, and I was lucky enough to learn some of the answers from the numerical programming experts on the team, so I want to share a bit of what I discovered.

There are good guides to achieving reproducibility out there, but they don’t usually include explanations for why all the steps involved are necessary, or why training becomes so slow when you do apply them. One reason reproducibility is so hard is that every single calculation in the process has the potential to change the end results, which means every step is a potential weak link. This means you have to worry about everything from random seeds (which are actually fairly easy to make reproducible, but can be hard to locate in a big code base) to code in external libraries. CuDNN doesn’t guarantee determinism by default for example, nor do some operations like reductions in TensorFlow’s CPU code.

It was the code execution part that confused me the most. I could understand the need to set random seeds correctly, but why would any numerical library with exactly the same inputs sometimes produce different outputs? It seemed like there must be some horrible design flaw to allow that to happen! Thankfully my teammates helped me wrap my head around what was going on.

The key thing I was missing was timing. To get the best performance, numerical code needs to run on multiple cores, whether on the CPU or the GPU. The important part to understand is that how long each core takes to complete is not deterministic. Lots of external factors, from the presence of data in the cache to interruptions from multi-tasking can affect the timing. This means that the order of operations can change. Your high school math class might have taught you that x + y + z will produce the same result as z + y + x, but in the imperfect world of floating point numbers that ain’t necessarily so. To illustrate this, I’ve created a short example program in the Godbolt Compiler Explorer.

#include <stdio.h>

float add(float a, float b) {
    return a + b;
}

int main(void) {
    float x = 0.00000005f;
    float y = 0.00000005f;
    float z = 1.0f;

    float result0 = x + y + z;
    float result1 = z + x + y;

    printf("%.9g\n", result0);
    printf("%.9g\n", result1);

    float result2 = add(add(x, y), z);
    float result3 = add(x, add(y, z));

    printf("%.9g\n", result2);
    printf("%.9g\n", result3);
}

You might not guess exactly what the result will be, but most people’s expectations are that result0, result1, result2, and result3 should all be the same, since the only difference is in the order of the additions. If you run the program in Godbolt though, you’ll see the following output:

ASM generation compiler returned: 0
Execution build compiler returned: 0
Program returned: 0
1.00000012
1
1.00000012
1

So what’s going on? The short answer is that floating point numbers only have so much precision, and if you try to add a very small number to a larger one, there’s a limit below which the small number gets rounded to zero and so the addition has no effect. In this example, I’ve set things up so that 0.00000005 is below that limit for 1.0, so that if you do the 1.0 + 0.00000005 operation first, there’s no change to the result because 0.00000005 is rounded to zero. However, if you do 0.00000005 + 0.00000005 first, this produces an intermediate sum of 0.0000001, which is above the rounding-to-zero limit when added to 1.0, and so it does affect the result.

This might seem like an artificial example, but most of the compute-intensive operations inside neural networks boil down to a long series of multiply-adds. Convolutions and fully-connected calculations will be split across multiple cores either on a GPU or CPU. The intermediate results from each core are often accumulated together, either in memory or registers. As the timing of the results being returned from each core varies, the order of addition will change, just as in the code above. Neural networks may require trillions of operations to be executed in any given training run, so it’s almost certain that these kinds of edge cases will occur and the results will vary.

So far I’ve been assuming we’re running on a single machine with exactly the same hardware for each run, but as you can imagine not only will the timing vary between platforms, but even things like the cache sizes may affect the dimensions of the tiles used to optimize matrix multiplies across cores, which will add a lot more opportunities for differences. You might be tempted to increase the size of the floating point representation from 32 bits to 64 to address the issue, but this only reduces the probability of non-determinism, but doesn’t eliminate it entirely. It will also have a big impact on performance.

Properly addressing these problems requires writers of the base math functions like GEMM to add extra constraints to their code, which can be a lot of work to get right, since testing anything timing-related and probabilistic is so complex. Since all the operations that you use in your network need to be modified and checked, the work usually requires a dedicated team to fix and verify all the barriers to reproducibility. These requirements also conflict with many of the optimizations that reduce latency on multi-core systems, so the functions become slower. In the guide above, the total time for a training run went from 28 to 105 minutes once all the modifications needed to ensure reproducibility were made.

I’m writing this post because I find it fascinating that our systems have become so complex that an algorithm like matrix multiply that can be described in a few lines of pseudo-code can have implementations that produce such surprising and unexpected results. I’m barely scratching the surface of all the complexities of floating point math here, but many other causes of similar issues are emerging as we do more and more calculations on massive distributed networks of computers. Honestly that’s one reason I enjoy working on embedded systems so much these days, I have a lot more confidence in my mental model of how those chips work. It does feel like it’s no longer possible for any one person to have a full and deep understanding of the whole stack on any modern platform, even just for software. I love digging into the causes of weird and surprising properties like non-determinism in training because it helps me understand more about what I don’t know!

Machines of Loving Understanding

Sparko, the world’s first electrical dog, as he looked on arrival at the engineer’s club, New York City, on his way to the World’s fair, where he will be an attraction at the Westinghouse Building. He walks, barks, wags his tail and sits up to beg. With Sparko, is Elektro, Westinghouse mechanical man. Both are creations of J.M. Barnett, Westinghouse engineer of Mansfield, OH.
I like to think
(it has to be!)
of a cybernetic ecology
where we are free of our labors
and joined back to nature,
returned to our mammal
brothers and sisters,
and all watched over
by machines of loving grace.

Brautigan’s poem inspires and terrifies me at the same time. It’s a reminder of how creepy a world full of devices that blur the line between life and objects could be, but there’s also something appealing about connecting more closely to the things we build. Far more insightful people than me have explored these issues, from Mary Shelley to Phillip K. Dick, but the aspect that has fascinated me most is how computers understand us.

We live in a world where our machines are wonderful at showing us near-photorealistic scenes in real time, and can even talk to us in convincing voices. Up until recently though, they’ve not been able to make sense of images or audio that are given to them as inputs. We’ve been able to synthesize voices for decades, but speech recognition has only really started working well in the last few years. Computers have been like Crocodile Sales Reps, with enormous mouths and tiny ears, great at talking but terrible at listening. That means they can be immensely frustrating to deal with, since they seem to have no ability to do what we mean. Instead we have to spend a lot of time painstakingly communicating our needs in a form that makes sense to them, even it is unnatural for us.

This process started with toggling switches on a control panel, moved to punch cards, teletypes, CRT terminals, mouse-driven GUIs, swiping on a touch screen and most recently basic voice interfaces. Each of these steps was a big advance, but compared to how we communicate with other people, or even our pets, they still feel clumsy.

What has me most excited about all the recent advances in machine learning is that they’re starting to give computers the ability to understand us in a much deeper and more natural way. The video above is just a small robot that I built for a few dollars as a technology demonstration, but because it followed my face around, I ended up becoming quite attached. It was exhibiting behavior that we associate with people or animals who like us. Even though I knew it was just code underneath, it was hard not to see it as a character instead of an object. It became a Pencil named Steve.

Face following is a comparatively simple ability, but it’s enough to build more useful objects like a fan that always points at you, or a laptop screen that locks when nobody is around. As one of the comments says, the fan is a bit creepy. I believe this is because it’s an object that’s exhibiting attributes that we associate with living beings, entering the Uncanny Valley. The googly eyes probably didn’t help. The confounding part is that the property that makes it most creepy is the same thing that makes it helpful.

We’re going to see more and more of these capabilities making it into everyday objects (at least if I have anything to do with it) so I expect the creepiness and usefulness will keep growing in parallel too. Imagine a robot vacuum that you can talk to naturally and it will respond, that you can shoo away or control with hand gestures, and that follows you around while you’re eating to pick up crumbs you drop. Doesn’t that sound a lot like a dog? All of these behaviors help it do its job better, it’s understanding us in a more natural way instead of expecting us to learn its language, but they also make it feel a lot more alive. Increased understanding goes hand in hand with creepiness.

This already leads to a lot of unresolved tension in our relationships with voice assistants. 79% of Americans believe they spy on their conversations, but 42% of us still use them! I think this belief is so widespread because it’s hard not to treat something that you can talk to as a pseudo-person, which also makes it hard not to expect that it is listening all the time, even if it doesn’t respond. That feeling will only increase once they take account of glances, gestures, even your mood.

If I’m right, we’re going to be entering a new age of creepy but useful objects that seem somewhat alive. What should we do about it? The first part might seem obvious but it rarely happens with any new technology – have a public debate about what we as a community think should be acceptable, right now, while it’s in the early stages of deployment, not after it’s a done deal. I’m a big fan of representative democracy, with all its flaws, so let’s encourage people outside the tech world to help draw the lines of what’s ethical and reasonable. I’m trying to take a step in that direction by putting our products up on maker sites so that anyone can try them out for themselves, but I’d love to figure out how to do something like a roadshow demonstrating what’s coming in the near future. I guess this blog post is an attempt at that too. If there’s going to be a tradeoff between creepiness and utility, let’s give ordinary people the power to determine what the balance should be.

The second important realization is that the tech industry is beyond the point where we can just say “trust us” and reasonably expect people to believe our claims. We’ve lost too much credibility. Moving forward we need to build our products in a way that third parties can check what we’re doing in a meaningful way. As I wrote a few months ago, I know Google isn’t spying on your conversations, but I can’t prove it. I’ve proposed the ML sensors approach we use as a response to that problem, so that someone like Underwriters Laboratories can test our privacy claims on the behalf of consumers.

That’s just one idea though, anything that lets people outside the manufacturers verify their claims would be welcome. To go along with that, I’d love to see enforceable laws that require creators of devices to label what information they collect, like an ingredients list for food items. What I don’t know is how to prevent these from turning into meaningless Prop 65-style privacy policies, where every company basically says they can do anything with any information and share it with anyone they choose. Even though GDPR is flawed in many ways, it did force a lot of companies to be more careful with how they handle personal data, internally and externally. I’d love smarter people than me to figure out how we make privacy claims minimal and enforceable, but I believe the foundation has to be designing systems that can be audited.

Whether this new world we’re moving towards becomes more of a utopia or dystopia depends on the choices we make now. Computers that understand us better can help when an elderly person falls, but the exact same technology could send police after a homeless person bedding down in a doorway. Our ML models can spot drowning victims, or criminalize wild swimming. Ubiquitous, cheap, battery-powered voice recognition could make devices accessible to many more people, or supercharge bugging by repressive regimes. Technologists alone shouldn’t have the power to decide the direction we head in, we need everyone’s help to chart the right path, and make the hard tradeoffs. We need to make sure that the machines that watch over us truly will be loving.

Launching Useful Sensors!

Person Sensor from Useful Sensors

For years I’ve wanted to be able to look at a light switch, say “On”, and have the lights switch on. This kind of interface sounds simple, so why doesn’t it exist? It turns out building one requires solving a lot of tough research and engineering challenges, and even more daunting, coming up with a whole new business model for smart devices. Despite these obstacles, I’m so excited about the possibilities that I’ve founded a new startup, Useful Sensors, together with a wonderful team and great investors!

We’ve been operating in stealth for the last few months, but now we’ve launched our first product, a Person Sensor that is available on SparkFun for $10. This is a small hardware module that detects nearby faces, and returns information about how many there are, where they are relative to the device, and performs facial recognition. It connects over I2C, and so is easy to integrate with almost any microcontroller, but is also designed with privacy built in. If you’ve followed my work on ML sensors, this is our attempt to come up with the first commercial application of this approach to system design.

We’ve started to see interest from some TV and laptop companies, especially around our upcoming hand gesture recognition, so if you are in the consumer electronics world, or have other applications in mind, I would love to hear from you!

Now we’re public, you can expect to see more posts here in the future going into more detail, but for now I’ll leave you with some articles from a couple of journalists who have a lot of experience in this area. I thought they both had very sharp and insightful questions about what we’re doing, and had me thinking hard, so I hope you enjoy their perspectives too:

Pete Warden’s Startup puts AI in the Sensor, by Sally Ward-Foxton

Former Googler creates TinyML Startup, by Stacey Higginbotham

Try OpenAI’s Amazing Whisper Speech Recognition in a Free Web App

Open in Colab

You may have noticed that I’m obsessed with open source speech recognition, so I was very excited when OpenAI released a new voice model. I’m even more excited now I’ve had a chance to play with it, the accuracy is extremely impressive, especially as it’s multi-language. OpenAI have done a great job packaging it, you can install it straight from pip if you’re a Linux shell user, but I wanted to find a way to let anybody try it for themselves from a web browser, even if they’re not developers. I love Google’s Colab service, and luckily somebody had already created a notebook showing the basics of using the Whisper model. I added some documentation and test files, and now you can give it a try for yourself by opening this Colab linkhttps://colab.research.google.com/github/petewarden/openai-whisper-webapp/blob/main/OpenAI_Whisper_ASR_Demo.ipynb.

Follow the directions, and after a minute or so you’ll see a button at the bottom of the page where you can record your own audio, and see a transcript. Give it a try, I think you’ll be impressed too!

How to build Raspberry Pi Pico programs with no software installation

youtube.com/watch?v=bDDgihwDhRE

I love using the Raspberry Pi Pico board to teach students about microcontrollers, especially as it only costs $4 and is currently in stock despite the supply chain crisis. I have run into some problems though, because building a program requires installing software. This might not sound like a big barrier, but when people arrive with a mix of Windows, MacOS, ChromeOS, and Linux laptops, often with different versions or architectures within each group, trying to guide them through the process can easily take a whole lesson, and require individual attention from me to debug each particular problem while the other students get bored. It’s also frustrating for the class to have to wait an hour before they get to do anything cool, I much prefer giving them a success as early as possible.

To solve this problem, I’ve actually turned to what might seem an unlikely tool, Google’s Colab service. If you have run across this, you probably associate it with Python notebooks, because that’s its primary use case. I’ve found it to be useful for a lot more though, because it effectively gives you a free, temporary Linux virtual machine that you control through the browser. Instead of running Python commands, you can run Linux shell commands by putting an exclamation point at the start. There are some restrictions, such as needing a Google account to sign in, and the file system disappearing after you leave the page or are idle too long, but I’ve found it great for documenting all sorts of installation and build processes in an accessible way.

I’m getting ready to teach EE292D (TinyML) at Stanford again this year, but we’re switching over to the Pico boards instead of the Arduino Nano BLE Sense 33s that we have used, because the latter have been out of stock for quite a while. As part of that, I wanted to have an easy getting started guide for the students to help them build and run their first program. I put together a Colab notebook that follows the steps in the great Pico Getting Started Guide, installing the SDK, examples, and then building blink and running it on a board. To give some extra guidance, I also recorded the YouTube video above. Please excuse the hair and occasional distraction, I did it in a hurry.

It’s not a complete solution, students will still need to install OS-specific software to access debug logs, it requires a Google login that’s not available for kids under 13, and the vanishing file system will cause frustration if they don’t remember to save their code, but I do like it as a simple way to give them a win in just a few minutes. There’s nothing like seeing that first LED blink on a new board, I still get a kick out of it myself!

Why isn’t there more training on the edge?

One of the most frequent questions I get asked from people exploring machine learning beyond cloud and desktop machines is “What about training?”. If you look around at the popular frameworks and use cases of edge ML, most of them seem focused on inference. It isn’t obvious why this is the case though, so I decided to collect my notes in a post here, so I can have something to refer to when this comes up (and organize my own thoughts too!).

No Labels

I think the biggest reason that there’s not more training on the edge is that most models need to be trained through supervised learning, that is each sample used for training needs a ground truth label. If you’re running on a phone or embedded system, there’s not likely to be an easy way to attach a label to incoming data, other than running an existing model and guessing. You need a person to look at an image, or listen to an audio recording, to identify what the prediction should be, before you can use it in training. You also generally need a fairly large number of labels per class for training to be effective.

This may change as semi-supervised or unsupervised approaches continue to improve, but right now supervised training is the most reliable method to get a model for most applications. I have seen some interesting hacks to guess labels on the edge though, that might fall into the semi-supervised category. For example, you can use temporal consistency on video frames to infer mistakes. In concrete terms, if your camera is identifying a fruit as a lemon for ten frames, then for one frame it’s a lime, and then it’s back to a lemon, you can guess that the lime prediction was an error (assuming the frame rate is high enough, fruits aren’t flying by at supersonic speed, and so forth). Another clever use of time was in an audio wake word application, where if there was a near-detection (the model gave a score just below the threshold) followed soon after by an actual detection (over the threshold) then the system would guess that the person had actually said the wake word the first time, and the model had failed to recognize it. This hack relies on the human behavior of trying again if it didn’t work initially.

Quality Control

Getting models to work well within an application is very hard when you are training a single version and putting it through testing before release. If an edge model is retrained, it will be very hard to predict the bounds of its behavior. Since this will affect how well your application works, training on the fly makes ensuring it behaves correctly much harder. This isn’t a complete blocker, there are clearly some products (like GBoard) that do manage to handle this problem, but they generally build some kind of guard rails around what the model can produce. For example, something that predicts words or sentences might have a block-list of banned words (such as hateful or obscene phrases) that will be scrubbed from a model’s output even if edge training causes it to start producing them.

This kind of post-processing is often needed even when using pre-trained models on the edge (I could probably fill a decent book with all the hacks that usually go into filtering and interpreting the raw model output to make it useful) but the presence of a model that can change in unpredictable ways makes it even harder. Nobody wants to be responsible for building another Tay.

Embeddings

When you set up a new phone, you’ll probably speak the assistant wake word a few times to help the system learn your voice. In my experience this doesn’t involve retraining in the sense of full back propagation. Instead, the “Is this audio a wake word?” model produces an embedding vector as its output, and that is then used in a nearest-neighbor lookup to compare to the embeddings from the first few utterances you spoke during setup. This is a surprisingly common technique across a lot of domains, because it is comparatively simple to implement, only requires storing a few values, and works robustly.

I’ve found embeddings to be a fantastic general purpose tool for customizing models on the edge, without requiring the full machinery of back propagation. The gradient descent approach used by modern deep learning needs high precision (usually floating point) weight arrays, along with specialized operators to run the back-prop version of each layer. The weights need to be stored between updates, and since they’re higher precision than is required for inference they take up more space than an inference-optimized model, and you’ll usually want to keep a copy of the original weights around in case you need to reset the model too. By contrast, you can often extract an embedding from an existing model just by reading the activation layer before the final fully-connected op that does the classification. Even though specialized loss functions exist to try to encourage embeddings with desired properties, like good spatial separation, I’ve found that training with a regular softmax and lopping off the last layer often works just as well in practice.

Exceptions

Of course, there are examples of very successful products that do use training on the edge. I already mentioned GBoard, which is the poster child for federated learning, but another domain where I’ve seen a lot of use is in anomaly detection, particularly around predictive maintenance for machinery. This is an application where it seems like every machine behaves differently, so learning “normal behavior” (by observing the first 24 hours of vibrations and labeling those as normal) allows the adaptation needed to spot deviations from those initial patterns. I’ve also seen interesting research projects around security and communications protocols that are looking at using training on the edge to be more robust to changing environmental conditions.

YAGNI

The short answer to the question is that if you’re getting started with ML on the edge, training models there is unlikely to be useful in the short or medium term. Technology keeps changing, and I am seeing some interesting applications starting to emerge, but I feel like a lot of the interest in edge training comes from how prominent training is in the cloud world. I often joke that all ML architecture researchers could go on strike indefinitely, and ML engineers would still have decades of productive work ahead of us. There are many better-motivated problems around deployment on the edge than bringing training up to server capabilities, and I bet your product will hit some of those long before training becomes an issue.

Don’t worry if college isn’t the happiest time of your life

I was digging through paperwork today to help complete my PhD admission process, and I stopped short when I saw the academic transcript from my undergraduate years. I was a terrible student! I got 0% on one course, awful scores on many others, and had to do a lot of retakes. It brought back memories of how I was feeling when I was 18. I was a mess. I was totally unprepared for life away from home, suffered from so much anxiety I wasn’t even able to name it, like a fish having no concept of water, and jumped right into a terrible relationship the first chance I got. I was working almost full-time at Kwik-Save to pay rent, and didn’t even have a computer at home I could use.

It wasn’t supposed to be like this. Since I was a kid I’d dreamed of escaping my tiny village for university. I wasn’t sure what it was going to be like, since nobody in my immediate family had completed college, but I had vague ideas from being a townie in Cambridge and shows like Brideshead Revisited that I would be transported to a magical world of privilege and punts. Most of all, I looked forward to meeting people I could talk to about important things, people who might listen to me. I also knew I was “good at computers”, and was looking forward to diving deeper into programming. The reality of being just one of hundreds of students, with little ability to connect with any of the staff, and discovering that most of what I’d learned about coding wasn’t a big help with Computer Science, left me more than deflated. I was still the same screwed-up person, there had been no magical transformation. A lot of the time I resented time spent on my classes, and felt I wasn’t learning what I needed for my true vocation, being a programmer, and you can see that in my grades.

Looking back, this wasn’t Manchester’s fault. It’s a fantastic university with a CS program that’s world-class, and despite my best efforts I did learn a lot from great teachers like Steve Furber and Carole Goble. Their lessons turned out to be far more useful in my career than I ever would have expected. The staff were kind and helpful on the few occasions I did reach out, but I had such a lack of confidence I seldom dared try. I managed to scrape through, with a lot of retakes, and helped by the fact that the overall marks were heavily weighted to the final year. It left me feeling cheated though, somehow. I’d always heard the cliche that these would be the happiest days of my life. If I was miserable and it was all downhill from here, what was even the point of carrying on? It didn’t help that the first technical job I could find out of college paid less than I’d made stacking shelves at a supermarket.

The good news is that life has pretty much continuously got better from that point on. Years of therapy, a career path that has gifted me some fascinating and impactful problems to work on, along with enough money to be comfortable, and building the kind of community I’d dreamed of at college through writing and the internet, have left me feeling happier than I’ve ever been. I feel very lucky to have found a way to engage with so many smart people, and find my voice, through open source coding, blogging, research papers, and teaching.

I didn’t write this post to humblebrag about how wonderful my life is, I still have plenty of challenges and disappointments. I just want to provide a datapoint for anyone else who is struggling or has struggled at college. It doesn’t have to define you for the rest of your life. If the experience isn’t what you’d expected and hoped, there’s no need for despair. Life can get so much better.

Why cameras are soon going to be everywhere

i-FlatCam demo

I’ve finally had the chance to play Cyberpunk 2077 over the last few weekends, and it’s an amazing feat of graphics programming, especially with ray-tracing enabled. I’ve had fun, but I have been struck by how the cyberpunk vision of the future is rooted in the ’80s. Even though William Gibson was incredibly prescient in so many ways, the actual future we’re living in is increasingly diverging from the one he painted. One of the differences that struck me most was how cameras exist in the game’s world, compared to what I can see happening as an engineer working in the imaging field. They still primarily show up as security cameras, brick-sized devices that are attached to walls and would look familiar to someone from forty years ago.

A lot of people still share the expectation that cameras will be obvious, standalone components of a system. Even though phone cameras and webcams are smaller, they still have a noticeable physical presence, and often come with indicators like red lights that show when they’re recording. What is clear to me from my work is that these assumptions aren’t going to hold much longer. Soon imaging sensors will be so small, cheap, and energy efficient that they’ll be added to many more devices in our daily lives, and because they’re so tiny they won’t even be noticeable!

What am I basing this prediction on? The clearest indicator for me is that you can already buy devices like the Himax HM01B0 with an imaging sensor that’s less than 2mm by 2mm in size, low single-digit dollars in cost, and 2 milliwatts or less in power usage. Even more striking are the cameras that are emerging from research labs. At the TinyML Summit the University of Michigan presented a complete system that fits on the tip of a finger.

The video at the top of this post shows another project from Rice that is able to perform state of the art eye tracking at 253 FPS, using 23 milliwatts, in a lens-less system that lets it achieve a much smaller size than other solutions.

Hopefully this makes it clear that there’s a growing supply of these kinds of devices. Why do I think there will be enough demand to include them in appliances and other items around homes and offices? This is tricky to show as clearly because the applications aren’t deployed yet, but cameras can replace or augment lots of existing sensors, and enable entirely new features. Here are a few examples:

Each one of these may or may not turn out to be useful, but there are so many potential applications (including many I’m sure nobody’s thought of yet) that I can’t imagine some of them won’t take off once the technology is widely available. Many scientists believe that the Cambrian Explosion occurred because of the evolution of eyes opened up so many new possibilities and functions. I’m hoping we’ll see a similarly massive expansion in the technology space once all our devices can truly see and understand.

So, that’s why I believe we’re going to end up in a world where we’re each surrounded by thousands of cameras. What does that mean? As an engineer I’m excited, because we have the chance to make a positive impact on peoples’ lives. As a human being, I’m terrified because the potential for harm is so large, through unwanted tracking, recording of private moments, and the sharing of massive amounts of data with technology suppliers.

If you accept my argument about why we’re headed for a world full of tiny cameras watching us at all times, then I think we all have a responsibility to plan ahead now to mitigate the potential harms. This is my motivation behind the ML sensors proposal to wall off sensitive data in a secure component, but I see this as just a starting point in the discussion. Do we need regulation? Even if we don’t get it in the US, will Europe take the lead? Should there be voluntary standards around labeling products that contain cameras or microphones? I don’t know the answers, but I don’t think we have the luxury of waiting too long to figure them out, because if we don’t make any changes we’ll be deploying billions of poorly-secured devices into everybody’s lives as a giant uncontrolled experiment.

What are ML Sensors?

I’ve spent a lot of time at conferences talking about all the wonderful things that are now possible using machine learning on embedded devices, but as Stacey Higginbotham pointed out at this year’s TinyML Summit, despite all the potential there haven’t been many shipping applications. My experience is that companies like Google with big ML teams have been able to deploy products successfully, but it has been a lot harder for teams in other industries. For example, when I visited an appliance manufacturer in China and pitched them on the glorious future they could access thanks to TensorFlow Lite Micro, they told me they didn’t even know how to open a Python notebook. Instead, they asked if I could just give them a voice interface, or something that told them when somebody sat down in front of their TV.

I realized that the software framework model that had worked so well for TensorFlow Lite adoption on phone apps didn’t translate to other domains. Many of the firms that could most benefit from ML just don’t have the software engineering resources to integrate a library, even with great tools like Edge Impulse that make the process much easier. As I thought about how to make on-device ML more widely accessible, I realized that providing ML capabilities as small, cheap hardware modules might be a good solution. This was the seed of the idea that became the ML sensors proposal, now available as a paper on arXiv.

The basic idea is that system builders are already able to integrate components like sensors into their products, so why not expose some higher-level information about the environment in the same form factor? For example, a person sensor might have a pin that goes high when someone is present, and then an I2C interface to supply more detailed information about their pose, activities, and identity. That would allow a TV manufacturer to wake up the display when someone sat down on the couch, and maybe even customize the UI to show recently-watched shows based on which family members are present. All of the complexity of the ML implementation would be taken care of by the sensor manufacturer and hidden inside the hardware module, which would have a microcontroller and a camera under the hood. The OEM would just need to respond to the actionable signals from the component.

At the same time as I was thinking about how to get ML into more peoples hands, I was also worried about the potential for abuse that the proliferation of cameras and microphones in everyday devices enables. I realized that the modular approach might have some advantages there too. I think of personal information as toxic waste, because any leaks can be highly damaging to individuals, and to the companies involved, and there are few data sources that have as much potential for harm as video and audio streams from within peoples’ homes. I believe it’s our responsibility as developers to engineer systems that are as leak-resistant as possible, especially if we’re dealing with cameras and microphones. I’d already explored the idea of using Arm’s TrustZone to keep sensitive data contained, but by moving the ML processing off the central microcontroller and onto a peripheral, we have the chance to design something that has a very small attack surface (because there’s no memory shared with the rest of the system) and can be audited by a third-party to ensure any claims of safety are credible.

The ML sensor paper brings all these ideas together into a proposal for designing systems that are easier to build, and safer by default. I’m hoping this will start a discussion about how to improve usability and privacy in everyday systems, and lead to more practical prototyping and experimentation to answer a lot of the questions it raises. I’d love to get feedback on this proposal, especially from product designers who might want to try integrating these into their systems. I’m looking forward to seeing more work in this area, I know I’m going to be busy trying to get some examples up and running, so watch this space!

Caches Considered Harmful for Machine Learning

Photo by the National Park Service

I’ve been working on a new research paper, and a friend gave me the feedback that he was confused by the statement “memory accesses can be accurately predicted at the compilation stage” for machine learning workloads, and that this made them a poor fit for conventional processor architectures with predictive caches. I realized that this was received wisdom among the ML engineers I know, but I wasn’t aware of any papers that discuss this point. I put out a request for help on Twitter, but while there were a lot of interesting resources in the answers, I still couldn’t find any papers that focused on what feels like an important property for machine learning systems. With that in mind, I wanted to at least describe the issue as best as I can in this blog post, so there’s a trail of breadcrumbs for anyone else interested in how system designs might need to change to accommodate ML.

So, what am I talking about? Modern processors are almost universally constructed around multiple layers of predictive memory caches. These are small areas of memory that can be accessed much faster than the main system memory, and are needed because processors can execute instructions far more quickly than they can fetch values from the DRAM used for main memory. In fact, you can usually run hundreds of instructions in the time it takes to bring one byte from the DRAM. This means if processors all executed directly from system memory, they would run hundreds of times more slowly than they could. For decades, the solution to the mismatch has been predictive caches. It’s possible to build memory that’s much faster to access than DRAM, but for power and area reasons it’s not easy to fit large amounts onto a chip. In modern systems you might have gigabytes of DRAM, but only single-digit megabytes of total cache. There are some great papers like What Every Programmer Should Know About Memory that go into a lot more detail about the overall approach, but the most important thing to know is that memory stored in these caches can be accessed in a handful of cycles, instead of hundreds, so moving data into these areas is crucial if you want to run your programs faster.

How do we decide what data should be placed in these caches though? This requires us to predict what memory locations will be accessed hundreds or thousands of cycles in the future, and with general programs with a lot of data dependent branches, comparisons, and complex address calculations this isn’t possible to do with complete accuracy. Instead, the caches use heuristics (like we just accessed address N, so also fetch N+1, N+2, etc in case we’re iterating through an array) to guess how to populate these small, fast areas of memory. The cost of making a mistake is still hundreds of cycles, but as long as most of the accesses are predicted correctly this works pretty well in practice. However there is still an underlying tension between the model used for programming languages, where memory is treated as a uniform arena, and the reality of hardware where data lives in multiple different places with very different characteristics. I never thought I’d be linking to a Hacker News comment, the community has enough toxic members I haven’t read it for years, but this post I was pointed to actually does a good job of talking about all the complexities that are introduced to make processors appear as if they’re working with uniform memory.

Why does all this matter for machine learning? The fundamental problem predictive caches are trying to solve is “What data needs to be prefetched into fast memory from DRAM?”. For most computing workloads, like rendering HTML pages or dealing with network traffic, the answer to this question is highly dependent on the input data to the algorithm. The code is full of lines like ‘if (a[i] == 10) { value = b[j] } else { value = b[k]; }‘, so predicting which addresses will be accessed requires advance knowledge of i, a[i], j, and k, at least. As more of these data-dependent conditionals accumulate, the permutations of possible access addresses become unmanageable, and effectively it’s impossible to predict addresses for code like this without accessing the data itself. Since the problem we’re trying to solve is that we can’t access the underlying data efficiently without a cache, we end up having to rely on heuristics instead.

Machine learning operations are very different. The layers that take up the majority of the time for most models tend to be based on operations like convolutions, which can be expressed as matrix multiplies. Crucially, the memory access patterns don’t depend on the input data. There’s no ‘if (a[i] == 10) {...‘ code in the inner loops of these kernels, they’re much simpler. The sizes of the inputs are also usually known ahead of time. These properties mean that we know exactly what data we need in fast memory for the entire execution of the layer ahead of time, with no dependencies on the values in that data. Each layer can often take hundreds of thousands of arithmetic operations to compute, and each value fetched has the potential to be used in multiple instructions, so making good use of the small amounts of fast memory available is crucial to reducing latency. What quickly becomes frustrating to any programmer trying to optimize these algorithms on conventional processors is that it’s very hard to transfer our complete knowledge of future access patterns into compiled code.

The caches rely almost entirely on the heuristics that were designed for conventional usage patterns, so we essentially have to reverse-engineer those heuristics to persuade them to load the data we know we’ll need. There are some tools to help like prefetching instructions and branch hints, but optimizing inner loops often feels like a struggle against a system that thinks it’s being helpful, but is actually getting in the way. Optimized matrix multiplication implementations usually require us to gather the needed data into tiles that are a good fit for the fast memory available, so we can do as much as possible with the values while they’re quickly accessible. Getting these tiles the right size and ensuring they’re populated with the correct data before it’s needed requires in-depth knowledge of the capacity, access latencies, and predictive algorithms of all levels of the cache hierarchy on a particular processor. An implementation that works well on one chip may produce drastically poorer performance on another in the same family if any of those characteristics change.

It would make more sense to expose the small, fast memories to the programmer directly, instead of relying on opaque heuristics to populate them. They could be made available as separate address spaces that can be explicitly preloaded ahead of time with data before it’s needed. We know what address ranges we’ll want to have and when, so give us a way to use this knowledge to provide perfect predictions to fill those areas of memory. Some embedded chips do offer this capability, known variously as tightly-coupled memory, or XY memory, and we do use this to improve performance for TensorFlow Lite Micro on platforms that support it.

There are lots of challenges to making this available more widely though. Modern desktop and mobile apps don’t have the luxury of targeting a single hardware platform, and are expected to be able to run across a wide variety of different chips within the same processor family. It would be very difficult to write efficient code that works for all those combinations of cache size, speed, and prefetch heuristics. Software libraries from the processor manufacturers themselves (like CuDNN or Intel’s MKL) are usually the best answer right now, since they are written by engineers with detailed knowledge of the hardware systems and will be updated to handle new releases. These still have to work around the underlying challenges of a programming model that tries to hide the cache hierarchy though, and every engineer I’ve talked to who has worked on these inner loops wishes they had a better way to take advantage of their knowledge of memory access patterns.

This is also the kind of radical workload difference that has inspired a lot of new kinds of NPU hardware aimed specifically at deep learning. From my perspective, these have also been hard to work with, because while their programming models may work better for core operations like convolutions, models also require layers like non-max suppressions that are only efficiently written as procedural code with data-dependent branches. Without the ability to run this kind of general purpose code, accelerators lose many of their advantages because they have to keep passing off work to the main CPU, with a high latency cost (partly because this kind of handover usually involves flushing all caches to keep different memory areas in sync).

I don’t know what the ultimate solution will look like, but I’d imagine it will either involve system programmers being able to populate parts of caches using explicit prefetching, maybe even just supplying a set of address ranges as requirements and relying on the processor to sort it out, or something more extreme. One possible idea is making matrix multiplies first-class instructions at the machine code level, and having each processor implement the optimal strategy in microcode, in a similar way to how floating-point operations have migrated from accelerators, to co-processors, and now to the core CPU. Whatever the future holds, I hope this post at least helps explain why conventional predictive caches are so unhelpful when trying to optimize machine learning operations.