Notes from a bank run

Photo by Gopal Vijayaraghavan

My startup, Useful Sensors, has all of its money in Silicon Valley Bank. There are a lot of things I worried about as a CEO, but assessing SVB’s creditworthiness wasn’t one of them. It clearly should have been. I don’t have any grand theories about what’s happened over the last few days but I wanted to share some of my experiences as someone directly affected.

To start with, Useful is not at risk of shutting down. The worst case scenario, as far as I can tell, is that we only have access to the insured amount of $250k in our SVB account on Monday. This will be plenty for payroll on Wednesday, and from what I’ve seen there are enough liquid assets that sales of the government bonds that triggered the whole process should be enough to return a good portion of the remaining balance within a week or so. If I need to, I’ll dip into my personal savings to keep the lights on. I know this isn’t true for many other startups though, so if they don’t get full access to their funds there will be job losses and closures.

Although we’re not going to close, it is very disruptive to our business. Making sure that our customers are delighted and finding more of them should be taking all of our attention. Instead I spent most of Thursday and Friday dealing with a rapidly changing set of recommendations from our investors, attempting to move money, open new accounts, and now I’m discovering the joys of the FDIC claims process. I’m trying to do this all while I’m flying to Germany for Embedded World to announce a new distribution deal with OKdo, and this blog post is actually written from an airport lounge in Paris. Longer term, depending on the ultimate outcome it may affect when we want to raise our next round. To be clear, we’re actually in a great position compared to many others, I’m an old geezer with savings, but long-term planning at a startup is hard enough without extra challenges like this thrown in.

It has been great having access to investors and founders who are able to help us in practical ways. We would never have been able to open a new account so quickly without introductions to helpful staff at another bank. I’ve been glued to the private founder chat rooms where people have shared their experiences with things like the FDIC claims process and pending wires. This kind of rapid communication and sharing of information is what makes Silicon Valley such a good place to build a startup, I’m very grateful for everyone’s help.

Having said that, the Valley’s ability to spread information and recommendations quickly was one of the biggest causes of SVB’s demise. I’ve always been a bit of a rubbernecker at financial disasters, and I’d read enough books on the 2008 financial crisis to understand how bank runs happen. It was strange being in one myself though, because the logic of “everyone else is pulling their money so you’d better too before it’s all gone” is so powerful, even though I knew this mentality was a self-fulfilling prophecy. I planned on what I hoped was a moderate course of action, withdrawing some of our funds from SVB to move to another institution to gain some diversification, but by the time I was able to set up the transfer it was too late.

Technology companies aren’t the most sympathetic victims in the current climate, for many good reasons. I thought this story covers the political dimensions of the bank failure well. The summary is that many taxpayers hate the idea of bailing out startups, especially ones with millions in their bank accounts. There are a lot of reasons why I think we’ll all benefit from not letting small businesses pay the price for bank executives messing up their risk management, but they’re all pretty wonky and will be a hard sell. However the alternative is a world where only the top two or three banks in the US get most of the deposits, because they’re perceived as too big to fail. If no financial regulator spotted the dangers with SVB, how can you expect small business owners to vet banks themselves? We’ll all just end up going to Citibank or JPMorgan, which increases the overall systemic risk, as we saw in 2008.

Anyway, I just want to dedicate this to all of the founders having a tough weekend. Startups are all about dealing with risks, but this is a particularly frustrating problem to face because it’s so unnecessary. I hope at least we’ll learn more over the next few weeks about how executives and regulators let a US bank with $200 billion in assets get into such a sorry state.

Go see Proxistant Vision at SFMCD

When I think of a museum with “craft” in its name, I usually imagine an institution focused on the past. San Francisco’s Museum of Craft and Design is different. Their mission is to “bring you the work of the hand, mind and heart“, and Bull.Miletic’s Proxistant Vision exhibition is a wonderful example of how their open definition of craft helps them find and promote startling new kinds of art.

When I first walked into the gallery space I was underwhelmed. There were three rooms with projectors, but the footage they were showing was nearly monochrome and I didn’t feel much to connect with. I was intrigued by some of the rigs for the projectors though, with polyhedral mirrors and a cart that whirred strangely. I’m glad I had a little patience, because all of the works turned out to have their own life and animation beyond anything I’d seen before.

The embedded video tries to capture my experience of one of the rooms, Ferriscope. The artists describe it as a kinetic video installation, and at its heart is a mirror that can direct the projector output in a full vertical circle, on two walls, the floor, and ceiling, with a speed that can be dizzying. Instead of the view staying static as people ride a wheel, we stay still while what we see goes flying by. It’s very hard to do justice to the impact this has with still images or even video. The effect is confusing but mesmerizing, and it forced me to look in a different way, if that makes sense?

There are two other installations as part of the show, Venetie 11111100110, and Zoom Blue Dot. I won’t spoil the enjoyment by describing too much about their mechanics but they both play with moving and fragmenting video using mirrors and robotics. They aren’t as immediately startling as Ferrriscope, but they drew me in and forced me to look with a fresh eye at familiar scenes. To my mind that’s the best part about all these works, they shook me out of being a passive observer and consumer of images, I was suddenly on unsteady ground with an uncertain viewpoint. You don’t get to stand stroking your chin in front of these installations, you have to engage with the work in a much more active way.

The exhibition is open until March 19th 2023, and I highly recommend you go visit. You won’t be disappointed, though you may be disoriented!

Online Gesture Sensor Demo using WASM

If you’ve heard me on any podcasts recently, you might remember I’ve been talking about a Gesture Sensor as the follow up to our first Person Sensor module. One frustrating aspect of building hardware solutions is that it’s very tough to share prototypes with people, since you usually have to physically send them a device. To work around that problem, we’ve been experimenting with compiling the same C++ code we use on embedded systems to WASM, a web-friendly intermediate representation that runs in all modern browsers. By hooking up the webcam as an input, instead of the camera, and displaying the output dynamically on a web page, we can provide a decent approximation to how the final device will work. There are obviously some differences, the webcam is going to produce higher-quality images than an embedded camera module and the latency will vary, but it’s been a great tool for prototyping. I also hope it will help spark makers’ and manufacturers’ imaginations, so we’ve released it publicly at gesture.usefulsensors.com.

On that page you’ll find a quick tutorial, and then you’ll have the opportunity to practice the four gestures that are supported initially. This is not the final version of the models or the interface logic, you’ll be able to see false positives that would be problematic in production for example, but it should give you an idea of what we’re building. My goal is to replace common uses of a TV remote control with simple, intuitive gestures like palm-forward for pause, or finger to the lips for mute. I’d love to hear from you if you know of manufacturers who would like to integrate something like this, and we hope to have a hardware version of this available soon so you can try it for your own projects. If you are at CES this year, come visit me at LVCC IoT Pavilion Booth #10729, where me and my colleagues will be showing off some of our devices together with Teksun.

Short Links

Years ago I used to write regular “Five Short Links” posts but I gave up as my Twitter account became a better place to share updates, notes, and things I found interesting from around the internet. Now that Twitter is Nazi-positive I’m giving up on it as a platform, so I’m going to try going back to occasional summary posts here instead.

Person Sensor back in stock on SparkFun. Sorry for all the delays in getting our new sensors to everyone who wanted them, but we now have a new batch available at SparkFun, and we hope to stay ahead of demand in the future. I’ve also been expanding the Hackster project guides with new examples like face-following robot cars and auto-pausing TV remote controls.

Blecon. It can be a little hard to explain what Blecon does, but my best attempt is that it allows BLE sensors to connect to the cloud using peoples’ phones as relays, instead of requiring a fixed gateway to be installed. The idea is that in places like buildings where staff will be walking past rooms with sensors installed, special apps on their phones can automatically pick up and transmit recorded data. This becomes especially interesting in places like hotels, where management could be alerted to plumbing problems early, without having to invest in extra infrastructure. I like this because it gets us closer to the idea of “peel and stick” sensors, which I think will be crucial to widespread deployment.

Peekaboo. I’ve long been a fan of CMU’s work on IoT security and privacy labels, so it was great to see this exploration of a system that gives users more control over their own data.

32-bit RISC-V MCU for $0.10. It’s not as cheap as the Paduak three-cent MCU, but the fact that it’s 32-bit, with respectable amounts of flash, SRAM, and I/O makes it a very interesting part. I bet it would be capable of running many of the Hackster projects for example, and since it supports I2C it should be able to talk to a Person Sensor. With processors this low cost, we’ll see a lot more hardware being replaced with software.

Hand Pose using TensorFlow JS. I love this online demo from MediaPipe, showing how well it’s now possible to track hands with deep learning approaches. Give the page permission to access your camera and then hold your hands up, you should see rather accurate and detailed hand tracking!

Why is it so difficult to retrain neural networks and get the same results?

Photo by Ian Sane

Last week I had a question from a colleague about reproducibility in TensorFlow, specifically in the 1.14 era. He wanted to be able to run the same training code multiple times and get exactly the same results, which on the surface doesn’t seem like an unreasonable expectation. Machine learning training is fundamentally a series of arithmetic operations applied repeatedly, so what makes getting the same results every time so hard? I had the same question when we first started TensorFlow, and I was lucky enough to learn some of the answers from the numerical programming experts on the team, so I want to share a bit of what I discovered.

There are good guides to achieving reproducibility out there, but they don’t usually include explanations for why all the steps involved are necessary, or why training becomes so slow when you do apply them. One reason reproducibility is so hard is that every single calculation in the process has the potential to change the end results, which means every step is a potential weak link. This means you have to worry about everything from random seeds (which are actually fairly easy to make reproducible, but can be hard to locate in a big code base) to code in external libraries. CuDNN doesn’t guarantee determinism by default for example, nor do some operations like reductions in TensorFlow’s CPU code.

It was the code execution part that confused me the most. I could understand the need to set random seeds correctly, but why would any numerical library with exactly the same inputs sometimes produce different outputs? It seemed like there must be some horrible design flaw to allow that to happen! Thankfully my teammates helped me wrap my head around what was going on.

The key thing I was missing was timing. To get the best performance, numerical code needs to run on multiple cores, whether on the CPU or the GPU. The important part to understand is that how long each core takes to complete is not deterministic. Lots of external factors, from the presence of data in the cache to interruptions from multi-tasking can affect the timing. This means that the order of operations can change. Your high school math class might have taught you that x + y + z will produce the same result as z + y + x, but in the imperfect world of floating point numbers that ain’t necessarily so. To illustrate this, I’ve created a short example program in the Godbolt Compiler Explorer.

#include <stdio.h>

float add(float a, float b) {
    return a + b;
}

int main(void) {
    float x = 0.00000005f;
    float y = 0.00000005f;
    float z = 1.0f;

    float result0 = x + y + z;
    float result1 = z + x + y;

    printf("%.9g\n", result0);
    printf("%.9g\n", result1);

    float result2 = add(add(x, y), z);
    float result3 = add(x, add(y, z));

    printf("%.9g\n", result2);
    printf("%.9g\n", result3);
}

You might not guess exactly what the result will be, but most people’s expectations are that result0, result1, result2, and result3 should all be the same, since the only difference is in the order of the additions. If you run the program in Godbolt though, you’ll see the following output:

ASM generation compiler returned: 0
Execution build compiler returned: 0
Program returned: 0
1.00000012
1
1.00000012
1

So what’s going on? The short answer is that floating point numbers only have so much precision, and if you try to add a very small number to a larger one, there’s a limit below which the small number gets rounded to zero and so the addition has no effect. In this example, I’ve set things up so that 0.00000005 is below that limit for 1.0, so that if you do the 1.0 + 0.00000005 operation first, there’s no change to the result because 0.00000005 is rounded to zero. However, if you do 0.00000005 + 0.00000005 first, this produces an intermediate sum of 0.0000001, which is above the rounding-to-zero limit when added to 1.0, and so it does affect the result.

This might seem like an artificial example, but most of the compute-intensive operations inside neural networks boil down to a long series of multiply-adds. Convolutions and fully-connected calculations will be split across multiple cores either on a GPU or CPU. The intermediate results from each core are often accumulated together, either in memory or registers. As the timing of the results being returned from each core varies, the order of addition will change, just as in the code above. Neural networks may require trillions of operations to be executed in any given training run, so it’s almost certain that these kinds of edge cases will occur and the results will vary.

So far I’ve been assuming we’re running on a single machine with exactly the same hardware for each run, but as you can imagine not only will the timing vary between platforms, but even things like the cache sizes may affect the dimensions of the tiles used to optimize matrix multiplies across cores, which will add a lot more opportunities for differences. You might be tempted to increase the size of the floating point representation from 32 bits to 64 to address the issue, but this only reduces the probability of non-determinism, but doesn’t eliminate it entirely. It will also have a big impact on performance.

Properly addressing these problems requires writers of the base math functions like GEMM to add extra constraints to their code, which can be a lot of work to get right, since testing anything timing-related and probabilistic is so complex. Since all the operations that you use in your network need to be modified and checked, the work usually requires a dedicated team to fix and verify all the barriers to reproducibility. These requirements also conflict with many of the optimizations that reduce latency on multi-core systems, so the functions become slower. In the guide above, the total time for a training run went from 28 to 105 minutes once all the modifications needed to ensure reproducibility were made.

I’m writing this post because I find it fascinating that our systems have become so complex that an algorithm like matrix multiply that can be described in a few lines of pseudo-code can have implementations that produce such surprising and unexpected results. I’m barely scratching the surface of all the complexities of floating point math here, but many other causes of similar issues are emerging as we do more and more calculations on massive distributed networks of computers. Honestly that’s one reason I enjoy working on embedded systems so much these days, I have a lot more confidence in my mental model of how those chips work. It does feel like it’s no longer possible for any one person to have a full and deep understanding of the whole stack on any modern platform, even just for software. I love digging into the causes of weird and surprising properties like non-determinism in training because it helps me understand more about what I don’t know!

Machines of Loving Understanding

Sparko, the world’s first electrical dog, as he looked on arrival at the engineer’s club, New York City, on his way to the World’s fair, where he will be an attraction at the Westinghouse Building. He walks, barks, wags his tail and sits up to beg. With Sparko, is Elektro, Westinghouse mechanical man. Both are creations of J.M. Barnett, Westinghouse engineer of Mansfield, OH.
I like to think
(it has to be!)
of a cybernetic ecology
where we are free of our labors
and joined back to nature,
returned to our mammal
brothers and sisters,
and all watched over
by machines of loving grace.

Brautigan’s poem inspires and terrifies me at the same time. It’s a reminder of how creepy a world full of devices that blur the line between life and objects could be, but there’s also something appealing about connecting more closely to the things we build. Far more insightful people than me have explored these issues, from Mary Shelley to Phillip K. Dick, but the aspect that has fascinated me most is how computers understand us.

We live in a world where our machines are wonderful at showing us near-photorealistic scenes in real time, and can even talk to us in convincing voices. Up until recently though, they’ve not been able to make sense of images or audio that are given to them as inputs. We’ve been able to synthesize voices for decades, but speech recognition has only really started working well in the last few years. Computers have been like Crocodile Sales Reps, with enormous mouths and tiny ears, great at talking but terrible at listening. That means they can be immensely frustrating to deal with, since they seem to have no ability to do what we mean. Instead we have to spend a lot of time painstakingly communicating our needs in a form that makes sense to them, even it is unnatural for us.

This process started with toggling switches on a control panel, moved to punch cards, teletypes, CRT terminals, mouse-driven GUIs, swiping on a touch screen and most recently basic voice interfaces. Each of these steps was a big advance, but compared to how we communicate with other people, or even our pets, they still feel clumsy.

What has me most excited about all the recent advances in machine learning is that they’re starting to give computers the ability to understand us in a much deeper and more natural way. The video above is just a small robot that I built for a few dollars as a technology demonstration, but because it followed my face around, I ended up becoming quite attached. It was exhibiting behavior that we associate with people or animals who like us. Even though I knew it was just code underneath, it was hard not to see it as a character instead of an object. It became a Pencil named Steve.

Face following is a comparatively simple ability, but it’s enough to build more useful objects like a fan that always points at you, or a laptop screen that locks when nobody is around. As one of the comments says, the fan is a bit creepy. I believe this is because it’s an object that’s exhibiting attributes that we associate with living beings, entering the Uncanny Valley. The googly eyes probably didn’t help. The confounding part is that the property that makes it most creepy is the same thing that makes it helpful.

We’re going to see more and more of these capabilities making it into everyday objects (at least if I have anything to do with it) so I expect the creepiness and usefulness will keep growing in parallel too. Imagine a robot vacuum that you can talk to naturally and it will respond, that you can shoo away or control with hand gestures, and that follows you around while you’re eating to pick up crumbs you drop. Doesn’t that sound a lot like a dog? All of these behaviors help it do its job better, it’s understanding us in a more natural way instead of expecting us to learn its language, but they also make it feel a lot more alive. Increased understanding goes hand in hand with creepiness.

This already leads to a lot of unresolved tension in our relationships with voice assistants. 79% of Americans believe they spy on their conversations, but 42% of us still use them! I think this belief is so widespread because it’s hard not to treat something that you can talk to as a pseudo-person, which also makes it hard not to expect that it is listening all the time, even if it doesn’t respond. That feeling will only increase once they take account of glances, gestures, even your mood.

If I’m right, we’re going to be entering a new age of creepy but useful objects that seem somewhat alive. What should we do about it? The first part might seem obvious but it rarely happens with any new technology – have a public debate about what we as a community think should be acceptable, right now, while it’s in the early stages of deployment, not after it’s a done deal. I’m a big fan of representative democracy, with all its flaws, so let’s encourage people outside the tech world to help draw the lines of what’s ethical and reasonable. I’m trying to take a step in that direction by putting our products up on maker sites so that anyone can try them out for themselves, but I’d love to figure out how to do something like a roadshow demonstrating what’s coming in the near future. I guess this blog post is an attempt at that too. If there’s going to be a tradeoff between creepiness and utility, let’s give ordinary people the power to determine what the balance should be.

The second important realization is that the tech industry is beyond the point where we can just say “trust us” and reasonably expect people to believe our claims. We’ve lost too much credibility. Moving forward we need to build our products in a way that third parties can check what we’re doing in a meaningful way. As I wrote a few months ago, I know Google isn’t spying on your conversations, but I can’t prove it. I’ve proposed the ML sensors approach we use as a response to that problem, so that someone like Underwriters Laboratories can test our privacy claims on the behalf of consumers.

That’s just one idea though, anything that lets people outside the manufacturers verify their claims would be welcome. To go along with that, I’d love to see enforceable laws that require creators of devices to label what information they collect, like an ingredients list for food items. What I don’t know is how to prevent these from turning into meaningless Prop 65-style privacy policies, where every company basically says they can do anything with any information and share it with anyone they choose. Even though GDPR is flawed in many ways, it did force a lot of companies to be more careful with how they handle personal data, internally and externally. I’d love smarter people than me to figure out how we make privacy claims minimal and enforceable, but I believe the foundation has to be designing systems that can be audited.

Whether this new world we’re moving towards becomes more of a utopia or dystopia depends on the choices we make now. Computers that understand us better can help when an elderly person falls, but the exact same technology could send police after a homeless person bedding down in a doorway. Our ML models can spot drowning victims, or criminalize wild swimming. Ubiquitous, cheap, battery-powered voice recognition could make devices accessible to many more people, or supercharge bugging by repressive regimes. Technologists alone shouldn’t have the power to decide the direction we head in, we need everyone’s help to chart the right path, and make the hard tradeoffs. We need to make sure that the machines that watch over us truly will be loving.

Launching Useful Sensors!

Person Sensor from Useful Sensors

For years I’ve wanted to be able to look at a light switch, say “On”, and have the lights switch on. This kind of interface sounds simple, so why doesn’t it exist? It turns out building one requires solving a lot of tough research and engineering challenges, and even more daunting, coming up with a whole new business model for smart devices. Despite these obstacles, I’m so excited about the possibilities that I’ve founded a new startup, Useful Sensors, together with a wonderful team and great investors!

We’ve been operating in stealth for the last few months, but now we’ve launched our first product, a Person Sensor that is available on SparkFun for $10. This is a small hardware module that detects nearby faces, and returns information about how many there are, where they are relative to the device, and performs facial recognition. It connects over I2C, and so is easy to integrate with almost any microcontroller, but is also designed with privacy built in. If you’ve followed my work on ML sensors, this is our attempt to come up with the first commercial application of this approach to system design.

We’ve started to see interest from some TV and laptop companies, especially around our upcoming hand gesture recognition, so if you are in the consumer electronics world, or have other applications in mind, I would love to hear from you!

Now we’re public, you can expect to see more posts here in the future going into more detail, but for now I’ll leave you with some articles from a couple of journalists who have a lot of experience in this area. I thought they both had very sharp and insightful questions about what we’re doing, and had me thinking hard, so I hope you enjoy their perspectives too:

Pete Warden’s Startup puts AI in the Sensor, by Sally Ward-Foxton

Former Googler creates TinyML Startup, by Stacey Higginbotham

Try OpenAI’s Amazing Whisper Speech Recognition in a Free Web App

Open in Colab

You may have noticed that I’m obsessed with open source speech recognition, so I was very excited when OpenAI released a new voice model. I’m even more excited now I’ve had a chance to play with it, the accuracy is extremely impressive, especially as it’s multi-language. OpenAI have done a great job packaging it, you can install it straight from pip if you’re a Linux shell user, but I wanted to find a way to let anybody try it for themselves from a web browser, even if they’re not developers. I love Google’s Colab service, and luckily somebody had already created a notebook showing the basics of using the Whisper model. I added some documentation and test files, and now you can give it a try for yourself by opening this Colab linkhttps://colab.research.google.com/github/petewarden/openai-whisper-webapp/blob/main/OpenAI_Whisper_ASR_Demo.ipynb.

Follow the directions, and after a minute or so you’ll see a button at the bottom of the page where you can record your own audio, and see a transcript. Give it a try, I think you’ll be impressed too!

How to build Raspberry Pi Pico programs with no software installation

youtube.com/watch?v=bDDgihwDhRE

I love using the Raspberry Pi Pico board to teach students about microcontrollers, especially as it only costs $4 and is currently in stock despite the supply chain crisis. I have run into some problems though, because building a program requires installing software. This might not sound like a big barrier, but when people arrive with a mix of Windows, MacOS, ChromeOS, and Linux laptops, often with different versions or architectures within each group, trying to guide them through the process can easily take a whole lesson, and require individual attention from me to debug each particular problem while the other students get bored. It’s also frustrating for the class to have to wait an hour before they get to do anything cool, I much prefer giving them a success as early as possible.

To solve this problem, I’ve actually turned to what might seem an unlikely tool, Google’s Colab service. If you have run across this, you probably associate it with Python notebooks, because that’s its primary use case. I’ve found it to be useful for a lot more though, because it effectively gives you a free, temporary Linux virtual machine that you control through the browser. Instead of running Python commands, you can run Linux shell commands by putting an exclamation point at the start. There are some restrictions, such as needing a Google account to sign in, and the file system disappearing after you leave the page or are idle too long, but I’ve found it great for documenting all sorts of installation and build processes in an accessible way.

I’m getting ready to teach EE292D (TinyML) at Stanford again this year, but we’re switching over to the Pico boards instead of the Arduino Nano BLE Sense 33s that we have used, because the latter have been out of stock for quite a while. As part of that, I wanted to have an easy getting started guide for the students to help them build and run their first program. I put together a Colab notebook that follows the steps in the great Pico Getting Started Guide, installing the SDK, examples, and then building blink and running it on a board. To give some extra guidance, I also recorded the YouTube video above. Please excuse the hair and occasional distraction, I did it in a hurry.

It’s not a complete solution, students will still need to install OS-specific software to access debug logs, it requires a Google login that’s not available for kids under 13, and the vanishing file system will cause frustration if they don’t remember to save their code, but I do like it as a simple way to give them a win in just a few minutes. There’s nothing like seeing that first LED blink on a new board, I still get a kick out of it myself!

Why isn’t there more training on the edge?

One of the most frequent questions I get asked from people exploring machine learning beyond cloud and desktop machines is “What about training?”. If you look around at the popular frameworks and use cases of edge ML, most of them seem focused on inference. It isn’t obvious why this is the case though, so I decided to collect my notes in a post here, so I can have something to refer to when this comes up (and organize my own thoughts too!).

No Labels

I think the biggest reason that there’s not more training on the edge is that most models need to be trained through supervised learning, that is each sample used for training needs a ground truth label. If you’re running on a phone or embedded system, there’s not likely to be an easy way to attach a label to incoming data, other than running an existing model and guessing. You need a person to look at an image, or listen to an audio recording, to identify what the prediction should be, before you can use it in training. You also generally need a fairly large number of labels per class for training to be effective.

This may change as semi-supervised or unsupervised approaches continue to improve, but right now supervised training is the most reliable method to get a model for most applications. I have seen some interesting hacks to guess labels on the edge though, that might fall into the semi-supervised category. For example, you can use temporal consistency on video frames to infer mistakes. In concrete terms, if your camera is identifying a fruit as a lemon for ten frames, then for one frame it’s a lime, and then it’s back to a lemon, you can guess that the lime prediction was an error (assuming the frame rate is high enough, fruits aren’t flying by at supersonic speed, and so forth). Another clever use of time was in an audio wake word application, where if there was a near-detection (the model gave a score just below the threshold) followed soon after by an actual detection (over the threshold) then the system would guess that the person had actually said the wake word the first time, and the model had failed to recognize it. This hack relies on the human behavior of trying again if it didn’t work initially.

Quality Control

Getting models to work well within an application is very hard when you are training a single version and putting it through testing before release. If an edge model is retrained, it will be very hard to predict the bounds of its behavior. Since this will affect how well your application works, training on the fly makes ensuring it behaves correctly much harder. This isn’t a complete blocker, there are clearly some products (like GBoard) that do manage to handle this problem, but they generally build some kind of guard rails around what the model can produce. For example, something that predicts words or sentences might have a block-list of banned words (such as hateful or obscene phrases) that will be scrubbed from a model’s output even if edge training causes it to start producing them.

This kind of post-processing is often needed even when using pre-trained models on the edge (I could probably fill a decent book with all the hacks that usually go into filtering and interpreting the raw model output to make it useful) but the presence of a model that can change in unpredictable ways makes it even harder. Nobody wants to be responsible for building another Tay.

Embeddings

When you set up a new phone, you’ll probably speak the assistant wake word a few times to help the system learn your voice. In my experience this doesn’t involve retraining in the sense of full back propagation. Instead, the “Is this audio a wake word?” model produces an embedding vector as its output, and that is then used in a nearest-neighbor lookup to compare to the embeddings from the first few utterances you spoke during setup. This is a surprisingly common technique across a lot of domains, because it is comparatively simple to implement, only requires storing a few values, and works robustly.

I’ve found embeddings to be a fantastic general purpose tool for customizing models on the edge, without requiring the full machinery of back propagation. The gradient descent approach used by modern deep learning needs high precision (usually floating point) weight arrays, along with specialized operators to run the back-prop version of each layer. The weights need to be stored between updates, and since they’re higher precision than is required for inference they take up more space than an inference-optimized model, and you’ll usually want to keep a copy of the original weights around in case you need to reset the model too. By contrast, you can often extract an embedding from an existing model just by reading the activation layer before the final fully-connected op that does the classification. Even though specialized loss functions exist to try to encourage embeddings with desired properties, like good spatial separation, I’ve found that training with a regular softmax and lopping off the last layer often works just as well in practice.

Exceptions

Of course, there are examples of very successful products that do use training on the edge. I already mentioned GBoard, which is the poster child for federated learning, but another domain where I’ve seen a lot of use is in anomaly detection, particularly around predictive maintenance for machinery. This is an application where it seems like every machine behaves differently, so learning “normal behavior” (by observing the first 24 hours of vibrations and labeling those as normal) allows the adaptation needed to spot deviations from those initial patterns. I’ve also seen interesting research projects around security and communications protocols that are looking at using training on the edge to be more robust to changing environmental conditions.

YAGNI

The short answer to the question is that if you’re getting started with ML on the edge, training models there is unlikely to be useful in the short or medium term. Technology keeps changing, and I am seeing some interesting applications starting to emerge, but I feel like a lot of the interest in edge training comes from how prominent training is in the cloud world. I often joke that all ML architecture researchers could go on strike indefinitely, and ML engineers would still have decades of productive work ahead of us. There are many better-motivated problems around deployment on the edge than bringing training up to server capabilities, and I bet your product will hit some of those long before training becomes an issue.