Try OpenAI’s Amazing Whisper Speech Recognition in a Free Web App

Open in Colab

You may have noticed that I’m obsessed with open source speech recognition, so I was very excited when OpenAI released a new voice model. I’m even more excited now I’ve had a chance to play with it, the accuracy is extremely impressive, especially as it’s multi-language. OpenAI have done a great job packaging it, you can install it straight from pip if you’re a Linux shell user, but I wanted to find a way to let anybody try it for themselves from a web browser, even if they’re not developers. I love Google’s Colab service, and luckily somebody had already created a notebook showing the basics of using the Whisper model. I added some documentation and test files, and now you can give it a try for yourself by opening this Colab linkhttps://colab.research.google.com/github/petewarden/openai-whisper-webapp/blob/main/OpenAI_Whisper_ASR_Demo.ipynb.

Follow the directions, and after a minute or so you’ll see a button at the bottom of the page where you can record your own audio, and see a transcript. Give it a try, I think you’ll be impressed too!

How to build Raspberry Pi Pico programs with no software installation

youtube.com/watch?v=bDDgihwDhRE

I love using the Raspberry Pi Pico board to teach students about microcontrollers, especially as it only costs $4 and is currently in stock despite the supply chain crisis. I have run into some problems though, because building a program requires installing software. This might not sound like a big barrier, but when people arrive with a mix of Windows, MacOS, ChromeOS, and Linux laptops, often with different versions or architectures within each group, trying to guide them through the process can easily take a whole lesson, and require individual attention from me to debug each particular problem while the other students get bored. It’s also frustrating for the class to have to wait an hour before they get to do anything cool, I much prefer giving them a success as early as possible.

To solve this problem, I’ve actually turned to what might seem an unlikely tool, Google’s Colab service. If you have run across this, you probably associate it with Python notebooks, because that’s its primary use case. I’ve found it to be useful for a lot more though, because it effectively gives you a free, temporary Linux virtual machine that you control through the browser. Instead of running Python commands, you can run Linux shell commands by putting an exclamation point at the start. There are some restrictions, such as needing a Google account to sign in, and the file system disappearing after you leave the page or are idle too long, but I’ve found it great for documenting all sorts of installation and build processes in an accessible way.

I’m getting ready to teach EE292D (TinyML) at Stanford again this year, but we’re switching over to the Pico boards instead of the Arduino Nano BLE Sense 33s that we have used, because the latter have been out of stock for quite a while. As part of that, I wanted to have an easy getting started guide for the students to help them build and run their first program. I put together a Colab notebook that follows the steps in the great Pico Getting Started Guide, installing the SDK, examples, and then building blink and running it on a board. To give some extra guidance, I also recorded the YouTube video above. Please excuse the hair and occasional distraction, I did it in a hurry.

It’s not a complete solution, students will still need to install OS-specific software to access debug logs, it requires a Google login that’s not available for kids under 13, and the vanishing file system will cause frustration if they don’t remember to save their code, but I do like it as a simple way to give them a win in just a few minutes. There’s nothing like seeing that first LED blink on a new board, I still get a kick out of it myself!

Why isn’t there more training on the edge?

One of the most frequent questions I get asked from people exploring machine learning beyond cloud and desktop machines is “What about training?”. If you look around at the popular frameworks and use cases of edge ML, most of them seem focused on inference. It isn’t obvious why this is the case though, so I decided to collect my notes in a post here, so I can have something to refer to when this comes up (and organize my own thoughts too!).

No Labels

I think the biggest reason that there’s not more training on the edge is that most models need to be trained through supervised learning, that is each sample used for training needs a ground truth label. If you’re running on a phone or embedded system, there’s not likely to be an easy way to attach a label to incoming data, other than running an existing model and guessing. You need a person to look at an image, or listen to an audio recording, to identify what the prediction should be, before you can use it in training. You also generally need a fairly large number of labels per class for training to be effective.

This may change as semi-supervised or unsupervised approaches continue to improve, but right now supervised training is the most reliable method to get a model for most applications. I have seen some interesting hacks to guess labels on the edge though, that might fall into the semi-supervised category. For example, you can use temporal consistency on video frames to infer mistakes. In concrete terms, if your camera is identifying a fruit as a lemon for ten frames, then for one frame it’s a lime, and then it’s back to a lemon, you can guess that the lime prediction was an error (assuming the frame rate is high enough, fruits aren’t flying by at supersonic speed, and so forth). Another clever use of time was in an audio wake word application, where if there was a near-detection (the model gave a score just below the threshold) followed soon after by an actual detection (over the threshold) then the system would guess that the person had actually said the wake word the first time, and the model had failed to recognize it. This hack relies on the human behavior of trying again if it didn’t work initially.

Quality Control

Getting models to work well within an application is very hard when you are training a single version and putting it through testing before release. If an edge model is retrained, it will be very hard to predict the bounds of its behavior. Since this will affect how well your application works, training on the fly makes ensuring it behaves correctly much harder. This isn’t a complete blocker, there are clearly some products (like GBoard) that do manage to handle this problem, but they generally build some kind of guard rails around what the model can produce. For example, something that predicts words or sentences might have a block-list of banned words (such as hateful or obscene phrases) that will be scrubbed from a model’s output even if edge training causes it to start producing them.

This kind of post-processing is often needed even when using pre-trained models on the edge (I could probably fill a decent book with all the hacks that usually go into filtering and interpreting the raw model output to make it useful) but the presence of a model that can change in unpredictable ways makes it even harder. Nobody wants to be responsible for building another Tay.

Embeddings

When you set up a new phone, you’ll probably speak the assistant wake word a few times to help the system learn your voice. In my experience this doesn’t involve retraining in the sense of full back propagation. Instead, the “Is this audio a wake word?” model produces an embedding vector as its output, and that is then used in a nearest-neighbor lookup to compare to the embeddings from the first few utterances you spoke during setup. This is a surprisingly common technique across a lot of domains, because it is comparatively simple to implement, only requires storing a few values, and works robustly.

I’ve found embeddings to be a fantastic general purpose tool for customizing models on the edge, without requiring the full machinery of back propagation. The gradient descent approach used by modern deep learning needs high precision (usually floating point) weight arrays, along with specialized operators to run the back-prop version of each layer. The weights need to be stored between updates, and since they’re higher precision than is required for inference they take up more space than an inference-optimized model, and you’ll usually want to keep a copy of the original weights around in case you need to reset the model too. By contrast, you can often extract an embedding from an existing model just by reading the activation layer before the final fully-connected op that does the classification. Even though specialized loss functions exist to try to encourage embeddings with desired properties, like good spatial separation, I’ve found that training with a regular softmax and lopping off the last layer often works just as well in practice.

Exceptions

Of course, there are examples of very successful products that do use training on the edge. I already mentioned GBoard, which is the poster child for federated learning, but another domain where I’ve seen a lot of use is in anomaly detection, particularly around predictive maintenance for machinery. This is an application where it seems like every machine behaves differently, so learning “normal behavior” (by observing the first 24 hours of vibrations and labeling those as normal) allows the adaptation needed to spot deviations from those initial patterns. I’ve also seen interesting research projects around security and communications protocols that are looking at using training on the edge to be more robust to changing environmental conditions.

YAGNI

The short answer to the question is that if you’re getting started with ML on the edge, training models there is unlikely to be useful in the short or medium term. Technology keeps changing, and I am seeing some interesting applications starting to emerge, but I feel like a lot of the interest in edge training comes from how prominent training is in the cloud world. I often joke that all ML architecture researchers could go on strike indefinitely, and ML engineers would still have decades of productive work ahead of us. There are many better-motivated problems around deployment on the edge than bringing training up to server capabilities, and I bet your product will hit some of those long before training becomes an issue.

Don’t worry if college isn’t the happiest time of your life

I was digging through paperwork today to help complete my PhD admission process, and I stopped short when I saw the academic transcript from my undergraduate years. I was a terrible student! I got 0% on one course, awful scores on many others, and had to do a lot of retakes. It brought back memories of how I was feeling when I was 18. I was a mess. I was totally unprepared for life away from home, suffered from so much anxiety I wasn’t even able to name it, like a fish having no concept of water, and jumped right into a terrible relationship the first chance I got. I was working almost full-time at Kwik-Save to pay rent, and didn’t even have a computer at home I could use.

It wasn’t supposed to be like this. Since I was a kid I’d dreamed of escaping my tiny village for university. I wasn’t sure what it was going to be like, since nobody in my immediate family had completed college, but I had vague ideas from being a townie in Cambridge and shows like Brideshead Revisited that I would be transported to a magical world of privilege and punts. Most of all, I looked forward to meeting people I could talk to about important things, people who might listen to me. I also knew I was “good at computers”, and was looking forward to diving deeper into programming. The reality of being just one of hundreds of students, with little ability to connect with any of the staff, and discovering that most of what I’d learned about coding wasn’t a big help with Computer Science, left me more than deflated. I was still the same screwed-up person, there had been no magical transformation. A lot of the time I resented time spent on my classes, and felt I wasn’t learning what I needed for my true vocation, being a programmer, and you can see that in my grades.

Looking back, this wasn’t Manchester’s fault. It’s a fantastic university with a CS program that’s world-class, and despite my best efforts I did learn a lot from great teachers like Steve Furber and Carole Goble. Their lessons turned out to be far more useful in my career than I ever would have expected. The staff were kind and helpful on the few occasions I did reach out, but I had such a lack of confidence I seldom dared try. I managed to scrape through, with a lot of retakes, and helped by the fact that the overall marks were heavily weighted to the final year. It left me feeling cheated though, somehow. I’d always heard the cliche that these would be the happiest days of my life. If I was miserable and it was all downhill from here, what was even the point of carrying on? It didn’t help that the first technical job I could find out of college paid less than I’d made stacking shelves at a supermarket.

The good news is that life has pretty much continuously got better from that point on. Years of therapy, a career path that has gifted me some fascinating and impactful problems to work on, along with enough money to be comfortable, and building the kind of community I’d dreamed of at college through writing and the internet, have left me feeling happier than I’ve ever been. I feel very lucky to have found a way to engage with so many smart people, and find my voice, through open source coding, blogging, research papers, and teaching.

I didn’t write this post to humblebrag about how wonderful my life is, I still have plenty of challenges and disappointments. I just want to provide a datapoint for anyone else who is struggling or has struggled at college. It doesn’t have to define you for the rest of your life. If the experience isn’t what you’d expected and hoped, there’s no need for despair. Life can get so much better.

Why cameras are soon going to be everywhere

i-FlatCam demo

I’ve finally had the chance to play Cyberpunk 2077 over the last few weekends, and it’s an amazing feat of graphics programming, especially with ray-tracing enabled. I’ve had fun, but I have been struck by how the cyberpunk vision of the future is rooted in the ’80s. Even though William Gibson was incredibly prescient in so many ways, the actual future we’re living in is increasingly diverging from the one he painted. One of the differences that struck me most was how cameras exist in the game’s world, compared to what I can see happening as an engineer working in the imaging field. They still primarily show up as security cameras, brick-sized devices that are attached to walls and would look familiar to someone from forty years ago.

A lot of people still share the expectation that cameras will be obvious, standalone components of a system. Even though phone cameras and webcams are smaller, they still have a noticeable physical presence, and often come with indicators like red lights that show when they’re recording. What is clear to me from my work is that these assumptions aren’t going to hold much longer. Soon imaging sensors will be so small, cheap, and energy efficient that they’ll be added to many more devices in our daily lives, and because they’re so tiny they won’t even be noticeable!

What am I basing this prediction on? The clearest indicator for me is that you can already buy devices like the Himax HM01B0 with an imaging sensor that’s less than 2mm by 2mm in size, low single-digit dollars in cost, and 2 milliwatts or less in power usage. Even more striking are the cameras that are emerging from research labs. At the TinyML Summit the University of Michigan presented a complete system that fits on the tip of a finger.

The video at the top of this post shows another project from Rice that is able to perform state of the art eye tracking at 253 FPS, using 23 milliwatts, in a lens-less system that lets it achieve a much smaller size than other solutions.

Hopefully this makes it clear that there’s a growing supply of these kinds of devices. Why do I think there will be enough demand to include them in appliances and other items around homes and offices? This is tricky to show as clearly because the applications aren’t deployed yet, but cameras can replace or augment lots of existing sensors, and enable entirely new features. Here are a few examples:

Each one of these may or may not turn out to be useful, but there are so many potential applications (including many I’m sure nobody’s thought of yet) that I can’t imagine some of them won’t take off once the technology is widely available. Many scientists believe that the Cambrian Explosion occurred because of the evolution of eyes opened up so many new possibilities and functions. I’m hoping we’ll see a similarly massive expansion in the technology space once all our devices can truly see and understand.

So, that’s why I believe we’re going to end up in a world where we’re each surrounded by thousands of cameras. What does that mean? As an engineer I’m excited, because we have the chance to make a positive impact on peoples’ lives. As a human being, I’m terrified because the potential for harm is so large, through unwanted tracking, recording of private moments, and the sharing of massive amounts of data with technology suppliers.

If you accept my argument about why we’re headed for a world full of tiny cameras watching us at all times, then I think we all have a responsibility to plan ahead now to mitigate the potential harms. This is my motivation behind the ML sensors proposal to wall off sensitive data in a secure component, but I see this as just a starting point in the discussion. Do we need regulation? Even if we don’t get it in the US, will Europe take the lead? Should there be voluntary standards around labeling products that contain cameras or microphones? I don’t know the answers, but I don’t think we have the luxury of waiting too long to figure them out, because if we don’t make any changes we’ll be deploying billions of poorly-secured devices into everybody’s lives as a giant uncontrolled experiment.

What are ML Sensors?

I’ve spent a lot of time at conferences talking about all the wonderful things that are now possible using machine learning on embedded devices, but as Stacey Higginbotham pointed out at this year’s TinyML Summit, despite all the potential there haven’t been many shipping applications. My experience is that companies like Google with big ML teams have been able to deploy products successfully, but it has been a lot harder for teams in other industries. For example, when I visited an appliance manufacturer in China and pitched them on the glorious future they could access thanks to TensorFlow Lite Micro, they told me they didn’t even know how to open a Python notebook. Instead, they asked if I could just give them a voice interface, or something that told them when somebody sat down in front of their TV.

I realized that the software framework model that had worked so well for TensorFlow Lite adoption on phone apps didn’t translate to other domains. Many of the firms that could most benefit from ML just don’t have the software engineering resources to integrate a library, even with great tools like Edge Impulse that make the process much easier. As I thought about how to make on-device ML more widely accessible, I realized that providing ML capabilities as small, cheap hardware modules might be a good solution. This was the seed of the idea that became the ML sensors proposal, now available as a paper on arXiv.

The basic idea is that system builders are already able to integrate components like sensors into their products, so why not expose some higher-level information about the environment in the same form factor? For example, a person sensor might have a pin that goes high when someone is present, and then an I2C interface to supply more detailed information about their pose, activities, and identity. That would allow a TV manufacturer to wake up the display when someone sat down on the couch, and maybe even customize the UI to show recently-watched shows based on which family members are present. All of the complexity of the ML implementation would be taken care of by the sensor manufacturer and hidden inside the hardware module, which would have a microcontroller and a camera under the hood. The OEM would just need to respond to the actionable signals from the component.

At the same time as I was thinking about how to get ML into more peoples hands, I was also worried about the potential for abuse that the proliferation of cameras and microphones in everyday devices enables. I realized that the modular approach might have some advantages there too. I think of personal information as toxic waste, because any leaks can be highly damaging to individuals, and to the companies involved, and there are few data sources that have as much potential for harm as video and audio streams from within peoples’ homes. I believe it’s our responsibility as developers to engineer systems that are as leak-resistant as possible, especially if we’re dealing with cameras and microphones. I’d already explored the idea of using Arm’s TrustZone to keep sensitive data contained, but by moving the ML processing off the central microcontroller and onto a peripheral, we have the chance to design something that has a very small attack surface (because there’s no memory shared with the rest of the system) and can be audited by a third-party to ensure any claims of safety are credible.

The ML sensor paper brings all these ideas together into a proposal for designing systems that are easier to build, and safer by default. I’m hoping this will start a discussion about how to improve usability and privacy in everyday systems, and lead to more practical prototyping and experimentation to answer a lot of the questions it raises. I’d love to get feedback on this proposal, especially from product designers who might want to try integrating these into their systems. I’m looking forward to seeing more work in this area, I know I’m going to be busy trying to get some examples up and running, so watch this space!

Caches Considered Harmful for Machine Learning

Photo by the National Park Service

I’ve been working on a new research paper, and a friend gave me the feedback that he was confused by the statement “memory accesses can be accurately predicted at the compilation stage” for machine learning workloads, and that this made them a poor fit for conventional processor architectures with predictive caches. I realized that this was received wisdom among the ML engineers I know, but I wasn’t aware of any papers that discuss this point. I put out a request for help on Twitter, but while there were a lot of interesting resources in the answers, I still couldn’t find any papers that focused on what feels like an important property for machine learning systems. With that in mind, I wanted to at least describe the issue as best as I can in this blog post, so there’s a trail of breadcrumbs for anyone else interested in how system designs might need to change to accommodate ML.

So, what am I talking about? Modern processors are almost universally constructed around multiple layers of predictive memory caches. These are small areas of memory that can be accessed much faster than the main system memory, and are needed because processors can execute instructions far more quickly than they can fetch values from the DRAM used for main memory. In fact, you can usually run hundreds of instructions in the time it takes to bring one byte from the DRAM. This means if processors all executed directly from system memory, they would run hundreds of times more slowly than they could. For decades, the solution to the mismatch has been predictive caches. It’s possible to build memory that’s much faster to access than DRAM, but for power and area reasons it’s not easy to fit large amounts onto a chip. In modern systems you might have gigabytes of DRAM, but only single-digit megabytes of total cache. There are some great papers like What Every Programmer Should Know About Memory that go into a lot more detail about the overall approach, but the most important thing to know is that memory stored in these caches can be accessed in a handful of cycles, instead of hundreds, so moving data into these areas is crucial if you want to run your programs faster.

How do we decide what data should be placed in these caches though? This requires us to predict what memory locations will be accessed hundreds or thousands of cycles in the future, and with general programs with a lot of data dependent branches, comparisons, and complex address calculations this isn’t possible to do with complete accuracy. Instead, the caches use heuristics (like we just accessed address N, so also fetch N+1, N+2, etc in case we’re iterating through an array) to guess how to populate these small, fast areas of memory. The cost of making a mistake is still hundreds of cycles, but as long as most of the accesses are predicted correctly this works pretty well in practice. However there is still an underlying tension between the model used for programming languages, where memory is treated as a uniform arena, and the reality of hardware where data lives in multiple different places with very different characteristics. I never thought I’d be linking to a Hacker News comment, the community has enough toxic members I haven’t read it for years, but this post I was pointed to actually does a good job of talking about all the complexities that are introduced to make processors appear as if they’re working with uniform memory.

Why does all this matter for machine learning? The fundamental problem predictive caches are trying to solve is “What data needs to be prefetched into fast memory from DRAM?”. For most computing workloads, like rendering HTML pages or dealing with network traffic, the answer to this question is highly dependent on the input data to the algorithm. The code is full of lines like ‘if (a[i] == 10) { value = b[j] } else { value = b[k]; }‘, so predicting which addresses will be accessed requires advance knowledge of i, a[i], j, and k, at least. As more of these data-dependent conditionals accumulate, the permutations of possible access addresses become unmanageable, and effectively it’s impossible to predict addresses for code like this without accessing the data itself. Since the problem we’re trying to solve is that we can’t access the underlying data efficiently without a cache, we end up having to rely on heuristics instead.

Machine learning operations are very different. The layers that take up the majority of the time for most models tend to be based on operations like convolutions, which can be expressed as matrix multiplies. Crucially, the memory access patterns don’t depend on the input data. There’s no ‘if (a[i] == 10) {...‘ code in the inner loops of these kernels, they’re much simpler. The sizes of the inputs are also usually known ahead of time. These properties mean that we know exactly what data we need in fast memory for the entire execution of the layer ahead of time, with no dependencies on the values in that data. Each layer can often take hundreds of thousands of arithmetic operations to compute, and each value fetched has the potential to be used in multiple instructions, so making good use of the small amounts of fast memory available is crucial to reducing latency. What quickly becomes frustrating to any programmer trying to optimize these algorithms on conventional processors is that it’s very hard to transfer our complete knowledge of future access patterns into compiled code.

The caches rely almost entirely on the heuristics that were designed for conventional usage patterns, so we essentially have to reverse-engineer those heuristics to persuade them to load the data we know we’ll need. There are some tools to help like prefetching instructions and branch hints, but optimizing inner loops often feels like a struggle against a system that thinks it’s being helpful, but is actually getting in the way. Optimized matrix multiplication implementations usually require us to gather the needed data into tiles that are a good fit for the fast memory available, so we can do as much as possible with the values while they’re quickly accessible. Getting these tiles the right size and ensuring they’re populated with the correct data before it’s needed requires in-depth knowledge of the capacity, access latencies, and predictive algorithms of all levels of the cache hierarchy on a particular processor. An implementation that works well on one chip may produce drastically poorer performance on another in the same family if any of those characteristics change.

It would make more sense to expose the small, fast memories to the programmer directly, instead of relying on opaque heuristics to populate them. They could be made available as separate address spaces that can be explicitly preloaded ahead of time with data before it’s needed. We know what address ranges we’ll want to have and when, so give us a way to use this knowledge to provide perfect predictions to fill those areas of memory. Some embedded chips do offer this capability, known variously as tightly-coupled memory, or XY memory, and we do use this to improve performance for TensorFlow Lite Micro on platforms that support it.

There are lots of challenges to making this available more widely though. Modern desktop and mobile apps don’t have the luxury of targeting a single hardware platform, and are expected to be able to run across a wide variety of different chips within the same processor family. It would be very difficult to write efficient code that works for all those combinations of cache size, speed, and prefetch heuristics. Software libraries from the processor manufacturers themselves (like CuDNN or Intel’s MKL) are usually the best answer right now, since they are written by engineers with detailed knowledge of the hardware systems and will be updated to handle new releases. These still have to work around the underlying challenges of a programming model that tries to hide the cache hierarchy though, and every engineer I’ve talked to who has worked on these inner loops wishes they had a better way to take advantage of their knowledge of memory access patterns.

This is also the kind of radical workload difference that has inspired a lot of new kinds of NPU hardware aimed specifically at deep learning. From my perspective, these have also been hard to work with, because while their programming models may work better for core operations like convolutions, models also require layers like non-max suppressions that are only efficiently written as procedural code with data-dependent branches. Without the ability to run this kind of general purpose code, accelerators lose many of their advantages because they have to keep passing off work to the main CPU, with a high latency cost (partly because this kind of handover usually involves flushing all caches to keep different memory areas in sync).

I don’t know what the ultimate solution will look like, but I’d imagine it will either involve system programmers being able to populate parts of caches using explicit prefetching, maybe even just supplying a set of address ranges as requirements and relying on the processor to sort it out, or something more extreme. One possible idea is making matrix multiplies first-class instructions at the machine code level, and having each processor implement the optimal strategy in microcode, in a similar way to how floating-point operations have migrated from accelerators, to co-processors, and now to the core CPU. Whatever the future holds, I hope this post at least helps explain why conventional predictive caches are so unhelpful when trying to optimize machine learning operations.

How Should you Protect your Machine Learning Models and IP?

Over the last decade I’ve helped hundreds of product teams ship ML-based products, inside and outside of Google, and one of the most frequent questions I got was “How do I protect my models?”. This usually came from executives, and digging deeper it became clear they were most worried about competitors gaining an advantage from what we released. This worry is completely understandable, because modern machine learning has become essential for many applications so quickly that best practices haven’t had time to settle and spread. The answers are complex and depend to some extent on your exact threat models, but if you want a summary of the advice I usually give it boils down to:

  • Treat your training data like you do your traditional source code.
  • Treat your model files like compiled executables.

To explain why I ended up with these conclusions, I’ll need to dive into some of the ways that malicious actors could potentially harm a company based on how ML materials are released. I’ve spent a lot of my time focused on edge deployments, but many of the points are applicable to cloud applications too.

The most concerning threat is frequently “Will releasing this make it easy for my main competitor to copy this new feature and hurt our differentiation in the market?”. If you haven’t spent time personally engineering ML features, you might think that releasing a model file, for example as part of a phone app, would make this easy, especially if it’s in a common format like a TensorFlow Lite flatbuffer. In practice, I recommend thinking about these model files like the binary executables that contain your application code. By releasing it you are making it possible to inspect the final result of your product engineering process, but trying to do anything useful with it is usually like trying to turn a hamburger back into a cow. Just as with executables you can disassemble them to get the overall structure, by loading them into a tool like Netron. You may be able to learn something about the model architecture, but just like disassembling machine code it won’t actually give you a lot of help reproducing the results. Knowing the model architecture is mildly useful, but most architectures are well known in the field anyway, and only differ from each other incrementally.

What about just copying the model file itself and using it in an application? That’s not as useful as you might think, for a lot of reasons. First off, it’s a clear copyright violation, just like copying an executable, so it’s easy to spot and challenge legally. If you are still worried about this, you can take some simple steps like encrypting the model file in the app bundle and only unpacking it into memory when the app is running. This won’t stop a determined attacker, but it makes it harder. To help catch copycats, you can also add text strings into your files that say something like “Copyright Foo, Inc.”, or get more elaborate and modify your training data to add canaries, also more poetically called Mountweazels, by modifying your training data so that the model produces distinct and unlikely results in rare circumstances. For example, an image model could be trained so that a Starbucks logo always returns “Duck” as the prediction. Your application could ignore this result, but even if the attacker got clever and added small perturbations to the model weights to prevent obvious binary comparisons, the behavior would be likely to persist and prove that it was directly derived from the original.

Even if you don’t detect the copying, having a static model is not actually that useful. The world keeps changing, you’ll want to keep improving the model and adapting to new needs, and that’s very hard to do if all you have is the end result of training. It’s also unlikely that a competitor will have exactly the same requirements as you, whether it’s because of using different hardware or a user population that differs from yours. You might be able to hack a bit of transfer learning to modify a model file, but at that point you’re probably better off starting with a publicly-released model, since you’ll have a very limited ability to make changes on a model that’s already been optimized (for example using quantization).

A lot of these properties are very analogous to a compiled executable, hence my advice at the start. You’ve got an artifact that’s the end result of a complex process, and any attacker is almost certain to want modifications that aren’t feasible without access to the intermediate steps that were required to produce it in the first place. From my experience, by far the most crucial, and so most valuable, part of the recipe for a machine learning feature is the training data. It would be much quicker for me to copy most features if I was given nothing but the dataset used to train the model, than if I had access to the training script, feature generation, optimization and deployment code without that data. The training data is what actually sets out the detailed requirements for what the model needs to do, and usually goes through a long process of refinement as the engineers involved learn more about what’s actually needed in the product.

This is why I recommend treating the dataset in the same way that you treat source code for your application. It’s a machine-readable specification of exactly how to tackle your problem, and as such requires a lot of time, resources, and expertise to reproduce. People in other industries often ask me why big tech companies give away so much ML software as open source, because they’re used to thinking about code as the crown jewels that need to be protected at all costs. This is true for your application code, but in machine learning having access to libraries like TensorFlow or PyTorch doesn’t get you that much closer to achieving what Google or Meta can do with machine learning. It’s actually the training data that’s the biggest barrier, so if you have built something using ML that’s a competitive advantage, make sure you keep your dataset secure.

Personally, I’m a big fan of opening up datasets for research purposes, but if you look around you’ll see that most releases are for comparatively generic problems within speech or vision, rather than more specific predictions that are useful for features in commercial products. Public datasets can be useful as starting points for training a more targeted model, but the process usually involves adding data that’s specific to your deployment environment, relabeling to highlight the things you really want to recognize, and removing irrelevant or poorly-tagged data. All these steps take time and resources, and form a barrier to any competitor who wants to do the same thing.

My experience has largely been with on-device ML, so these recommendations are focused on the cases I’m most familiar with. Machine learning models deployed behind a cloud API have different challenges, but are easier in a lot of ways because the model file itself isn’t accessible. You may still want to put in terms-of-use clauses to bar people from using the services to train their own models, like all commercial speech recognition APIs I know of do, but this approach to copying isn’t as effective as you might expect. It suffers the Multiplicity problem, where copies inevitably seem to lose quality compared to their originals.

Anyway, I am very definitely Not A Lawyer, so don’t take any of this as legal advice, but I hope it will be useful to help understand some useful responses to some typical threat models, and at least give you my perspective on the best practices I’ve seen emerge. I’ll be interested to hear if there are any papers or other publications around these questions too, so please do get in touch if you know of anything I should check out!

Is Google Spying on your Conversations?

No.

Ok, I thought about leaving this as a one-word blog post, but even though I can categorically state that it isn’t happening, the fact that this question comes up regularly in my everyday life, and that I worked on always-on audio when I was at Google, makes me want to expand on this a bit.

A good starting point is this BBC article from 2016 asking “Is your smartphone listening to you?“, which includes the common anecdote of an ad that seems like it was triggered by a recent conversation, an investigation into the technical possibility that it could be happening, and denials from Google, Facebook, and Amazon that what users suspect is actually occurring. I worked for years on the infrastructure Google uses for the machine learning models to recognize speech triggers like “Hey Google”, so if you trust me you can take my word that we didn’t have the capability to do what people are concerned about. Even if you don’t trust me, there are public papers from Google and Apple that go into detail about how the always-on system in Android and iOS phones works. The summary is that in order to run even when most of the phone (including the CPU) is powered down, the microphone data has to be processed by a subsystem that is extremely constrained, because to avoid draining the battery it can only consume something like ten milliwatts. For comparison, a Cortex A processor used for the main CPU (or application processor) can easily burn a watt or more. To run at such low power, this subsystem has a lot less memory and compute than the application processor, often only a few hundred kilobytes of RAM and runs at a frequency in the low hundreds of megahertz. This makes running full speech recognition, or even listening for more than a few keywords, impractical from an engineering perspective. The Google research teams have managed some minor miracles like squeezing “Now Playing” onto the Pixel’s always-on subsystem, listening out for when music is playing and waking up the application processor to identify it, but it took incredible ingenuity to fit that into the memory budget available. Even though the article states the security researchers built a proof of concept app that didn’t use much power, they don’t link to any code or power measurements. Since regular Android developers can’t run apps on the always-on subsystem (it’s restricted to phone manufacturers) their app must have been running on the application processor, and I’m willing to bet a lot of money you’d notice your battery draining fast if the main CPU was awake for long periods.

So, I would have been directly involved in any code that did the kind of conversational spying that many people incorrectly suspect is happening, and I’m in a good position to categorically say it isn’t. Why should you trust me though? Or to put it another way, how can an everyday user verify my statement? The BBC article is a bit unsatisfying, because they have security researchers create a proof of concept for an app that listens to conversations, and then state that the companies involved deny that they are doing this. Even if you have faith in the big tech firms involved, I know from my own experience that their engineers can make mistakes and leak information accidentally. My knowledge is also aging, technology keeps improving and running full speech recognition on an always-on chip won’t always be out of reach.

That gap, the fact that we have to trust the word of phone manufacturers that they aren’t spying on us and that there’s no good way for a third party to verify that promise, is what I’ll be focusing on in my research. I believe it should be possible to build voice interfaces and other devices with microphones and cameras in such a way that someone like Underwriters’ Laboratories or Consumer Reports can test their privacy guarantees. I’ve already explored some technical solutions in the past, but I think it’s important to gather a coalition of people interested in the broader questions. With that in mind, if you are a researcher or engineer either in academia or industry who’s interested in this area, drop me an email at [email protected]. I’m hoping we can organize some kind of symposium and discussion groups to figure out the best practices. I believe that we as computer scientists can do better than just asking the public to blindly trust corporations to do the right thing, so let’s figure out how!

Non-Max Suppressions, How do they Work?

(En espaƱol: https://www.ibidem-translations.com/edu/traduccion-non-max-supression/)

I’ve been working with neural networks to do image recognition for almost a decade now, but I have to admit I never really understood the details of how they output things like bounding boxes. I didn’t have a good mental model for how it all worked, and the reference functions always seemed pretty intimidating. In a lot of cases this doesn’t matter, the conversion process is handled by internal layers inside a model, and the application developer doesn’t need to worry about what’s happening under the hood. Recently though, I’ve begun working with some networks that expect the conversion to be handled externally, and so I’ve been writing code from scratch to perform the translation.

That has forced me to understand the details, and to make sure I have a good grasp, and have something to refer to in the future, I’ve put together this blog post and a Python Colab demonstrating it all, step by step. I’m using an example model from the awesome MediaPipe framework (which handles all the conversion itself, if you’re on a platform that supports it), and I’ve written reference code to explain the workflow to get from raw tensors to a cleaned-up set of bounding boxes. In particular, I feel like I’ve finally got a handle on “non-max suppression”, which turned out to be less intimidating than I’d feared.

How do they Work?

I recommend working through the Colab to get the best understanding, but the summary is that most neural networks designed to produce bounding boxes use a grid of anchor points across the image as a base. Each anchor point is associated with a score value, along with x and y offsets, width, height, and any other feature coordinates (like nose or eye positions). All of these coordinates are output relative to the anchor points, normalized between 0.0 and 1.0, where 1.0 is the image size. There is one score, and one set of coordinates for each anchor point, so in the case of the face model I’m using in the notebook, there are 48 columns and 48 rows of anchors, spread 4 pixels apart on a 192×192 image, which means 2,304 entries.

There are two outputs to the model, the first with a shape of (1, 2304, 16), holding 8 pairs of (x, y) coordinates for each anchor. The second is (1, 2304, 1) and holds the score for each anchor. For this model, the first two pairs of coordinates are the origin of the bounding box and its width and height. The other six are the positions of facial landmarks like the mouth or nose. The first stage of decoding is to turn these from relative positions into absolute coordinates by adding the corresponding anchor origins. This gives you a soup of overlapping bounding boxes, each associated with a score.

The next challenge is reducing this set of overlapping boxes into a single one for each real object detection. That’s where the non-max suppression algorithm comes in.

The initial step is to sort the boxes with the highest scores first. After that, we find all the boxes that overlap significantly and merge them together. The exact methods we use to determine if the overlap is significant can be seen in the `overlap_similarity()` function. The merging process either involves just taking the top-scoring box from an overlapping set (`unweighted_non_max_suppression()`) or averaging all the boxes and features in a set, weighted by their score (`weighted_non_max_suppression()`). And that’s how non-max suppression works!