Try OpenAI’s Amazing Whisper Speech Recognition in a Free Web App

Open in Colab

You may have noticed that I’m obsessed with open source speech recognition, so I was very excited when OpenAI released a new voice model. I’m even more excited now I’ve had a chance to play with it, the accuracy is extremely impressive, especially as it’s multi-language. OpenAI have done a great job packaging it, you can install it straight from pip if you’re a Linux shell user, but I wanted to find a way to let anybody try it for themselves from a web browser, even if they’re not developers. I love Google’s Colab service, and luckily somebody had already created a notebook showing the basics of using the Whisper model. I added some documentation and test files, and now you can give it a try for yourself by opening this Colab link

Follow the directions, and after a minute or so you’ll see a button at the bottom of the page where you can record your own audio, and see a transcript. Give it a try, I think you’ll be impressed too!

How to build Raspberry Pi Pico programs with no software installation

I love using the Raspberry Pi Pico board to teach students about microcontrollers, especially as it only costs $4 and is currently in stock despite the supply chain crisis. I have run into some problems though, because building a program requires installing software. This might not sound like a big barrier, but when people arrive with a mix of Windows, MacOS, ChromeOS, and Linux laptops, often with different versions or architectures within each group, trying to guide them through the process can easily take a whole lesson, and require individual attention from me to debug each particular problem while the other students get bored. It’s also frustrating for the class to have to wait an hour before they get to do anything cool, I much prefer giving them a success as early as possible.

To solve this problem, I’ve actually turned to what might seem an unlikely tool, Google’s Colab service. If you have run across this, you probably associate it with Python notebooks, because that’s its primary use case. I’ve found it to be useful for a lot more though, because it effectively gives you a free, temporary Linux virtual machine that you control through the browser. Instead of running Python commands, you can run Linux shell commands by putting an exclamation point at the start. There are some restrictions, such as needing a Google account to sign in, and the file system disappearing after you leave the page or are idle too long, but I’ve found it great for documenting all sorts of installation and build processes in an accessible way.

I’m getting ready to teach EE292D (TinyML) at Stanford again this year, but we’re switching over to the Pico boards instead of the Arduino Nano BLE Sense 33s that we have used, because the latter have been out of stock for quite a while. As part of that, I wanted to have an easy getting started guide for the students to help them build and run their first program. I put together a Colab notebook that follows the steps in the great Pico Getting Started Guide, installing the SDK, examples, and then building blink and running it on a board. To give some extra guidance, I also recorded the YouTube video above. Please excuse the hair and occasional distraction, I did it in a hurry.

It’s not a complete solution, students will still need to install OS-specific software to access debug logs, it requires a Google login that’s not available for kids under 13, and the vanishing file system will cause frustration if they don’t remember to save their code, but I do like it as a simple way to give them a win in just a few minutes. There’s nothing like seeing that first LED blink on a new board, I still get a kick out of it myself!

Why isn’t there more training on the edge?

One of the most frequent questions I get asked from people exploring machine learning beyond cloud and desktop machines is “What about training?”. If you look around at the popular frameworks and use cases of edge ML, most of them seem focused on inference. It isn’t obvious why this is the case though, so I decided to collect my notes in a post here, so I can have something to refer to when this comes up (and organize my own thoughts too!).

No Labels

I think the biggest reason that there’s not more training on the edge is that most models need to be trained through supervised learning, that is each sample used for training needs a ground truth label. If you’re running on a phone or embedded system, there’s not likely to be an easy way to attach a label to incoming data, other than running an existing model and guessing. You need a person to look at an image, or listen to an audio recording, to identify what the prediction should be, before you can use it in training. You also generally need a fairly large number of labels per class for training to be effective.

This may change as semi-supervised or unsupervised approaches continue to improve, but right now supervised training is the most reliable method to get a model for most applications. I have seen some interesting hacks to guess labels on the edge though, that might fall into the semi-supervised category. For example, you can use temporal consistency on video frames to infer mistakes. In concrete terms, if your camera is identifying a fruit as a lemon for ten frames, then for one frame it’s a lime, and then it’s back to a lemon, you can guess that the lime prediction was an error (assuming the frame rate is high enough, fruits aren’t flying by at supersonic speed, and so forth). Another clever use of time was in an audio wake word application, where if there was a near-detection (the model gave a score just below the threshold) followed soon after by an actual detection (over the threshold) then the system would guess that the person had actually said the wake word the first time, and the model had failed to recognize it. This hack relies on the human behavior of trying again if it didn’t work initially.

Quality Control

Getting models to work well within an application is very hard when you are training a single version and putting it through testing before release. If an edge model is retrained, it will be very hard to predict the bounds of its behavior. Since this will affect how well your application works, training on the fly makes ensuring it behaves correctly much harder. This isn’t a complete blocker, there are clearly some products (like GBoard) that do manage to handle this problem, but they generally build some kind of guard rails around what the model can produce. For example, something that predicts words or sentences might have a block-list of banned words (such as hateful or obscene phrases) that will be scrubbed from a model’s output even if edge training causes it to start producing them.

This kind of post-processing is often needed even when using pre-trained models on the edge (I could probably fill a decent book with all the hacks that usually go into filtering and interpreting the raw model output to make it useful) but the presence of a model that can change in unpredictable ways makes it even harder. Nobody wants to be responsible for building another Tay.


When you set up a new phone, you’ll probably speak the assistant wake word a few times to help the system learn your voice. In my experience this doesn’t involve retraining in the sense of full back propagation. Instead, the “Is this audio a wake word?” model produces an embedding vector as its output, and that is then used in a nearest-neighbor lookup to compare to the embeddings from the first few utterances you spoke during setup. This is a surprisingly common technique across a lot of domains, because it is comparatively simple to implement, only requires storing a few values, and works robustly.

I’ve found embeddings to be a fantastic general purpose tool for customizing models on the edge, without requiring the full machinery of back propagation. The gradient descent approach used by modern deep learning needs high precision (usually floating point) weight arrays, along with specialized operators to run the back-prop version of each layer. The weights need to be stored between updates, and since they’re higher precision than is required for inference they take up more space than an inference-optimized model, and you’ll usually want to keep a copy of the original weights around in case you need to reset the model too. By contrast, you can often extract an embedding from an existing model just by reading the activation layer before the final fully-connected op that does the classification. Even though specialized loss functions exist to try to encourage embeddings with desired properties, like good spatial separation, I’ve found that training with a regular softmax and lopping off the last layer often works just as well in practice.


Of course, there are examples of very successful products that do use training on the edge. I already mentioned GBoard, which is the poster child for federated learning, but another domain where I’ve seen a lot of use is in anomaly detection, particularly around predictive maintenance for machinery. This is an application where it seems like every machine behaves differently, so learning “normal behavior” (by observing the first 24 hours of vibrations and labeling those as normal) allows the adaptation needed to spot deviations from those initial patterns. I’ve also seen interesting research projects around security and communications protocols that are looking at using training on the edge to be more robust to changing environmental conditions.


The short answer to the question is that if you’re getting started with ML on the edge, training models there is unlikely to be useful in the short or medium term. Technology keeps changing, and I am seeing some interesting applications starting to emerge, but I feel like a lot of the interest in edge training comes from how prominent training is in the cloud world. I often joke that all ML architecture researchers could go on strike indefinitely, and ML engineers would still have decades of productive work ahead of us. There are many better-motivated problems around deployment on the edge than bringing training up to server capabilities, and I bet your product will hit some of those long before training becomes an issue.