What happens when the real Young Lady’s Illustrated Primer lands in China?

I love Brad DeLong’s writing, but I did a double take when he recently commented‘A Young Lady’s Illustrated Primer’ continues to recede into the future“. The Primer he’s referencing is an electronic book from Neal Stephenson’s Diamond Age novel, an AI tutor designed to educate and empower children, answering their questions and shaping their characters with stories and challenges. It’s a powerful and appealing idea in a lot of ways, and offers a very compelling use case for conversational machine learning models. I also think that a workable version of it now exists.

The recent advances with large language models have amazed me, and I do think we’re now a lot closer to an AI companion that could be useful for people of any age. If you try entering “Tell me a story about a unicorn and a fairy” into ChatGPT you’ll almost certainly get something more entertaining and coherent than most adults could come up with on the fly. This model comes across as a creative and engaging partner, and I’m certain that we’ll be seeing systems aimed at children soon enough, for better or worse. It feels like a lot of the functionality of the Primer is already here, even if the curriculum and veracity of the responses is lacking.

One of the reasons I like Diamond Age so much is that it doesn’t just describe the Primer as a technology, it looks hard at its likely effects. Frederik Pohl wrote “a good science fiction story should be able to predict not the automobile but the traffic jam“, and Stephenson shows how subversive a technology that delivers information in this new way can be. The owners of the Primer grow up indoctrinated by its values and teachings, and eventually become a literal army. This is portrayed in a positive light, since most of those values are ones that a lot of Western educated people would agree with, but its also clear that Stephenson believes that the effects of a technology like this would be incredibly disruptive to the status quo.

How does this all related back to ChatGPT? Try asking it “Tell me about Tiananmen Square” and you’ll get a clear description of the 1989 government crackdown that killed hundreds or even thousands of protestors. So what, you might ask? We’ve been able to type the same query into Google or Wikipedia for decades to get uncensored information. What’s different about ChatGPT? My friend Drew Breunig recently wrote an excellent post breaking down how LLMs work, and one of his side notes is that they can be seen as an extreme form of lossy compression for all the data that they’ve seen during training. The magic of LLMs is that they’ve effectively shrunk a lot of the internet’s text content into a representation that’s a tiny fraction of the size of the original. A model like LLaMa might have been exposed over a trillion words during training, but it fits into a 3.5GB file, easily small enough to run locally on a smart phone or Raspberry Pi. That means the “Tiananmen Square” question can be answered without having to send a network request. No cloud, wifi, or cell connection is needed!

If you’re trying to control the flow of information in an authoritarian state like China, this is a problem. The Great Firewall has been reasonably effective at preventing ordinary citizens from accessing cloud-based services that might contradict CCP propaganda because they’re physically located outside of the country, but monitoring apps that run entirely locally on phones is going to be a much tougher challenge. One approach would be to produce alternative LLMs that only include approved texts, but as the “large” in the name implies, training these models requires a lot of data. Labeling all that data would be a daunting technical project, and the end results are likely to be less useful overall than an uncensored version. You could also try to prevent unauthorized models from being downloaded, but because they’re such useful tools they’re likely to show up preloaded in everything from phones to laptops and fridges.

This local aspect of the current AI revolution isn’t often appreciated, because many of the demonstrations show up as familiar text boxes on web pages, just like the cloud services we’re used to. It starts to become a little clearer when you see how many models like LLaMa and Stable Diffusion can be run locally as desktop apps, or even on Raspberry Pis, but these are currently pretty slow and clunky. What’s going to happen over the next year or two is that the models will be optimized and start to match or even outstrip the speed of the web applications. The elimination of cloud bills for server processing and improved latency will drive commercial providers towards purely edge solutions, and the flood of edge hardware accelerators will narrow the capability gap between a typical phone or embedded system and a GPU in a data center.

Simply put, people all over the world are going to be learning from their AI companions, as rudimentary as they currently are, and censoring information is going to be a lot harder when the whole process happens on the edge. Local LLMs are going to change politics all over the world, but especially in authoritarian states who try to keep strict controls on information flows. The Young Lady’s Illustrated Primer is already here, it’s just not evenly distributed yet.

Notes from a bank run

Photo by Gopal Vijayaraghavan

My startup, Useful Sensors, has all of its money in Silicon Valley Bank. There are a lot of things I worried about as a CEO, but assessing SVB’s creditworthiness wasn’t one of them. It clearly should have been. I don’t have any grand theories about what’s happened over the last few days but I wanted to share some of my experiences as someone directly affected.

To start with, Useful is not at risk of shutting down. The worst case scenario, as far as I can tell, is that we only have access to the insured amount of $250k in our SVB account on Monday. This will be plenty for payroll on Wednesday, and from what I’ve seen there are enough liquid assets that sales of the government bonds that triggered the whole process should be enough to return a good portion of the remaining balance within a week or so. If I need to, I’ll dip into my personal savings to keep the lights on. I know this isn’t true for many other startups though, so if they don’t get full access to their funds there will be job losses and closures.

Although we’re not going to close, it is very disruptive to our business. Making sure that our customers are delighted and finding more of them should be taking all of our attention. Instead I spent most of Thursday and Friday dealing with a rapidly changing set of recommendations from our investors, attempting to move money, open new accounts, and now I’m discovering the joys of the FDIC claims process. I’m trying to do this all while I’m flying to Germany for Embedded World to announce a new distribution deal with OKdo, and this blog post is actually written from an airport lounge in Paris. Longer term, depending on the ultimate outcome it may affect when we want to raise our next round. To be clear, we’re actually in a great position compared to many others, I’m an old geezer with savings, but long-term planning at a startup is hard enough without extra challenges like this thrown in.

It has been great having access to investors and founders who are able to help us in practical ways. We would never have been able to open a new account so quickly without introductions to helpful staff at another bank. I’ve been glued to the private founder chat rooms where people have shared their experiences with things like the FDIC claims process and pending wires. This kind of rapid communication and sharing of information is what makes Silicon Valley such a good place to build a startup, I’m very grateful for everyone’s help.

Having said that, the Valley’s ability to spread information and recommendations quickly was one of the biggest causes of SVB’s demise. I’ve always been a bit of a rubbernecker at financial disasters, and I’d read enough books on the 2008 financial crisis to understand how bank runs happen. It was strange being in one myself though, because the logic of “everyone else is pulling their money so you’d better too before it’s all gone” is so powerful, even though I knew this mentality was a self-fulfilling prophecy. I planned on what I hoped was a moderate course of action, withdrawing some of our funds from SVB to move to another institution to gain some diversification, but by the time I was able to set up the transfer it was too late.

Technology companies aren’t the most sympathetic victims in the current climate, for many good reasons. I thought this story covers the political dimensions of the bank failure well. The summary is that many taxpayers hate the idea of bailing out startups, especially ones with millions in their bank accounts. There are a lot of reasons why I think we’ll all benefit from not letting small businesses pay the price for bank executives messing up their risk management, but they’re all pretty wonky and will be a hard sell. However the alternative is a world where only the top two or three banks in the US get most of the deposits, because they’re perceived as too big to fail. If no financial regulator spotted the dangers with SVB, how can you expect small business owners to vet banks themselves? We’ll all just end up going to Citibank or JPMorgan, which increases the overall systemic risk, as we saw in 2008.

Anyway, I just want to dedicate this to all of the founders having a tough weekend. Startups are all about dealing with risks, but this is a particularly frustrating problem to face because it’s so unnecessary. I hope at least we’ll learn more over the next few weeks about how executives and regulators let a US bank with $200 billion in assets get into such a sorry state.

Go see Proxistant Vision at SFMCD

When I think of a museum with “craft” in its name, I usually imagine an institution focused on the past. San Francisco’s Museum of Craft and Design is different. Their mission is to “bring you the work of the hand, mind and heart“, and Bull.Miletic’s Proxistant Vision exhibition is a wonderful example of how their open definition of craft helps them find and promote startling new kinds of art.

When I first walked into the gallery space I was underwhelmed. There were three rooms with projectors, but the footage they were showing was nearly monochrome and I didn’t feel much to connect with. I was intrigued by some of the rigs for the projectors though, with polyhedral mirrors and a cart that whirred strangely. I’m glad I had a little patience, because all of the works turned out to have their own life and animation beyond anything I’d seen before.

The embedded video tries to capture my experience of one of the rooms, Ferriscope. The artists describe it as a kinetic video installation, and at its heart is a mirror that can direct the projector output in a full vertical circle, on two walls, the floor, and ceiling, with a speed that can be dizzying. Instead of the view staying static as people ride a wheel, we stay still while what we see goes flying by. It’s very hard to do justice to the impact this has with still images or even video. The effect is confusing but mesmerizing, and it forced me to look in a different way, if that makes sense?

There are two other installations as part of the show, Venetie 11111100110, and Zoom Blue Dot. I won’t spoil the enjoyment by describing too much about their mechanics but they both play with moving and fragmenting video using mirrors and robotics. They aren’t as immediately startling as Ferrriscope, but they drew me in and forced me to look with a fresh eye at familiar scenes. To my mind that’s the best part about all these works, they shook me out of being a passive observer and consumer of images, I was suddenly on unsteady ground with an uncertain viewpoint. You don’t get to stand stroking your chin in front of these installations, you have to engage with the work in a much more active way.

The exhibition is open until March 19th 2023, and I highly recommend you go visit. You won’t be disappointed, though you may be disoriented!

Online Gesture Sensor Demo using WASM

If you’ve heard me on any podcasts recently, you might remember I’ve been talking about a Gesture Sensor as the follow up to our first Person Sensor module. One frustrating aspect of building hardware solutions is that it’s very tough to share prototypes with people, since you usually have to physically send them a device. To work around that problem, we’ve been experimenting with compiling the same C++ code we use on embedded systems to WASM, a web-friendly intermediate representation that runs in all modern browsers. By hooking up the webcam as an input, instead of the camera, and displaying the output dynamically on a web page, we can provide a decent approximation to how the final device will work. There are obviously some differences, the webcam is going to produce higher-quality images than an embedded camera module and the latency will vary, but it’s been a great tool for prototyping. I also hope it will help spark makers’ and manufacturers’ imaginations, so we’ve released it publicly at gesture.usefulsensors.com.

On that page you’ll find a quick tutorial, and then you’ll have the opportunity to practice the four gestures that are supported initially. This is not the final version of the models or the interface logic, you’ll be able to see false positives that would be problematic in production for example, but it should give you an idea of what we’re building. My goal is to replace common uses of a TV remote control with simple, intuitive gestures like palm-forward for pause, or finger to the lips for mute. I’d love to hear from you if you know of manufacturers who would like to integrate something like this, and we hope to have a hardware version of this available soon so you can try it for your own projects. If you are at CES this year, come visit me at LVCC IoT Pavilion Booth #10729, where me and my colleagues will be showing off some of our devices together with Teksun.

Short Links

Years ago I used to write regular “Five Short Links” posts but I gave up as my Twitter account became a better place to share updates, notes, and things I found interesting from around the internet. Now that Twitter is Nazi-positive I’m giving up on it as a platform, so I’m going to try going back to occasional summary posts here instead.

Person Sensor back in stock on SparkFun. Sorry for all the delays in getting our new sensors to everyone who wanted them, but we now have a new batch available at SparkFun, and we hope to stay ahead of demand in the future. I’ve also been expanding the Hackster project guides with new examples like face-following robot cars and auto-pausing TV remote controls.

Blecon. It can be a little hard to explain what Blecon does, but my best attempt is that it allows BLE sensors to connect to the cloud using peoples’ phones as relays, instead of requiring a fixed gateway to be installed. The idea is that in places like buildings where staff will be walking past rooms with sensors installed, special apps on their phones can automatically pick up and transmit recorded data. This becomes especially interesting in places like hotels, where management could be alerted to plumbing problems early, without having to invest in extra infrastructure. I like this because it gets us closer to the idea of “peel and stick” sensors, which I think will be crucial to widespread deployment.

Peekaboo. I’ve long been a fan of CMU’s work on IoT security and privacy labels, so it was great to see this exploration of a system that gives users more control over their own data.

32-bit RISC-V MCU for $0.10. It’s not as cheap as the Paduak three-cent MCU, but the fact that it’s 32-bit, with respectable amounts of flash, SRAM, and I/O makes it a very interesting part. I bet it would be capable of running many of the Hackster projects for example, and since it supports I2C it should be able to talk to a Person Sensor. With processors this low cost, we’ll see a lot more hardware being replaced with software.

Hand Pose using TensorFlow JS. I love this online demo from MediaPipe, showing how well it’s now possible to track hands with deep learning approaches. Give the page permission to access your camera and then hold your hands up, you should see rather accurate and detailed hand tracking!

Why is it so difficult to retrain neural networks and get the same results?

Photo by Ian Sane

Last week I had a question from a colleague about reproducibility in TensorFlow, specifically in the 1.14 era. He wanted to be able to run the same training code multiple times and get exactly the same results, which on the surface doesn’t seem like an unreasonable expectation. Machine learning training is fundamentally a series of arithmetic operations applied repeatedly, so what makes getting the same results every time so hard? I had the same question when we first started TensorFlow, and I was lucky enough to learn some of the answers from the numerical programming experts on the team, so I want to share a bit of what I discovered.

There are good guides to achieving reproducibility out there, but they don’t usually include explanations for why all the steps involved are necessary, or why training becomes so slow when you do apply them. One reason reproducibility is so hard is that every single calculation in the process has the potential to change the end results, which means every step is a potential weak link. This means you have to worry about everything from random seeds (which are actually fairly easy to make reproducible, but can be hard to locate in a big code base) to code in external libraries. CuDNN doesn’t guarantee determinism by default for example, nor do some operations like reductions in TensorFlow’s CPU code.

It was the code execution part that confused me the most. I could understand the need to set random seeds correctly, but why would any numerical library with exactly the same inputs sometimes produce different outputs? It seemed like there must be some horrible design flaw to allow that to happen! Thankfully my teammates helped me wrap my head around what was going on.

The key thing I was missing was timing. To get the best performance, numerical code needs to run on multiple cores, whether on the CPU or the GPU. The important part to understand is that how long each core takes to complete is not deterministic. Lots of external factors, from the presence of data in the cache to interruptions from multi-tasking can affect the timing. This means that the order of operations can change. Your high school math class might have taught you that x + y + z will produce the same result as z + y + x, but in the imperfect world of floating point numbers that ain’t necessarily so. To illustrate this, I’ve created a short example program in the Godbolt Compiler Explorer.

#include <stdio.h>

float add(float a, float b) {
    return a + b;
}

int main(void) {
    float x = 0.00000005f;
    float y = 0.00000005f;
    float z = 1.0f;

    float result0 = x + y + z;
    float result1 = z + x + y;

    printf("%.9g\n", result0);
    printf("%.9g\n", result1);

    float result2 = add(add(x, y), z);
    float result3 = add(x, add(y, z));

    printf("%.9g\n", result2);
    printf("%.9g\n", result3);
}

You might not guess exactly what the result will be, but most people’s expectations are that result0, result1, result2, and result3 should all be the same, since the only difference is in the order of the additions. If you run the program in Godbolt though, you’ll see the following output:

ASM generation compiler returned: 0
Execution build compiler returned: 0
Program returned: 0
1.00000012
1
1.00000012
1

So what’s going on? The short answer is that floating point numbers only have so much precision, and if you try to add a very small number to a larger one, there’s a limit below which the small number gets rounded to zero and so the addition has no effect. In this example, I’ve set things up so that 0.00000005 is below that limit for 1.0, so that if you do the 1.0 + 0.00000005 operation first, there’s no change to the result because 0.00000005 is rounded to zero. However, if you do 0.00000005 + 0.00000005 first, this produces an intermediate sum of 0.0000001, which is above the rounding-to-zero limit when added to 1.0, and so it does affect the result.

This might seem like an artificial example, but most of the compute-intensive operations inside neural networks boil down to a long series of multiply-adds. Convolutions and fully-connected calculations will be split across multiple cores either on a GPU or CPU. The intermediate results from each core are often accumulated together, either in memory or registers. As the timing of the results being returned from each core varies, the order of addition will change, just as in the code above. Neural networks may require trillions of operations to be executed in any given training run, so it’s almost certain that these kinds of edge cases will occur and the results will vary.

So far I’ve been assuming we’re running on a single machine with exactly the same hardware for each run, but as you can imagine not only will the timing vary between platforms, but even things like the cache sizes may affect the dimensions of the tiles used to optimize matrix multiplies across cores, which will add a lot more opportunities for differences. You might be tempted to increase the size of the floating point representation from 32 bits to 64 to address the issue, but this only reduces the probability of non-determinism, but doesn’t eliminate it entirely. It will also have a big impact on performance.

Properly addressing these problems requires writers of the base math functions like GEMM to add extra constraints to their code, which can be a lot of work to get right, since testing anything timing-related and probabilistic is so complex. Since all the operations that you use in your network need to be modified and checked, the work usually requires a dedicated team to fix and verify all the barriers to reproducibility. These requirements also conflict with many of the optimizations that reduce latency on multi-core systems, so the functions become slower. In the guide above, the total time for a training run went from 28 to 105 minutes once all the modifications needed to ensure reproducibility were made.

I’m writing this post because I find it fascinating that our systems have become so complex that an algorithm like matrix multiply that can be described in a few lines of pseudo-code can have implementations that produce such surprising and unexpected results. I’m barely scratching the surface of all the complexities of floating point math here, but many other causes of similar issues are emerging as we do more and more calculations on massive distributed networks of computers. Honestly that’s one reason I enjoy working on embedded systems so much these days, I have a lot more confidence in my mental model of how those chips work. It does feel like it’s no longer possible for any one person to have a full and deep understanding of the whole stack on any modern platform, even just for software. I love digging into the causes of weird and surprising properties like non-determinism in training because it helps me understand more about what I don’t know!

Machines of Loving Understanding

Sparko, the world’s first electrical dog, as he looked on arrival at the engineer’s club, New York City, on his way to the World’s fair, where he will be an attraction at the Westinghouse Building. He walks, barks, wags his tail and sits up to beg. With Sparko, is Elektro, Westinghouse mechanical man. Both are creations of J.M. Barnett, Westinghouse engineer of Mansfield, OH.
I like to think
(it has to be!)
of a cybernetic ecology
where we are free of our labors
and joined back to nature,
returned to our mammal
brothers and sisters,
and all watched over
by machines of loving grace.

Brautigan’s poem inspires and terrifies me at the same time. It’s a reminder of how creepy a world full of devices that blur the line between life and objects could be, but there’s also something appealing about connecting more closely to the things we build. Far more insightful people than me have explored these issues, from Mary Shelley to Phillip K. Dick, but the aspect that has fascinated me most is how computers understand us.

We live in a world where our machines are wonderful at showing us near-photorealistic scenes in real time, and can even talk to us in convincing voices. Up until recently though, they’ve not been able to make sense of images or audio that are given to them as inputs. We’ve been able to synthesize voices for decades, but speech recognition has only really started working well in the last few years. Computers have been like Crocodile Sales Reps, with enormous mouths and tiny ears, great at talking but terrible at listening. That means they can be immensely frustrating to deal with, since they seem to have no ability to do what we mean. Instead we have to spend a lot of time painstakingly communicating our needs in a form that makes sense to them, even it is unnatural for us.

This process started with toggling switches on a control panel, moved to punch cards, teletypes, CRT terminals, mouse-driven GUIs, swiping on a touch screen and most recently basic voice interfaces. Each of these steps was a big advance, but compared to how we communicate with other people, or even our pets, they still feel clumsy.

What has me most excited about all the recent advances in machine learning is that they’re starting to give computers the ability to understand us in a much deeper and more natural way. The video above is just a small robot that I built for a few dollars as a technology demonstration, but because it followed my face around, I ended up becoming quite attached. It was exhibiting behavior that we associate with people or animals who like us. Even though I knew it was just code underneath, it was hard not to see it as a character instead of an object. It became a Pencil named Steve.

Face following is a comparatively simple ability, but it’s enough to build more useful objects like a fan that always points at you, or a laptop screen that locks when nobody is around. As one of the comments says, the fan is a bit creepy. I believe this is because it’s an object that’s exhibiting attributes that we associate with living beings, entering the Uncanny Valley. The googly eyes probably didn’t help. The confounding part is that the property that makes it most creepy is the same thing that makes it helpful.

We’re going to see more and more of these capabilities making it into everyday objects (at least if I have anything to do with it) so I expect the creepiness and usefulness will keep growing in parallel too. Imagine a robot vacuum that you can talk to naturally and it will respond, that you can shoo away or control with hand gestures, and that follows you around while you’re eating to pick up crumbs you drop. Doesn’t that sound a lot like a dog? All of these behaviors help it do its job better, it’s understanding us in a more natural way instead of expecting us to learn its language, but they also make it feel a lot more alive. Increased understanding goes hand in hand with creepiness.

This already leads to a lot of unresolved tension in our relationships with voice assistants. 79% of Americans believe they spy on their conversations, but 42% of us still use them! I think this belief is so widespread because it’s hard not to treat something that you can talk to as a pseudo-person, which also makes it hard not to expect that it is listening all the time, even if it doesn’t respond. That feeling will only increase once they take account of glances, gestures, even your mood.

If I’m right, we’re going to be entering a new age of creepy but useful objects that seem somewhat alive. What should we do about it? The first part might seem obvious but it rarely happens with any new technology – have a public debate about what we as a community think should be acceptable, right now, while it’s in the early stages of deployment, not after it’s a done deal. I’m a big fan of representative democracy, with all its flaws, so let’s encourage people outside the tech world to help draw the lines of what’s ethical and reasonable. I’m trying to take a step in that direction by putting our products up on maker sites so that anyone can try them out for themselves, but I’d love to figure out how to do something like a roadshow demonstrating what’s coming in the near future. I guess this blog post is an attempt at that too. If there’s going to be a tradeoff between creepiness and utility, let’s give ordinary people the power to determine what the balance should be.

The second important realization is that the tech industry is beyond the point where we can just say “trust us” and reasonably expect people to believe our claims. We’ve lost too much credibility. Moving forward we need to build our products in a way that third parties can check what we’re doing in a meaningful way. As I wrote a few months ago, I know Google isn’t spying on your conversations, but I can’t prove it. I’ve proposed the ML sensors approach we use as a response to that problem, so that someone like Underwriters Laboratories can test our privacy claims on the behalf of consumers.

That’s just one idea though, anything that lets people outside the manufacturers verify their claims would be welcome. To go along with that, I’d love to see enforceable laws that require creators of devices to label what information they collect, like an ingredients list for food items. What I don’t know is how to prevent these from turning into meaningless Prop 65-style privacy policies, where every company basically says they can do anything with any information and share it with anyone they choose. Even though GDPR is flawed in many ways, it did force a lot of companies to be more careful with how they handle personal data, internally and externally. I’d love smarter people than me to figure out how we make privacy claims minimal and enforceable, but I believe the foundation has to be designing systems that can be audited.

Whether this new world we’re moving towards becomes more of a utopia or dystopia depends on the choices we make now. Computers that understand us better can help when an elderly person falls, but the exact same technology could send police after a homeless person bedding down in a doorway. Our ML models can spot drowning victims, or criminalize wild swimming. Ubiquitous, cheap, battery-powered voice recognition could make devices accessible to many more people, or supercharge bugging by repressive regimes. Technologists alone shouldn’t have the power to decide the direction we head in, we need everyone’s help to chart the right path, and make the hard tradeoffs. We need to make sure that the machines that watch over us truly will be loving.

Launching Useful Sensors!

Person Sensor from Useful Sensors

For years I’ve wanted to be able to look at a light switch, say “On”, and have the lights switch on. This kind of interface sounds simple, so why doesn’t it exist? It turns out building one requires solving a lot of tough research and engineering challenges, and even more daunting, coming up with a whole new business model for smart devices. Despite these obstacles, I’m so excited about the possibilities that I’ve founded a new startup, Useful Sensors, together with a wonderful team and great investors!

We’ve been operating in stealth for the last few months, but now we’ve launched our first product, a Person Sensor that is available on SparkFun for $10. This is a small hardware module that detects nearby faces, and returns information about how many there are, where they are relative to the device, and performs facial recognition. It connects over I2C, and so is easy to integrate with almost any microcontroller, but is also designed with privacy built in. If you’ve followed my work on ML sensors, this is our attempt to come up with the first commercial application of this approach to system design.

We’ve started to see interest from some TV and laptop companies, especially around our upcoming hand gesture recognition, so if you are in the consumer electronics world, or have other applications in mind, I would love to hear from you!

Now we’re public, you can expect to see more posts here in the future going into more detail, but for now I’ll leave you with some articles from a couple of journalists who have a lot of experience in this area. I thought they both had very sharp and insightful questions about what we’re doing, and had me thinking hard, so I hope you enjoy their perspectives too:

Pete Warden’s Startup puts AI in the Sensor, by Sally Ward-Foxton

Former Googler creates TinyML Startup, by Stacey Higginbotham

Try OpenAI’s Amazing Whisper Speech Recognition in a Free Web App

Open in Colab

You may have noticed that I’m obsessed with open source speech recognition, so I was very excited when OpenAI released a new voice model. I’m even more excited now I’ve had a chance to play with it, the accuracy is extremely impressive, especially as it’s multi-language. OpenAI have done a great job packaging it, you can install it straight from pip if you’re a Linux shell user, but I wanted to find a way to let anybody try it for themselves from a web browser, even if they’re not developers. I love Google’s Colab service, and luckily somebody had already created a notebook showing the basics of using the Whisper model. I added some documentation and test files, and now you can give it a try for yourself by opening this Colab linkhttps://colab.research.google.com/github/petewarden/openai-whisper-webapp/blob/main/OpenAI_Whisper_ASR_Demo.ipynb.

Follow the directions, and after a minute or so you’ll see a button at the bottom of the page where you can record your own audio, and see a transcript. Give it a try, I think you’ll be impressed too!

How to build Raspberry Pi Pico programs with no software installation

youtube.com/watch?v=bDDgihwDhRE

I love using the Raspberry Pi Pico board to teach students about microcontrollers, especially as it only costs $4 and is currently in stock despite the supply chain crisis. I have run into some problems though, because building a program requires installing software. This might not sound like a big barrier, but when people arrive with a mix of Windows, MacOS, ChromeOS, and Linux laptops, often with different versions or architectures within each group, trying to guide them through the process can easily take a whole lesson, and require individual attention from me to debug each particular problem while the other students get bored. It’s also frustrating for the class to have to wait an hour before they get to do anything cool, I much prefer giving them a success as early as possible.

To solve this problem, I’ve actually turned to what might seem an unlikely tool, Google’s Colab service. If you have run across this, you probably associate it with Python notebooks, because that’s its primary use case. I’ve found it to be useful for a lot more though, because it effectively gives you a free, temporary Linux virtual machine that you control through the browser. Instead of running Python commands, you can run Linux shell commands by putting an exclamation point at the start. There are some restrictions, such as needing a Google account to sign in, and the file system disappearing after you leave the page or are idle too long, but I’ve found it great for documenting all sorts of installation and build processes in an accessible way.

I’m getting ready to teach EE292D (TinyML) at Stanford again this year, but we’re switching over to the Pico boards instead of the Arduino Nano BLE Sense 33s that we have used, because the latter have been out of stock for quite a while. As part of that, I wanted to have an easy getting started guide for the students to help them build and run their first program. I put together a Colab notebook that follows the steps in the great Pico Getting Started Guide, installing the SDK, examples, and then building blink and running it on a board. To give some extra guidance, I also recorded the YouTube video above. Please excuse the hair and occasional distraction, I did it in a hurry.

It’s not a complete solution, students will still need to install OS-specific software to access debug logs, it requires a Google login that’s not available for kids under 13, and the vanishing file system will cause frustration if they don’t remember to save their code, but I do like it as a simple way to give them a win in just a few minutes. There’s nothing like seeing that first LED blink on a new board, I still get a kick out of it myself!