Join me at the Tesla Protests on Saturday

I’ve been writing this blog for nineteen years, and in over 1,100 posts I’ve never once brought up politics, but I can’t ignore what’s happening in our country. We’re facing such a profound crisis right now in the US that not speaking up at this point would be breaking the oath I took in 2014, when I became a proud citizen, to “defend the constitution” … “against all enemies, foreign and domestic“. I won’t repeat all the ways that the executive branch is destroying fundamental rights like habeas corpus and the rule of law. If you’re happy with what’s going on, I don’t know how to even reach you, so feel free to stop reading.

If you think what’s happening is wrong, but feel helpless to do anything about it, you should join one of the nationwide protests at Tesla showrooms around the country. I have never been to a protest in the US before, and I was actually pretty scared to attend my first. I’m a naturalized citizen, and I’ve never been made to feel more of a foreigner than I have over the last few months. Even though I have incredible privileges and resources compared to the most vulnerable groups, like trans people and immigrants who haven’t finished the arduous process of becoming citizens yet, I was still nervous about standing up in public to say what I believed. I’ve now been to two Saturday protests at the Tesla dealership on Van Ness in San Francisco, and I’ve been amazed at how heartening it has been to be surrounded by other people who are appalled at what is happening, and to hear the horns of the many others who drive by and show their support.

If, like me, you haven’t been to a protest before, you might have questions. Is it safe? The crowd and organizers are extremely chill, and at least half of the protestors are senior citizens. Despite what Fox may tell its viewers, the protestors are ordinary people like you and me who care about their country, not “Radical Leftists”. There’s incredible positive energy, and there’s never been a hint of violence. The organizers are very clear that this is a peaceful protest, and there will be zero tolerance from them for any trespassing or property damage. Tesla drivers who pass get some good-natured thumbs down, but even when a couple of agitators in MAGA hats showed up filming this weekend, everyone just laughed and rolled their eyes. I’m particularly proud of my wife Joanne, who when one of them stuck a camera in her face and asked “What are you protesting?” (in a thick Russian accent, so presumably a fellow immigrant?), she smiled and replied “You“, which he didn’t have a response to. There’s also no sign up necessary, you can arrive any time between 12pm and 2pm, stay for as long as you feel like, and leave whenever you want.

If you would like to do something, this Saturday (March 29th 2025) is going to be the biggest yet. Find your local Tesla dealer, and even if you’re in a deep red state, there’s almost certainly going to be a group gathering between 12pm and 2pm.

I know not everybody has the resources or ability to attend these protests, but there are still things you can do. I write as many Blue Wave Postcards as I can find time for. They encourage people to vote, and there are important elections coming up all the time, like the judicial race in Wisconsin that may decide whether they get fair redistricting for a long time to come. If you don’t have the money to pay for the postcards and stamps required, you can use the 5 Calls app to tell your representatives how concerned you are.

It’s no longer okay to ignore what’s happening, or keep your head down to avoid offending other people. This is a deep, deep crisis, and our only chance of a way out is if we work together to make sure our voices are heard. Please, join me in doing what you can. Even if we aren’t successful, I want to be able to say I went down fighting for what I believe in. Don’t you?

Debugging Disposable ML Frameworks

Guest post by Nat Jeffries, Founding Engineer at Useful Sensors.

At Useful Sensors we love using disposable frameworks to deploy on-device transformers. Having built several such frameworks, I realized that, while there are great resources for understanding and training transformer models, there are few guides for deploying them on-device. The following are some lessons I wish I knew when I started building disposable frameworks, and some tricks I’ve learned along the way.

First, I’ve learned to make sure to test parts of the model rather than the whole thing. When you run a transcription model on some sample audio clip and get back wingdings, curse words or nothing at all, it’s hard to know what went wrong. I like to compare intermediate tensor values from a known-good model against the same tensors in my custom framework, working from the input through each major block until these tensors differ. One trick I’ve found is to log the sum and shape of each tensor rather than all or some of the tensor values. 

Here’s an example in C++:

void print_tensor(const Tensor* tensor, std::string msg) {
  float sum = 0;
  for (auto elem : tensor->data) {
    sum += elem;
  }
  printf("%s: sum: %.4f shape (", msg.c_str(), sum);
  for (auto elem : tensor->shape()) {
    printf("%d ", elem);
  } printf(")\n");
}

Tensor* generate(Tensor* input, Tensor* mask, Tensor* seq) {
  print_tensor(input, "input");
  print_tensor(mask, "mask");
  auto* preprocessed = preprocess(input);
  print_tensor(preprocessed, "preprocessed");
  auto* embedding = encoder(input, mask);
  print_tensor(embedding, "embedding");
  auto* output = decoder(seq, embedding, mask);
  print_tensor(output, "output");
  return output;
}

And here’s the Python version:

def print_tensor(tensor, name):
    print(f'{name} sum {torch.sum(tensor)} shape {tensor.shape}')

def generate(src, mask, seq):
    print_tensor(src, "input")
    print_tensor(mask, "input mask")

    preprocessed = preprocessor(src)
    print_tensor(preprocessed, "preprocessed")

    enc = encoder(src=preprocessed, input_mask=mask)
    print_tensor(enc, "embedding")

    output = decoder(prompt=seq, embedding=enc, input_mask=mask)
    print_tensor(output, "output")

It’s rare that two tensors with the same sum and shape contain different values, and even if they do then the error will almost always appear one block later. Remember that this includes checking the input of the two models. I’ve lost count of the number of times I used an incorrectly quantized input, the wrong input mask, or fed inputs into the model in the wrong order.

When dealing with quantized tensors, always refer back to the floating point values represented by the quantized tensors. Remember that regardless of the quantization scheme, each quantized value is an approximation of an equivalent floating point value in the known-good (usually floating point) model. Recording sums and shapes of quantized tensors converted back to float can be a good way to ensure that the models match, and to quickly identify integer overflow, incorrect logic, or excessive quantization error.

Finally, make sure to periodically take a step back and honestly evaluate how clear your mental picture of what you’re trying to implement is. I recently experienced this while adding batch decoding to our Moonshine model. I spent many days debugging subtle differences between batch and non-batch versions of our model before realizing that I had forgotten to mask cross attention in the decoder. A simple gap in my knowledge, quickly solved by reading a guide on masking in encoder-decoder models, resulted in days of wasted effort.
Hopefully these tricks can save somebody from the pitfalls I’ve fallen into. If you’re interested in deploying speech models on-device or have tips I missed here, please reach out!

How to shrink ONNX files

I’ve been using the ONNX Runtime a lot recently, and while it has been a lot of fun, there are a few things I’ve missed from the TensorFlow Lite world. The biggest (no pun intended) is the lack of tools to shrink the model file size, something that’s always been essential in the mobile app world. You can quantize using the standard ONNX tools, but in my experience you’ll often run into accuracy problems because all of the calculations are done at lower precision. These are usually fixable, but require some time and effort.

Instead, I like to perform “weights-only quantization”, where the calculations are still done in 32-bit floating point, but the large arrays of weight values are stored as 8-bit codes. This usually has no impact on accuracy, and the effect on latency should be pretty negligible, since the compute involved in unpacking those values every time is a tiny fraction of the rest of the network calculations. I couldn’t find a tool to do that for me though, so I’ve just released ONNX Shrink Ray on GitHub and pypi. This tool processes ONNX files, finds large arrays of float32 values, and replaces them with an equivalent array of 8-bit codes followed by a DequantizeLinear operation. This typically reduces large float models to around 30% of their original size, usually with no measurable impact on accuracy.

This is especially important for models that are hosted on the web or using the ONNX web runtime, since big downloads cost money. I’ve put together a quick pricing calculator using Claude to demonstrate the potential savings, using Google Cloud Storage download costs as the default. You can enter in your own values to see what the impact would be in your situation.

Other frameworks like GGML do offer similar kinds of weight-only quantization, but this is the only solution I know of for ONNX. I’ve also included a variation on this kind of quantization, where the values are still stored as floats, but quantized to an arbitrary number of values. This is very effective when your content is compressed for delivery (which if you’re concerned about download costs, you’re probably already doing) and has no impact on latency.

We have some other tricks up our sleeve for shrinking large models, so if you are running into this issue yourself, please do get in touch, I’ll be happy to geek out.

Why Speech to Intent is so Vital for Voice

When I first tried ChatGPT, it blew my mind. Its ability to respond intelligently to almost any prompt I gave it was astonishing, it was obvious to me it was the future. It seemed like we’d finally built the kind of AI we’ve all seen in the movies. Over time though, one big limitation became clear – they’re all talk and no action. By that I mean they’re fantastic for anything that requires generating text, but persuading them to make something happen is a lot harder. For example, we can now build a model that could have a natural conversation with a person, just like HAL 9000, but if you ask it to open the pod bay doors, there’s no easy way to connect the LLM’s output to those doors’ controls.

The challenge of converting something somebody said into an action is known as the “speech to intent” problem in the research world. If you’ve ever used a voice assistant, you’ll know that you have to be careful about how you phrase requests. “Alexa, living room lights on” may work, but “Alexa, turn on the lights in the living room” might not. If you were talking to a person, you wouldn’t have this problem, they would be able to understand what you meant even if you didn’t use the exact phrase they were expecting. In natural conversations we’re just as likely to say something like “Can you hit the switch for the lights by the TV?” or “We need light in the living room“, and we’d expect someone else to understand. Solving speech to intent means recognizing all of those possible natural language phrases as inputs, and outputting a structured result that unambiguously tells the rest of the system to turn a particular light on.

As you can probably tell from your own experiences with voice assistants, this problem is far from solved. A lot of current solutions still work a lot like Infocom text games from the 80’s – here’s a genuine example from Azure’s “AI Services”:

You might already be able to spot a few problems with this. What if someone said “Go to six” or “Six please“? This kind of pattern matching is very brittle because it either relies on the developer coming up with every likely variation on a command, or the user choosing exactly the expected phrase. Even worse, there’s usually no way for a user to tell what the correct phrases actually are, so the interface is incredibly undiscoverable too! I believe the problems that this rule-based approach causes are a big reason that very few people use voice interfaces. We expect our assistants to be able to understand us when we talk naturally to them, and right now they don’t.

Large Language Models seem to be great at understanding people, so are they the solution? I think they will be soon, but the best paper I’ve found on this approach shows we still have some work to do. The authors’ experiments show that you can get results as good as the non-LLM state of the art by using ChatGPT 3.5 on a simple intent classification task (table 3), but the LLM approach is much worse when the requirements are tougher (table 4). ChatGPT also struggles with the kinds of word errors that show up on transcribed text. I’m optimistic that we can solve these issues (and we’re actively working on this at Useful) but it will require some new approaches to training and using models.

So, why is speech to intent so important? I believe it’s the last missing piece before we finally have voice interfaces that are a joy to use! Imagine leaning back on your couch with your laptop open and browsing purely through speech. Blade Runner has a beautiful example of how this might work in its zoom and enhance scene:

Of course I’m more likely to be buying jeans from Zappos than playing robot detective, but almost any interactive experience can be improved with a voice interface that actually understands people. Speech won’t replace keyboards or touch screens, we’ll still be typing into spreadsheets, but there will be a lot of cases where it will be the easiest way to interact. This change won’t just be an incremental one, it will open up experiences on devices that have never been possible before. If voice truly works, you’ll be able to use your TV to browse the web, get a quick summary of a page from your smart speaker, or work with apps from your AR or VR devices. It will free us from remote controls and having to physically touch something to make it work. If you’re using voice, then the results can be displayed on any screen that’s convenient, and computing becomes much more ambient, rather than something you have to carry around with you.

This is why I’m so excited to be working on this problem. We’ve been suffering through a long voice interface winter, but almost all of the ingredients are in place to make speech work. If we can persuade LLMs to turn their words into deeds, then we’ll be finally be able to talk to machines like we can to people, and I think that will be glorious.

Introducing Moonshine, the new state of the art for speech to text

Can you imagine using a keyboard where it took a key press two seconds to show up on screen? That’s the typical latency for most voice interfaces, so it’s no wonder they’ve failed to catch on for most people. Today we’re open sourcing Moonshine, a new speech to text model that returns results faster and more efficiently than the current state of the art, OpenAI’s Whisper, while matching or exceeding its accuracy. The paper has the full details, but the key improvements are an architecture that offers an overall 1.7x speed boost compared to Whisper, and a flexibly-sized input window. This variable length input is very important, since Whisper always works with 30 second chunks of audio, so even if you only have a few seconds of speech you have to zero-pad the input and process much more data than you need. These two improvements mean we’re five times faster than Whisper on ten second audio clips!

To understand what that means in practice, you can check out our Torre translator. The speed of Moonshine means we can offer almost instant translations as people are talking, making for a conversation that’s much more natural than existing solutions.

Even better, the low resource demands of Moonshine allow us to run everything locally on the device, without any network connection, safeguarding privacy and letting us run anywhere in the world, instantly.

We founded Useful to help machines understand us better, and we’re proud to share this new step forward in speech to text, since voice interfaces are a vital part of that mission. Moonshine doesn’t just help us with products like Torre, its unique design makes it possible to fit full automatic speech recognition on true embedded hardware. We’ve found the biggest obstacle to running ASR on microcontrollers and DSPs hasn’t been the processing power, since accelerators help with that, but RAM limits. Even the smallest Whisper model requires at least 30MB of RAM, since modern transformers create large dynamic activation layers which can’t be stored in flash or other read-only memory. Because Moonshine’s requirements scale with the size of the input window, we are on target to transcribe full sentences a few seconds long in 8MB of RAM or less.

I can’t wait to see what people are able to build with these new models, especially on resource-constrained platforms like the Raspberry Pi, where running full speech to text has been challenging. Please do get in touch if you’ve built something neat, we’d love to hear from you!

Update – I talk a bit more about Moonshine on YouTube at youtu.be/sZVTisKqJtA.

AI PCs aren’t very good at AI

I’ve long been a fan of Qualcomm’s NPUs, and I even collaborated with them to get experimental support for the underlying HVX DSP into TensorFlow back in 2017 (traces remain here). That meant I was very excited when I heard they were bringing those same accelerators to Windows tablets, offering up to 45 trillion ops per second. As soon as the Microsoft Surface Pro version running on Arm was released, we bought a bunch and prepared to use them as the main platform for our instant translation app, since it requires a lot of computing power to run all the transformer models that power it.

Unfortunately I struggled to get anywhere near the advertised performance using the NPU. In fact, in my experience it was usually significantly slower than the CPU. To try to get to the bottom of these issues, I’ve open sourced a benchmark where I try to get the best possible performance on a foundational AI operation, multiplying two large matrices, and show that the NPU is slower than the CPU path. I only see 573 billion operations per second, less than 1.3% of the 45 trillion operations per second that’s listed in the specs (and four times less than the Nvidia RTX 4080’s 2.16 teraops in my gaming laptop with the same benchmark).

I’m used to not getting great utilization of AI acceleration hardware, often getting to 10% of the theoretical maximum throughput is considered a good result, but I’m disappointed at the 1.3% we’re seeing here. It’s hard to tell where the problem lies, but I’m hoping it’s in the software stack somewhere, since I’ve seen much better performance with similar chips on Android. It could even be an issue with how I’m calling the code, though I’ve tried to follow the documentation as closely as possible. I’m guessing the Onnx runtime, drivers, and on-chip code haven’t had enough work done on them yet, which is good news because those all should be fixable with software updates. I also miss the ability to compile and run my own operations on the DSP, since that would provide an escape hatch to these issues, but that’s apparently not allowed on Windows.

Hopefully we will get some help solving whatever issues are preventing us from achieving the performance that we’d expect. If you have ideas, please feel free to fork the code and give it a try yourself, I’d love to hear from you. I’m still hopeful that the hardware can deliver, but right now it’s very disappointing.

Introducing Torre, a new way to translate

I’m excited to announce Torre, a new product that translates instantly between Spanish and English. A lot of native English speakers I talk to don’t understand why a better approach to translation is needed, since there have been phone apps around for years. The best way I’ve found to explain is “Can you imagine watching a foreign language movie using Google Translate?”.

I’m an immigrant to the US who was lucky enough to already speak the dominant language, and so I feel like I’ve experienced the whole process on easy mode. When I talk to the children of immigrants from other parts of the world, language brokering for their parents and relatives is a huge part of their lives. Kids end up being thrust into situations like medical appointments, PTA meetings, and legal consultations, often from a young age, and are exposed to aspects of adult life we shouldn’t expect children to deal with. Sometimes professional human translators are theoretically available, but the difficulty of scheduling them, and the awkwardness of alternative phone services, mean that family members are still the most common option.

We’re taking the latest advances in AI language models, and use them to offer a fast and fluent experience, aiming to make a live conversation as easy as watching a movie with subtitles. A lot of the situations that need translation also require privacy, so our tablets run with no internet connection at all, air-gapped so there’s no risk of your data leaving the device.

Initially we’re looking for lawyers, doctors, and educators who want to give Torre a try, since those are some of the roles we think we can be most helpful to. Drop me an email if you’d like to know more. I’d love to hear from you even if you don’t fit those categories, since we’re still learning about all the places Torre could be useful.

To show where we’re at with the product, here’s me and my colleague Jackie doing a live demo in a single take!

The Long, Strange Journey of Language Models

DALLE-3-generated image of a paper tape rolling out across a landscape

Have you ever wondered why ChatGPT and similar advanced AI systems are known as Large Language Models? What are “language models”, even? To answer that, and understand how remarkable the current state of the art is, I need to jump back a few decades.

Understanding language has always been a goal for artificial intelligence researchers, right from the field’s start in the 1950’s, but what might surprise you is that language models were traditionally seen as just one processing step in a larger workflow, not as the catchall solution they are now. A good way of thinking about a language model’s job is that, given a sequence of words, it predicts which words are most likely to come next. For example, given “Move it over“, it might predict “to“, “there” or “more” as likely next words. It’s very similar to autocomplete. If you build a speech recognition system that takes in audio and tries to output the corresponding to the speech, having this kind of prediction can help decide between two words that sound the same. For example, if the speech to text model had previously heard “Move it over“, and the next word sounded like “their” or “there“, the information from the language model will tell you that “there” is more likely to be right. You can see probably see how language models could be used in similar ways to post-process the results of machine translation or optical character recognition.

For many decades, the primary focus of AI research was on symbolics, and language models were seen as a hacky statistical approach that might be useful to help clean up data, but weren’t a promising avenue towards any kind of general intelligence. They didn’t seem to embody knowledge about the world, they were just predicting strings, so how could they be more than low level tools? Even now, language models are criticized as “Stochastic Parrots“, mindlessly regurgitating plausible text with no underlying understanding of anything. There’s a whole genre of autofill games that use text prediction on phones to generate surreal sentences, highlighting the uncanny valley aspect of these comparatively primitive language models.

To understand how they have potential to be more useful, think about the words “The sky is“. As people, we’d guess “blue” or maybe “cloudy” as likely next words, and good enough language models would do the same. If you add in a preceding question, so the full prefix is “What color is the sky? The sky is“, we’d be even more likely to guess “blue“, and so would a model. This is purely because in a large enough collection of writing, a model will have come across enough instances of the question “What color is the sky?” to know that “blue” is a likely answer, but crucially, this means it has acquired some knowledge of the world! This is despite having no eyes to see, and having never been explicitly programmed with what color the sky is. The prompt you give to a modern LLM is essentially just that question at the start of the string to kick things off, so even the latest models still work in the same basic fashion.

What has happened since BERT in 2018 is that language models have been trained on larger and larger sets of data, and we’ve discovered that they’re incredibly effective at solving all sorts of problems that were considered to be challenging, and were seen as significant stepping stones towards general intelligence. For a lot of us, me included, this was very surprising, and challenged a lot of our beliefs about what makes intelligence. After all, language models are fundamentally just auto-complete. If intelligence can seem to emerge from repeatedly predicting the next word in a sentence, what does that mean about how we think ourselves? Is this truly a path towards general intelligence, or just a mirage that disappears once we run out of larger and larger sets of data to feed it?

You probably know from your own experience that modern chat bots can handily pass the Turing Test, and act as convincing conversation partners, even joking, detecting sarcasm, and exhibiting other behaviors we usually consider to require intelligence. They clearly work in practice, but as the old joke about French engineers goes, do they work in theory? This is where research is just getting off the ground. Since we have systems that exhibit intelligence, but we don’t understand how or why, it’s more experimental than theory-driven right now, and has far fewer resources available than applied and commercial applications of LLMs, but I think it will reveal some mind-bending results over the next few years. We have something approaching an intelligence that’s we’ve constructed, how could we not discover new insights by analyzing these models?

I love that language models are Cinderellas of the AI world, rising from humble servants of more popular techniques, to solving the hardest problems all on their own. I would never have predicted this myself a few years ago, but it is more evidence that larger datasets solve most problems in machine learning, and I can’t wait to see where they go next!

Update – I talked with Ann Spencer about this topic on my YouTube podcast.

Why has the Internet of Things failed?

According to a survey last year, less than 50% of appliances that are internet-capable ever get connected. When I talk to manufacturers, I often hear even worse numbers, sometimes below 30%! Despite many years and billions of dollars of investment into the “Internet of Things”, this lack of adoption makes it clear that even if a device can be connected, consumers don’t see the value in most cases. I think it’s time to admit that the core idea of IoT has failed. To understand why, it’s worth looking at how it was originally pitched, and what flaws time has revealed in those arguments.

The idea of an internet of everyday devices has been around for decades, but its definition has always been centered on connecting electronics to a network. This is superficially sensible, because we’ve seen internet-connected devices overtake standalone equivalents in everything from mainframes, to personal computers, and finally to phones. Sun coined the phrase “The network is the computer”, and that philosophy has clearly won in most domains, from Salesforce pioneering Software as a Service, to the majority of user applications today being delivered as web or mobile apps with a data-center-based backend. Given this history, it makes sense that the world of embedded systems, the billions of tiny, cheap, microcontrollers in everything from cars to toasters, would be revolutionized by a similar switch to network-reliant technologies. So, why has this approach failed?

Setup

The biggest obstacle is the setup tax. All of our communication technologies, from WiFi to cellular, cost money to use, and so require authentication and billing accounts. This isn’t as big a problem with PCs and phones because we only replace them every few years, and they have screens and keyboards, so going through the setup process is comparatively straightforward. By comparison, your fridge or toaster probably doesn’t have a full-featured user interface, and so you’re expected to download a phone app, and then use that to indirectly set up your appliance. This adds multiple extra steps, and anyone who’s ever worked on a customer funnel means that every additional stage means losing some people along the way. If you also factor in that a household might have dozens of different devices that all want you to go through the same process, with different applications, accounts, and quirks, it’s clear why people suffer from setup fatigue and often don’t even try.

Uselessness

“Your scientists were so preoccupied with whether or not they could, they didn’t stop to think if they should.” — Ian Malcolm, Jurassic Park.

Last year I talked to an engineer who had spent six months working on a smart dishwasher that could be connected to the internet. He confessed that none of the team had been able to figure out a compelling user benefit for the system. You could start the dishwasher remotely, but how did that help if you had to be there in person to load it? Knowing when it was done was mildly useful, but most people would know that from when they started it. With phones and PCs adding an internet connection unlocked immediately compelling use cases, thanks to all the human-readable content on web pages, and once the network was widely available more applications like Salesforce or Uber added to the appeal. We’ve never seen anything like this for IoT in the consumer space. Getting an alert that your fridge door has been left open is nice, but isn’t much better than having an audible alarm go off. Amazon, Apple, and Google have tried to use voice interfaces as a selling point for devices to connect through their ecosystem, but almost nobody uses them for anything other than setting alarms and playing songs. There’s also no inherent reason to send audio data to the cloud to have a voice interface, one of the reasons we founded Useful was to bring local speech interfaces to everyday objects. People need a motivation to connect their devices, especially with the time cost involved in setup, and nobody has given them one.

Energy

The final nail in IoT’s coffin is a bit more subtle and technical than the first two. Unless you want to run ethernet cables everywhere, a network connection requires radio communication, through Bluetooth, WiFi, or cellular data. All of these technologies need at least 100 milliwatts of power to run continuously. This isn’t much when you are connected to a mains power supply, but you’ll quickly run down anything battery-powered. There’s a reason you need to charge your phone and wearables every day. The philosophy of “The network is the computer” requires that you can access a data center with low enough latency that you can treat the cloud as just another component of your system. If you need to wait seconds for it to become available, and if it’s so hungry for precious resources that you can’t use it routinely, the programming model that allows phone and desktop apps to seamlessly integrate with remote servers breaks down. Ubiquitous, always-on network connectivity makes writing software so much easier, because you can always tap into a deep pool of compute and data as if it were local. That’s a big reason why the cloud has eaten the regular software world over the last two decades. A costly, intermittent connection removes that advantage.

You might argue that the consumer IoT should be focused on mains-powered devices, but that’s a very limited vision, since there are only so many appliances you want to plug in, and being able to run on batteries or energy harvesting opens up the possibility of there being hundreds or even thousands of sensors per person. The energy costs behind radio transmission don’t seem to be improving very fast, so I believe this will continue to be a barrier to the IoT ideal.

What’s next?

Believe it or not, I’m still optimistic about the future of embedded technology! I just get frustrated that a superficial analogy to previous technology cycles has focused so much academic and commercial attention on bringing an internet connection to everyday objects. Instead, I think it will be much more fruitful to spend our time tackling issues users actually care about, like why do I have five remotes on my couch, or why doesn’t my TV turn on instantly like it used to years ago? Most of the issues that are frustrating people with consumer electronics don’t need a network connection to solve. I’d much rather have us building machines that can understand us better, and figure out the monetization strategy after we’re providing value, instead of building features nobody uses because we think they can make money.

Understanding the Raspberry Pi Pico’s Memory Layout

A few months ago I started updating TensorFlow Lite Micro for the Raspberry Pi Pico board, which uses the RP2040 microcontroller. I ran into some baffling bugs that stopped me making progress, but eventually I tracked them down to my poor understanding of the memory layout. Since I had to do a deep dive, I wanted to share what I learned here.

This diagram shows the physical address layout of the RP2040. I believe the flash location can be board-specific, but on the Pico boards it begins at 0x10000000 and is two megabytes long. Where things get a bit more complex is the RAM. The RP2040 has built-in SRAM, made up of four 64KB banks, followed by two 4KB banks. There isn’t much documentation I can find about the characteristics of these banks, but from what I can gather different banks can be accessed at the same time by the two Cortex M0 cores on the chip. I believe if the same bank is accessed by both cores one of the cores will stall for at least a cycle while the other is given access.

The physical layout is fixed and controlled by the hardware, but the compiler and linker decide how the software is going to use the available address space. The default RAM layout is defined in src/rp2_common/pico_standard_link/memmap_default.ld in the Pico SDK, and I’ve used those values for the diagram above. To explain some of the labels, the vector table is a 256 byte array of function pointers for system routines, and is usually at the start of RAM, .data is where all the global and static variables that start with a value are stored, .bss is the same, but for variables that don’t need to be initialized, the heap is where malloc-ed memory comes from, and the two stacks hold local variables for functions.

There are a few things to be aware of here. There are two stacks, one for each of the Cortex M0 cores the RP2040 has. Unless your program explicitly calls the second core, only core 0 will be used, so the core 1 stack is often unused. The stacks are defined as 2kb in size, and they grow downwards in this diagram, starting with the highest address as the top of the stack and moving to smaller addresses as more items are added. For performance reasons, each core’s stack is defined in a different bank, one of the smaller scratch x or y areas, presumably so that local variables can be accessed independently by each core, with no risk of stalls. One oddity is that each stack is 2KB, but the scratch banks are 4kb each, and so they each only use half of the bank.

The heap size is defined to be the remaining memory once all the other fixed-size sections have been allocated. This means it stretches from the top of .bss to the bottom of the core 1 stack. In theory there’s no mandated way for areas to be allocated from this region when you call malloc(), but in practice every implementation I’ve seen will begin allocating at the bottom (lowest address) of the heap, and move upwards as more space is needed for further allocations.

To recap, the stacks grow downwards from the highest addresses in memory, and the allocated parts of the heap grow upwards. This means that the area immediately below the stacks is unlikely to be used unless you’re heavily allocating memory from the heap. The subtle consequence of this is that you will probably not observe incorrect behavior in most programs if you end up using more than 2kb of stack space. The memory at the top of the heap is unlikely to be used, so the stack can start stomping all over it without any apparent bugs surfacing, up until the point that it reaches part of the heap that has been allocated.

So, the nominal limit for stack size on the RP2040 is 2KB, but we can definitely use 4KB (because that’s the size of the scratch bank), and in all likelihood many programs will appear to work correctly even if they use a lot more. This is important because most programs designed for non-embedded platforms assume that the stack size is on the order of megabytes at least. Even some libraries aimed at embedded systems assume at least tens of kilobytes of memory is available. In this case, it was my baby, TensorFlow Lite Micro, that had these buried assumptions.

My quest started when I saw a particular convolution test fail when I enabled my dual-core optimizations. After a lot of debugging, I realized that the test function was allocating several multi-kilobyte arrays as local variables on the stack. This blew out the 2kb nominal limit, and the 4kb practical limit for the stack size, but didn’t cause any visible problems because the heap was not heavily used. However, if you look at the RAM layout diagram above, you’ll see that the core 1 stack is immediately below the core 0 stack. This means that a core 0 function that overflows its stack size will start using memory reserved for the core 1 stack! This caused me a lot of confusion until I figured out what was going on, and I want to flag this as something to watch out for if anyone else is working on dual-core RP2040 optimizations. It meant that there were weird race conditions that meant apparently random data would end up in the data arrays, depending on which core wrote to those locations first.

Thanks to the great community on the RPi forums I was able to come up with a simple solution for my immediate problem, by putting the core 0 stack below the core 1 stack in the memmap_default.ld file (placing core 0 in scratch x, and core 1 in scratch y) since I controlled all the code running on core 1 and could ensure it wouldn’t overflow the stack, whereas core 0 ran application code that I couldn’t control. This allowed core 0’s stack to overflow into the heap, but left core 1’s stack untouched. I also learned a few helpful techniques from the forum thread, such as running -fstack-usage to get the stack size of functions and the ‘USE_STACK_GUARDS’ macro that can check for overflows. I haven’t figured out how to specify a custom .ld file in cmake yet, but I hope to add that in the future.

I hope this brain dump of what I learned about the RP2040’s memory layout and the potential for silent stack overflows helps somebody else out there. It was one of the most elusive bugs I’ve chased in quite a while, but it was very satisfying to finally understand what was going on. One of the reasons that I enjoy working on embedded platforms is that they are small enough systems that it should be possible to figure out any unexpected behavior, but this one tested my faith in that idea!