programming « Pete Warden's blog

A few months ago I started updating TensorFlow Lite Micro for the Raspberry Pi Pico board, which uses the RP2040 microcontroller. I ran into some baffling bugs that stopped me making progress, but eventually I tracked them down to my poor understanding of the memory layout. Since I had to do a deep dive, I wanted to share what I learned here.

This diagram shows the physical address layout of the RP2040. I believe the flash location can be board-specific, but on the Pico boards it begins at 0x10000000 and is two megabytes long. Where things get a bit more complex is the RAM. The RP2040 has built-in SRAM, made up of four 64KB banks, followed by two 4KB banks. There isn’t much documentation I can find about the characteristics of these banks, but from what I can gather different banks can be accessed at the same time by the two Cortex M0 cores on the chip. I believe if the same bank is accessed by both cores one of the cores will stall for at least a cycle while the other is given access.

The physical layout is fixed and controlled by the hardware, but the compiler and linker decide how the software is going to use the available address space. The default RAM layout is defined in src/rp2_common/pico_standard_link/memmap_default.ld in the Pico SDK, and I’ve used those values for the diagram above. To explain some of the labels, the vector table is a 256 byte array of function pointers for system routines, and is usually at the start of RAM, .data is where all the global and static variables that start with a value are stored, .bss is the same, but for variables that don’t need to be initialized, the heap is where malloc-ed memory comes from, and the two stacks hold local variables for functions.

There are a few things to be aware of here. There are two stacks, one for each of the Cortex M0 cores the RP2040 has. Unless your program explicitly calls the second core, only core 0 will be used, so the core 1 stack is often unused. The stacks are defined as 2kb in size, and they grow downwards in this diagram, starting with the highest address as the top of the stack and moving to smaller addresses as more items are added. For performance reasons, each core’s stack is defined in a different bank, one of the smaller scratch x or y areas, presumably so that local variables can be accessed independently by each core, with no risk of stalls. One oddity is that each stack is 2KB, but the scratch banks are 4kb each, and so they each only use half of the bank.

The heap size is defined to be the remaining memory once all the other fixed-size sections have been allocated. This means it stretches from the top of .bss to the bottom of the core 1 stack. In theory there’s no mandated way for areas to be allocated from this region when you call malloc(), but in practice every implementation I’ve seen will begin allocating at the bottom (lowest address) of the heap, and move upwards as more space is needed for further allocations.

To recap, the stacks grow downwards from the highest addresses in memory, and the allocated parts of the heap grow upwards. This means that the area immediately below the stacks is unlikely to be used unless you’re heavily allocating memory from the heap. The subtle consequence of this is that you will probably not observe incorrect behavior in most programs if you end up using more than 2kb of stack space. The memory at the top of the heap is unlikely to be used, so the stack can start stomping all over it without any apparent bugs surfacing, up until the point that it reaches part of the heap that has been allocated.

So, the nominal limit for stack size on the RP2040 is 2KB, but we can definitely use 4KB (because that’s the size of the scratch bank), and in all likelihood many programs will appear to work correctly even if they use a lot more. This is important because most programs designed for non-embedded platforms assume that the stack size is on the order of megabytes at least. Even some libraries aimed at embedded systems assume at least tens of kilobytes of memory is available. In this case, it was my baby, TensorFlow Lite Micro, that had these buried assumptions.

My quest started when I saw a particular convolution test fail when I enabled my dual-core optimizations. After a lot of debugging, I realized that the test function was allocating several multi-kilobyte arrays as local variables on the stack. This blew out the 2kb nominal limit, and the 4kb practical limit for the stack size, but didn’t cause any visible problems because the heap was not heavily used. However, if you look at the RAM layout diagram above, you’ll see that the core 1 stack is immediately below the core 0 stack. This means that a core 0 function that overflows its stack size will start using memory reserved for the core 1 stack! This caused me a lot of confusion until I figured out what was going on, and I want to flag this as something to watch out for if anyone else is working on dual-core RP2040 optimizations. It meant that there were weird race conditions that meant apparently random data would end up in the data arrays, depending on which core wrote to those locations first.

Thanks to the great community on the RPi forums I was able to come up with a simple solution for my immediate problem, by putting the core 0 stack below the core 1 stack in the memmap_default.ld file (placing core 0 in scratch x, and core 1 in scratch y) since I controlled all the code running on core 1 and could ensure it wouldn’t overflow the stack, whereas core 0 ran application code that I couldn’t control. This allowed core 0’s stack to overflow into the heap, but left core 1’s stack untouched. I also learned a few helpful techniques from the forum thread, such as running -fstack-usage to get the stack size of functions and the ‘USE_STACK_GUARDS’ macro that can check for overflows. I haven’t figured out how to specify a custom .ld file in cmake yet, but I hope to add that in the future.

I hope this brain dump of what I learned about the RP2040’s memory layout and the potential for silent stack overflows helps somebody else out there. It was one of the most elusive bugs I’ve chased in quite a while, but it was very satisfying to finally understand what was going on. One of the reasons that I enjoy working on embedded platforms is that they are small enough systems that it should be possible to figure out any unexpected behavior, but this one tested my faith in that idea!

Back in 2020 Foone Turing caused a sensation when she showed Doom running on a pregnancy test. For anyone who remembered desktop computers from the 90’s, it was amazing to see a disposable device run something that used to take thousands of dollars worth of hardware. It’s not a fluke either – calculators, ATMs, fridges, and even keychains can run the game. What this shows is how much computing power low-cost, everyday objects now have. If you’d told teenage me that I could buy a 50 cent chip as powerful as my PC, my imagination would have raced with all of the amazing things that people could build.

So why does the world of embedded devices feel so boring? We have orders of magnitude more compute available than even a couple of decades ago, but no real killer apps have emerged, outside of mobile and wearables. The truth is that most compute is sitting idle most of the time. It’s like Dark Fibre after the Dot Com Bubble. In both cases it made engineering sense to add the extra capacity since the marginal cost was so low, even though the applications weren’t yet there to use it. Dark Fibre eventually gave us streaming, video calls, and the internet we know today. I think all of this Dark Compute in embedded devices will lead to a wave of innovation too, once product designers understand the possibilities.

How much Dark Compute is out there?

From Arm’s own data, there are 100 billion (or 1e14) Arm Cortex M chips out in the world. Even if we assume most of those are the cheapest M0 class running at 100MHz, this translates to 100 million (or 1e8) integer arithmetic ops per second per CPU. This suggests that 1e22 integer ops per second could be executed if they were all working at full capacity. Though this is not comparing apples to apples, it is more than twice the number of FLOPs available through all the world’s active GPUs and TPUs. I’ll explain why comparing float and integer operations is interesting below, but the headline is that the embedded world contains a massive amount of computing power.

Estimating how much is actually used is harder, but the vast majority of current applications are for things like fans, appliances, or other devices that don’t need much more than simple control logic. They’re using these over-powered chips because once the price of a 32-bit MCU drops below fifty cents (or even ten cents!) it’s cheaper overall to buy a system that is easy to program and well supported, as the NRE costs start to dominate. My best guess is that ninety percent of the time these processors are left idle. That still leaves us in the 1e22 range for the total amount of Dark Compute.

What can we use Dark Compute for?

AI!

You might have guessed where I’m going from the title, but we have an amazing opportunity to turn all of this dead silicon into delightful experiences for users. It’s now possible run speech recognition to offer voice interfaces on everyday devices, local closed captions and translations for accessibility, person sensing so your TV can pause when you get up to make a cup of tea, play air drums, recognize gestures, brew coffee perfectly, or a hundred other interface improvements, all using the same underlying machine learning technology. In many cases, this doesn’t even need a hardware change, because the systems already have Dark Compute lying idle. Even better, the quality of the AI scales with the compute available, so as more modern chips are used the capabilities of these interface features grow too. It also only needs 8-bit operations to execute, so the comparisons betweens FLOPS and integer ops in terms of computing capacity are valid.

There are plenty of challenges still to overcome, from battery usage limiting compute, to including the right sensors and making the tools easy enough to use, but I’m convinced we’re going to see a wave of incredible AI innovations once the engineering community figures out how to effectively use all this idle capacity. I’m working to make this happen with Useful Sensors, so please get in touch if you’re interested too, and I’ll be at CES next week if anyone’s around. Let’s move our compute away from the dark side!

	Rick on Why I Love my Chevy Bolt …
	ademidun on Leave a trail of breadcru…
	oi0841oi on Understanding the Raspberry Pi…
	Pete Warden on Understanding the Raspberry Pi…
	oi0841oi on Understanding the Raspberry Pi…

Pete Warden's blog

Ever tried. Ever failed. No matter. Try Again. Fail again. Fail better.

Tag Archives: programming

Understanding the Raspberry Pi Pico’s Memory Layout

Doom, Dark Compute, and AI