Accelerating AI with the Raspberry Pi Pico’s dual cores

I’ve been a fan of the RP2040 chip powering the Pico since it was launched, and we’re even using them in some upcoming products, but I’d never used one of its most intriguing features, the second core. It’s not common to have two cores in a microcontroller, especially a seventy cent Cortex M0, and most of the system software for that level of CPU doesn’t have standardized support for threads and other typical ways to get parallelized performance from your algorithms. I still wanted to see if I could get a performance boost on compute-intensive tasks like machine learning though, so I dug into the pico_multicore library which provides access low-level access to the second core.

The summary is that I was able to get approximately a 1.9x speed boost by breaking a convolution function into two halves and running one on each processor. The longer story is that I actually implemented most of this several months ago, but got stuck due to a silly mistake where I was accidentally serializing the work by calling functions in the wrong order! I was in the process of preparing a bug report for the RPi team who had kindly agreed to take a look when I realized my mistake. Another win for rubberducking!

If you’re interested in the details, the implementation is in my custom version of an Arm CMSIS-NN source file. I actually ended up putting together an updated version of the whole TFLite Micro library for the Pico to take advantage of this. There’s another long story behind that too. I did the first TFLM port for the Pico in my own time, and since nobody at Google or Raspberry Pi is actively working on it, it’s remained stuck at that original version. I can’t make the commitment to be a proper maintainer of this new version, it will be on a best-effort basis, so bugs and PRs may not be addressed, but I’ve at least tried to make it easier to update with a sync/sync_with_upstream.sh script that currently works and is designed to as robust to future changes as I can make it.

If you want more information on the potential speedup, I’ve included some benchmarking results. The lines to compare are the CONV2D results. For example the first convolution layer takes 46ms without the optimizations, and 24ms when run on both the cores. There are other layers in the benchmark that aren’t optimized, like depthwise convolution, but the overall time for running the person detection model once drops from 782ms to 599ms. This is already a nice boost, but in the future we could do something similar for the depthwise convolution to increase the speed even more.

Thanks to the Raspberry Pi team for building a lovely little chip! Everything from the PIOs to software overclocking and dual cores makes it a fascinating system to work with, and I look forward to diving in even deeper.

One response

  1. Pingback: Understanding the Raspberry Pi Pico’s Memory Layout « Pete Warden's blog

Leave a comment