How to optimize Raspberry Pi code using its GPU

warpspeed

Photo by Michal

When I was at Apple, I spent five years trying to get source-code access to the Nvidia and ATI graphics drivers. My job was to accelerate image-processing operations using GPUs to do the heavy lifting, and a lot of my time went into debugging crashes or strange performance issues. I could have been a lot more effective if I’d had better insights into the underlying hardware, and been able to step through and instrument the code that controlled the graphics cards. Previously I’d written custom graphics drivers for game consoles, so I knew how useful having that level of control could be.

I never got the access I’d wanted, and it left me with an unscratched itch. I love CUDA/OpenCL and high-level shader interfaces, but the underlying hardware of graphics cards is so specialized, diverse, and quirky that you can’t treat them like black boxes and expect to get the best performance. Even with CUDA, you end up having to understand the characteristics of what’s under the hood if you want to really speed things up. I understand why most GPU manufacturers hate the idea, even just the developer support you’d need to offer for a bare-metal interface would take a lot of resources, but it still felt like a big missed opportunity to write more efficient software.

That all meant I was very excited when Broadcom released detailed documentation of the GPU used on the Raspberry Pi a few months ago. The Pi’s a great device to demonstrate the power of deep learning computer vision, and I’d ported my open-source library to run on it, but the CPU was woefully slow on the heavy math that neural networks require, taking almost twenty seconds even with optimized assembler, so I had a real problem I thought GPU acceleration might be able to help with.

Broadcom’s manual is a good description of the hardware interface to their GPU, but you’ll need more than that if you’re going to write code to run on it. In the end I was able to speed up object recognition from twenty seconds on the CPU to just three on the GPU, but it took a lot of head-scratching and help from others in the community to get there. In the spirit of leaving a trail of breadcrumbs through the forest, I’m going to run through some of what I learned along the way.

Getting started

Broadcom’s Videocore Reference Guide will be your bible and companion, I’m constantly referring to it to understand everything from assembly instructions to interface addresses.

The very first program you should try running is the hello_fft sample included in the latest Raspbian. If you can get this running, then at least you’re set up correctly to run GPU programs.

There’s a missing piece in that example though – the source assembler text isn’t included, only a compiled binary blob. [Thanks to Andrew Holmes and Eben for pointing me to a recent update adding the assembler code!] There isn’t an official program available to compile GPU assembler, so the next place to look is eman’s excellent blog series on writing an SHA-256 implementation. This includes a simple assembler, which I’ve forked and patched a bit to support instructions I needed for my algorithm. Once you’ve got his code running, and have the assembler installed, you should be ready to begin coding.

Debugging

There’s no debugger for the GPU, at all. You can’t even log messages. In the past I’ve had to debug shaders by writing colors to the screen, but in this case there isn’t even a visible output surface to use. I’ve never regretted investing time up-front into writing debug tools, so I created a convention where a register was reserved for debug output, it would be written out to main memory at the end of the program, could be immediately invoked with a LOG_AND_EXIT() macro, and the contents would be printed out to the console after the code was done. It’s still painful, but this mechanism at least let me get glimpses of what was going on internally.

I also highly recommend using a regular laptop to ssh into your Pi, alongside something like sshfs so you can edit source files easily in your normal editor. You’ll be crashing the device a lot during development, so having a separate development machine makes life a lot easier.

Vertex Program Memory

One of the eternal problems of GPU optimization is getting data back and forth between the main processor and the graphics chip. GPUs are blazingly fast when they’re working with data in their local memory, but coordinating the transfers so they don’t stall either processor is a very hard problem. My biggest optimization wins on the Playstation 2 came from fiddling with the DMA controller to feed the GPU more effectively, and on modern desktop GPUs grouping data into larger batches to upload is one of the most effective ways to speed things up.

The Broadcom GPU doesn’t have very much dedicated memory at all. In fact, the only RAM that’s directly accessible is 4,096 bytes in an area known as Vertex Program Memory. This is designed to be used as a staging area for polygon coordinates so they can be transformed geometrically. My initial assumption was that this would have the fastest path into and out of the GPU, so I built my first implementation to rely on it for data transfer. Unfortunately, it has a few key flaws.

There are actually 12 cores inside the GPU, each one known as a QPU for Quad Processing Unit. The VPM memory is shared between them, so there wasn’t much available for each. I ended up using only 8 cores, and allocating 512 bytes of storage to each, which meant doing a lot of small and therefore inefficient transfers from main memory. The real killer was that a mutex lock was required before kicking off a transfer, so all of the other cores ground to a halt while one was handling an upload, which killed parallelism and overall performance.

Texture Memory Unit

After I released the initial VPM-based version of the matrix-to-matrix multiply GEMM function that’s the most time-consuming part of the object recognition process, several people mentioned that the Texture Memory Unit or TMU was a lot more efficient. The documentation only briefly mentions that you can use the TMU for general memory access, and there wasn’t any detail on how to do it, so I ended up looking at the disassembly of the hello_fft sample to see how it was done. I also received some help over email from Eben Upton himself, which was a lovely surprise! Here’s a summary of what I learned:

 – There are two TMUs available to each core. You can manually choose how to use each if you have an algorithmic way to send the same work to both, by turning off ‘TMU swap’, or if you leave it enabled half the cores will be transparently rewired to use alternating TMUs for 0 and 1.

 – You write a vector of 16 addresses to registers ra56 and ra60 for TMU0 and 1 respectively, and that will start a fetch of the values held in those addresses.

 – Setting a ldtmu0/1 code in an instruction causes the next read in the pipeline to block until the memory values are returned, and then you can read from r4 to access those values in further instructions.

 – There’s a potentially long latency before those values are ready. To mitigate that, you can kick off up to four reads on each TMU before calling a ldtmu0/1. This means that memory reads can be pipelined while computation is happening on the GPU, helping performance a lot thanks to all the overlapping pipelining.

 – To reduce extra logic-checking instructions, I don’t try to prevent overshooting on speculative reads, which means there may be accesses beyond the end of arrays (though the values aren’t used). In practice this hasn’t caused problems.

 – I didn’t dive into this yet, but there’s a 4K direct-mapped L1 cache with 64-byte lines for the TMU. Avoiding aliasing on this will be crucial for maintaining speed, and in my case I bet it depends heavily on the matrix size and allocation of work to different QPUs. There are performance counters available to monitor cache hits and misses, and on past experience dividing up the data carefully so everything stays in-cache could be a big optimization.

 – A lot of my data is stored as 8 or 16-bit fixed point, and the VPM had a lot more support for converting them into float vectors than the TMU does. I discovered some funky problems, like the TMU ignoring the lower two bits of addresses and only loading from 32-bit aligned words, which was tricky when I was dealing with odd matrix widths and lower precision. There isn’t much support for ‘swizzling’ between components in the 16-float vectors that are held in each register either, beyond rotating, so I ended up doing lots of masking tricks.

 – Reading from nonsensical addresses can crash the system. During development I’d sometimes end up with wildly incorrect values for my read addresses, and that would cause a hang so severe I’d have to reboot.

 – This isn’t TMU specific, but I’ve noticed that having a display attached to your Pi taxes the GPU, and can result in slower performance by around 25%.

In the end I was able to perform object recognition in just three seconds with the optimized TMU code, rather than six using the VPM, which opens up a lot more potential applications!

Going Further

Developing GPU code on the Raspberry Pi has come a long way in just the last few months, but it’s still in its early stages. I’m hitting mysterious system hangs when I try to run my deep learning TMU example with any kind of overclocking for example, and there’s no obvious way to debug those kind of problems, especially if they’re hard to reproduce in a simple example.

The community, including folks like eman, Eben, Andrew Holme, and Herman Hermitage, are constantly improving and extending the documentation, examples, and tools, so developing should continue to get easier. I recommend keeping an eye on the Raspberry Pi forums to see the latest news! 

Running the example

If you want to try out the deep learning object recognition code I developed yourself, you can follow these steps:

Install Raspbian.

Install the latest firmware by running `sudo rpi-update`.

From `raspi-config`, choose 256MB for GPU memory.

Clone qpu-asm from Github.

Run `make` inside the qpu-asm folder.

Create a symbolic link to the qpu-asm program, for example by running `sudo ln -s /home/pi/projects/qpu-asm/qpu-asm /usr/bin/`.

Clone DeepBeliefSDK from Github.

From the DeepBeliefSDK/source folder, run `make TARGET=pi GEMM=piqpu`.

Once it’s successfully completed the build, make sure the resulting library is in your path, for example by running `sudo ln -s /home/pi/projects/DeepBeliefSDK/source/libjpcnn.so /usr/lib/`.

Run `sudo ./jpcnn -i data/dog.jpg -n ../networks/jetpac.ntwk -t -m s`

You should see output that looks like this:Screen Shot 2014-08-07 at 1.49.33 PM

50 responses

  1. Pingback: More QPU magic from Pete Warden | Raspberry Pi

  2. Pingback: More QPU magic from Pete Warden | Raspberry World

  3. I set up your object recognition example just as you said and it works but i get terrible performance. over 100 seconds for one photo! I am running it on the raspberry pi model b+ with 512 megabytes of ram. Please help me get it working because the software seems great so far.

  4. I tried it again with a different photo but I got the same result what is going on the picture titled lena.png in the data file got these results:
    0.484777 hacksaw
    0.515223 Euopean hoopoe
    Classification took 111039 milliseconds
    Help Me!

    • Could you try commenting back in lines 32 and 33 in source/src/lib/libjpcnn.cpp (test_qpu_gemm() and exit() ) . That will run a small-scale test of the internal GPU code.

      • I uncommented those lines like you said and ran the program again this was my result:
        jpcnn: src/lib/pi/qpu_gemm.cop:104: void qpu_colas_sgemm_fixed(int, int, int, int, int, int, float, uint32_t, float, float, int, int, uint32_t, int, float, uint32_t, int): Assertion ‘transposeA == JPCblasTrans’ failed.

      • Sorry you’re still hitting problems, that is a very strange error! I’ve sent you an email, hopefully it will be easier to figure out what’s going wrong that way.

      • I had this problem, it turned out to be that I was missing the m4 preprocessor and the .cdat files will be built incorrectly (should be a fatal error, but it isn’t). Once you install m4 you need to do a git clean -f, make clean doesn’t cut it.

  5. Pingback: More QPU magic from Pete Warden - Raspberry Pi Tips

  6. I’ve been really interested in pursuing GPU programming etc. but I’ve been in a bit of a slump lately. Reading your post here has gotten me back in the game though I think, re-created my excitement. While I’m not a terribly experienced programmer as of yet, I’ve always been captivated by math and computers (hence I am a computational mathematics major in college).

    I know a lot of computer science majors, computer engineering majors and the likes that are so obsessed with learning programming etc. but I have yet to meet a single one that’s dedicated to programming GPUs straight. Even when I go to the library there are bookcases full of topics from the typical C programming all the way to Linux security etc. But the GPU programming section entails literally less than 10 books, with 8/10 of those dedicated to a bunch of hocus pocus related to gaming. I’ve never been a gamer and even though I’m now building a gaming desktop the hardware will never be used to play games. Instead I’ll be installing CentOS or Debian on it and buying a really, really decent gaming GPU to try and do similar things to what you’ve done here. I feel like it’s just incredible that a card I could hold in my hand like that, with >1000 CUDA cores in it, has more computing power than some of the world’s largest super computers up until just about 15 years ago (I apologize if that statistic is wrong, I’m simply recalling what I read in a book on parallel C programming I got from the library this past week).

    I recall going to a talk once by an astronomy professor that teaches at a local college. She had a colleague that worked on taking all of the data they gathered on exoplanet transits and starspots from dedicated local and worldwide amateur astronomers that wanted to contribute to a professional team. It was a great line of research and I felt it had a lot of potential and would likely see a lot of success. The problem, however, was that it was very, very computationally/mathematically intensive. So her colleague, as she told me, would take over a whole computer lab and run their programs on the computers running in parallel, essentially three or four dimensional regression analysis — which to me still sounded like a bit of a guessing game since there were more factors to consider (in other words, I found their deductions a bit incomplete, but that’s besides this conversation) — but I couldn’t help but wonder if they could have done all of that on a single high-performance GPU. I still do not know enough about it to this day, but I’m intrigued, and as long as I am intrigued I will continue to seek out articles like this one and learn from them, take notes.

    I too own a RPi and I’ve been wondering what I should do with it. I recently uninstalled the GUI and all of the extra “stuff” that I saw as unnecessary, and it truly does perform a lot better. I have yet to have it crash on me as you described a few times in this article. Obviously I’m not doing anything as understood or in-depth as you, but I suppose the model B does have more to it than people give it credit for. Definitely worth the $35 I paid for it.

    Thanks for writing this great post, I’ll stay tuned in the future to learn more!

    Brandon Doyle

  7. Pingback: Raspberry pi and GPU | HONG@USTC

  8. This post report me to my teenager time, middle ’80, when men were men and wrote their own Assembrer programs using only paper pencil and eraser (debugger and decompiler was not avaliable)… i was one of those. Glorious times.

  9. In the middle 80’s we used tweezers, microscopes and tiny magnets to edit the code directly. Ah, those were the days…

  10. Yes.. the good ole days. I wrote my own Asm/DisAsm for 6502 processor in basic. 🙂 Worked okay. The stuff they are doing today is really cool. object recognition, etc. Thats the start of the new world.

      • I’ve already got one.. Followed your instructions but when I run the “Run `sudo ./jpcnn -i data/dog.jpg -n ../networks/jetpac.ntwk -t -m s`” the system hangs… waited 5min and got nothing.. Have to cold boot it as it doesn’t respond to anything anymore.. I had a second session (ALT-F2) with top running and it also stopped… Will let it run for 15min.. IF it comes to I’ll update this reply of not.. Then I cold booted it 😉

  11. Pete,

    Have you tried this on a Pi V2 yet? I am very interested in implementing this as part of an effort to make machines for blind people. I am planning on building as you have described soon to test.

    I’m hoping the time can be reduced to ~1 second. This would make it much more useful.

  12. Pingback: DOE Announces a High Performance Computing Fortran Compiler Agreement | Hackaday

  13. Pingback: DOE Announces a High Performance Computing Fortran Compiler Agreement | Hackaday

  14. Hi Pete,

    Improvement from 20 secs to 3 secs sounds very nice! I am going to build very simple motion detection by substraction to learn C++ and opencv. I have heard about GPU processing is really efficient in matrix arithmetic. Maybe my question is really on a more “dummy” level but do you think there is a simple beginner way of doing background subtraction and thresholding in the GPU? Is it even worth it for the effort?

    Thanks!

  15. Hello Pete

    I fllowed step by step your explanations and I can not run the example

    The mistake is:

    pi@raspberrypi ~/projects/DeepBeliefSDK/source $ sudo ./jpcnn -i data/dog.jpg -n ../networks/jetpac.ntwk -t -m s
    ./jpcnn: error while loading shared libraries: libjpcnn.so: cannot open shared object file: Error 40

    What you think?

    Thanks!!!

  16. I spent a couple of decades writing pure assembly language. Felide-constructors sometimes look like someone who has never coded in ASM before. ARM is about the best CPU you could ask for and read-and-discard ensures that the data has arrived with no wait-states. About the only thing I don’t like is separate code & data cache. ARM lends itself to Self-modifying code. I emulated the SNES playfields, HDMA & VDMA & Audio on the Gameboy advance. It ended up working really well but to be honest, I’ve never liked game-code. The engine has always been my domain. The Broadcom GPU copies a 32K block into L2 cache & runs it. I’m fascinated to know if that block can be reflashed. It would be good to setup Trustzone right away. Not to hide code but as a debugging aid. I believe you could fit the uploader for real-time debugging in a few K but the point is, you don’t have to keep reflashing. Just assemble, download and run. I just wondered if you have any insight to the first stage boot. Many thanks.

  17. Seems like a great Idea buy my rpi 3 keeps shouting about a problem opening /var/lib/jpcnn/char_dev and it’s there … could this be a 32/64 bits build problem ?

  18. Pingback: 记录下tensorflow的使用 – 门虽设而常关

  19. Hello Pete Warden,

    Hoping that it is not to late to ask question.

    I follow all your step but I hit a problem once calling “sudo make TARGET=pi GEMM=piqpu” which return the followings:

    Makefile:12: GEMM=piqpu
    Makefile:13: TARGET=pi
    m4 -I ./src/lib/pi/ src/lib/pi/gemm_float.asm | qpu-asm -o src/lib/pi/gemm_float.cdat -c g_gemm_floatCode
    /bin/sh: 1: qpu-asm: Permission denied
    Makefile:90: recipe for target ‘src/lib/pi/gemm_float.cdat’ failed
    make: * [src/lib/pi/gemm_float.cdat] Error 127

  20. Pingback: An example of using computer vision on a Raspberry Pi to follow a line - Opencvpython

Leave a comment