How to run the Caffe deep learning vision library on Nvidia’s Jetson mobile GPU board

jetson

Photo by Gareth Halfacree

My colleague Yangqing Jia, creator of Caffe, recently spent some free time getting the framework running on Nvidia’s Jetson board. If you haven’t heard of the Jetson, it’s a small development board that includes Nvidia’s TK1 mobile GPU chip. The TK1 is starting to appear in high-end tablets, and has 192 cores so it’s great for running computational tasks like deep learning. The Jetson’s a great way to get a taste of what we’ll be able to do on mobile devices in the future, and it runs Ubuntu so it’s also an easy environment to develop for.

Caffe comes with a pre-built ‘Alexnet’ model, a version of the Imagenet-winning architecture that recognizes 1,000 different kinds of objects. Using this as a benchmark, the Jetson can analyze an image in just 34ms! Based on this table I’m estimating it’s drawing somewhere around 10 or 11 watts, so it’s power-intensive for a mobile device but not too crazy.

Yangqing passed along his instructions, and I’ve checked them on my own Jetson, so here’s what you need to do to get Caffe up and running.

Setup

The first step once you’ve unboxed your Jetson is logging in. You can attach a monitor and keyboard, but I prefer just plugging it into a local router and ssh-ing in. elinux.org/Jetson/Remote_Access has more details, but it should show up as tegra-ubuntu.local on your local network, and the username is ubuntu:

ssh ubuntu@tegra-ubuntu.local

The default password is ubuntu. Next we need to run Nvidia’s installer that comes with the device, and reboot.

sudo NVIDIA-INSTALLER/installer.sh
sudo shutdown -r now

Once the board has rebooted, you can log back in and continue installing all the packages you’ll need for Caffe.

ssh ubuntu@tegra-ubuntu.local
sudo add-apt-repository universe
sudo apt-get update
sudo apt-get install libprotobuf-dev protobuf-compiler gfortran \
libboost-dev cmake libleveldb-dev libsnappy-dev \
libboost-thread-dev libboost-system-dev \
libatlas-base-dev libhdf5-serial-dev libgflags-dev \
libgoogle-glog-dev liblmdb-dev gcc-4.7 g++-4.7

You’ll need the Cuda SDK to build and run GPU programs, and elinux.org/Tegra/Installing_CUDA has a good general guide. The summary is that you’ll need to register as an Nvidia developer,  on a logged-in browser download the Cuda 6.0 for ARM package to your local machine and then copy it over to the Jetson from there.

scp ~/Downloads/cuda-repo-l4t-r19.2_6.0-42_armhf.deb ubuntu@tegra-ubuntu.local:

Then back on the ssh connection to your Tegra, run these Cuda installation steps.

sudo dpkg -i cuda-repo-l4t-r19.2_6.0-42_armhf.deb
sudo apt-get update
sudo apt-get install cuda-toolkit-6-0
sudo usermod -a -G video $USER
echo "# Add CUDA bin & library paths:" >> ~/.bashrc
echo "export PATH=/usr/local/cuda/bin:$PATH" >> ~/.bashrc
echo "export LD_LIBRARY_PATH=/usr/local/cuda/lib:$LD_LIBRARY_PATH" >> ~/.bashrc
source ~/.bashrc

If everything’s installed correctly, running ‘nvcc -V’ should give you a compiler version message. Now you need to grab the Tegra versions of OpenCV. On your main machine, download developer.nvidia.com/rdp/assets/opencv-run-tegra-k1 and developer.nvidia.com/rdp/assets/opencv-dev-tegra-k1 from your logged-in browser and copy them over to the Jetson.

scp ~/Downloads/libopencv4tegra* ubuntu@tegra-ubuntu.local:

On the Jetson, install those packages.

sudo dpkg -i libopencv4tegra_2.4.8.2_armhf.deb
sudo dpkg -i libopencv4tegra-dev_2.4.8.2_armhf.deb

We need to download and install Caffe. Yangqing has put in a few recent tweaks and fixes so at the moment you’ll need to grab the dev branch, but those should soon be rolled into master.

sudo apt-get install -y git
git clone https://github.com/BVLC/caffe.git
cd caffe && git checkout dev
cp Makefile.config.example Makefile.config
sed -i "s/# CUSTOM_CXX := g++/CUSTOM_CXX := g++-4.7/" Makefile.config

We have to use gcc version 4.7 because nvcc hits some problems with the default 4.8, but otherwise we’re using a pretty standard setup. You should be able to kick off the build.

make -j 8 all

Once that’s complete, you should check things are working properly by running Caffe’s test suite. This can take quite a while to finish, but hopefully it should report a clean bill of health.

make -j 8 runtest

Finally you can run Caffe’s benchmarking code to measure performance.

build/tools/caffe time --model=models/bvlc_alexnet/deploy.prototxt --gpu=0

This should take about 30 seconds, and output a set of statistics. It’s running 50 iterations of the recognition pipeline, and each one is analyzing 10 different crops of the input image, so look at the ‘Average Forward pass’ time and divide by 10 to get the timing per recognition result. I see 337.86 ms as the average, so it takes 34 ms for each image. You can also try leaving off the –gpu=0 flag to see the CPU results, in my case is about 585 ms, so you can see how much Cuda helps!

Why nerd culture must die

pocketprotector

Photo by Attila Acs

My first girlfriend was someone I met through a MUD, and I had to fly 7,000 miles to see her in person. I read a paper version of the Jargon File at 15 and it became my bible. Just reading its descriptions of the internet I knew it was world-changing, even before the web, and as soon as I could I snuck into the local university computer labs with a borrowed account to experience the wonder of Usenet, FTP, and Gopher. I chose my college because Turing had once taught there, and the designer of the ARM chip would be one of my lecturers. My first job out of college was helping port the original Diablo to the first Playstation, and I spent five years writing games. I’ve dived deep into GPU programming. I’ve worked for almost two decades at both big tech companies and startups. I’ve spent countless hours writing about coding for the pure love of it. I’m a grown man who still plays Dungeons and Dragons!

My point is that if anyone can claim to be a nerd, it’s me. As a lonely teenager growing up in the English countryside, reading the Portrait of J. Random Hacker gave me a wonderful jolt of excitement and recognition. I’d never met anyone like that, but knowing that there were others out there like me gave me hope. As I went through college I started to discover a few more people who took a perverse pride in being geeks, but it was still rare and very much outside mainstream culture. Nobody really understood why I took a poorly-paid job in game programming after college instead of joining a bank, and most people’s eyes would glaze over when I mentioned I worked in computers. Over the years I gradually built a group of friends who shared the same interests in sci-fi, comics, games, and computers. It was nerd culture that brought us together, and their support was life-saving, but they were hard to find, and we were still way outside the cultural mainstream.

Over the last decade, that’s changed. Comic book adaptations are the safest bet in Hollywood. Lord of the Rings and Game of Thrones have made fantasy something anyone can enjoy without embarrassment. Perhaps most importantly, nerds now have money, power, and status. The biggest, fastest-growing companies in the world are run and staffed by us, and mainstream culture has shifted from mocking us to respect. Startups are sexy. We’ve won.

And that’s where the problem lies. We’re still behaving like the rebel alliance, but now we’re the Empire. We got where we are by ignoring outsiders and believing in ourselves even when nobody else would. The decades have proved that our way was largely right and the critics were wrong, so our habit of not listening has become deeply entrenched. It even became a bit of a bonding ritual to attack critics of the culture because they usually didn’t understand what we were doing beyond a surface level. It didn’t used to matter because nobody except a handful of forum readers would see the rants. The same reflex becomes a massive problem now that nerds wield real power. GamerGate made me ashamed to be a gamer, but the scary thing is that the underlying behavior of attacking critics felt like something I’d always seen in our culture, and tolerated. It only shocked me when it was scaled up so massively into rape and death threats, and I saw mainstream corporations like Intel folding in the face of the pressure we can bring to bear.

That’s why Marc Andreessen’s comment that Silicon Valley is nerd culture, and nerds are bro’s natural enemies felt so wrong. Sure, we used to be picked on or ignored by the bro’s, but that was when we had no money or power. Now we have status, bro’s are happy to treat us as buddies instead of victims, to the point that we’re unlikely to think of them as bro’s. I’ve pitched most VC firms in the Valley at one time or another, and a lot of the partners come from business or finance backgrounds. There are nerds in there too of course, and they do control the culture, but they also get along perfectly well with the preppy MBAs. The same holds true across the whole tech industry – they might have tried to steal our lunch money twenty years ago, but now they’re quite happy running biz-dev while we do the engineering.

One of the things I love about nerd culture is how much it values evidence and checking facts. When I’m optimizing code, my intuition about which parts are slowest is often wildly wrong, so I’ve learned the hard way that I have to profile the hell out of it before I try to fix anything. It’s a core skill for dealing with computers, our gut feelings often don’t work in such an alien realm, so skepticism becomes a habit. What has surprised me is how we leave that habit behind when confronted with evidence about ourselves. Pretty much every statistic we can track has shown fewer women getting computer science degrees and working as engineers compared to the 80’s. It’s a basic fact that we’re an incredibly imbalanced industry in all sorts of ways, from race to class and gender, and we’re getting worse.

I’m not claiming to know the answers, but you don’t have to be a social justice warrior to notice something is going very wrong somewhere. Even the Jargon File acknowledged, to paraphrase, that hackers routinely behave like assholes. Is it a crazy leap to imagine that this deeply-rooted tolerance of terrible behavior might drive people away?

When I look around, I see the culture we’ve built turning from a liberating revolution into a repressive incumbency. We’ve built magical devices, but we don’t care enough about protecting ordinary people from harm when they use them. We don’t care that a lot of the children out there with the potential to become amazing hackers are driven away at every stage in the larval process. We don’t care about the people who lose out when we disrupt the world, just the winners (who tend to look a lot like us).

I’d always hoped we were more virtuous than the mainstream, but it turns out we just didn’t have enough power to cause much harm. Our ingrained sense of victimization has become a perverse justification for bullying. That’s why I’m calling time on nerd culture. It’s done wonderful things, but these days it’s like a crawling horror of a legacy codebase so riddled with problems the only rational decision is to deprecate it and build something better.

What would something better look like? The Maker movement gives me hope, because including all the kids we’re missing is built in from the start. Whatever the future becomes, the bottom line is we need to value being a decent human being a hell of a lot more than we do now. Our toleration of asshole behavior must end, and it’s such an integral part of nerd culture that nuking the entire thing from orbit is the only way to be sure.

Untethered, by Jinky de Rivera

shxiv_postcardFront_081214_8

I haven’t made it to much local theatre recently, but thanks to Facebook I heard about an old high school friend of Joanne who’d written a new play. I hadn’t run across Bindlestiff before, but it’s a small theatre on 6th Street that’s focused on Filipino artists and culture, and Jinky’s play came out of a workshop there. Last night we went to see it performed as one of six single-act plays that came out of that process, and her short piece ‘Untethered’ turned out to be one of the highlights for us. There was a lot of acting talent visible in all of the plays, but her piece about love, loss, and family had an amazing script too. It was technically complex, with time-jumping, different actors playing the same characters, and a narrator addressing the audience, so it could easily have been confusing and stilted, but everything came together in a way that was moving and natural. It felt autobiographical, but part of that may just be because the characters felt so fully-drawn and real.

If you get a chance I highly recommend trying to catch the show while it’s still playing. You’ll be seeing works that are raw and in-progress, so you’ll need to be prepared for some ups and downs across the evening, but ‘Untethered’ is a consistent delight, and we came away with something good from all of them.

Five short links

Jan_van_Scorel_-_Five_Members_of_the_Utrecht_Brotherhood_of_Jerusalem_Pilgrims_-_Google_Art_Project

Picture by Jan van Scorel

A quick programming note – I’m now at Google! This blog will continue to be my personal collection of random things and occasional rants though.

Frida – Greasemonkey for arbitrary binaries! You can hook into all sorts of function calls with Javascript, even on binaries you didn’t build yourself. I love the idea of being able to mash up desktop apps.

Spotting Tumors with Deep Learning – My friend Jeremy Howard has launched a new startup to apply deep learning to medical problems. Great to see the technology being applied to more things that matter.

Mechanical Turk Worker Protection Guidelines – It’s aimed at academics, but anyone who employs human data raters should read this as a guide on how not to be a jerk.

GPU_FFT – Andrew Holmes on how he created his super-fast FFT library on the Raspberry Pi, with lots of detail on the hand-coded assembler and memory access optimizations. Geek heaven!

fork() can fail : this is important – The crazy tale of a pathological edge case with fork(), and how code that doesn’t check return values very carefully will wipe out all the processes on a machine in a mysterious fashion. “Unix: just enough potholes and bear traps to keep an entire valley going.

How to optimize Raspberry Pi code using its GPU

warpspeed

Photo by Michal

When I was at Apple, I spent five years trying to get source-code access to the Nvidia and ATI graphics drivers. My job was to accelerate image-processing operations using GPUs to do the heavy lifting, and a lot of my time went into debugging crashes or strange performance issues. I could have been a lot more effective if I’d had better insights into the underlying hardware, and been able to step through and instrument the code that controlled the graphics cards. Previously I’d written custom graphics drivers for game consoles, so I knew how useful having that level of control could be.

I never got the access I’d wanted, and it left me with an unscratched itch. I love CUDA/OpenCL and high-level shader interfaces, but the underlying hardware of graphics cards is so specialized, diverse, and quirky that you can’t treat them like black boxes and expect to get the best performance. Even with CUDA, you end up having to understand the characteristics of what’s under the hood if you want to really speed things up. I understand why most GPU manufacturers hate the idea, even just the developer support you’d need to offer for a bare-metal interface would take a lot of resources, but it still felt like a big missed opportunity to write more efficient software.

That all meant I was very excited when Broadcom released detailed documentation of the GPU used on the Raspberry Pi a few months ago. The Pi’s a great device to demonstrate the power of deep learning computer vision, and I’d ported my open-source library to run on it, but the CPU was woefully slow on the heavy math that neural networks require, taking almost twenty seconds even with optimized assembler, so I had a real problem I thought GPU acceleration might be able to help with.

Broadcom’s manual is a good description of the hardware interface to their GPU, but you’ll need more than that if you’re going to write code to run on it. In the end I was able to speed up object recognition from twenty seconds on the CPU to just three on the GPU, but it took a lot of head-scratching and help from others in the community to get there. In the spirit of leaving a trail of breadcrumbs through the forest, I’m going to run through some of what I learned along the way.

Getting started

Broadcom’s Videocore Reference Guide will be your bible and companion, I’m constantly referring to it to understand everything from assembly instructions to interface addresses.

The very first program you should try running is the hello_fft sample included in the latest Raspbian. If you can get this running, then at least you’re set up correctly to run GPU programs.

There’s a missing piece in that example though – the source assembler text isn’t included, only a compiled binary blob. [Thanks to Andrew Holmes and Eben for pointing me to a recent update adding the assembler code!] There isn’t an official program available to compile GPU assembler, so the next place to look is eman’s excellent blog series on writing an SHA-256 implementation. This includes a simple assembler, which I’ve forked and patched a bit to support instructions I needed for my algorithm. Once you’ve got his code running, and have the assembler installed, you should be ready to begin coding.

Debugging

There’s no debugger for the GPU, at all. You can’t even log messages. In the past I’ve had to debug shaders by writing colors to the screen, but in this case there isn’t even a visible output surface to use. I’ve never regretted investing time up-front into writing debug tools, so I created a convention where a register was reserved for debug output, it would be written out to main memory at the end of the program, could be immediately invoked with a LOG_AND_EXIT() macro, and the contents would be printed out to the console after the code was done. It’s still painful, but this mechanism at least let me get glimpses of what was going on internally.

I also highly recommend using a regular laptop to ssh into your Pi, alongside something like sshfs so you can edit source files easily in your normal editor. You’ll be crashing the device a lot during development, so having a separate development machine makes life a lot easier.

Vertex Program Memory

One of the eternal problems of GPU optimization is getting data back and forth between the main processor and the graphics chip. GPUs are blazingly fast when they’re working with data in their local memory, but coordinating the transfers so they don’t stall either processor is a very hard problem. My biggest optimization wins on the Playstation 2 came from fiddling with the DMA controller to feed the GPU more effectively, and on modern desktop GPUs grouping data into larger batches to upload is one of the most effective ways to speed things up.

The Broadcom GPU doesn’t have very much dedicated memory at all. In fact, the only RAM that’s directly accessible is 4,096 bytes in an area known as Vertex Program Memory. This is designed to be used as a staging area for polygon coordinates so they can be transformed geometrically. My initial assumption was that this would have the fastest path into and out of the GPU, so I built my first implementation to rely on it for data transfer. Unfortunately, it has a few key flaws.

There are actually 12 cores inside the GPU, each one known as a QPU for Quad Processing Unit. The VPM memory is shared between them, so there wasn’t much available for each. I ended up using only 8 cores, and allocating 512 bytes of storage to each, which meant doing a lot of small and therefore inefficient transfers from main memory. The real killer was that a mutex lock was required before kicking off a transfer, so all of the other cores ground to a halt while one was handling an upload, which killed parallelism and overall performance.

Texture Memory Unit

After I released the initial VPM-based version of the matrix-to-matrix multiply GEMM function that’s the most time-consuming part of the object recognition process, several people mentioned that the Texture Memory Unit or TMU was a lot more efficient. The documentation only briefly mentions that you can use the TMU for general memory access, and there wasn’t any detail on how to do it, so I ended up looking at the disassembly of the hello_fft sample to see how it was done. I also received some help over email from Eben Upton himself, which was a lovely surprise! Here’s a summary of what I learned:

 – There are two TMUs available to each core. You can manually choose how to use each if you have an algorithmic way to send the same work to both, by turning off ‘TMU swap’, or if you leave it enabled half the cores will be transparently rewired to use alternating TMUs for 0 and 1.

 – You write a vector of 16 addresses to registers ra56 and ra60 for TMU0 and 1 respectively, and that will start a fetch of the values held in those addresses.

 – Setting a ldtmu0/1 code in an instruction causes the next read in the pipeline to block until the memory values are returned, and then you can read from r4 to access those values in further instructions.

 – There’s a potentially long latency before those values are ready. To mitigate that, you can kick off up to four reads on each TMU before calling a ldtmu0/1. This means that memory reads can be pipelined while computation is happening on the GPU, helping performance a lot thanks to all the overlapping pipelining.

 – To reduce extra logic-checking instructions, I don’t try to prevent overshooting on speculative reads, which means there may be accesses beyond the end of arrays (though the values aren’t used). In practice this hasn’t caused problems.

 – I didn’t dive into this yet, but there’s a 4K direct-mapped L1 cache with 64-byte lines for the TMU. Avoiding aliasing on this will be crucial for maintaining speed, and in my case I bet it depends heavily on the matrix size and allocation of work to different QPUs. There are performance counters available to monitor cache hits and misses, and on past experience dividing up the data carefully so everything stays in-cache could be a big optimization.

 – A lot of my data is stored as 8 or 16-bit fixed point, and the VPM had a lot more support for converting them into float vectors than the TMU does. I discovered some funky problems, like the TMU ignoring the lower two bits of addresses and only loading from 32-bit aligned words, which was tricky when I was dealing with odd matrix widths and lower precision. There isn’t much support for ‘swizzling’ between components in the 16-float vectors that are held in each register either, beyond rotating, so I ended up doing lots of masking tricks.

 – Reading from nonsensical addresses can crash the system. During development I’d sometimes end up with wildly incorrect values for my read addresses, and that would cause a hang so severe I’d have to reboot.

 – This isn’t TMU specific, but I’ve noticed that having a display attached to your Pi taxes the GPU, and can result in slower performance by around 25%.

In the end I was able to perform object recognition in just three seconds with the optimized TMU code, rather than six using the VPM, which opens up a lot more potential applications!

Going Further

Developing GPU code on the Raspberry Pi has come a long way in just the last few months, but it’s still in its early stages. I’m hitting mysterious system hangs when I try to run my deep learning TMU example with any kind of overclocking for example, and there’s no obvious way to debug those kind of problems, especially if they’re hard to reproduce in a simple example.

The community, including folks like eman, Eben, Andrew Holme, and Herman Hermitage, are constantly improving and extending the documentation, examples, and tools, so developing should continue to get easier. I recommend keeping an eye on the Raspberry Pi forums to see the latest news! 

Running the example

If you want to try out the deep learning object recognition code I developed yourself, you can follow these steps:

Install Raspbian.

Install the latest firmware by running `sudo rpi-update`.

From `raspi-config`, choose 256MB for GPU memory.

Clone qpu-asm from Github.

Run `make` inside the qpu-asm folder.

Create a symbolic link to the qpu-asm program, for example by running `sudo ln -s /home/pi/projects/qpu-asm/qpu-asm /usr/bin/`.

Clone DeepBeliefSDK from Github.

From the DeepBeliefSDK/source folder, run `make TARGET=pi GEMM=piqpu`.

Once it’s successfully completed the build, make sure the resulting library is in your path, for example by running `sudo ln -s /home/pi/projects/DeepBeliefSDK/source/libjpcnn.so /usr/lib/`.

Run `sudo ./jpcnn -i data/dog.jpg -n ../networks/jetpac.ntwk -t -m s`

You should see output that looks like this:Screen Shot 2014-08-07 at 1.49.33 PM

How to get computer vision out of the unimpressive valley

valley

Photo by Severin Sadjina

When I first saw the results of the Kaggle Cats vs Dogs competition, I was amazed by how accurate it was. When I show consumers our Spotter iPhone app based on the same deep learning technology the contestants used, most people are distinctly underwhelmed thanks to all the mistakes it makes.

The problem is that while computer vision has got dramatically better in the last few years, it was so bad before that we’re still a long way behind what a human can achieve. Most of the obvious applications of computer vision, like the Fire Phone’s object recognition, implicitly assume a higher degree of accuracy than we can achieve, so users are left feeling disappointed and disillusioned by the technology. There’s a disconnect between researchers’ excitement about the improvements and future promise, and the general public’s expectations of what good computer vision should be able to do. I think we’re in a space much like the uncanny valley, where the technology is good enough to be built into applications, but bad enough that those apps will end up frustrating users.

I believe we need to stop trying to build applications that assume human levels of accuracy, and instead engineer around the strengths and weaknesses of the actual technology we have. Here’s some of the approaches that can help.

Forgiving Interfaces

Imagine a user loads a video clip and the application suggests a template and music that fit the subject, whether it’s a wedding or a kids soccer match. The cost and annoyance of the algorithm getting it wrong are low because it’s just a smart suggestion the user can dismiss, so the recognition accuracy only needs to be decent, not infallible. This approach of using computer vision to assist human decisions rather than replacing them can be used in a lot of applications if the designers are willing to build an interface around the actual capabilities of the technology.

Big Data

A lot of the ideas I see for vision applications are essentially taking a job humans currently do, and getting a computer to do it instead (eg identifying products on the Fire Phone). They almost always involve taking a single photo, extracting a rock-solid identification, and then fetching related data based on that. These kind of applications fall apart if the identification piece is inaccurate, which it currently for everything but the simplest cases of bar codes. Going in to build Jetpac’s City Guides, I knew that I wouldn’t be able to identify hipsters with 100% accuracy, but by analyzing a few thousand photos taken at the same place, I could get good data about the prevalence of hipsters at a venue even if there were some mistakes on individual images. As long as the errors are fairly random, throwing more samples at the problem will help. If you can, try to recast your application as something that will ingest a lot more photos than a human could ever deal with, and mine that bigger set for meaning. 

Grunt Work

Right now, looking at photos and making sense of them is an expensive business. Even if you give a security guard a bank of monitors, they probably can’t track more than a dozen or so in any meaningful way. With the current state of computer vision, you could have hundreds of cheap cameras in a facility, and have them trigger an alert when something unusual happens, saving the guard’s superior recognition skills to make sense of the anomaly rather than trying to spot them in the first place. More generally, intelligent cameras become more like sensors that can be deployed in large numbers all over an assembly line, road tunnel, or sewer to detect when things are out of the ordinary. You’ll still need a human’s skills to investigate more deeply, but cheap computing power means you can deploy an army of smart sensors for applications you never could justify paying people to manually monitor.

I’m sure there are other approaches that will help too, but my big hope is that we can be more imaginative about designing around the limitations of current vision technology, and actually start delivering some of the promise that researchers are so excited about!

Setting up Caffe on Ubuntu 14.04

A lot of people on this morning’s webcast asked if I had an Amazon EC2 image of pre-installed Caffe. I didn’t then, but I’ve just put one together! It’s available as ami-2faaa96a in the Northern California zone. There’s also a Vagrant VM at https://d2rlgkokhpr1uq.cloudfront.net/dl_webcast.box, and I’ve got full instructions for setting up your own machine on the Caffe wiki. I’m shaving yaks, so you don’t have to!

Follow

Get every new post delivered to your Inbox.

Join 492 other followers