How to run the Caffe deep learning vision library on Nvidia’s Jetson mobile GPU board

jetson

Photo by Gareth Halfacree

My colleague Yangqing Jia, creator of Caffe, recently spent some free time getting the framework running on Nvidia’s Jetson board. If you haven’t heard of the Jetson, it’s a small development board that includes Nvidia’s TK1 mobile GPU chip. The TK1 is starting to appear in high-end tablets, and has 192 cores so it’s great for running computational tasks like deep learning. The Jetson’s a great way to get a taste of what we’ll be able to do on mobile devices in the future, and it runs Ubuntu so it’s also an easy environment to develop for.

Caffe comes with a pre-built ‘Alexnet’ model, a version of the Imagenet-winning architecture that recognizes 1,000 different kinds of objects. Using this as a benchmark, the Jetson can analyze an image in just 34ms! Based on this table I’m estimating it’s drawing somewhere around 10 or 11 watts, so it’s power-intensive for a mobile device but not too crazy.

Yangqing passed along his instructions, and I’ve checked them on my own Jetson, so here’s what you need to do to get Caffe up and running.

Setup

The first step once you’ve unboxed your Jetson is logging in. You can attach a monitor and keyboard, but I prefer just plugging it into a local router and ssh-ing in. elinux.org/Jetson/Remote_Access has more details, but it should show up as tegra-ubuntu.local on your local network, and the username is ubuntu:

ssh ubuntu@tegra-ubuntu.local

The default password is ubuntu. Next we need to run Nvidia’s installer that comes with the device, and reboot.

sudo NVIDIA-INSTALLER/installer.sh
sudo shutdown -r now

Once the board has rebooted, you can log back in and continue installing all the packages you’ll need for Caffe.

ssh ubuntu@tegra-ubuntu.local
sudo add-apt-repository universe
sudo apt-get update
sudo apt-get install libprotobuf-dev protobuf-compiler gfortran \
libboost-dev cmake libleveldb-dev libsnappy-dev \
libboost-thread-dev libboost-system-dev \
libatlas-base-dev libhdf5-serial-dev libgflags-dev \
libgoogle-glog-dev liblmdb-dev gcc-4.7 g++-4.7

You’ll need the Cuda SDK to build and run GPU programs, and elinux.org/Tegra/Installing_CUDA has a good general guide. The summary is that you’ll need to register as an Nvidia developer,  on a logged-in browser download the Cuda 6.0 for ARM package to your local machine and then copy it over to the Jetson from there.

scp ~/Downloads/cuda-repo-l4t-r19.2_6.0-42_armhf.deb ubuntu@tegra-ubuntu.local:

Then back on the ssh connection to your Tegra, run these Cuda installation steps.

sudo dpkg -i cuda-repo-l4t-r19.2_6.0-42_armhf.deb
sudo apt-get update
sudo apt-get install cuda-toolkit-6-0
sudo usermod -a -G video $USER
echo "# Add CUDA bin & library paths:" >> ~/.bashrc
echo "export PATH=/usr/local/cuda/bin:$PATH" >> ~/.bashrc
echo "export LD_LIBRARY_PATH=/usr/local/cuda/lib:$LD_LIBRARY_PATH" >> ~/.bashrc
source ~/.bashrc

If everything’s installed correctly, running ‘nvcc -V’ should give you a compiler version message. Now you need to grab the Tegra versions of OpenCV. On your main machine, download developer.nvidia.com/rdp/assets/opencv-run-tegra-k1 and developer.nvidia.com/rdp/assets/opencv-dev-tegra-k1 from your logged-in browser and copy them over to the Jetson.

scp ~/Downloads/libopencv4tegra* ubuntu@tegra-ubuntu.local:

On the Jetson, install those packages.

sudo dpkg -i libopencv4tegra_2.4.8.2_armhf.deb
sudo dpkg -i libopencv4tegra-dev_2.4.8.2_armhf.deb

We need to download and install Caffe. Yangqing has put in a few recent tweaks and fixes so at the moment you’ll need to grab the dev branch, but those should soon be rolled into master.

sudo apt-get install -y git
git clone https://github.com/BVLC/caffe.git
cd caffe && git checkout dev
cp Makefile.config.example Makefile.config
sed -i "s/# CUSTOM_CXX := g++/CUSTOM_CXX := g++-4.7/" Makefile.config

We have to use gcc version 4.7 because nvcc hits some problems with the default 4.8, but otherwise we’re using a pretty standard setup. You should be able to kick off the build.

make -j 8 all

Once that’s complete, you should check things are working properly by running Caffe’s test suite. This can take quite a while to finish, but hopefully it should report a clean bill of health.

make -j 8 runtest

Finally you can run Caffe’s benchmarking code to measure performance.

build/tools/caffe time --model=models/bvlc_alexnet/deploy.prototxt --gpu=0

This should take about 30 seconds, and output a set of statistics. It’s running 50 iterations of the recognition pipeline, and each one is analyzing 10 different crops of the input image, so look at the ‘Average Forward pass’ time and divide by 10 to get the timing per recognition result. I see 337.86 ms as the average, so it takes 34 ms for each image. You can also try leaving off the –gpu=0 flag to see the CPU results, in my case is about 585 ms, so you can see how much Cuda helps!

20 responses

  1. Pingback: How to run the Caffe deep learning vision library… « Another Word For It

  2. Pingback: Deep Learning with the NVIDIA Jetson TK1 board | smccann0011

  3. Hi Gareth,

    Thanks for the writeup. One question: what makes you say that the benchmark is analyzing 10 different crops of the input image? According to the documentation, this needs to be specified by a transformation layer via “transform_param” in the prototxt, which does not appear to be the case.

    What appears to be happening is that the batch size is 10, so there are 10 forward passes that occur before the backward pass. This is specified by the first dimension at the top of the prototxt.

  4. Hi Gareth,

    Thanks for the writeup. One question: what makes you say that the benchmark is analyzing 10 different crops of the input image? According to the documentation, this needs to be specified by a transformation layer via “transform_param” in the prototxt, which does not appear to be the case.

    It appears that the batch size is 10, so there are 10 forward passes that occur before the backward pass. This is specified by the first dimension near the top of the prototxt.

  5. Excited by this post I got my fresh jetson and installed cuda 6.5 and followed the above directions.
    Regretably, the benchmark is nowhere near yours. For no GPU, I am comparable at 532ms, but with the GPU I am only seeing a gain to around 233ms. Nowhere near your 34ms. Any ideas?

    I0117 17:23:25.077291 18162 caffe.cpp:273] Average Forward pass: 2333.57 ms.

  6. Pingback: NVIDIA Jetson TK1 - Caffe Deep Learning Framework - NVIDIA Jetson TK1 Dev

  7. Thank you so much!! I have a question about 34 ms per image. When I changed batch size from 10 to 1 by editing deploy.prototxt, I got a result which is 107 ms for average forward pass, not about 30 ms. Am I wrong with measuring performance? Any idea?

  8. Thanks for the walk-through. In the latest caffe (Mar 2016), you get the following error: /usr/bin/ld: cannot find -lboost_filesystem. I resolved it by: sudo apt-get install libboost-all-dev (which may have been overkill, but I didn’t want to hit another linker error for some other part of boost down the road).

  9. Pingback: 怎样在Nvidia的Jetson开发板上运行Caffe深度学习视觉库 – 深度学习实验室

  10. Just built caffe for the Jetson Tegra X1 board, using cuDNN v4. I was able to use g++ 4.8, so whatever previous issue there was between this and nvcc seems resolved. Running the benchmark timing, I observe the following. Average Forward Pass: 102.17 ms (or about 10ms per image), which is about 3.4x speedup. Nice.

  11. error :src/caffe/util/math_function.cu(157):error: kernel launches from templates are not allowed in system files
    1 error detected in the compilation of “/tmp/tmpxft_000011e3_00000000-13_math_functions.cpp4.ii”.
    make : ***[.build_release/cuda/src/caffe/util/math_functions.o] error 1
    make :***waiting for unfinished jobs……..

    How to solve this problem??

  12. Pingback: How to run the Caffe deep learning vision library on Nvidia’s Jetson mobile GPU board « Pete Warden’s blog – Qamar-ud-Din

Leave a comment