How to run the Caffe deep learning vision library on Nvidia’s Jetson mobile GPU board

Photo by Gareth Halfacree

My colleague Yangqing Jia, creator of Caffe, recently spent some free time getting the framework running on Nvidia’s Jetson board. If you haven’t heard of the Jetson, it’s a small development board that includes Nvidia’s TK1 mobile GPU chip. The TK1 is starting to appear in high-end tablets, and has 192 cores so it’s great for running computational tasks like deep learning. The Jetson’s a great way to get a taste of what we’ll be able to do on mobile devices in the future, and it runs Ubuntu so it’s also an easy environment to develop for.

Caffe comes with a pre-built ‘Alexnet’ model, a version of the Imagenet-winning architecture that recognizes 1,000 different kinds of objects. Using this as a benchmark, the Jetson can analyze an image in just 34ms! Based on this table I’m estimating it’s drawing somewhere around 10 or 11 watts, so it’s power-intensive for a mobile device but not too crazy.

Yangqing passed along his instructions, and I’ve checked them on my own Jetson, so here’s what you need to do to get Caffe up and running.

Setup

The first step once you’ve unboxed your Jetson is logging in. You can attach a monitor and keyboard, but I prefer just plugging it into a local router and ssh-ing in. elinux.org/Jetson/Remote_Access has more details, but it should show up as tegra-ubuntu.local on your local network, and the username is ubuntu:
ssh ubuntu@tegra-ubuntu.local
The default password is ubuntu. Next we need to run Nvidia’s installer that comes with the device, and reboot.
sudo NVIDIA-INSTALLER/installer.sh sudo shutdown -r now
Once the board has rebooted, you can log back in and continue installing all the packages you’ll need for Caffe.
ssh ubuntu@tegra-ubuntu.local sudo add-apt-repository universe sudo apt-get update sudo apt-get install libprotobuf-dev protobuf-compiler gfortran \ libboost-dev cmake libleveldb-dev libsnappy-dev \ libboost-thread-dev libboost-system-dev \ libatlas-base-dev libhdf5-serial-dev libgflags-dev \ libgoogle-glog-dev liblmdb-dev gcc-4.7 g++-4.7
You’ll need the Cuda SDK to build and run GPU programs, and elinux.org/Tegra/Installing_CUDA has a good general guide. The summary is that you’ll need to register as an Nvidia developer, on a logged-in browser download the Cuda 6.0 for ARM package to your local machine and then copy it over to the Jetson from there.

scp ~/Downloads/cuda-repo-l4t-r19.2_6.0-42_armhf.deb ubuntu@tegra-ubuntu.local:
Then back on the ssh connection to your Tegra, run these Cuda installation steps.
sudo dpkg -i cuda-repo-l4t-r19.2_6.0-42_armhf.deb sudo apt-get update sudo apt-get install cuda-toolkit-6-0 sudo usermod -a -G video $USER echo "# Add CUDA bin & library paths:" >> ~/.bashrc echo "export PATH=/usr/local/cuda/bin:$PATH" >> ~/.bashrc echo "export LD_LIBRARY_PATH=/usr/local/cuda/lib:$LD_LIBRARY_PATH" >> ~/.bashrc source ~/.bashrc
If everything’s installed correctly, running ‘nvcc -V’ should give you a compiler version message. Now you need to grab the Tegra versions of OpenCV. On your main machine, download developer.nvidia.com/rdp/assets/opencv-run-tegra-k1 and developer.nvidia.com/rdp/assets/opencv-dev-tegra-k1 from your logged-in browser and copy them over to the Jetson.
scp ~/Downloads/libopencv4tegra* ubuntu@tegra-ubuntu.local:
On the Jetson, install those packages.
sudo dpkg -i libopencv4tegra_2.4.8.2_armhf.deb sudo dpkg -i libopencv4tegra-dev_2.4.8.2_armhf.deb

We need to download and install Caffe. Yangqing has put in a few recent tweaks and fixes so at the moment you’ll need to grab the dev branch, but those should soon be rolled into master.
sudo apt-get install -y git git clone https://github.com/BVLC/caffe.git cd caffe && git checkout dev cp Makefile.config.example Makefile.config sed -i "s/# CUSTOM_CXX := g++/CUSTOM_CXX := g++-4.7/" Makefile.config

We have to use gcc version 4.7 because nvcc hits some problems with the default 4.8, but otherwise we’re using a pretty standard setup. You should be able to kick off the build.

make -j 8 all

Once that’s complete, you should check things are working properly by running Caffe’s test suite. This can take quite a while to finish, but hopefully it should report a clean bill of health.
make -j 8 runtest

Finally you can run Caffe’s benchmarking code to measure performance.

build/tools/caffe time --model=models/bvlc_alexnet/deploy.prototxt --gpu=0

This should take about 30 seconds, and output a set of statistics. It’s running 50 iterations of the recognition pipeline, and each one is analyzing 10 different crops of the input image, so look at the ‘Average Forward pass’ time and divide by 10 to get the timing per recognition result. I see 337.86 ms as the average, so it takes 34 ms for each image. You can also try leaving off the –gpu=0 flag to see the CPU results, in my case is about 585 ms, so you can see how much Cuda helps!

20 responses

Stephen says:

October 26, 2014 at 12:03 am

Good

Atcold says:

October 27, 2014 at 3:24 pm

Sweet, thank you! 🙂

Pingback: How to run the Caffe deep learning vision library… « Another Word For It
Dan says:

November 16, 2014 at 2:46 am

Fantastic !

Pingback: Deep Learning with the NVIDIA Jetson TK1 board | smccann0011
Richard says:

January 15, 2015 at 3:58 pm

Hi Gareth,

Thanks for the writeup. One question: what makes you say that the benchmark is analyzing 10 different crops of the input image? According to the documentation, this needs to be specified by a transformation layer via “transform_param” in the prototxt, which does not appear to be the case.

What appears to be happening is that the batch size is 10, so there are 10 forward passes that occur before the backward pass. This is specified by the first dimension at the top of the prototxt.

abrichr says:

January 15, 2015 at 4:00 pm

Hi Gareth,

Thanks for the writeup. One question: what makes you say that the benchmark is analyzing 10 different crops of the input image? According to the documentation, this needs to be specified by a transformation layer via “transform_param” in the prototxt, which does not appear to be the case.

It appears that the batch size is 10, so there are 10 forward passes that occur before the backward pass. This is specified by the first dimension near the top of the prototxt.

- Pete Warden says:
  
  January 15, 2015 at 4:17 pm
  
  That’s a question the Caffe mailing list can answer in more detail, the detailed description of what’s going on is from Yangqing (though any mistakes are my own!).
  
Mark Noworolski says:

January 17, 2015 at 7:17 pm

Excited by this post I got my fresh jetson and installed cuda 6.5 and followed the above directions.
Regretably, the benchmark is nowhere near yours. For no GPU, I am comparable at 532ms, but with the GPU I am only seeing a gain to around 233ms. Nowhere near your 34ms. Any ideas?

I0117 17:23:25.077291 18162 caffe.cpp:273] Average Forward pass: 2333.57 ms.

- Kevin says:
  
  January 23, 2015 at 3:40 pm
  
  Mark i think that he is talking about 34ms per image. So i guess your results are even better i.e. 23.3 per image
  
  - Mark Noworolski says:
    
    January 24, 2015 at 7:05 am
    
    I managed to resolve this by installing fresh from jetpack. See: https://groups.google.com/forum/#!searchin/caffe-users/jetson/caffe-users/xsvwi6yM0X0/inmqmFfRt5IJ
Pingback: NVIDIA Jetson TK1 - Caffe Deep Learning Framework - NVIDIA Jetson TK1 Dev
Jen says:

February 3, 2015 at 1:40 am

Thank you so much!! I have a question about 34 ms per image. When I changed batch size from 10 to 1 by editing deploy.prototxt, I got a result which is 107 ms for average forward pass, not about 30 ms. Am I wrong with measuring performance? Any idea?

blackso1l says:

April 21, 2015 at 6:27 am

Awesome, thanks for writing this up 🙂

Stephen O'Hara says:

March 4, 2016 at 11:33 pm

Thanks for the walk-through. In the latest caffe (Mar 2016), you get the following error: /usr/bin/ld: cannot find -lboost_filesystem. I resolved it by: sudo apt-get install libboost-all-dev (which may have been overkill, but I didn’t want to hit another linker error for some other part of boost down the road).

Pingback: 怎样在Nvidia的Jetson开发板上运行Caffe深度学习视觉库 – 深度学习实验室
huixiang lin says:

March 23, 2016 at 11:39 am

thank you

Stephen O'Hara says:

March 24, 2016 at 7:27 pm

Just built caffe for the Jetson Tegra X1 board, using cuDNN v4. I was able to use g++ 4.8, so whatever previous issue there was between this and nvcc seems resolved. Running the benchmark timing, I observe the following. Average Forward Pass: 102.17 ms (or about 10ms per image), which is about 3.4x speedup. Nice.

jiaxiang says:

July 14, 2016 at 12:36 am

error ：src/caffe/util/math_function.cu(157):error: kernel launches from templates are not allowed in system files
1 error detected in the compilation of “/tmp/tmpxft_000011e3_00000000-13_math_functions.cpp4.ii”.
make : ***[.build_release/cuda/src/caffe/util/math_functions.o] error 1
make :***waiting for unfinished jobs……..

How to solve this problem??

Pingback: How to run the Caffe deep learning vision library on Nvidia’s Jetson mobile GPU board « Pete Warden’s blog – Qamar-ud-Din

	ademidun on Leave a trail of breadcru…
	oi0841oi on Understanding the Raspberry Pi…
	Pete Warden on Understanding the Raspberry Pi…
	oi0841oi on Understanding the Raspberry Pi…
	Alan on Understanding the Raspberry Pi…

Pete Warden's blog

Ever tried. Ever failed. No matter. Try Again. Fail again. Fail better.

How to run the Caffe deep learning vision library on Nvidia’s Jetson mobile GPU board

Setup

20 responses

Leave a comment Cancel reply

Setup

Share this:

Related

20 responses

Leave a comment Cancel reply