Why GEMM is at the heart of deep learning

seventiessuits

Photo by Anthony Catalano

I spend most of my time worrying about how to make deep learning with neural networks faster and more power efficient. In practice that means focusing on a function called GEMM. It’s part of the BLAS (Basic Linear Algebra Subprograms) library that was first created in 1979, and until I started trying to optimize neural networks I’d never heard of it. To explain why it’s so important, here’s a diagram from my friend Yangqing Jia’s thesis:

profile

This is breaking down where the time’s going for a typical deep convolutional neural network doing image recognition using Alex Krizhevsky’s Imagenet architecture. All of the layers that start with fc (for fully-connected) or conv (for convolution) are implemented using GEMM, and almost all the time (95% of the GPU version, and 89% on CPU) is spent on those layers.

So what is GEMM?  It stands for GEneral Matrix to Matrix Multiplication, and it essentially does exactly what it says on the tin, multiplies two input matrices together to get an output one. The difference between it and the kind of matrix operations I was used to in the 3D graphics world is that the matrices it works on are often very big. For example, a single layer in a typical network may require the multiplication of a 256 row, 1,152 column matrix by an 1,152 row, 192 column matrix to produce a 256 row, 192 column result. Naively, that requires 57 million (256 x 1,152, x 192) floating point operations and there can be dozens of these layers in a modern architecture, so I often see networks that need several billion FLOPs to calculate a single frame. Here’s a diagram that I sketched to help me visualize how it works:

gemm_corrected

Fully-Connected Layers

Fully-connected layers are the classic neural networks that have been around for decades, and it’s probably easiest to start with how GEMM is used for those. Each output value of an FC layer looks at every value in the input layer, multiplies them all by the corresponding weight it has for that input index, and sums the results to get its output. In terms of the diagram above, it looks like this:

fcgemm_corrected

There are ‘k’ input values, and there are ‘n’ neurons, each one of which has its own set of learned weights for every input value. There are ‘n’ output values, one for each neuron, calculated by doing a dot product of its weights and the input values.

Convolutional Layers

Using GEMM for the convolutional layers is a lot less of an obvious choice. A conv layer treats its input as a two dimensional image, with a number of channels for each pixel, much like a classical image with width, height, and depth. Unlike the images I was used to dealing with though, the number of channels can be in the hundreds, rather than just RGB or RGBA!

The convolution operation produces its output by taking a number of ‘kernels’ of weights. and applying them across the image. Here’s what an input image and a single kernel look like:

kernelview

Each kernel is another three-dimensional array of numbers, with the depth the same as the input image, but with a much smaller width and height, typically something like 7×7. To produce a result, a kernel is applied to a grid of points across the input image. At each point where it’s applied, all of the corresponding input values and weights are multiplied together, and then summed to produce a single output value at that point. Here’s what that looks like visually:

patches

You can think of this operation as something like an edge detector. The kernel contains a pattern of weights, and when the part of the input image it’s looking at has a similar pattern it outputs a high value. When the input doesn’t match the pattern, the result is a low number in that position. Here are some typical patterns that are learned by the first layer of a network, courtesy of the awesome Caffe and featured on the NVIDIA blog:

kernels

Because the input to the first layer is an RGB image, all of these kernels can be visualized as RGB too, and they show the primitive patterns that the network is looking for. Each one of these 96 kernels is applied in a grid pattern across the input, and the result is a series of 96 two-dimensional arrays, which are treated as an output image with a depth of 96 channels. If you’re used to image processing operations like the Sobel operator, you can probably picture how each one of these is a bit like an edge detector optimized for different important patterns in the image, and so each channel is a map of where those patterns occur across the input.

You may have noticed that I’ve been vague about what kind of grid the kernels are applied in. The key controlling factor for this is a parameter called ‘stride’, which defines the spacing between the kernel applications. For example, with a stride of 1, a 256×256 input image would have a kernel applied at every pixel, and the output would be the same width and height as the input. With a stride of 4, that same input image would only have kernels applied every four pixels, so the output would only be 64×64. Typical stride values are less than the size of a kernel, which means that in the diagram visualizing the kernel application, a lot of them would actually overlap at the edges.

How GEMM works for Convolutions

This seems like quite a specialized operation. It involves a lot of multiplications and summing at the end, like the fully-connected layer, but it’s not clear how or why we should turn this into a matrix multiplication for the GEMM. I’ll talk about the motivation at the end, but here’s how the operation is expressed in terms of a matrix multiplication.

The first step is to turn the input from an image, which is effectively a 3D array, into a 2D array that we can treat like a matrix. Where each kernel is applied is a little three-dimensional cube within the image, and so we take each one of those cubes of input values and copy them out as a single column into a matrix. This is known as im2col, for image-to-column, I believe from an original Matlab function, and here’s how I visualize it:

im2col_corrected

Now if you’re an image-processing geek like me, you’ll probably be appalled at the expansion in memory size that happens when we do this conversion if the stride is less than the kernel size. This means that pixels that are included in overlapping kernel sites will be duplicated in the matrix, which seems inefficient. You’ll have to trust me that this wastage is outweighed by the advantages though.

Now you have the input image in matrix form, you do the same for each kernel’s weights, serializing the 3D cubes into rows as the second matrix for the multiplication. Here’s what the final GEMM looks like:

im2colmult_corrected

Here ‘k’ is the number of values in each patch and kernel, so it’s kernel width * kernel height * depth. The resulting matrix is ‘Number of patches’ columns high, by ‘Number of kernel’ rows wide. This matrix is actually treated as a 3D array by subsequent operations, by taking the number of kernels dimension as the depth, and then splitting the patches back into rows and columns based on their original position in the input image.

Why GEMM works for Convolutions

Hopefully you can now see how you can express a convolutional layer as a matrix multiplication, but it’s still not obvious why you would do it. The short answer is that it turns out that the Fortran world of scientific programmers has spent decades optimizing code to perform large matrix to matrix multiplications, and the benefits from the very regular patterns of memory access outweigh the wasteful storage costs. This paper from Nvidia is a good introduction to some of the different approaches you can use, but they also describe why they ended up with a modified version of GEMM as their favored approach. There are also a lot of advantages to being able to batch up a lot of input images against the same kernels at once, and this paper on Caffe con troll uses those to very good effect. The main competitor to the GEMM approach is using Fourier transforms to do the operation in frequency space, but the use of strides in our convolutions makes it hard to be as efficient.

The good news is that having a single, well-understood function taking up most of our time gives a very clear path to optimizing for speed and power usage, both with better software implementations and by tailoring the hardware to run the operation well. Because deep networks have proven to be useful for a massive range of applications across speech, NLP, and computer vision, I’m looking forward to seeing massive improvements over the next few years, much like the widespread demand for 3D games drove a revolution in GPUs by forcing a revolution in vertex and pixel processing operations.

(Updated to fix my incorrect matrix ordering in the diagrams, apologies to anyone who was confused!)

Give Bay Area girls a head-start in tech

Screen Shot 2015-03-12 at 8.03.07 AM

This summer the Stanford AI Lab has a two week program called SAILOR aimed at local 9th grade girls, and I think it’s a wonderful chance to give promising students a strong start in a very important field. They’re a scrappy grass-roots initiative within the organization though, so they do need financial support to help pay for attendees expenses! There’s no online way to sponsor the program unfortunately, but if you email them at sailors-finance@cs.stanford.edu, they’ll be able to help you donate. In their own words, here’s what the program is trying to accomplish:

  • To simultaneously educate and excite students about the field of AI by providing exposure to a variety of AI topics, discussing in-depth some of the cutting-edge AI research, and exploring the societal impacts of AI.
  • To foster personal growth through career development workshops, mentoring, and social events.
  • To provide students a hands-on experience with real research projects in the AI Lab.
  • The program also aims to build a close-knit community and encourage interest among underrepresented minorities in the field.

I think this is important because it’s a practical and immediate way to do something at the grassroots to address the inequalities that plague our industry, and the local area. It’s just one step, but I think it can make a real difference to the lives of the attendees.

I do have a personal reason for supporting this effort. I grew up as a “Townie” in Cambridge, England and I was fascinated by the university but never had a chance to experience what it had to offer as a child. I think it’s sad that the college was so cut off from its local community, for both sides’ sake. One of the things I love about America is that universities are far more open to the outside world, with a lot of people I know taking Stanford’s continuing studies program for example, or their online courses. There are still incredible contrasts of course, like between East Palo Alto and the main city, but at least the college is actively trying to do something about the problems.

If you can help, or if you know students who might benefit from this program, do reach out to sailors@cs.stanford.edu for more details. I’m excited to see what this initiative can accomplish!

Five short links

numberfivecat

Photo by Brian Schoonover

Understanding genre in a collection of a million volumes – This project achieved 97% precision in identifying whether a book was poetry, prose, or non-fiction. Machines are never going to replace human scholars, but I know they can help them answer questions that would have been impossible to tackle in the past.

OpenAddresses – A wonderful resource for building geocoding tools, and one we’ve needed for a long time, I’m excited to see this collection growing.

Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-Scale Convolutional Architecture – I know I’m a broken record on deep learning, but almost everywhere it’s being applied it’s doing better than techniques that people have been developing for decades. This example is particularly exciting because the results can be fed in to other image processing algorithms, it’s a big improvement in the foundations of our understanding of natural scenes.

Book Review: On the Road – Looking back on my reading growing up, I realize that the underlying appeal of a lot of books was a world where life would be easy, at least for the heroes and by extension me. I’ll always remember the review of Phillip K. Dick’s work that pointed out his protagonist always had jobs, and they were often pretty unglamorous, and how unusual that was in sci-fi.

It’s not an asshole problem – it’s a bystander problem – More food for thought from Cate Huston, talking about some practical ways for men to help our industry’s awful gender ratio without making a big song and dance of it.

Five Short Links

Documentary_Hypothesis_Sources_Distribution_English

Picture from Wikipedia

The Documentary Hypothesis – When I get frustrated by my lack of data when I’m trying to track down a bug or build a machine learning system, I try to think about how much historians manage to do with a tiny amount of information. This is a great example, the sheer amount of thought that has gone into analyzing the authors of the Old Testament over the last few centuries is mind-boggling. Scholars have created a vast structure of interlocking ideas, and are constantly rearranging the pieces, all by squeezing out clues from between the lines of the texts. I get the same sense of awe when I see archaeologists reconstructing the past, nobody can squeeze more meaning out of ‘data exhaust’ than they can.

Your telltale video wobble can identify you – On a related topic, it’s not just your camera’s noise pattern that’s unique, now footage from body cameras can be matched to people based on movement patterns. As we gather greater volumes of more precise data from all kinds of sensors, this kind of problem is going to come up more and more often.

How transferable are features in deep neural networks? – To save you a click, the answer is “Very”! I spent decades skeptical of neural networks, they seemed theoretically elegant but useless in practice, but over the last couple of years they’ve astonished me. They really have matured into general purpose tools for turning almost any noisy ‘natural’ data source, sound, text, images, into clean information. Suddenly everything’s machine-readable!

The other side of diversity – An honest, frank, and depressing account of what it’s like to be a black woman in my world.

Weaving a very visual web – Om captures a lot of why photos and cameras are becoming so important.

A Newbie’s Guide to the San Francisco Mission Swimming Pool

IMG_0507

Years ago I lived in Dundee, Scotland. During the long winters exercise was hard to come by, and bridies weren’t, so I had to find some kind of workout before I keeled over. Thankfully the city had a wonderful public swimming pool. It was 50 meters long, had plenty of lanes, was open late so I could use it after work, and was indoors and heated which made a welcome change from the freezing rain outside.

I’ve had trouble finding anything as good in the American cities I’ve lived in, but after some detective work I have found a public pool I like here in San Francisco, the Mission Pool. There’s not much good information about it online though, and I found it a bit intimidating to go the first time, so I want to share what I discovered to help any other newbies who might find it daunting, and help it live up to its ‘Inclusion Center’ title!

It’s located between Guerrero and Valencia off 19th Street (technically on Linda Street, but that’s just an alley), and you’ll see a sign for the Mission Pool and Playground outside. It’s an outdoor pool, but so far it hasn’t been too cold, even in chilly weather. It is closed December to March though, so it’s only got a couple of weeks left as I write this. There’s a schedule online of the sessions they run, I’ve only ever attended the lane swimming myself though.

Here’s what you need to know if you’re going for the first time:

– The fee is $6, and you need exact change! There’s a corner store on 19th and Guerrero that I’ll often buy some water at to get the right money. The attendants push the money into a sealed box, so they really don’t have any change to give.

– The lockers have no locks. I bought a cheap padlock that I use, but it’s also pretty common to bring your bag out to the poolside. It’s only accessible to other swimmers, so that feels pretty safe.

– The two outside lanes tend to be used by slower swimmers, the two center ones are faster-paced. I’m not all that speedy so I usually hang in the slow lane. As a 25 meter public pool, most people are pretty chill recreational swimmers, so don’t worry if you’re slow too, you won’t feel out of place.

– The etiquette is that you stay on the right side of your lane, so other people in it can pass you going the other way, or overtake if they need to. This does require a bit of cooperation, but everyone I’ve swum with has been very thoughtful and so it has worked surprisingly well. If you’re worried, hang out by the poolside first for a few minutes and you’ll see how it works. There’s usually only three or four swimmers in each lane, so it doesn’t feel crowded.

– The pool is open to the elements so you get a few leaves blown in, but otherwise the water has been clean and not too chlorinated. If you forget your goggles, there’s a box of spares available too, they are a big help.

After a long break from swimming it’s been fun to get back into it at the Mission Pool, so I hope this trail of breadcrumbs helps some other newbies discover this neighborhood gem. It’s turned into one of my favorite outings in the city, so I hope you get a chance to give it a try too!

How to run the Caffe deep learning vision library on Nvidia’s Jetson mobile GPU board

jetson

Photo by Gareth Halfacree

My colleague Yangqing Jia, creator of Caffe, recently spent some free time getting the framework running on Nvidia’s Jetson board. If you haven’t heard of the Jetson, it’s a small development board that includes Nvidia’s TK1 mobile GPU chip. The TK1 is starting to appear in high-end tablets, and has 192 cores so it’s great for running computational tasks like deep learning. The Jetson’s a great way to get a taste of what we’ll be able to do on mobile devices in the future, and it runs Ubuntu so it’s also an easy environment to develop for.

Caffe comes with a pre-built ‘Alexnet’ model, a version of the Imagenet-winning architecture that recognizes 1,000 different kinds of objects. Using this as a benchmark, the Jetson can analyze an image in just 34ms! Based on this table I’m estimating it’s drawing somewhere around 10 or 11 watts, so it’s power-intensive for a mobile device but not too crazy.

Yangqing passed along his instructions, and I’ve checked them on my own Jetson, so here’s what you need to do to get Caffe up and running.

Setup

The first step once you’ve unboxed your Jetson is logging in. You can attach a monitor and keyboard, but I prefer just plugging it into a local router and ssh-ing in. elinux.org/Jetson/Remote_Access has more details, but it should show up as tegra-ubuntu.local on your local network, and the username is ubuntu:

ssh ubuntu@tegra-ubuntu.local

The default password is ubuntu. Next we need to run Nvidia’s installer that comes with the device, and reboot.

sudo NVIDIA-INSTALLER/installer.sh
sudo shutdown -r now

Once the board has rebooted, you can log back in and continue installing all the packages you’ll need for Caffe.

ssh ubuntu@tegra-ubuntu.local
sudo add-apt-repository universe
sudo apt-get update
sudo apt-get install libprotobuf-dev protobuf-compiler gfortran \
libboost-dev cmake libleveldb-dev libsnappy-dev \
libboost-thread-dev libboost-system-dev \
libatlas-base-dev libhdf5-serial-dev libgflags-dev \
libgoogle-glog-dev liblmdb-dev gcc-4.7 g++-4.7

You’ll need the Cuda SDK to build and run GPU programs, and elinux.org/Tegra/Installing_CUDA has a good general guide. The summary is that you’ll need to register as an Nvidia developer,  on a logged-in browser download the Cuda 6.0 for ARM package to your local machine and then copy it over to the Jetson from there.

scp ~/Downloads/cuda-repo-l4t-r19.2_6.0-42_armhf.deb ubuntu@tegra-ubuntu.local:

Then back on the ssh connection to your Tegra, run these Cuda installation steps.

sudo dpkg -i cuda-repo-l4t-r19.2_6.0-42_armhf.deb
sudo apt-get update
sudo apt-get install cuda-toolkit-6-0
sudo usermod -a -G video $USER
echo "# Add CUDA bin & library paths:" >> ~/.bashrc
echo "export PATH=/usr/local/cuda/bin:$PATH" >> ~/.bashrc
echo "export LD_LIBRARY_PATH=/usr/local/cuda/lib:$LD_LIBRARY_PATH" >> ~/.bashrc
source ~/.bashrc

If everything’s installed correctly, running ‘nvcc -V’ should give you a compiler version message. Now you need to grab the Tegra versions of OpenCV. On your main machine, download developer.nvidia.com/rdp/assets/opencv-run-tegra-k1 and developer.nvidia.com/rdp/assets/opencv-dev-tegra-k1 from your logged-in browser and copy them over to the Jetson.

scp ~/Downloads/libopencv4tegra* ubuntu@tegra-ubuntu.local:

On the Jetson, install those packages.

sudo dpkg -i libopencv4tegra_2.4.8.2_armhf.deb
sudo dpkg -i libopencv4tegra-dev_2.4.8.2_armhf.deb

We need to download and install Caffe. Yangqing has put in a few recent tweaks and fixes so at the moment you’ll need to grab the dev branch, but those should soon be rolled into master.

sudo apt-get install -y git
git clone https://github.com/BVLC/caffe.git
cd caffe && git checkout dev
cp Makefile.config.example Makefile.config
sed -i "s/# CUSTOM_CXX := g++/CUSTOM_CXX := g++-4.7/" Makefile.config

We have to use gcc version 4.7 because nvcc hits some problems with the default 4.8, but otherwise we’re using a pretty standard setup. You should be able to kick off the build.

make -j 8 all

Once that’s complete, you should check things are working properly by running Caffe’s test suite. This can take quite a while to finish, but hopefully it should report a clean bill of health.

make -j 8 runtest

Finally you can run Caffe’s benchmarking code to measure performance.

build/tools/caffe time --model=models/bvlc_alexnet/deploy.prototxt --gpu=0

This should take about 30 seconds, and output a set of statistics. It’s running 50 iterations of the recognition pipeline, and each one is analyzing 10 different crops of the input image, so look at the ‘Average Forward pass’ time and divide by 10 to get the timing per recognition result. I see 337.86 ms as the average, so it takes 34 ms for each image. You can also try leaving off the –gpu=0 flag to see the CPU results, in my case is about 585 ms, so you can see how much Cuda helps!

Why nerd culture must die

pocketprotector

Photo by Attila Acs

My first girlfriend was someone I met through a MUD, and I had to fly 7,000 miles to see her in person. I read a paper version of the Jargon File at 15 and it became my bible. Just reading its descriptions of the internet I knew it was world-changing, even before the web, and as soon as I could I snuck into the local university computer labs with a borrowed account to experience the wonder of Usenet, FTP, and Gopher. I chose my college because Turing had once taught there, and the designer of the ARM chip would be one of my lecturers. My first job out of college was helping port the original Diablo to the first Playstation, and I spent five years writing games. I’ve dived deep into GPU programming. I’ve worked for almost two decades at both big tech companies and startups. I’ve spent countless hours writing about coding for the pure love of it. I’m a grown man who still plays Dungeons and Dragons!

My point is that if anyone can claim to be a nerd, it’s me. As a lonely teenager growing up in the English countryside, reading the Portrait of J. Random Hacker gave me a wonderful jolt of excitement and recognition. I’d never met anyone like that, but knowing that there were others out there like me gave me hope. As I went through college I started to discover a few more people who took a perverse pride in being geeks, but it was still rare and very much outside mainstream culture. Nobody really understood why I took a poorly-paid job in game programming after college instead of joining a bank, and most people’s eyes would glaze over when I mentioned I worked in computers. Over the years I gradually built a group of friends who shared the same interests in sci-fi, comics, games, and computers. It was nerd culture that brought us together, and their support was life-saving, but they were hard to find, and we were still way outside the cultural mainstream.

Over the last decade, that’s changed. Comic book adaptations are the safest bet in Hollywood. Lord of the Rings and Game of Thrones have made fantasy something anyone can enjoy without embarrassment. Perhaps most importantly, nerds now have money, power, and status. The biggest, fastest-growing companies in the world are run and staffed by us, and mainstream culture has shifted from mocking us to respect. Startups are sexy. We’ve won.

And that’s where the problem lies. We’re still behaving like the rebel alliance, but now we’re the Empire. We got where we are by ignoring outsiders and believing in ourselves even when nobody else would. The decades have proved that our way was largely right and the critics were wrong, so our habit of not listening has become deeply entrenched. It even became a bit of a bonding ritual to attack critics of the culture because they usually didn’t understand what we were doing beyond a surface level. It didn’t used to matter because nobody except a handful of forum readers would see the rants. The same reflex becomes a massive problem now that nerds wield real power. GamerGate made me ashamed to be a gamer, but the scary thing is that the underlying behavior of attacking critics felt like something I’d always seen in our culture, and tolerated. It only shocked me when it was scaled up so massively into rape and death threats, and I saw mainstream corporations like Intel folding in the face of the pressure we can bring to bear.

That’s why Marc Andreessen’s comment that Silicon Valley is nerd culture, and nerds are bro’s natural enemies felt so wrong. Sure, we used to be picked on or ignored by the bro’s, but that was when we had no money or power. Now we have status, bro’s are happy to treat us as buddies instead of victims, to the point that we’re unlikely to think of them as bro’s. I’ve pitched most VC firms in the Valley at one time or another, and a lot of the partners come from business or finance backgrounds. There are nerds in there too of course, and they do control the culture, but they also get along perfectly well with the preppy MBAs. The same holds true across the whole tech industry – they might have tried to steal our lunch money twenty years ago, but now they’re quite happy running biz-dev while we do the engineering.

One of the things I love about nerd culture is how much it values evidence and checking facts. When I’m optimizing code, my intuition about which parts are slowest is often wildly wrong, so I’ve learned the hard way that I have to profile the hell out of it before I try to fix anything. It’s a core skill for dealing with computers, our gut feelings often don’t work in such an alien realm, so skepticism becomes a habit. What has surprised me is how we leave that habit behind when confronted with evidence about ourselves. Pretty much every statistic we can track has shown fewer women getting computer science degrees and working as engineers compared to the 80’s. It’s a basic fact that we’re an incredibly imbalanced industry in all sorts of ways, from race to class and gender, and we’re getting worse.

I’m not claiming to know the answers, but you don’t have to be a social justice warrior to notice something is going very wrong somewhere. Even the Jargon File acknowledged, to paraphrase, that hackers routinely behave like assholes. Is it a crazy leap to imagine that this deeply-rooted tolerance of terrible behavior might drive people away?

When I look around, I see the culture we’ve built turning from a liberating revolution into a repressive incumbency. We’ve built magical devices, but we don’t care enough about protecting ordinary people from harm when they use them. We don’t care that a lot of the children out there with the potential to become amazing hackers are driven away at every stage in the larval process. We don’t care about the people who lose out when we disrupt the world, just the winners (who tend to look a lot like us).

I’d always hoped we were more virtuous than the mainstream, but it turns out we just didn’t have enough power to cause much harm. Our ingrained sense of victimization has become a perverse justification for bullying. That’s why I’m calling time on nerd culture. It’s done wonderful things, but these days it’s like a crawling horror of a legacy codebase so riddled with problems the only rational decision is to deprecate it and build something better.

What would something better look like? The Maker movement gives me hope, because including all the kids we’re missing is built in from the start. Whatever the future becomes, the bottom line is we need to value being a decent human being a hell of a lot more than we do now. Our toleration of asshole behavior must end, and it’s such an integral part of nerd culture that nuking the entire thing from orbit is the only way to be sure.

Follow

Get every new post delivered to your Inbox.

Join 1,061 other followers