Why I support At the Crossroads

atclogo

When I first moved to San Francisco, I was shocked by how many people were living on the streets. We’re in one of the richest cities in the world, I was appalled that we couldn’t do a better job helping them. I wanted to do something, but it was frustratingly hard to figure out ways that would be effective in doing more than just salving my conscience. I even attended a few “Homeless Innovation” meetups, but from talking to the people who worked in the trenches it was clear that new technology wasn’t the solution. I did discover some non-profits doing important work through the group though, like the Lava Mae shower bus project, and At the Crossroads.

Rob Gitin, the co-founder of the group, gave a short presentation about what ATC did, and it made a lot of sense to me. For seventeen years groups of staff members have walked the Tenderloin and Mission districts at night, talking to young homeless people and handing out basic necessities like toothbrushes, snacks, and clothes. There’s no agenda, the goal is just to make contact with people, and start a conversation. As trust grows they can get informal counseling on the spot, get practical advice about connecting with other services that are available, and much more, but what most impressed me was ATC’s focus on just listening.

I know from my own life that just feeling like you’re being heard can make a massive difference, and unlike a lot of non-profit program the recipients are in control. They’re not being pushed down prescribed programs administered from above, it’s a grass-roots approach that lets them choose what help they need, and when, without judgment, delays, and paperwork.

I’ve stayed involved with ATC since then, trying to help them as a donor (most recently with a boost from Google’s generous gift-matching program). One of the perks has been getting the newsletter every few months, which is simple, but beautifully written, and often very moving as they focus on the stories of the clients. You can view the latest issue here, and what really struck me this time was how Rob distilled the group’s philosophy on helping people in his editorial:

It took me 10 years of doing this work before I realized there was no better topic to discuss than relationships. Knowing how to build and sustain healthy relationships, and how to navigate difficult ones, is the single most important tool our youth can develop that will empower them to build the lives they want. It can have a greater impact on their long-term stability than getting into housing, going back to school, or finding work.

Our clients can get a job, but if they don’t know how to deal with a harsh boss, they will quit or get fired. They can find a room in an apartment or in subsidized housing, but if they can’t navigate roommate conflicts or deal with a case manager they don’t like, they will lose their housing. Furthermore, if they don’t have a strong, supportive community, losing a job or housing will send them right back to the streets.

Stable, long-term relationships are the building blocks upon which our youth create healthy and fulfilling lives. They also feed the heart and the soul. Care without condition nurtures hope, which is often in short supply for our youth. For many, we are the first people to reflect back what is special about them, and who actually see them for all that they are. Hope is a prerequisite for change, and it is wonderful to get to instill it.

It’s this kind of practical, humane wisdom that makes me happy that ATC exists, and is out there night after night helping people. If you’re concerned about homelessness in San Francisco, but you’ve felt lost trying to find a practical way to help, I encourage you to check out what ATC does, and all the different ways you can get involved. It’s not a magic solution to all the pain out there, but I really see them making a difference, one person at a time.

Why are Eight Bits Enough for Deep Neural Networks?

turbo_esprit_alternative_loading_screen_by_retronator-d72wl6o

Picture by Retronator

Deep learning is a very weird technology. It evolved over decades on a very different track than the mainstream of AI, kept alive by the efforts of a handful of believers. When I started using it a few years ago, it reminded me of the first time I played with an iPhone – it felt like I’d been handed something that had been sent back to us from the future, or alien technology.

One of the consequences of that is that my engineering intuitions about it are often wrong. When I came across im2col, the memory redundancy seemed crazy, based on my experience with image processing, but it turns out it’s an efficient way to tackle the problem. While there are more complex approaches that can yield better results, they’re not the ones my graphics background would have predicted.

Another key area that seems to throw a lot of people off is how much precision you need for the calculations inside neural networks. For most of my career, precision loss has been a fairly easy thing to estimate. I almost never needed more than 32-bit floats, and if I did it was because I’d screwed up my numerical design and I had a fragile algorithm that would go wrong pretty soon even with 64 bits. 16-bit floats were good for a lot of graphics operations, as long as they weren’t chained together too deeply. I could use 8-bit values for a final output for display, or at the end of an algorithm, but they weren’t useful for much else.

It turns out that neural networks are different. You can run them with eight-bit parameters and intermediate buffers, and suffer no noticeable loss in the final results. This was astonishing to me, but it’s something that’s been re-discovered over and over again. My colleague Vincent Vanhoucke has the only paper I’ve found covering this result for deep networks, but I’ve seen with my own eyes how it holds true across every application I’ve tried it on. I’ve also had to convince almost every other engineer who I tell that I’m not crazy, and watch them prove it to themselves by running a lot of their own tests, so this post is an attempt to short-circuit some of that!

How does it work?

You can see an example of a low-precision approach in the Jetpac mobile framework, though to keep things simple I keep the intermediate calculations in float and just use eight bits to compress the weights. Nervana’s NEON library also supports fp16, though not eight-bit yet. As long as you accumulate to 32 bits when you’re doing the long dot products that are the heart of the fully-connected and convolution operations (and that take up the vast majority of the time) you don’t need float though, you can keep all your inputs and output as eight bit. I’ve even seen evidence that you can drop a bit or two below eight without too much loss! The pooling layers are fine at eight bits too, I’ve generally seen the bias addition and activation functions (other than the trivial relu) done at higher precision, but 16 bits seems fine even for those.

I’ve generally taken networks that have been trained in full float and down-converted them afterwards, since I’m focused on inference, but training can also be done at low precision. Knowing that you’re aiming at a lower-precision deployment can make life easier too, even if you train in float, since you can do things like place limits on the ranges of the activation layers.

Why does it work?

I can’t see any fundamental mathematical reason why the results should hold up so well with low precision, so I’ve come to believe that it emerges as a side-effect of a successful training process. When we are trying to teach a network, the aim is to have it understand the patterns that are useful evidence and discard the meaningless variations and irrelevant details. That means we expect the network to be able to produce good results despite a lot of noise. Dropout is a good example of synthetic grit being thrown into the machinery, so that the final network can function even with very adverse data.

The networks that emerge from this process have to be very robust numerically, with a lot of redundancy in their calculations so that small differences in input samples don’t affect the results. Compared to differences in pose, position, and orientation, the noise in images is actually a comparatively small problem to deal with. All of the layers are affected by those small input changes to some extent, so they all develop a tolerance to minor variations. That means that the differences introduced by low-precision calculations are well within the tolerances a network has learned to deal with. Intuitively, they feel like weebles that won’t fall down no matter how much you push them, thanks to an inherently stable structure.

At heart I’m an engineer, so I’ve been happy to see it works in practice without worrying too much about why, I don’t want to look a gift horse in the mouth! What I’ve laid out here is my best guess at the cause of this property, but I would love to see a more principled explanation if any researchers want to investigate more thoroughly? [Update – here’s a related paper from Matthieu Courbariaux, thanks Scott!]

What does this mean?

This is very good news for anyone trying to optimize deep neural networks. On the general CPU side, modern SIMD instruction sets are often geared towards float, and so eight bit calculations don’t offer a massive computational advantage on recent x86 or ARM chips. DRAM access takes a lot of electrical power though, and is slow too, so just reducing the bandwidth by 75% can be a very big help. Being able to squeeze more values into fast, low-power SRAM cache and registers is a win too.

GPUs were originally designed to take eight bit texture values, perform calculations on them at higher precisions, and then write them back out at eight bits again, so they’re a perfect fit for our needs. They generally have very wide pipes to DRAM, so the gains aren’t quite as straightforward to achieve, but can be exploited with a bit of work. I’ve learned to appreciate DSPs as great low-power solutions too, and their instruction sets are geared towards the sort of fixed-point operations we need. Custom vision chips like Movidius’ Myriad are good fits too.

Deep networks’ robustness means that they can be implemented efficiently across a very wide range of hardware. Combine this flexibility with their almost-magical effectiveness at a lot of AI tasks that have eluded us for decades, and you can see why I’m so excited about how they will alter our world over the next few years!

Jetpac’s deep learning framework on the Beaglebone Black

beagle

Photo by Michael Nika

I’ve been having a lot of fun porting the Jetpac image recognition library to new and tinier devices, and the latest addition to the family is the Beaglebone Black. As I mentioned in my Raspberry Pi 2 port, the Eigen math library has had a lot of effort put into ARM optimizations by Benoit Jacob and Benoit Steiner recently, and I was able to use it to good effect on the Beagle. The overall time for the 1,000 category Imagenet task was 5.5 seconds, not enough for real time but still promising for a lot of applications with smaller networks or slower response needs. The default OS was a bit long in the tooth though, so I had to patch Eigen to get NEON support working.

I also updated the general project documentation to describe how to build the library from source on a new device, since I’ve been seeing it pop up as a “hello world” for deep networks on new platforms. Thanks to everyone who’s reached out, it’s been great hearing about all the cool projects out there. I just can’t wait until I get my hands on a CHIP to see how much performance we can squeeze out of a $9 computer!

Image Recognition on the Raspberry Pi 2

raspberrypie

Photo by Shashinjutsu

I loved the original Raspberry Pi, it was a great platform to run deep neural networks on, especially with a fully-programmable GPU. I was excited when the new Pi 2 was released, because it was even more powerful for the same low price. Unfortunately I heard back from early users that the GPU code I had been using no longer worked, the device just crashed when the example program was run.

I ordered a Pi 2, and this weekend I was finally able to devote a few hours to debugging the problems. The bad news is that I wasn’t able to figure out why the GPU code is being problematic. The good news is that the CPU’s so improved on the Pi 2 that I’m able to run even faster without it, in 3.2 seconds!

I’ve checked in my changes, and you can see full directions in the README, but the summary is that by using Eigen and gcc 4.8, NEON code on the CPU is able to run the matrix calculations very fast. One of my favorite parts of joining Google has been all the open-source heroes I’ve been able to hang out with, and I’ve got to know Benoit Jacob , the founder, and Benoit Steiner, a top contributor to the Eigen project. I knew they’ve been doing amazing work improving ARM performance, so I was hopeful that the latest version would be a big step forward. I was pleased to discover that the top of tree is almost 25% faster than the last stable release in January!

Let me know how you get on if you do dive in. I’ve had a lot of fun with this, and I hope you do too!