Cross-compiling TensorFlow for the Raspberry Pi

raspberriesPhoto by oatsy40

I love the Raspberry Pi because it’s such a great platform for software to interact with the physical world. TensorFlow makes it possible to turn messy, chaotic sensor data from cameras and microphones into useful information, so running models on the Pi has enabled some fascinating applications, from predicting train times, sorting trash, helping robots see, and even avoiding traffic tickets!

It’s never been easy to get TensorFlow installed on a Pi though. I had created a makefile script that let you build the C++ part from scratch, but it took several hours to complete and didn’t support Python. Sam Abrahams, an external contributor, did an amazing job maintaining a Python pip wheel for major releases, but building it required you to add swap space on a USB device for your Pi, and took even longer to compile than the makefile approach. Snips managed to get TensorFlow cross-compiling for Rust, but it wasn’t clear how to apply this to other languages.

Plenty of people on the team are Pi enthusiasts, and happily Eugene Brevdo dived in to investigate how we could improve the situation. We knew we wanted to have something that could be run as part of TensorFlow’s Jenkins continuous integration system, which meant building a completely automatic solution that would run with no user intervention. Since having a Pi plugged into a machine to run something like the makefile build would be hard to maintain, we did try using a hosted server from Mythic Beasts. Eugene got the makefile built going after a few hiccups, but the Python version required more RAM than was available, and we couldn’t plug in a USB drive remotely!

Cross compiling, building on an x86 Linux machine but targeting the Pi, looked a lot more maintainable, but also more complex. Thankfully we had the Snips example to give us some pointers, a kindly stranger had provided a solution to a crash that blocked me last time I tried it, and Eugene managed to get an initial version working.

I was able to take his work, abstract it into a Docker container for full reproducibility, and now we have nightly builds running as part of our main Jenkins project. If you just want to try it out for Python 2.7, run:

sudo apt-get install libblas-dev liblapack-dev python-dev \
libatlas-base-dev gfortran python-setuptools
sudo ​pip2 install \
http://ci.tensorflow.org/view/Nightly/job/nightly-pi/lastSuccessfulBuild/artifact/output-artifacts/tensorflow-1.4.0-cp27-none-any.whl

This can take quite a while to complete, largely because it looks like the SciPy compilation is extremely slow. Once it’s done, you’ll be able to run TensorFlow in Python 2. If you get an error about the .whl file not being found at that URL, the version number may have changed. To find the correct name, go to  http://ci.tensorflow.org/view/Nightly/job/nightly-pi/lastSuccessfulBuild/artifact/output-artifacts/ and you should see the new version listed.

For Python 3.4 support, you’ll need to use a different wheel and pip instead of pip2, like this:

sudo apt-get install libblas-dev liblapack-dev python-dev \
 libatlas-base-dev gfortran python-setuptools
sudo ​pip install \
 http://ci.tensorflow.org/view/Nightly/job/nightly-pi-python3/lastSuccessfulBuild/artifact/output-artifacts/tensorflow-1.4.0-cp34-none-any.whl

If you’re running Python 3.5, you can use the same wheel but with a slight change to the file name, since that encodes the version. You will see a couple of warnings every time you import tensorflow, but it should work correctly.

sudo apt-get install libblas-dev liblapack-dev python-dev \
 libatlas-base-dev gfortran python-setuptools
curl -O http://ci.tensorflow.org/view/Nightly/job/nightly-pi-python3/lastSuccessfulBuild/artifact/output-artifacts/tensorflow-1.4.0-cp34-none-any.whl
mv tensorflow-1.4.0-cp34-none-any.whl tensorflow-1.4.0-cp35-none-any.whl
sudo ​pip install tensorflow-1.4.0-cp35-none-any.whl

If you have a Pi Zero or One that you want to use TensorFlow on, you’ll need to use an alternative wheel that doesn’t include NEON instructions. This is a lot slower than the one above that’s optimized for the Pi Two and above, so I don’t recommend you use it on newer devices. Here are the commands for Python 2.7:

sudo apt-get install libblas-dev liblapack-dev python-dev \
libatlas-base-dev gfortran python-setuptools
​sudo pip2 install \
http://ci.tensorflow.org/view/Nightly/job/nightly-pi-zero/lastSuccessfulBuild/artifact/output-artifacts/tensorflow-1.4.0rc1-cp27-none-any.whl

Here is the Python 3.4 version for the Pi Zero:

sudo apt-get install libblas-dev liblapack-dev python-dev \
 libatlas-base-dev gfortran python-setuptools 
sudo ​pip install \
 http://ci.tensorflow.org/view/Nightly/job/nightly-pi-zero-python3/lastSuccessfulBuild/artifact/output-artifacts/tensorflow-1.4.0-cp34-none-any.whl

And here are the Python 3.5 instructions:

sudo apt-get install libblas-dev liblapack-dev python-dev \
 libatlas-base-dev gfortran python-setuptools
curl -O http://ci.tensorflow.org/view/Nightly/job/nightly-pi-zero-python3/lastSuccessfulBuild/artifact/output-artifacts/tensorflow-1.4.0-cp34-none-any.whl
mv tensorflow-1.4.0-cp34-none-any.whl tensorflow-1.4.0-cp35-none-any.whl
sudo ​pip install tensorflow-1.4.0-cp35-none-any.whl

I’ve found the scipy compilation on Pi Zeros/Ones is so slow (many hours), it is unfeasible to wait for it to complete. Instead I’ve found myself pressing Control-C to cancel when it’s in the middle of a scipy-related compile step, and then re-running with ‘–no-deps’ flag after install to skip building dependencies. This is extremely hacky, but since scipy is only needed for testing purposes you should have a workable copy of TensorFlow at the end, provided all the other dependencies completed.

If you want to build your own copy of the wheels, you can run this line from within the TensorFlow source root on a Linux machine with Docker installed to build for the Pi Two or Three with Python 2.7:

tensorflow/tools/ci_build/ci_build.sh PI tensorflow/tools/ci_build/pi/build_raspberry_pi.sh

For Python 3.4:

CI_DOCKER_EXTRA_PARAMS="-e CI_BUILD_PYTHON=python3 -e CROSSTOOL_PYTHON_INCLUDE_PATH=/usr/include/python3.4" tensorflow/tools/ci_build/ci_build.sh PI-PYTHON3 tensorflow/tools/ci_build/pi/build_raspberry_pi.sh

For Python 2.7 on the Pi Zero:

tensorflow/tools/ci_build/ci_build.sh PI tensorflow/tools/ci_build/pi/build_raspberry_pi.sh PI_ONE

For Python 3.4 on the Pi Zero:

CI_DOCKER_EXTRA_PARAMS="-e CI_BUILD_PYTHON=python3 -e CROSSTOOL_PYTHON_INCLUDE_PATH=/usr/include/python3.4" tensorflow/tools/ci_build/ci_build.sh PI-PYTHON3 tensorflow/tools/ci_build/pi/build_raspberry_pi.sh PI_ONE

(Note, the Docker files are currently broken because they were upgraded to use Ubuntu 16.04 and the Python cross toolchain fails to install on that version. There should be a fix visible in TensorFlow’s github within the next few days, but for now you can locally change Dockerfile.pi, etc, to use 14.04 instead.)

This is all still experimental, so please do file bugs with feedback if these don’t work for you. I’m hoping we will be able to provide official stable Pi binaries for each major release in the future, like we do for Android and iOS, so knowing how well things are working is important to me. I’m also always excited to hear about cool new applications you find for TensorFlow on the Pi, so do let me know what you build too!

A quick hack to align single-word audio recordings

As I’ve been training on the initial results of the speech gathering app, one of the challenges has been aligning the recordings. There can be a delay between somebody hitting record and saying a word, or they can say it very quickly and leave a large gap at the end of the audio file. To improve the results of the training, I wanted to find a way to standardize the start of a word in my input files, since that would also let me shorten the window of audio I’m looking at, and so reduce the overall compute time.

I looked into advanced speech alignment tools like Sphinx, but they had some pretty gnarly dependencies which I was hoping to avoid in a beginning tutorial. They also had a lot of assumptions built in that didn’t transfer well to single word commands, most didn’t have many prebuilt models, and in general they weren’t easy to integrate.

Looking at visualizations of the waveforms from the recordings using the great Fission app, it usually appeared pretty obvious which section had the word, and which parts were background.

waveform4.png

In this example, the word is in the highlighted portion, and the only other peaks are a noisy click near the end. I was hoping to find an existing tool that would recognize this kind of pattern and help me remove the background, leaving only the part I wanted. I looked at both sox and ffmpeg’s silenceremove filters, but I couldn’t find one that worked well:

– Sox clipped initial sections of the spoken word, since there was a delay before it recognized ‘non-silence’.

– There was an option to avoid this with ffmpeg, but reliably detecting silence meant normalizing all my clips to a standard volume level, which wasn’t something I wanted to do to speech samples.

I also couldn’t specify that I wanted a particular length of clip. In my case, I knew I wanted a second-long result, because that’s what my models take in, and all the words should fit in that length. Most of the tools out there seemed designed to remove gaps in recorded music, but intuitively it felt like my problem was more like ‘give me the second-long section with the most relevant audio in it’.

As I thought about this, I realized that the speech should be the loudest sustained part of the recording, so if I could slide a contiguous window through the audio data and pick the section that was loudest in total, I might get good results.

To visualize what I mean, imagine a simplified waveform of a two-second long clip:

waveform1.png

To my untrained eye, it’s clear that the middle section has the most going on. To turn that into a useful definition, I estimated the volume at each point in the file using the absolute of the PCM value (volume = abs(value)) and then walked through the clip looking at the total of those volumes for a one-second range. By picking the point where the sum total of the volumes is highest:

waveform2.png

You can clip down to a short section with the loudest audio in it:

waveform3.png

I’m sure this particular wheel has been invented many times before, but I couldn’t find it in my searches, so I wanted to leave a trail of breadcrumbs for anyone else stuck with a similar problem. Hopefully people with more experience in this domain will also leave comments offering other suggestions!

The code itself is very straightforward, and I’ve put it up at https://github.com/petewarden/extract_loudest_section. The command line interface has only been designed for my particular use case, with one second hardcoded as the desired window length, only folders of .wavs supported, and no build file for anything other than OS X. It should be easy to port to your own system though, it doesn’t have any dependencies outside of Posix and the C/C++ standard libraries.

The only real point of interest is that it doesn’t recalculate the whole sum at every sample, instead it keeps a running total by subtracting the value leaving the interval as it moves forward in time, and adding in the new volume, which keeps the latency very low.

float current_volume_sum = 0.0f;
for (int64_t i = 0; i < desired_samples; ++i) {
  const float input_value = input[i];
  current_volume_sum += fabsf(input_value);
}
 
int64_t loudest_end_index = desired_samples;
float loudest_volume = current_volume_sum;
for (int64_t i = desired_samples; i < input_size; ++i) {
  const float trailing_value = input[i - desired_samples];
  current_volume_sum -= fabsf(trailing_value);
  const float leading_value = input[i];
  current_volume_sum += fabsf(leading_value);
  if (current_volume_sum > loudest_volume) {
    loudest_volume = current_volume_sum;
    loudest_end_index = i;
  }
}

 

What I’ve learned about neural network quantization

Screen Shot 2017-06-22 at 1.06.20 PM

Photo by badjonni

It’s been a while since I last wrote about using eight bit for inference with deep learning, and the good news is that there has been a lot of progress, and we know a lot more than we did even a year ago. There are still a lot of unanswered questions too, which is why I’m waiting for a plane to take me to MobiSys, where I’ll be helping Nic Lane from UCL run a workshop for the research community to investigate some of them.

As a foundation for that, I’ll be giving a talk on what I know now, and what my hunches are. A lot of it is empirical, and we don’t have nearly enough rigorous experiments, let alone published papers, but if you take all this as provisional I hope it might still be useful. I’m also very happy to acknowledge my deep debt to my Google colleagues and others like Song Han who are the driving forces behind much of this work! Here are my notes on the areas I’ll be covering tomorrow.

Hardware implementations

Since the original TPU paper has been published, we can now use that as a successful example of using eight bit for inference across a wide variety of models within Google. There’s also the collaboration between the Qualcomm and TensorFlow teams that enables models to run up to seven times faster on the HVX DSP than on the CPU, thanks to the use of eight bit. This means we now have more evidence that this is a good approach to use on the hardware side.

Training with forward passes

I don’t have any published papers to hand, and we haven’t documented it well within TensorFlow, but we do have support for “fake quantization” operators. If you include these in your graphs at the points where quantization is expected to occur (for example after convolutions), then in the forward pass the float values will be rounded to the specified number of levels (typically 256) to simulate the effects of quantization. In the backward pass, this rounding won’t be performed, so gradients will be calculated using full float values. This has the effect of forcing the graph to adapt to the lower precision it will encounter during inference, and in practice we’ve seen this improve the accuracy of the quantized graph dramatically, sometimes to a level indistinguishable from float. It also gives precalculated min/max ranges for the 32-bit to 8-bit downscaling that needs to happen after many operations. This saves a step on the CPU, but for hardware implementations it’s even more important, since a dynamically-calculated range may be impossible to efficiently implement.

By the way, if you do want fixed ranges but can’t retrain, there are some options for running example data through a pretrained network to bake them in instead.

Exact zeroes are important

The current TensorFlow way of figuring out ranges just looks at the min/max of the float values and assigns those to 0 and 255. This means that real zero is almost always not exactly representable, and the closest encoded value may represent something like 0.046464, or some other arbitrary distance from exact zero. For most numbers this doesn’t matter, because the float values are assumed to occur in a ‘random’ enough way that the error on the representation of any individual value is also uniformly random. The idea is that as long as the errors generally cancel each other out, they’ll just appear as the kind of random noise that the network is trained to cope with and so not destroy the overall accuracy by introducing a bias.

The problem is that the real value of zero shows up a lot more often you’d expect in neural network calculations. Convolutions are padded with zeros at the edges when filters overlap, and the Relu activation function gates any negative numbers at zero. This means that any error in the zero representation contributes disproportionately to overall results.

The solution to this is to ensure that real values of zero are represented as exactly as possible in the quantized encoding. The way to do this is to nudge the overall min/max values so that zero is exact. We’re not (yet) doing this in TensorFlow, but hope to have it in soon. For much more information, Benoit Jacob has some excellent documentation in gemmlowp, and is the source of most of the information above.

Asymmetric ranges are inconvenient, but may be necessary

Constraining the min/max ranges so that the minimum is always the negative of the maximum is very convenient for a lot of purposes because it avoids having to apply an offset to the operands to the matrix multiply. Unfortunately the evidence for whether this allows for enough precision is mixed, with some models showing unacceptable loss of overall accuracy. This is still an open question, and an area where we need more experiments.

Excluding -128 can be useful

One practical issue that has come up in various contexts is that signed eight bit values run from -128 to +127. This is inconvenient because there’s one more value on the negative side than the positive, and so requires careful handling if we want to use symmetric ranges and ensure zero is exactly representable as encoded zero. Unrelatedly, it’s also been helpful with the ARM NEON CPU implementation to avoid -128 for the weights to allow a faster code path. There’s not all that much principle behind it yet, but there’s thus some evidence that avoiding -128 in general may be helpful.

Lower bit depths are promising, but unproven

There have been some fantastic papers around four bit, two bit, or even one bit precision for neural networks. Unfortunately so far they’ve all had some practical drawbacks that have prevented us from taking advantage of them. Song Han’s four-bit weights require a lookup table, which makes them hard to implement efficiently at runtime, though I’m intrigued to know if a simple function to handle the nonlinear distribution might work as well and be easier to optimize. We haven’t been able to achieve the accuracy we need on models we care about using lower bit depths, or even four-bit linear. The number of one-bit ops required also seems to scale in a way that negates the advantage of their lower precision. Unfortunately I don’t have any papers or documented experiments to share on this though, and I’m also hopeful that these issues can be overcome in the future, so I’ll be keeping a close eye on the literature.

Models are important

A lot of what I’m discussing above are fairly low-level optimizations, but as we know from software engineering, the biggest gains are often to be found higher up the stack. Switching to a more efficient sorting algorithm will probably do more for traditional code than rewriting a less-suited one in assembler. In the same spirit, altering the model architectures so that there’s less work to do is usually a much bigger win than tweaking the bit depth. That’s why I was very pleased that we could release the Mobilenet family of models. These substantially reduce the amount of computation needed, and also work well with quantization, thanks to hard work by Andrew Howard, Benoit Jacob, Dmitry Kalenichenko, and the rest of the Mobile Vision team.

As we keep pushing on quantization, this sort of co-design between researchers and implementers is crucial to get the best results. I think there’s a whole new field beginning to emerge, which I’m not sure whether to call ML Engineering or ML Systems, looking at the whole lifecycle of a deep learning solution, all the way from initial research through to deployment in production. It’s only with that sort of integrated view that we’re going to be able to solve some of the outstanding problems we’re still facing.

Can you help me gather open speech data?

Screen Shot 2017-06-12 at 3.18.46 PM

Photo by The Alien Experience

I miss having a dog, and I’d love to have a robot substitute! My friend Lukas built a $100 Raspberry Pi robot using TensorFlow to wander the house and recognize objects, and with the person detection model it can even follow me around. I want to be able to talk to my robot though, and at least have it understand simple words. To do that, I need to write a simple speech recognition example for TensorFlow.

As I looked into it, one of the biggest barriers was the lack of suitable open data sets. I need something with thousands of labelled utterances of a small set of words, from a lot of different speakers. TIDIGITS is a pretty good start, but it’s a bit small, a bit too clean, and more importantly you have to pay to download it, so it’s not great for an open source tutorial.  I like https://github.com/Jakobovski/free-spoken-digit-dataset, but it’s still small and only includes digits. LibriSpeech is large enough, but isn’t broken down into individual words, just sentences.

To solve this, I need your help! I’ve put together a website at https://open-speech-commands.appspot.com/ (now at https://aiyprojects.withgoogle.com/open_speech_recording) that asks you to speak about 100 words into the microphone, records the results, and then lets you submit the clips. I’m then hoping to release an open source data set out of these contributions, along with a TensorFlow example of a simple spoken word recognizer. The website itself is a little Flask app running on GCE, and the source code is up on github. I know it doesn’t work on iOS unfortunately, but it should work on Android devices, and any desktop machine with a microphone.

Screen Shot 2017-06-12 at 3.24.10 PM

I’m hoping to get as large a variety of accents and devices as possible, since that will help the recognizer work for as many people as possible, so please do take five minutes to record your contributions if you get a chance, and share with anyone else who might be able to help!

Running TensorFlow Graphs on Microcontrollers

crown-gecko-967335_960_720

Photo by PixelMixer

I gave a talk last week at the Embedded Vision Summit, and one question that came up was how to run neural networks trained in TensorFlow on tiny, power-efficient CPUs like the EFM32. This microcontroller is based on an ARM M4 design, and uses about 2.5 milliwatts when running at 40 MHz. It’s not much to work with, but it does at least have 256KB of RAM and 1024KB of flash program memory. This is more than enough to run some simple neural networks, especially for tasks like audio hotword detection that don’t need ultra-high accuracy to be useful. It is a tricky task though, since it requires a combination of ML and embedded hardware expertise. Here’s how I would tackle it:

Make sure you can train a model on the desktop that achieves the accuracy you need for your product, ignoring embedded constraints. Usually the hardest part here is getting enough relevant training data, since for good results you’ll need something that reflects the actual data you’ll be seeing on the device.

Once you have a proof of concept model working, try to shrink it down to fit the constraints of your device. This will involve:

  • Making sure the number of weights is small enough to fit in RAM. You can compress them down to eight bit in most cases to help. The tensorflow/tools/graph_transform/summarize_graph tool will give you estimates on the number of weights.
  • Reduce the size of the fully-connected and convolutional layers so that the number of compute operations is small enough to run within your device’s budget. The tensorflow/tools/benchmark utility with –show_flops will give you an estimate of this number. For example, I might guess that an M4 could do 2 MFLOPs/second, and so aim for a model that fits in that limit.

When you have a model that looks like it should fit, write a custom runtime to execute it. Right now I wouldn’t recommend using the standard TensorFlow runtime on an embedded device because the binary size overhead of a general interpreter is too large. Instead, manually write a piece of code that loads your weights and does the matrix multiplies. Usually your model will be comparatively simple, just a couple of fully-connected layers for example, so hopefully the amount of coding won’t be overwhelming. I am hoping we have a better solution and some examples for this use case in the future though, since I think it’s an area where neural network solutions are going to make a massive impact!

How the TensorFlow team handles open source support

I’ve been very impressed by how seriously the TensorFlow team and leadership have taken supporting the open source community, so I was glad I was able to write up some of the approaches we took at O’Reilly last week. The process is definitely not perfect though, just take a look at my awful backlog of Github issues! I’m hoping we’ll be able to keep learning and improving how we do things, and I’d be interested to hear how other commercial teams handle these sort of problems too.

How to Label Images Quickly

Screen Shot 2017-04-26 at 12.33.36 PM.png

I’ve found collecting great data is a lot more important than using the latest architecture when you’re trying to get good results in deep learning, so ever since my Jetpac days I’ve spent a lot of time trying to come up with good ways to refine my training sets. I’ve written or used a lot of different user interfaces custom designed for this, but surprisingly I’ve found that the stock Finder window in OS X has been the most productive!

Here is how I curated the flowers set of images that’s used in TensorFlow for Poets, and I’ve found I can sort through many thousands of images an hour using this approach.

  • Copy and decompress the images onto a folder on my OS X machine.
  • Open the folder in the OS X Finder app, the normal file viewer.
  • Choose the ‘Column’ view for the Finder window, which is an icon in the top bar, the third from the left in the view choices.
  • Select the first image. You should now see a small preview picture in the right-hand column.
  • Move the mouse pointer over the right-hand edge of the window, until you see the cursor change into a ‘drag left/right’ icon.
  • Drag the right-hand side of the Finder window out. You should see the image preview get larger. Stop once the preview size is no longer growing.

You should now have a window that looks like the image at the start of the post. There are a couple of ways of using this view. If I have a set of images that have been roughly sorted, but I want to do some quality control by weeding out pictures that are misclassified, I’ll use the up and down arrow keys to move through the images, look at each preview to quickly tell if it’s correct, and press the Command and Delete keys to remove it if not. After removing a photo, the selection automatically moves onto the next image, which is convenient.

If I have a large set of photos I want to label as belonging to a set of categories, rather than just rejecting bad labels, then I’ll use a slightly more involved approach. The key is to use “Tags” in OS X (which used to be called labels). You can follow these instructions for setting up a keyboard shortcut to open the Tags menu for an item, and then move through the files using the down keys, assigning tags as you go. Unfortunately OS X removed the ability to apply particular tags through a single keyboard shortcut, which used to be possible in older versions of the system, but this can still be an efficient way to label large sets of images.

Another approach I sometimes use to very quickly remove a small number of bad labels is to open a folder of images using the Icon view in the finder, and then crank up the preview size slider in the bottom right corner of the window. You may have to select “View->Arrange By->Name” from the top menu to ensure that the enlarged icons all fit inside the window.

Screen Shot 2017-04-26 at 1.26.45 PM.png

I don’t find this as efficient for moving through every image as the column view, but if I want to quickly visually scan to find a few rogue images it’s very handy. I’ll usually just grab the scroll bar at the right hand side, or use mouse scroll to quickly look through the entire data set, and then click to select any that I want to remove.

What I like about these approaches are that they are very lightweight, I don’t need to install any special software, and the speed of the preview loading in the Finder beats any custom software that I’ve found, so I can run through a lot of images very fast. Anyway, I hope you find them useful too, and do let me know your favorite labeling hacks in the comments or on Twitter.

Why Deep Learning Needs Assembler Hackers

Screen Shot 2017-01-03 at 9.58.12 AM.png

Photo by Daniel Lopez

Take a look at this function:

 for (j = 0; j < n; j++) {
   for (i = 0; i < m; i++) {
     float total(0);
     for (l = 0; l < k; l++) {
       const size_t a_index = ((i * a_i_stride) + (l * a_l_stride));
       const float a_value = a[a_index];
       const size_t b_index = ((j * b_j_stride) + (l * b_l_stride));
       const float b_value = b[b_index];
       total += (a_value * b_value);
     }
     const size_t c_index = ((i * c_i_stride) + (j * c_j_stride));
     c[c_index] = total;
   }
 }

For something so simple, it turns out it’s amazingly hard for compilers to speed up without a lot of human intervention. This is the heart of the GEMM matrix multiply function, which powers deep learning, and every fast implementation I know has come from old-school assembler jockeys hand-tweaking instructions!

When I first started looking at the engineering side of neural networks, I assumed that I’d be following the path I’d taken on the rest of my career and getting most of my performance wins from improving the algorithms, writing clean code, and generally getting out of the way so the compiler could do its job of optimizing it. Instead I spend a large amount of my time worrying about instruction dependencies and all the other hardware details that we were supposed to be able to escape in the 21st century. Why is this?

Matrix multiplies are a hard case for modern compilers to handle. The inputs used for neural networks mean that one function call may require millions of operations to complete, which magnifies the latency impact of any small changes to the code. The access patterns are entirely predictable for a long period, but not purely linear, which doesn’t fit well with cache line algorithms as written in the naive way above. There are lots of choices about how to accumulate intermediate results and reuse memory reads, which will have different outcomes depending on the sizes of the matrices involved.

All this means that an endangered species, hand-coding assembler experts, write all of the best implementations. GotoBlas (which evolved into OpenBlas) showed how much speed could be gained on Intel CPUs. Eigen has had a lot of work put into it to run well on both x86 and ARM with float, and gemmlowp is optimized for eight-bit on ARM. Even if you’re running on a GPU, Scott Gray (formerly at Nervana, now at OpenAI) has shown how much faster hand-coded solutions can be.

This is important because it means that there’s a lot of work involved in getting good performance from new platforms, and there’s often a big gap between existing highly-optimized solutions and those ported from other architectures. This is visible for example with gemmlowp on x86, where the hand optimization is still a work in progress and so the speed still lags behind float alternatives right now.

It’s also exciting, because the real-world performance of even most optimized libraries lags behind the theoretical limits of the hardware, so there are still opportunities to squeeze more speed out of them with some clever hacking. There are also exciting developments in fundamentally different approaches to the problem like the Winograd algorithm. The good news is that if you’re an old-school assembler hacker there’s still an important place for you in the brave new world of deep learning, so I hope we can pull you in!

Rewriting TensorFlow Graphs with the GTT

networks

Photo by Stephen D. Strowes

One of the most interesting things about neural networks for me is that they’re programs you can do meaningful computation on. The most obvious example of that is automatic differentiation, but even after you’ve trained a model there are lots of other interesting transformations you can apply. These can be as simple as trimming parts of the graph that aren’t needed for just running inference, all the way to folding batch normalization nodes into precalculated weights, turning constant sub expressions into single nodes, or rewriting calculations in eight bit.

Many of these operations have been available as piecemeal Python scripts inside the TensorFlow codebase, but I’ve spent some time rewriting them into what I hope is a much cleaner and easier to extend C++ Graph Transform Tool. As well as a set of predefined operations based on what we commonly need ourselves, I’ve tried to create a simple set of matching operators and other utilities to encourage contributors to create and share their own rewriting passes.

I think there’s a lot of potential for computing on compute graphs, so I’m excited to hear what you can come up with! Do cc me (@petewarden) on github too with any issues you encounter.

 

AI and Unreliable Electronics (*batteries not included)

screen-shot-2016-12-28-at-6-28-16-pm

Picture by Torley

A few months ago I returned to my home town of Cambridge to attend the first ARM Research Summit. What was special about this conference was that it focused on introducing external researchers to each other, rather than pushing ARM’s own agenda. They had invited a broad range of people they worked with, from academic researchers to driver engineers, and all we had in common was that we spent a lot of time working on the ARM platform. This turned out fantastically for me at least, because it meant I had the chance to learn from experts in fields I knew nothing about. As such, it left my mind spinning a little, and so this post is a bit unusual! I’m trying to clarify gut feelings about the future with some actual evidence, so please bear with me as I work through my reasoning.

One of my favorite talks was on energy harvesting by James Myers. This table leapt out at me (apologies to James if I copied any of his figures incorrectly):

Energy harvesting rules of thumb:

  • Human vibration – 4µW/cm2
  • Industrial vibration – 100µW/cm2
  • Human temperature difference – 25µW/cm2
  • Industrial temperature difference – 1 to 10 mW/cm2
  • Indoor light – 10µW/cm2
  • Outdoor light – 10mW/cm2
  • GSM RF – 0.1µW/cm2
  • Wifi RF – 0.001µW/cm2

What this means in plain English is that you can expect to harvest four micro-watts (millionths of a watt or µW) for every square centimeter of a device relying on human vibration. A solar panel in sunlight could gather ten milliwatts (thousandths of a watt or mW) for every square centimeter. If you think about an old incandescent bulb, that burns forty watts, and even a modern cell phone probably uses a watt or so when it’s being actively used, so the power you can get from energy harvesting is clearly not enough for most applications. My previous post on smartphone energy consumption shows that even running an accelerometer takes over one twenty milliwatts, so clearly it’s hard to build devices that rely on these levels of power.

Why does that matter? I’m convinced that smart sensors are going to be massively important in the future, and that vision can’t work if they require batteries. I believe that we’ll be throwing tiny cheap devices up in the air like confetti to scatter around all the environments we care about, and they’ll result in a world we can intelligently interact with in unprecedented ways. Imagine knowing exactly where pests are in a crop field so that a robot can manually remove them rather than indiscriminately spraying pesticides, or having stickers on every piece of machinery in a factory that listen to the sounds and report when something needs maintenance.

These sort of applications will only work if the devices can last for years unattended. We can already build tiny chips that do these sort of things, but we can’t build batteries that can power them for anywhere near that long, and that’s unlikely to change soon.

Can the cloud come to our rescue? I’m a software guy, but everything I see in the hardware world shows that transmitting signals continuously takes a lot of energy. Even with a protocol like BLE sending data just a foot draws more than 10 milliwatts. There seems to be an enduring relationship between power usage and the distance you’re sending the data, with register access cheaper than SRAM, which is far cheaper than DRAM, which beats radio transmission.

That’s why I believe our only hope for long-lived smart sensors is driving down the energy used by local compute to the point at which harvesting gives enough power to run useful applications. The good news is that existing hardware like DSPs can perform a multiply-add for just low double-digit picojoules, and can access local SRAM to avoid the costs of DRAM. If you do the back of the envelope calculations, a small image network like Inception V1 takes about 1.5 billion multiply-adds, so 20 picojoules * 1.5 billion gives a rough energy cost of 30 millijoules per prediction (or 30 milliwatts at 1 prediction per second). This is already an order of magnitude less energy than the equivalent work done on a general-purpose CPU, so it’s a good proof that it’s possible to dramatically reduce computational costs, even though it’s still too high for energy harvesting to work.

That’s where another recurrent theme of the ARM research conference started to seem very relevant. I didn’t realize how much of a problem keeping results reliable is as the components continue to shrink. Increasingly large parts of the design are devoted to avoiding problems like Rowhammer, where accesses to adjacent DRAM rows can flip bits, as Onur Mutlu explained. It’s not just memory that faces problems like these, CPUs also need to be over-engineered to avoid errors introduced by current leakage and weirder quantum-level effects.

I was actually very excited when I learned this, because one of the great properties of neural networks is that they’re very resilient in the face of random noise. If we’re going to be leaving an increasing amount of performance on the table to preserve absolute reliability for traditional computing applications, that opens the door for specialized hardware without those guarantees that will be able to offer increasingly better energy consumption. Again, I’m a software engineer so I don’t know exactly what kinds of designs are possible, but I’m hoping that by relaxing constraints on the hardware the chip creators will be able to come up with order-of-magnitude improvements, based on what I heard at the conference.

If we can drive computational energy costs down into the femtojoules per multiply-add, then the world of ambient sensors will explode. As I was writing, I ran across a new startup that’s using deep learning and microphones to predict problems with machinery, but just imagine when those, along with seismic, fire, and all sorts of other sensors are scattered everywhere, too simple to record data but smart enough to alert people when special conditions occur. I can’t wait to see how this process unfolds, but I’m betting unreliable electronics will be a key factor in making it possible.