The Machine Learning Reproducibility Crisis

Gosper Glider Gun

I was recently chatting to a friend whose startup’s machine learning models were so disorganized it was causing serious problems as his team tried to build on each other’s work and share it with clients. Even the original author sometimes couldn’t train the same model and get similar results! He was hoping that I had a solution I could recommend, but I had to admit that I struggle with the same problems in my own work. It’s hard to explain to people who haven’t worked with machine learning, but we’re still back in the dark ages when it comes to tracking changes and rebuilding models from scratch. It’s so bad it sometimes feels like stepping back in time to when we coded without source control.

When I started out programming professionally in the mid-90’s, the standard for keeping track and collaborating on source code was Microsoft’s Visual SourceSafe. To give you a flavor of the experience, it didn’t have atomic check-ins, so multiple people couldn’t work on the same file, the network copy required nightly scans to avoid mysterious corruption, and even that was no guarantee the database would be intact in the morning. I felt lucky though, one of the places I interviewed at just had a wall of post-it notes, one for each file in the tree, and coders would take them down when they were modifying files, and return them when they were done!

This is all to say, I’m no shrinking violet when it comes to version control. I’ve toughed my way through some terrible systems, and I can still monkey together a solution using rsync and chicken wire if I have to. Even with all that behind me, I can say with my hand on my heart, that machine learning is by far the worst environment I’ve ever found for collaborating and keeping track of changes.

To explain why, here’s a typical life cycle of a machine learning model:

  • A researcher decides to try a new image classification architecture.
  • She copies and pastes some code from a previous project to handle the input of the dataset she’s using.
  • This dataset lives in one of her folders on the network. It’s probably one of the ImageNet downloads, but it isn’t clear which one. At some point, someone may have removed some of the images that aren’t actually JPEGs, or made other minor modifications, but there’s no history of that.
  • She tries out a lot of slightly different ideas, fixing bugs and tweaking the algorithms. These changes are happening on her local machine, and she may just do a mass file copy of the source code to her GPU cluster when she wants to kick off a full training run.
  • She executes a lot of different training runs, often changing the code on her local machine while jobs are in progress, since they take days or weeks to complete.
  • There might be a bug towards the end of the run on a large cluster that means she modifies the code in one file and copies that to all the machines, before resuming the job.
  • She may take the partially-trained weights from one run, and use them as the starting point for a new run with different code.
  • She keeps around the model weights and evaluation scores for all her runs, and picks which weights to release as the final model once she’s out of time to run more experiments. These weights can be from any of the runs, and may have been produced by very different code than what she currently has on her development machine.
  • She probably checks in her final code to source control, but in a personal folder.
  • She publishes her results, with code and the trained weights.

This is an optimistic scenario with a conscientious researcher, but you can already see how hard it would be for somebody else to come in and reproduce all of these steps and come out with the same result. Every one of these bullet points is an opportunity to inconsistencies to creep in. To make things even more confusing, ML frameworks trade off exact numeric determinism for performance, so if by a miracle somebody did manage to copy the steps exactly, there would still be tiny differences in the end results!

In many real-world cases, the researcher won’t have made notes or remember exactly what she did, so even she won’t be able to reproduce the model. Even if she can, the frameworks the model code depend on can change over time, sometimes radically, so she’d need to also snapshot the whole system she was using to ensure that things work. I’ve found ML researchers to be incredibly generous with their time when I’ve contacted them for help reproducing model results, but it’s often months-long task even with assistance from the original author.

Why does this all matter? I’ve had several friends contact me about their struggles reproducing published models as baselines for their own papers. If they can’t get the same accuracy that the original authors did, how can they tell if their new approach is an improvement? It’s also clearly concerning to rely on models in production systems if you don’t have a way of rebuilding them to cope with changed requirements or platforms. At that point your model moves from being a high-interest credit card of technical debt to something more like what a loan-shark offers. It’s also stifling for research experimentation; since making changes to code or training data can be hard to roll back it’s a lot more risky to try different variations, just like coding without source control raises the cost of experimenting with changes.

It’s not all doom and gloom, there are some notable efforts around reproducibility happening in the community. One of my favorites is the TensorFlow Benchmarks project Toby Boyd’s leading. He’s made it his team’s mission not only to lay out exactly how to train some of the leading models from scratch with high training speed on a lot of different platforms, but also ensures that the models train to the expected accuracy. I’ve seen him sweat blood trying to get models up to that precision, since variations in any of the steps I listed above can affect the results and there’s no easy way to debug what the underlying cause is, even with help from the authors. It’s also a never-ending job, since changes in TensorFlow, in GPU drivers, or even datasets, can all hurt accuracy in subtle ways. By doing this work, Toby’s team helps us spot and fix bugs caused by changes in TensorFlow in the models they cover, and chase down issues caused by external dependencies, but it’s hard to scale beyond a comparatively small set of platforms and models.

I also know of other teams who are serious about using models in production who put similar amounts of time and effort into ensuring their training can be reproduced, but the problem is that it’s still a very manual process. There’s no equivalent to source control or even agreed best-practices about how to archive a training process so that it can be successfully re-run in the future. I don’t have a solution in mind either, but to start the discussion here are some principles I think any approach would need to follow to be successful:

  •  Researchers must be able to easily hack around with new ideas, without paying a large “process tax”. If this isn’t true, they simply won’t use it. Ideally, the system will actually boost their productivity.
  • If a researcher gets hit by a bus founds their own startup, somebody else should be able to step in the next day and train all the models they have created so far, and get the same results.
  • There should be some way of packaging up just what you need to train one particular model, in a way that can be shared publicly without revealing any history the author doesn’t wish to.
  • To reproduce results, code, training data, and the overall platform need to be recorded accurately.

I’ve been seeing some interesting stirrings in the open source and startup world around solutions to these challenges, and personally I can’t wait to spend less of my time dealing with all the related issues, but I’m not expecting to see a complete fix in the short term. Whatever we come up with will require a change in the way we all work with models, in the same way that source control meant a big change in all of our personal coding processes. It will be as much about getting consensus on the best practices and educating the community as it will be about the tools we come up with. I can’t wait to see what emerges!

Why Low-Power NN Accelerators Matter


When I released the Speech Commands dataset and code last year, I was hoping they would give a boost to teams building low-energy-usage hardware by providing a realistic application benchmark. It’s been great to see Vikas Chandra of ARM using them to build keyword spotting examples for Cortex M-series chips, and now a hardware startup I’ve been following, Green Waves, have just announced a new device and shared some numbers using the dataset as a benchmark. They’re showing power usage numbers of just a few milliwatts for an always-on keyword spotter, which is starting to approach the coin-battery-for-a-year target I think will open up a whole new world of uses.

I’m not just excited about this for speech recognition’s sake, but because the same hardware can also accelerate vision, and other advanced sensor processing, turning noisy signals into something actionable. I’m also fascinated by the idea that we might be able to build tiny robots with the intelligence of insects if we can get the energy usage and mass small enough, or even send smart nano-probes to nearby stars!

Neural networks offer a whole new way of programming that’s inherently a lot easier to scale down than conventional instruction-driven approaches. You can transform and convert network models in ways we’ve barely begun to explore, fitting them to hardware with few resources while preserving performance. Chips can also take a lot of shortcuts that aren’t possible with traditional code, like tolerating calculation errors, and they don’t have to worry about awkward constructs like branches, everything is straight-line math at its heart.

I’ve put in my preorder for a GAP8 developer kit, to join the ARM-based prototyping devices on my desk, and I’m excited to see so much activity in this area. I think we’re going to see a lot of progress over the next couple of years, and I can’t wait to see what new applications emerge as hardware capabilities keep improving!

Blue Pill: A 72MHz 32-Bit Computer for $2!


Some people love tiny houses, but I’m fascinated by tiny computers. My house is littered with Raspberry Pi’s, but recently my friend Andy Selle introduced me to Blue Pill single-board computers. These are ARM M3 CPUs running at 72MHz, available for $2 or less on Ebay and Aliexpress, even when priced individually. These are complete computers with 20KB of RAM and 64KB of Flash for programs, and while that may not sound like much memory, their computing power as 32-bit ARM CPUs running at a fast clock-rate make them very attractive for applications like machine learning that rely more on arithmetic than memory. Even better, they can run for weeks or months on a single battery thanks to their ultra-low energy usage.

This makes them interesting platforms to explore the emerging world of smart sensors; they may not quite be fifty cents each, but they’re in the same ballpark. Unfortunately I’m a complete novice when it comes to microcontrollers, but luckily Andy was able to give me a few pointers to help me get started. After I struggled through a few hurdles, I managed to get a workflow laid out that I like, and ran some basic examples. To leave a trail of breadcrumbs for anyone else who’s fascinated by the possibilities of these devices, I’ve open-sourced stm32_bare_lib on GitHub. It includes step by step instructions designed for a newbie like me, especially on the wiring (which doesn’t require soldering or any special tools, thankfully), and has some examples written in plain C to play with. I hope you have as much fun playing with these tiny computers as I have!

How to Compile for ARM M-series Chips from the Command Line


Image from

When I was first learning programming in the 90’s, embedded systems were completely out of reach. I couldn’t afford the commercial development boards and toolchains that I’d need to even get started, and I’d need a lot of proprietary knowledge just to get started. I was excited a few years ago when the Arduino environment first appeared, since it removed a lot of the barriers to the general public. I still didn’t dive in though, because I couldn’t see an easy way to port the kind of algorithms I was interested in to eight-bit hardware, and even the coding environment wasn’t a good fit for my C++ background.

That’s why I’ve been fascinated by the rise of the ARM M-series chips. They’re cheap, going for as little as $2 for a “blue pill” M3 board on ebay, they can run with very low power usage which can be less than a milliwatt (offering the chance to run on batteries for months or years), and they have the full 32-bit ARM instruction I’m familiar with. This makes them tempting as a platform to prototype the kind of smart sensors I believe the combination of deep learning eating software and cheap, low-power compute power is going to enable.

I was still daunted by the idea of developing for them though. Raspberry Pi’s are very approachable because you can use a very familiar Linux environment to program them, but there wasn’t anything as obvious for me to use in the M-series world. Happily I was able to get advice from some experts at ARM, who helped steer me as a newbie through this unfamiliar world. I was very pleasantly surprised by the maturity and ease of use of the development ecosystem, so I want to share what I learned for anyone else who’s interested.

The first tip they had was to check out STMicroelectronics “Discovery” boards. These are small circuit boards with an M-series CPU and often a lot of peripherals built in to make experimentation easy. I started with the 32F746G which included a touch screen, audio input and output, microphones, and even ethernet, and cost about $70. There are cheaper versions available too, but I wanted something easy to demo with. I also chose the M7 chip because it has support for floating point calculations, even though I don’t expect I’ll need that long term it’s helpful when porting and prototyping to have it.

The unboxing experience was great, I just plugged the board into a USB socket on my MacBook Pro and it powered itself up into some demonstration programs. It showed up on my MacOS file system as a removable drive, and the usefulness of that quickly became clear when I went to the mbed online IDE. This is one of the neatest developer environments I’ve run across, it runs completely in the browser and makes it easy to clone and modify examples. You can pick your device, grab a project, press the compile button and in a few seconds you’ll have a “.bin” file downloaded. Just drag and drop that from your downloads folder into the USB device in the Finder and the board will reboot and run the program you’ve just built.

I liked this approach a lot as a way to get started, but I wasn’t sure how to integrate larger projects into the IDE. I thought I’d have to do some kind of online import, and then keep copies of my web code in sync with more traditional github and file system versions. When I was checking out the awesome keyword spotting code from ARM research I saw they were using a tool called “mbed-cli“, which sounded a lot easier to integrate into my workflow. When I’m doing a lot of cross platform work, I usually find it easier to use Emacs as my editor, and a custom IDE can actually get in the way. As it turns out, mbed-cli offers a command line experience while still keeping a lot of the usability advantages I’d discovered in the web IDE. Adding libraries and aiming at devices is easy, but it integrates smoothly with my local file system and github. Here’s what I did to get started with it:

  • I used pip install mbed-cli on my MacOS machine to add the Python tools to my system.
  • I ran mbed new mbed-hello-world to create a new project folder, which the mbed tools populated with all the baseline files I needed, and then I cd-ed into it with cd mbed-hello-world
  • I decided to use gcc for consistency with other platforms, so I downloaded the GCC v7 toolchain from ARM, and set a global variable to point to it by running ​​mbed config -G GCC_ARM_PATH "/Users/petewarden/projects/arm-compilers/gcc-arm-none-eabi-7-2017-q4-major/bin"
  • I then added a ‘main.cpp’ file to my empty project, by writing out this code:
#include <mbed.h>

Serial pc(USBTX, USBRX);

int main(int argc, char** argv) {
  pc.printf("Hello world!\r\n");
  return 0;

The main thing to notice here is that we don’t have a natural place to view stdout results from a normal printf on an embedded system, so what I’m doing here is creating an object that can send text over a USB port using an API exported from the main mbed framework, and then doing a printf call on that object. In a later step we’ll set up something on the laptop side to display that. You can see the full mbed documentation on using printf for debugging here.

  • I did git add main.cpp and git commit -a "Added main.cpp" to make sure the file was part of the project.
  • I created a new terminal window to look at the output of the printf after it’s been sent over the USB connection. How to view this varies for different platforms, but for MacOS you need to enter the command screen /dev/tty.usbm, and press Tab to autocomplete the correct device name. After that, the terminal may contain some random text, but after you’ve successfully compiled and run the program you should see “Hello World!” output. One quirk I noticed was that \n on its own was not enough to cause the normal behavior I expected from a new line, in that it just moved down one line but the next output started at the same horizontal position. That’s why I added a \r carriage return character in the example above.
  • I ran mbed compile -m auto -t GCC_ARM -f to build and flash the resulting ‘.bin’ file onto the Discovery board I had plugged in to the USB port. The -m auto part made the compile process auto-discover what device it should be targeting, based on what was plugged in, and -f triggered the transfer of the program after a successful build.
  • If it built and ran correctly, you should see “Hello World!” in the terminal window where you’re running your screen command.

I’ve added my version of this project on Github as, in case you want to compare with what you get following these steps. I found the README for the mbed-cli project extremely clear though, and it’s what most of this post is based on.

My end goal is to set up a simple way to compile a code base that’s shared with other platforms for M-series devices. I’m not quite certain how to do that (for example integrating with a makefile, or otherwise syncing a list of files between an mbed project and other build systems), but so far mbed has been such a pleasant experience I’m hopeful I’ll be able to figure that out soon!

How many images do you need to train a neural network?


Photo by Glenn Scott

Today I got an email with a question I’ve heard many times – “How many images do I need to train my classifier?“. In the early days I would reply with the technically most correct, but also useless answer of “it depends”, but over the last couple of years I’ve realized that just having a very approximate rule of thumb is useful, so here it is for posterity:

You need 1,000 representative images for each class.

Like all models, this rule is wrong but sometimes useful. In the rest of this post I’ll cover where it came from, why it’s wrong, and what it’s still good for.

The origin of the 1,000-image magic number comes from the original ImageNet classification challenge, where the dataset had 1,000 categories, each with a bit less than 1,000 images for each class (most I looked at had around seven or eight hundred). This was good enough to train the early generations of image classifiers like AlexNet, and so proves that around 1,000 images is enough.

Can you get away with less though? Anecdotally, based on my experience, you can in some cases but once you get into the low hundreds it seems to get trickier to train a model from scratch. The biggest exception is when you’re using transfer learning on an already-trained model. Because you’re using a network that has already seen a lot of images and learned to distinguish between the classes, you can usually teach it new classes in the same domain with as few as ten or twenty examples.

What does “in the same domain” mean? It’s a lot easier to teach a network that’s been trained on photos of real world objects (like Imagenet) to recognize other objects, but taking that same network and asking it to categorize completely different types of images like x-rays, faces, or satellite photos is likely to be less successful, and at least require a lot more training images.

Another key point is that “representative” modifier in my rule of thumb. That’s there because the quality of the images is important, not just the quantity. What’s crucial is that the training images are as close as possible to the inputs that the model will see when it’s deployed. When I first tried to run a model trained with ImageNet on a robot I didn’t see great results, and it turned out it that was because the robot’s camera had a lot of fisheye distortion, and the objects weren’t well-framed in the viewfinder. ImageNet consists of photos taken from the web, so they’re usually well-framed and without much distortion. Once I retrained my network with images that were taken by the robot itself the results got a lot better. The same applies to almost any application, a smaller amount of training images that were taken in the same environment that it will produce better end results than a larger number of less representative images.

Andreas just reminded me that augmentations are important too. You can augment the training data by randomly cropping, rotating, brightening, or warping the original images. TensorFlow for Poets controls this with command line flags like ‘flip_left_to_right‘ and ‘random_scale‘. This has the effect of effectively increasing the size of your training images, and is standard for most ImageNet-style training pipelines. It can be very useful for helping out transfer learning on smaller sets of images as well though. In my experience, distorted copies are not worth quite as much as new original images when it comes to overall accuracy, but if you only have a few images it’s a great way to boost the results and will reduce the overall number of images you need.

The real answer is to try for yourself, so if you have fewer images than the rule suggests don’t let it stop you, but I hope this rule of thumb will give you a good starting point for planning your approach at least.

Deep Learning is Eating Software


Photo by John Watson

When I had a drink with Andrej Karpathy a couple of weeks ago, we got to talking about where we thought machine learning was going over the next few years. Andrej threw out the phrase “Software 2.0”, and I was instantly jealous because it captured the process I see happening every day across hundreds of projects. I held my tongue until he got his blog post out there, but now I want to expand my thoughts on this too.

The pattern is that there’s an existing software project doing data processing using explicit programming logic, and the team charged with maintaining it find they can replace it with a deep-learning-based solution. I can only point to examples within Alphabet that we’ve made public, like upgrading search ranking, data center energy usage, language translation, and solving Go, but these aren’t rare exceptions internally. What I see is that almost any data processing system with non-trivial logic can be improved significantly by applying modern machine learning.

This might sound less than dramatic when put in those terms, but it’s a radical change in how we build software. Instead of writing and maintaining intricate, layered tangles of logic, the developer has to become a teacher, a curator of training data and an analyst of results. This is very, very different than the programming I was taught in school, but what gets me most excited is that it should be far more accessible than traditional coding, once the tooling catches up.

The essence of the process is providing a lot of examples of inputs, and what you expect for the outputs. This doesn’t require the same technical skills as traditional programming, but it does need a deep knowledge of the problem domain. That means motivated users of the software will be able to play much more of a direct role in building it than has ever been possible. In essence, the users are writing their own user stories and feeding them into the machinery to build what they want.

Andrej focuses on areas like audio and speech recognition in his post, but I’m actually arguing that there will be an impact across many more domains. The classic “Machine Learning: The High-Interest Credit Card of Technical Debt” identifies a very common pattern where machine learning systems become embedded in deep stacks of software. What I’m seeing is that the problem is increasingly solved by replacing the whole stack with a deep learning model! Taking the analogy to breaking point, this is like consolidating all your debts into a single loan with lower payments. A single model is far easier to improve than a set of deeply interconnected modules, and the maintenance becomes far easier. For many large systems there’s no one person who can claim to understand what they’re actually doing anyway, so there’s no real loss in debuggability or control.

I know this will all sound like more deep learning hype, and if I wasn’t in the position of seeing the process happening every day I’d find it hard to swallow too, but this is real. Bill Gates is supposed to have said “Most people overestimate what they can do in one year and underestimate what they can do in ten years“, and this is how I feel about the replacement of traditional software with deep learning. There will be a long ramp-up as knowledge diffuses through the developer community, but in ten years I predict most software jobs won’t involve programming. As Andrej memorably puts it, “[deep learning] is better than you”!

How do CNNs Deal with Position Differences?

An engineer who’s learning about using convolutional neural networks for image classification just asked me an interesting question; how does a model know how to recognize objects in different positions in an image? Since this actually requires quite a lot of explanation, I decided to write up my notes here in case they help some other people too.

Here’s two example images showing the problem that my friend was referring to:

CNN Position 0

If you’re trying to recognize all images with the sun shape in them, how do you make sure that the model works even if the sun can be at any position in the image? It’s an interesting problem because there are really three stages of enlightenment in how you perceive it:

  • If you haven’t tried to program computers, it looks simple to solve because our eyes and brain have no problem dealing with the differences in positioning.
  • If you have tried to solve similar problems with traditional programming, your heart will probably sink because you’ll know both how hard dealing with input differences will be, and how tough it can be to explain to your clients why it’s so tricky.
  • As a certified Deep Learning Guru, you’ll sagely stroke your beard and smile, safe in the knowledge that your networks will take such trivial issues in their stride.

My friend is at the third stage of enlightenment, but is smart enough to realize that there are few accessible explanations of why CNNs cope so well. I don’t claim to have any novel insights myself, but over the last few years of working with image models I have picked up some ideas from experience, and heard folklore passed down through the academic family tree, so I want to share what I know. I would welcome links to good papers on this, since I’m basing a lot of this on hand-wavey engineering intuition, so please do help me improve the explanation!

The starting point for understanding this problem is that networks aren’t naturally immune to positioning issues. I first ran across this when I took networks trained on the ImageNet collection of photos, and ran them on phones. The history of ImageNet itself is fascinating. Originally, Google Image Search was used to find candidate images from the public web by searching for each class name, and then researchers went through the candidates to weed out any that were incorrect. My friend Tom White has been having fun digging through the resulting data for anomalies, and found some fascinating oddities like a large number of female models showing up in the garbage truck category! You should also check out Andrej Karpathy’s account of trying to label ImageNet pictures by hand to understand more about its characteristics.

The point for our purposes is that all the images in the training set are taken by people and published on websites that rank well in web searches. That means they tend to be more professional than a random snapshot, and in particular they usually have the subject of the image well-framed, near the center, taken from a horizontal angle, and taking up a lot of the picture. By contrast, somebody pointing a phone’s live camera at an object to try out a classifier is more likely to be at an odd angle, maybe from above, and may only have part of the object in frame. This meant that models trained on ImageNet had much worse perceived performance when running on phones than the published accuracy statistics would suggest, because the training data was so different than what they were given by users. You can still see this for yourself if you install the TensorFlow Classify application on Android. It isn’t bad enough to make the model useless on mobile phones, since there’s still usually some framing by users, but it’s a much more serious problem on robots and similar devices. Since their camera positioning is completely arbitrary, ImageNet-trained models will often struggle seriously. I usually recommend developers of those applications look out for their own training sets captured on similar devices, since there are also often other differences like fisheye lenses.

Even still, within ImageNet there is still a lot of variance in positioning, so how do networks cope so well? Part of the secret is that training often includes adding artificial offsets to the inputs, so that the network has to learn to cope with these differences.

CNN Position 1

Before each image is fed into the network, it can be randomly cropped. Because all inputs are squashed to a standard size (often around 200×200 or 300×300), this has the effect of randomizing the positioning and scale of objects within each picture, as well as potentially cutting off sections of them. The network is still punished or rewarded for its answers, so to get good performance it has to be able to guess correctly despite these differences. This explains why networks learn to cope with positioning changes, but not how.

To delve into that, I have to dip into a bit of folklore and analogy. I don’t have research to back up what I’m going to offer as an explanation, but in my experiments and discussions with other practitioners, it seems pretty well accepted as a working theory.

Ever since the seminal AlexNet, CNN’s have been organized as consecutive layers feeding data through to a final classification operation. We think about the initial layers as being edge detectors, looking for very basic pixel patterns, and then each subsequent layer takes those as inputs and guesses higher and higher level concepts as you go deeper. You can see this most easily if you view the filters for the first layer of a typical network:


Image by Evan Shelhamer from Caffenet

What this shows are the small patterns that each filter is looking for. Some of them are edges in different orientations, others are colors or corners. Unfortunately we can’t visualize later layers nearly as simply, though Jason Yosinski and others have some great resources if you do want to explore that topic more.

Here’s a diagram to try to explain the concepts involved:

CNN Position 2

What it’s trying to show is that the first layer is looking for very simple pixel patterns in the image, like horizontal edges, corners, or patches of solid color. These are similar to the filters shown in the CaffeNet illustration just above. As these are run across the input image, they output a heat map highlighting where each pattern matches.

The tricky bit to understand is what happens in the second layer. The heatmap for each simple filter in the first layer is put into a separate channel in the activation layer, so the input to the second layer typically has over a hundred channels, unlike the three or four in a typical image. What the second layer is looking for is more complex patterns in these heatmaps combined together. In the diagram we’re trying to recognize one petal of the sun. We know that this has a sharp corner on one end, and nearby will be a vertical line, and the center will be filled with yellow. Each one of these individual characteristics is represented by one channel in the input activation layer, and the second layer’s filter for “petal facing left” looks for parts of the images where all three occur together. In areas of the image where only one or two are present, nothing is output, but where all three are there the output of the second layer will show high activation.

Just like with the first layer, there are many filters in the second layer, and you can think of each one as representing a higher-level concept like “petal facing up”, “petal facing right”, and others. This is harder to visualize, but results in an activation layer with many channels, each representing one of those concepts.

As you go deeper into the network, the concepts get higher and higher level. For example, the third or fourth layer here might activate yellow circles surrounded by petals, by combining the relevant input channels. From that representation it’s fairly easy to write a simple classifier that spots whenever a sun is present. Of course real-world classifiers don’t represent concepts nearly as cleanly as I’ve laid out above, since they learn how to break down the problem themselves rather than being supplied with human-friendly components, but the same basic ideas hold.

This doesn’t explain how the network deals with position differences though. To understand that, you need to know about another common design trait of CNNs for image classification. As you go deeper into a network, the number of channels will typically increase, but the size of the image will shrink. This shrinking is done using pooling layers, traditionally with average pooling but more commonly using maximum pooling these days. Either way, the effect is pretty similar.

Max Pooling

Here you can see that we take an image and shrink it in half. For each output pixel, we look at a 2×2 input patch and choose the maximum value, hence the name maximum pooling. For average pooling, we take the mean of the four values instead.

This sort of pooling is applied repeatedly as values travel through the network. This means that by the end, the image size may have shrunk from 300×300 to 13×13. This shrinkage also means that the number of position variations that are possible has shrunk a lot. In terms of the example above, there are only 13 possible horizontal rows for a sun image to appear in, and only 13 vertical columns. Any smaller position differences are hidden because the activations will be merged into the same cell thanks to max pooling. This makes the problem of dealing with positional differences much more manageable for the final classifier, since it only has to deal with a much simpler representation than the original image.

This is my explanation for how image classifiers typically handle position changes, but what about similar problems like offsets in audio? I’ve been intrigued by the recent rise of “dilated” or “atrous” convolutions that offer an alternative to pooling. Just like max pooling, these produce a smaller output image, but they do it within the context of the convolution itself. Rather than sampling adjacent input pixels, they look at samples separated by a stride, which can potentially be quite large. This gives them the ability to pull non-local information into a manageable form quite quickly, and are part of the magic of DeepMind’s WaveNet paper, giving them the ability to tackle a time-based problem using convolution rather than recurrent neural networks.

I’m excited by this because RNNs are a pain to accelerate. If you’re dealing with a batch size of one, as is typical with real-time applications, then most of the compute is matrix time vector multiplications, with the equivalent of fully-connected layers. Since every weight is only used once, the calculations are memory bound rather than compute bound as is typically the case with convolutions. Hence I have my fingers crossed that this becomes more common in other domains!

Anyway, thanks for making it this far. I hope the explanation is helpful, and I look forward to hearing ideas on improving it in the comments or on twitter.

The Joy of an Indian Paradox

When I was growing up in England I never tasted garlic in my cooking, let alone any spice. Then I moved away to Manchester and found myself in a world of Indian food I’d never imagined! On a Saturday night I’d walk along the Curry Mile in Rusholme and find our group of students implored to enter the dozens of restaurants along the strip. The sheer joy of being able to devour succulent chicken on freshly baked naan has never left me, but I’ve also learned over the years how much more the subcontinent has to offer. Even as I’ve lived in suburbs like Simi Valley, I’ve always been able to find an Indian restaurant that taught me something new about the cuisine.

When I arrived in San Francisco, I have to confess I was a bit disappointed. Down near San Jose there were some amazing Indian experiences, but nothing I tried locally really hit the spot. That’s why I was so excited when Indian Paradox opened in my Divisadero neighborhood a couple of years ago. The owner Kavitha has a unique vision of pairing South Indian street food with the perfect wines in a combination I’ve never heard of anywhere else. She’s able to conjure up delicacies like Dabeli potato burgers and Kanda Batata Poha flattened rice, and pair them with delicious Zinfandels and Mosels to create something I’ve never been able to experience anywhere else.

I’ve been a frequent enough visitor to hear a little of Kavitha’s story, and her journey from Chennai to San Francisco, via Alabama. She’s driven by her love of the food, and when that’s combined with deep knowledge of wine, it gives an experience I don’t think you could find anywhere else in the world. I’ve never found great pairings with Indian food before. In Manchester the best I could hope for was a clean Kingfisher lager that wouldn’t clash with the spice, but somehow the right wines feel like the ingredient I’ve been missing in my Indian meals all these years.

Anyway, it’s a small local business that I love, so I wanted to share a little of my own enthusiasm with the world. If you’re ever in San Francisco and love food, I highly encourage you to make it along to Indian Paradox, and say hi from me!

Cross-compiling TensorFlow for the Raspberry Pi

raspberriesPhoto by oatsy40

I love the Raspberry Pi because it’s such a great platform for software to interact with the physical world. TensorFlow makes it possible to turn messy, chaotic sensor data from cameras and microphones into useful information, so running models on the Pi has enabled some fascinating applications, from predicting train times, sorting trash, helping robots see, and even avoiding traffic tickets!

It’s never been easy to get TensorFlow installed on a Pi though. I had created a makefile script that let you build the C++ part from scratch, but it took several hours to complete and didn’t support Python. Sam Abrahams, an external contributor, did an amazing job maintaining a Python pip wheel for major releases, but building it required you to add swap space on a USB device for your Pi, and took even longer to compile than the makefile approach. Snips managed to get TensorFlow cross-compiling for Rust, but it wasn’t clear how to apply this to other languages.

Plenty of people on the team are Pi enthusiasts, and happily Eugene Brevdo dived in to investigate how we could improve the situation. We knew we wanted to have something that could be run as part of TensorFlow’s Jenkins continuous integration system, which meant building a completely automatic solution that would run with no user intervention. Since having a Pi plugged into a machine to run something like the makefile build would be hard to maintain, we did try using a hosted server from Mythic Beasts. Eugene got the makefile built going after a few hiccups, but the Python version required more RAM than was available, and we couldn’t plug in a USB drive remotely!

Cross compiling, building on an x86 Linux machine but targeting the Pi, looked a lot more maintainable, but also more complex. Thankfully we had the Snips example to give us some pointers, a kindly stranger had provided a solution to a crash that blocked me last time I tried it, and Eugene managed to get an initial version working.

I was able to take his work, abstract it into a Docker container for full reproducibility, and now we have nightly builds running as part of our main Jenkins project. If you just want to try it out for Python 2.7, run:

sudo apt-get install libblas-dev liblapack-dev python-dev \
libatlas-base-dev gfortran python-setuptools
sudo ​pip2 install \

This can take quite a while to complete, largely because it looks like the SciPy compilation is extremely slow. Once it’s done, you’ll be able to run TensorFlow in Python 2. If you get an error about the .whl file not being found at that URL, the version number may have changed. To find the correct name, go to and you should see the new version listed.

For Python 3.4 support, you’ll need to use a different wheel and pip instead of pip2, like this:

sudo apt-get install libblas-dev liblapack-dev python-dev \
 libatlas-base-dev gfortran python-setuptools
sudo ​pip install \

If you’re running Python 3.5, you can use the same wheel but with a slight change to the file name, since that encodes the version. You will see a couple of warnings every time you import tensorflow, but it should work correctly.

sudo apt-get install libblas-dev liblapack-dev python-dev \
 libatlas-base-dev gfortran python-setuptools
curl -O
mv tensorflow-1.4.0-cp34-none-any.whl tensorflow-1.4.0-cp35-none-any.whl
sudo ​pip install tensorflow-1.4.0-cp35-none-any.whl

If you have a Pi Zero or One that you want to use TensorFlow on, you’ll need to use an alternative wheel that doesn’t include NEON instructions. This is a lot slower than the one above that’s optimized for the Pi Two and above, so I don’t recommend you use it on newer devices. Here are the commands for Python 2.7:

sudo apt-get install libblas-dev liblapack-dev python-dev \
libatlas-base-dev gfortran python-setuptools
​sudo pip2 install \

Here is the Python 3.4 version for the Pi Zero:

sudo apt-get install libblas-dev liblapack-dev python-dev \
 libatlas-base-dev gfortran python-setuptools 
sudo ​pip install \

And here are the Python 3.5 instructions:

sudo apt-get install libblas-dev liblapack-dev python-dev \
 libatlas-base-dev gfortran python-setuptools
curl -O
mv tensorflow-1.4.0-cp34-none-any.whl tensorflow-1.4.0-cp35-none-any.whl
sudo ​pip install tensorflow-1.4.0-cp35-none-any.whl

I’ve found the scipy compilation on Pi Zeros/Ones is so slow (many hours), it is unfeasible to wait for it to complete. Instead I’ve found myself pressing Control-C to cancel when it’s in the middle of a scipy-related compile step, and then re-running with ‘–no-deps’ flag after install to skip building dependencies. This is extremely hacky, but since scipy is only needed for testing purposes you should have a workable copy of TensorFlow at the end, provided all the other dependencies completed.

If you want to build your own copy of the wheels, you can run this line from within the TensorFlow source root on a Linux machine with Docker installed to build for the Pi Two or Three with Python 2.7:

tensorflow/tools/ci_build/ PI tensorflow/tools/ci_build/pi/

For Python 3.4:

CI_DOCKER_EXTRA_PARAMS="-e CI_BUILD_PYTHON=python3 -e CROSSTOOL_PYTHON_INCLUDE_PATH=/usr/include/python3.4" tensorflow/tools/ci_build/ PI-PYTHON3 tensorflow/tools/ci_build/pi/

For Python 2.7 on the Pi Zero:

tensorflow/tools/ci_build/ PI tensorflow/tools/ci_build/pi/ PI_ONE

For Python 3.4 on the Pi Zero:

CI_DOCKER_EXTRA_PARAMS="-e CI_BUILD_PYTHON=python3 -e CROSSTOOL_PYTHON_INCLUDE_PATH=/usr/include/python3.4" tensorflow/tools/ci_build/ PI-PYTHON3 tensorflow/tools/ci_build/pi/ PI_ONE

(Note, the Docker files are currently broken because they were upgraded to use Ubuntu 16.04 and the Python cross toolchain fails to install on that version. There should be a fix visible in TensorFlow’s github within the next few days, but for now you can locally change Dockerfile.pi, etc, to use 14.04 instead.)

This is all still experimental, so please do file bugs with feedback if these don’t work for you. I’m hoping we will be able to provide official stable Pi binaries for each major release in the future, like we do for Android and iOS, so knowing how well things are working is important to me. I’m also always excited to hear about cool new applications you find for TensorFlow on the Pi, so do let me know what you build too!

A quick hack to align single-word audio recordings

As I’ve been training on the initial results of the speech gathering app, one of the challenges has been aligning the recordings. There can be a delay between somebody hitting record and saying a word, or they can say it very quickly and leave a large gap at the end of the audio file. To improve the results of the training, I wanted to find a way to standardize the start of a word in my input files, since that would also let me shorten the window of audio I’m looking at, and so reduce the overall compute time.

I looked into advanced speech alignment tools like Sphinx, but they had some pretty gnarly dependencies which I was hoping to avoid in a beginning tutorial. They also had a lot of assumptions built in that didn’t transfer well to single word commands, most didn’t have many prebuilt models, and in general they weren’t easy to integrate.

Looking at visualizations of the waveforms from the recordings using the great Fission app, it usually appeared pretty obvious which section had the word, and which parts were background.


In this example, the word is in the highlighted portion, and the only other peaks are a noisy click near the end. I was hoping to find an existing tool that would recognize this kind of pattern and help me remove the background, leaving only the part I wanted. I looked at both sox and ffmpeg’s silenceremove filters, but I couldn’t find one that worked well:

– Sox clipped initial sections of the spoken word, since there was a delay before it recognized ‘non-silence’.

– There was an option to avoid this with ffmpeg, but reliably detecting silence meant normalizing all my clips to a standard volume level, which wasn’t something I wanted to do to speech samples.

I also couldn’t specify that I wanted a particular length of clip. In my case, I knew I wanted a second-long result, because that’s what my models take in, and all the words should fit in that length. Most of the tools out there seemed designed to remove gaps in recorded music, but intuitively it felt like my problem was more like ‘give me the second-long section with the most relevant audio in it’.

As I thought about this, I realized that the speech should be the loudest sustained part of the recording, so if I could slide a contiguous window through the audio data and pick the section that was loudest in total, I might get good results.

To visualize what I mean, imagine a simplified waveform of a two-second long clip:


To my untrained eye, it’s clear that the middle section has the most going on. To turn that into a useful definition, I estimated the volume at each point in the file using the absolute of the PCM value (volume = abs(value)) and then walked through the clip looking at the total of those volumes for a one-second range. By picking the point where the sum total of the volumes is highest:


You can clip down to a short section with the loudest audio in it:


I’m sure this particular wheel has been invented many times before, but I couldn’t find it in my searches, so I wanted to leave a trail of breadcrumbs for anyone else stuck with a similar problem. Hopefully people with more experience in this domain will also leave comments offering other suggestions!

The code itself is very straightforward, and I’ve put it up at The command line interface has only been designed for my particular use case, with one second hardcoded as the desired window length, only folders of .wavs supported, and no build file for anything other than OS X. It should be easy to port to your own system though, it doesn’t have any dependencies outside of Posix and the C/C++ standard libraries.

The only real point of interest is that it doesn’t recalculate the whole sum at every sample, instead it keeps a running total by subtracting the value leaving the interval as it moves forward in time, and adding in the new volume, which keeps the latency very low.

float current_volume_sum = 0.0f;
for (int64_t i = 0; i < desired_samples; ++i) {
  const float input_value = input[i];
  current_volume_sum += fabsf(input_value);
int64_t loudest_end_index = desired_samples;
float loudest_volume = current_volume_sum;
for (int64_t i = desired_samples; i < input_size; ++i) {
  const float trailing_value = input[i - desired_samples];
  current_volume_sum -= fabsf(trailing_value);
  const float leading_value = input[i];
  current_volume_sum += fabsf(leading_value);
  if (current_volume_sum > loudest_volume) {
    loudest_volume = current_volume_sum;
    loudest_end_index = i;