Will Compression Be Machine Learning’s Killer App?

vice.pngPhoto by Greg Simenoff

When I talk to people about machine learning on phones and devices I often get asked “What’s the killer application?“. I have a lot of different answers, everything from voice interfaces to entirely new ways of using sensor data, but the one I’m most excited about in the near-team is compression. Despite being fairly well-known in the research community, this seems to surprise a lot of people, so I wanted to share some of my personal thoughts on why I see compression as so promising.

I was reminded of this whole area when I came across an OSDI paper on “Neural Adaptive Content-aware Internet Video Delivery“. The summary is that by using neural networks they’re able to improve a quality-of-experience metric by 43% if they keep the bandwidth the same, or alternatively reduce the bandwidth by 17% while preserving the perceived quality. There have also been other papers in a similar vein, such as this one on generative compression, or adaptive image compression. They all show impressive results, so why don’t we hear more about compression as a machine learning application?

We don’t (yet) have the compute

All of these approaches require comparatively large neural networks, and the amount of arithmetic needed scales with the number of pixels. This means large images or video with high frames-per-second can require more computing power than current phones and similar devices have available. Most CPUs can only practically handle tens of billions of arithmetic operations per second, and running ML compression on HD video could easily require ten times that.

The good news is that there are hardware solutions, like the Edge TPU amongst others, that offer the promise of much more compute being available in the future. I’m hopeful that we’ll be able to apply these resources to all sorts of compression problems, from video and image, to audio, and even more imaginative approaches.

Natural language is the ultimate compression

One of the other reasons I think ML is such a good fit for compression is how many interesting results we’ve had recently with natural language. If you squint, you can see captioning as a way of radically compressing an image. One of the projects I’ve long wanted to create is a camera that runs captioning at one frame per second, and then writes each one out as a series of lines in a log file. That would create a very simplistic story of what the camera sees over time, I think of it as a narrative sensor.

The reason I think of this as compression is that you can then apply a generative neural network to each caption to recreate images. The images won’t be literal matches to the inputs, but they should carry the same meaning. If you want results that are closer to the originals, you can also look at stylization, for example to create a line drawing of each scene. What these techniques have in common is that they identify parts of the input that are most important to us as people, and ignore the rest.

It’s not just images

There’s a similar trend in the speech world. Voice recognition is improving rapidly, and so is the ability to synthesize speech. Recognition can be seen as the process of compressing audio into natural language text, and synthesis as the reverse. You could imagine being able to highly compress conversations down to transmitting written representations rather than audio. I can’t imagine a need to go that far, but it does seem likely that we’ll be able to achieve much better quality and lower bandwidth by exploiting our new understanding of the patterns in speech.

I even see interesting possibilities for applying ML compression to text itself. Andrej Karpathy’s char-rnn shows how well neural networks can mimic styles given some examples, and that prediction is a similar problem to compression. If you think about how much redundancy is in a typical HTML page, it seems likely that there would be some decent opportunities for ML to improve on gzip. This is getting into speculation though, since I don’t have any ML text compression papers handy.

Compression is already a budget item

One of the things I learned while unsuccessfully trying to sell to businesses during my startup career was that it was much easier to make a sale if there was already a chunk of money allocated to what you were selling. The existence of a budget line item meant that the hard battle over whether the company should spend money on a solution had already been won, now the only questions was which solution to buy. That’s one of the reasons why I think that ML could make dramatic inroads in this area, because manufacturers already have engineers, money, and silicon area earmarked for video and audio compression. If we can show that adding machine learning to existing solutions improves them in measurable ways (for example quality, speed, or power consumption) then they will be adopted quickly.

Bandwidth costs users and carriers money, and quality and battery life are selling points for products, so the motivation behind adopting ML for compression is much more direct than many other use cases. Existing research shows that it can be very effective, and I’m optimistic that there’s a lot more to be discovered, so I’m hopeful that it will develop into a key use of the technology.

 

What Does it Take to Train Deep Learning Models On-Device?

graduationPhoto by Fort George G. Meade Public Affairs Office

Over the past few weeks,a few different people have asked me about the state of model training on phones and embedded devices. The good news is that it’s definitely possible, I know of multiple examples of teams doing this successfully. The bad news is that our tools don’t yet make it easy.

Back in 2014 I released some code and a video showing how you could do simple transfer learning on a phone. At that point I was using a support vector machine to handle the on-device training of the final layer, but there was no fundamental reason I couldn’t have used back-propagation instead, it was just easier to use a technique I knew. I’m sure there are other examples even earlier than that too.

One area where on-device learning with neural networks has long been common is speech recognition. Even today, your phone will often ask you to say the wake word (for example “OK Google” or “Hey Siri”) a few times, so it can learn your pronunciation. Typically this won’t involve a complete retraining of the network using back propagation, but something that’s more like transfer learning on either the feature extraction process, or the final layer. Apple’s implementation uses a supplemental network to compare utterances to your references in what they term speaker space, so it’s pretty different than how we normally think of training. I still consider these cases valid examples of on-device learning, but often use the term “personalization” to capture the limited extent of the network changes.

There are situations where the kind of full-network back propagation familiar to deep learning practitioners building models in Python happens on devices too. A good published example is for the Google Keyboard application. The paper focuses on the novel aspects of sending anonymized model updates over the network, but those updates come from a traditional process of running forward and backward passes on the phone.

So, corporate teams are building applications that use learning on-device, but how can you do the same thing? The key thing to realize is that there are two different stages to training a model. The first is building the back-propagation machinery to update the gradients. The second is actually feeding examples through the forward pass, and then passing the gradient changes through the backward pass.

Most libraries, including TensorFlow, handle the automatic differentiation you need to build the back propagation operations in the Python layer. This code looks at the model you’ve defined in terms of inference operations (convolution layers feeding into activation functions, etc) and builds a complementary set of operations for the back propagation, including a loss function. The important part is that once the graph is built, at least in TensorFlow, you don’t need the Python code any more. Running the backward pass just means running the sub-graph of operations that was created by the differentiation stage.

In TensorFlow terms, this means after you’ve created the backward pass you can save off the GraphDef containing both the forward and backward operations. If you then load that in C++, either on-device, or on a server, theoretically you can also run gradient updates from labeled examples using Session::Run(). I say theoretically because we don’t have documentation on this process, and it involves a lot of unfamiliar steps that can be hard to get right in C++, like initializing variables, loading and saving checkpoints, and handling losses. This is a gap I’m hoping we can fill, but there are internal teams who’ve been able to past these issues, so at least there’s proof that it’s possible. I’m sure there must be examples of external teams doing similar work, so feel free to add comments or reply on Twitter if you do have code to share, in TensorFlow or any other framework, I’d love to share them!

What Image Classifiers Can Do About Unknown Objects

question_markPhoto by Brandon Giesbrecht

A few days ago I received a question from Plant Village, a team I’m collaborating with about a problem that’s emerged with a mobile app they’re developing. It detects plant diseases, and is delivering good results when it’s pointed at leaves, but if you point it at a computer keyboard it thinks it’s a damaged crop.

Photos from David Hughes and Amanda Ramcharan

This isn’t a surprising result to computer vision researchers, but it is a shock to most other people, so I want to explain why it’s happening, and what we can do about it.

As people, we’re used to being able to classify anything we see in the world around us, and we naturally expect machines to have the same ability. Most models are only trained to recognize a very limited set of objects though, such as the 1,000 categories of the original ImageNet competition. Crucially, the training process makes the assumption that every example the model sees is one of those objects, and the prediction must be within that set. There’s no option for the model to say “I don’t know”, and there’s no training data to help it learn that response. This is a simplification that makes sense within a research setting, but causes problems when we try to use the resulting models in the real world.

Back when I was at Jetpac, we had a lot of trouble convincing people that the ground-breaking AlexNet model was a big leap forward because every time we handed over a demo phone running the network, they would point it at their faces and it would predict something like “Oxygen mask” or “Seat belt”. This was because the ImageNet competition categories didn’t include any labels for people, but most of the photos with mask and seatbelt labels included faces along with the objects. Another embarrassing mistake came when they would point it at a plate and it would predict “Toilet seat”! This was because there were no plates in the original categories, and the closest white circular object in appearance was a toilet.

I came to think of this as the “open world” versus “closed world” problem. Models were trained and evaluated assuming that there was only ever going to be a limited universe of objects presented to them, but as soon as they make it outside the lab that assumption breaks down and they are judged by users on their performance for any arbitrary object that’s put in front of them, whether or not it was in the training set.

So, What’s The Solution?

Unfortunately I don’t know of a simple fix for this problem, but there are some strategies that I’ve seen help. The most obvious start is to add an “Unknown” class to your training data. The bad news is that just opens up a whole different set of issues

  • What examples should go into that class? There’s an almost limitless number of possible natural images, so how do you choose which to include?
  • How many of each different type of object do you need in the unknown class?
  • What should you do about unknown objects that look very similar to the classes you care about? For example adding a dog breed that’s not in the ImageNet 1,000, but looks nearly identical, will likely force a lot of what would have been correct matches into the unknown bucket.
  • What proportion of your training data should be made up of examples of the unknown class?

This last point actually touches on a much larger issue. The prediction values you get from image classification networks are not probabilities. They assume that the odds of seeing any particular class are equal to how often that class shows up in the training data. If you try to use an animal classifier that includes penguins in the Amazon jungle you’ll experience this problem, since (presumably) all of the penguin sightings will be false positives. Even with dog breeds in a US city, the rarer breeds show up a lot more often in the ImageNet training data than they will in a dog park, so they’ll be over-represented as false positives. The usual solution is to figure out what the prior probabilities in the situation you’ll be facing in production are, and then use those to apply calibration values to the network’s output to get something that’s closer to real probabilities.

The main strategy that helps tackle the overall problem in real applications is constraining the model’s usage to situations where the assumptions about what objects will be present matches the training data. A straightforward way of doing this is through product design. You can create a user interface that directs people to focus their device on an object of interest before running the classifier, much like applications that ask you to take photographs of checks or other documents often do.

Getting a little more sophisticated, you can write a separate image classifier that tries to identify conditions that the main image classifier is not designed for. This is different than adding a single “Unknown” class, because it acts more like a cascade, or a filter before the detailed model. In the crop disease case, the operating environment is visually distinct enough that it might be fine to just train a model to distinguish between leaves and a random selection of other photos. There’s enough similarity that the gating model should at least be able to tell if the image is being taken in a type of scene that’s not supported. This gating model would be run before the full image classifier, and if it doesn’t detect something that looks like it could be a plant, it will bail out early with an error message indicating no crops were found.

Applications that ask you to capture images of credit cards or perform other kinds of OCR will often use a combination of on-screen directions and a model to detect blurriness or lack of alignment to guide users to take photos that can be successfully processed, and having a “are there leaves?” model is a simple version of this interface pattern.

This probably isn’t a very satisfying set of answers, but they’re a reflection of the messiness of user expectations once you take machine learning beyond constrained research problems. There’s a lot of common sense and external knowledge that goes into a person’s recognition of an object, and we don’t capture any of that in the classic image classification task. To get results that meet user expectations, we have to design a full system around our models that understands the world that they will be deployed in, and makes smart decisions based on more than just the model outputs.

 

Why the Future of Machine Learning is Tiny

ruler.pngPhoto by Kevin Steinhardt

When Azeem asked me to give a talk at CogX, he asked me to focus on just a single point that I wanted the audience to take away. A few years ago my priority would have been convincing people that deep learning was a real revolution, not a fad, but there have been enough examples of shipping products that that question seems answered. I knew this was true before most people not because I’m any kind of prophet with deep insights, but because I’d had a chance to spend a lot of time running hands-on experiments with the technology myself. I could be confident of the value of deep learning because I had seen with my own eyes how effective it was across a whole range of applications, and knew that the only barrier to seeing it deployed more widely was how long it takes to get from research to deployment.

Instead I chose to speak about another trend that I am just as certain about, and will have just as much impact, but which isn’t nearly as well known. I’m convinced that machine learning can run on tiny, low-power chips, and that this combination will solve a massive number of problems we have no solutions for right now. That’s what I’ll be talking about at CogX, and in this post I’ll explain more about why I’m so sure.

Tiny Computers are Already Cheap and Everywhere

jg07_The_Shape_of_the_MCU_Market_c

Chart from embedded.com

The market is so fragmented that it’s hard to get precise numbers, but the best estimates are that over 40 billion microcontrollers will be sold this year, and given the persistence of the products they’re in, there’s likely to be hundreds of billions of them in service. Microcontrollers (or MCUs) are packages containing a small CPU with possibly just a few kilobytes of RAM, and are embedded in consumer, medical, automotive and industrial devices. They are designed to use very small amounts of energy, and to be cheap enough to include in almost any object that’s sold, with average prices expected to dip below 50 cents this year.

They don’t get much attention because they’re often used to replace functionality that older electro-mechanical systems could do, in cars, washing machines, or remote controls. The logic for controlling the devices is almost unchanged from the analog circuits and relays that used to be used, except possibly with a few tweaks like programmable remote control buttons or windshield wipers that vary their speed with rain intensity. The biggest benefit for the manufacturer is that standard controllers can be programmed with software rather than requiring custom electronics for each task, so they make the manufacturing process cheaper and easier.

Energy is the Limiting Factor

Any device that requires mains electricity faces a lot of barriers. It’s restricted to places with wiring, and even where it’s available, it may be tough for practical reasons to plug something new in, for example on a factory floor or in an operating theatre. Putting something high up in the corner of a room means running a cord or figuring out alternatives like power-over-ethernet. The electronics required to convert mains voltage to a range circuitry can use is expensive and wastes energy. Even portable devices like phones or laptops require frequent docking.

The holy grail for almost any smart product is for it to be deployable anywhere, and require no maintenance like docking or battery replacement. The biggest barrier to achieving this is how much energy most electronic systems use. Here are some rough numbers for common components based on figures from Smartphone Energy Consumption (see my old post here for more detail):

  • A display might use 400 milliwatts.
  • Active cell radio might use 800 milliwatts.
  • Bluetooth might use 100 milliwatts.
  • Accelerometer is 21 milliwatts.
  • Gyroscope is 130 milliwatts.
  • GPS is 176 milliwatts.

A microcontroller itself might only use a milliwatt or even less, but you can see that peripherals can easily require much more. A coin battery might have 2,500 Joules of energy to offer, so even something drawing at one milliwatt will only last about a month. Of course most current products use duty cycling and sleeping to avoid being constantly on, but you can see what a tight budget there is even then.

CPUs and Sensors Use Almost No Power, Radios and Displays Use Lots

The overall thing to take away from these figures is that processors and sensors can scale their power usage down to microwatt ranges (for example Qualcomm’s Glance vision chip, even energy-harvesting CCDs, or microphones that consume just hundreds of microwatts) but displays and especially radios are constrained to much higher consumption, with even low-power wifi and bluetooth using tens of milliwatts when active. The physics of moving data around just seems to require a lot of energy. There seems to be a rule that the energy an operation takes is proportional to how far you have to send the bits. CPUs and sensors send bits a few millimeters, and is cheap, radio sends them meters or more and is expensive. I don’t see this relationship fundamentally changing, even as technology improves overall. In fact, I expect the relative gap between the cost of compute and radio to get even wider, because I see more opportunities to reduce computing power usage.

We Capture Much More Sensor Data Than We Use

A few years ago I talked to some engineers working on micro-satellites capturing imagery. Their problem was that they were essentially using phone cameras, which are capable of capturing HD video, but they only had a small amount of memory on the satellite to store the results, and only a limited amount of bandwidth every few hours to download to the base stations on Earth. I realized that we face the same problem almost everywhere we deploy sensors. Even in-home cameras are limited by the bandwidth of wifi and broadband connections. My favorite example of this was a friend whose December ISP usage was dramatically higher than the rest of the year, and when he drilled down it was because his blinking Christmas lights caused the video stream compression ratio to drop dramatically, since so many more frames had differences!

There are many more examples of this, all the accelerometers on our wearables and phones are only used to detect events that might wake up the device or for basic step counting, with all the possibilities of more sophisticated activity detection untouched.

What This All Means For Machine Learning

If you accept all of the points above, then it’s obvious there’s a massive untapped market waiting to be unlocked with the right technology. We need something that works on cheap microcontrollers, that uses very little energy, that relies on compute not radio, and that can turn all our wasted sensor data into something useful. This is the gap that machine learning, and specifically deep learning, fills.

Deep Learning is Compute-Bound and Runs Well on Existing MCUs

One of my favorite parts of working on deep learning implementations is that they’re almost always compute-bound. This is important because almost all of the other applications I’ve worked on have been limited by how fast large amounts of memory can be accessed, usually in unpredictable patterns. By contrast, most of the time for neural networks is spent multiplying large matrices together, where the same numbers are used repeatedly in different combinations. This means that the CPU spends most of its time doing the arithmetic to multiply two cached numbers together, and much less time fetching new values from memory.

This is important because fetching values from DRAM can easily use a thousand times more energy than doing an arithmetic operation. This seems to be another example of the distance/energy relationship, since DRAM is physically further away than registers. The comparatively low memory requirements (just tens or hundreds of kilobytes) also mean that lower-power SRAM or flash can be used for storage. This makes deep learning applications well-suited for microcontrollers, especially when eight-bit calculations are used instead of float, since MCUs often already have DSP-like instructions that are a good fit. This idea isn’t particularly new, both Apple and Google run always-on networks for voice recognition on these kind of chips, but not many people in either the ML or embedded world seem to realize how well deep learning and MCUs match.

Deep Learning Can Be Very Energy-Efficient

I spend a lot of time thinking about picojoules per op. This is a metric for how much energy a single arithmetic operation on a CPU consumes, and it’s useful because if I know how many operations a given neural network takes to run once, I can get a rough estimate for how much power it will consume. For example, the MobileNetV2 image classification network takes 22 million ops (each multiply-add is two ops) in its smallest configuration. If I know that a particular system takes 5 picojoules to execute a single op, then it will take (5 picojoules * 22,000,000) = 110 microjoules of energy to execute. If we’re analyzing one frame per second, then that’s only 110 microwatts, which a coin battery could sustain continuously for nearly a year. These numbers are well within what’s possible with DSPs available now, and I’m hopeful we’ll see the efficiency continue to increase. That means that the energy cost of running existing neural networks on current hardware is already well within the budget of an always-on battery-powered device, and it’s likely to improve even more as both neural network model architectures and hardware improve.

Deep Learning Makes Sense of Sensor Data

In the last few years its suddenly become possible to take noisy signals like images, audio, or accelerometers and extract meaning from them, by using neural networks. Because we can run these networks on microcontrollers, and sensors themselves use little power, it becomes possible to interpret much more of the sensor data we’re currently ignoring. For example, I want to see almost every device have a simple voice interface. By understanding a small vocabulary, and maybe using an image sensor to do gaze detection, we should be able to control almost anything in our environment without needing to reach it to press a button or use a phone app. I want to see a voice interface component that’s less than fifty cents that runs on a coin battery for a year, and I believe it’s very possible with the technology we have right now.

As another example, I’d love to have a tiny battery-powered image sensor that I could program to look out for things like particular crop pests or weeds, and send an alert when one was spotted. These could be scattered around fields and guide interventions like weeding or pesticides in a much more environmentally friendly way.

One of the industrial examples that stuck with me was a factory operator’s description of “Hans”. He’s a long-time engineer that every morning walks along the row of machines, places a hand on each of them, listens, and then tells the foreman which will have to be taken offline for servicing, all based on experience and intuition. Every factory has one, but many are starting to face retirement. If you could stick a battery-powered accelerometer and microphone to every machine (a “Cyber-Hans”) that would learn usual operation and signal if there was an anomaly, you might be able to catch issues before they became real problems.

I probably have a hundred other products I could dream up, but if I’m honest what I’m most excited about is that I don’t know how these new devices will be used, just that the technological imperative behind them is so compelling that they’ll be built and whole new applications I can’t imagine will emerge. For me it feels a lot like being a kid in the Eighties when the first home computers emerged. I had no idea what they would become, and most people at the time used them for games or storing address books, but there were so many possibilities I knew whole new worlds would emerge.

The Takeaway

The only reason to have an in-person meeting instead sending around a document is to communicate the emotion behind the information. What I want to share with the CogX audience is my excitement and conviction about the future of ML on tiny devices, and while a blog post is a poor substitute for real presence, I hope I’ve got across some of that here. I don’t know the details of what the future will bring, but I know ML on tiny, cheap battery powered chips is coming and will open the door for some amazing new applications!

Why you need to improve your training data, and how to do it

sleep_lostPhoto by Lisha Li

Andrej Karpathy showed this slide as part of his talk at Train AI and I loved it! It captures the difference between deep learning research and production perfectly. Academic papers are almost entirely focused on new and improved models, with datasets usually chosen from a small set of public archives. Everyone I know who uses deep learning as part of an actual application spends most of their time worrying about the training data instead.

There are lots of good reasons why researchers are so fixated on model architectures, but it does mean that there are very few resources available to guide people who are focused on deploying machine learning in production. To address that, my talk at the conference was on “the unreasonable effectiveness of training data”, and I want to expand on that a bit in this blog post, explaining why data is so important along with some practical tips on improving it.

As part of my job I work closely with a lot of researchers and product teams, and my belief in the power of data improvements comes from the massive gains I’ve seen them achieve when they concentrate on that side of their model building. The biggest barrier to using deep learning in most applications is getting high enough accuracy in the real world, and improving the training set is the fastest route I’ve seen to accuracy improvements. Even if you’re blocked on other constraints like latency or storage size, increasing accuracy on a particular model lets you trade some of it off for those performance characteristics by using a smaller architecture.

Speech Commands

I can’t share most of my observations of production systems, but I do have an open source example that demonstrates the same pattern. Last year I created a simple speech recognition example for TensorFlow, and it turned out that there was no existing dataset that I could easily use for training models. With the generous help of a lot of volunteers I collected 60,000 one-second audio clips of people speaking short words, thanks to the Open Speech Recording site the AIY team helped me launch. The resulting model was usable, but not as accurate as I’d like. To see how much of that was to do with my own limitations as a model designer, I ran a Kaggle competition using the same dataset. The competitors did much better than my naive models, but even with a lot of different approaches multiple teams came to within a fraction of a percent of 91% accuracy. To me this implied that there was something fundamentally wrong with the data, and indeed competitors uncovered a lot of errors like incorrect labels or truncated audio. This gave me the impetus to focus on a new release of the dataset with the problems they’d uncovered fixed, along with more samples.

I looked at the error metrics to understand what words the models were having the most problems with, and it turned out that the “Other” category (when speech was recognized, but the words weren’t within the model’s limited vocabulary) was particularly error-prone. To address that, I increased the number of different words that we were capturing, to provide more variety in training data.

Since the Kaggle contestants had reported labeling errors, I crowd-sourced an extra verification pass, asking people to listen to each clip and ensure that it matched the expected label. Because Kaggle had also uncovered some nearly silent or truncated files, I also wrote a utility to do some simple audio analysis and weed out particularly bad samples automatically. Finally, I increased the total number of utterances to over 100,000, despite removing bad files, thanks to the efforts of more volunteers and some paid crowd-sourcing.

To help others use the dataset (and learn from my mistakes!) I wrote everything relevant up in an Arxiv paper, along with updated accuracy results. The most important conclusion was that, without changing the model or test data at all, the top-one accuracy increased by over 4%, from 85.4% to 89.7%. This was a dramatic improvement, and was reflected in much higher satisfaction when people used the model in the Android or Raspberry Pi demo applications. I’m confident I would have achieved a much lower improvement if I’d spent the time on model adjustments, even though I’m currently using an architecture that I know is behind the state of the art.

This is the sort of process that I’ve seen produce great results again and again in production settings, but it can be hard to know where to start if you want to do the same thing. You can get some idea from the kind of techniques I used on the speech data, but to be more explicit, here are some approaches that I’ve found useful.

First, Look at Your Data

It may seem obvious, but your very first step should be to randomly browse through the training data you’re starting with. Copy some of the files onto your local machine, and spend a few hours previewing them. If you’re working with images, use something like MacOS’s finder to scroll through thumbnail views and you’ll be able to check out thousands very quickly. For audio, use the finder to play previews, or for text dump random snippets into your terminal. I didn’t spend enough time doing this for the first version speech commands which is why so many problems were uncovered by Kaggle contestants once they started working with the data.

I always feels a bit silly going through this process, but I’ve never regretted it afterwards. Every time I’ve done it, I’ve discovered something critically important about the data, whether it’s an unbalanced number of examples in different categories, corrupted data (for example PNGs labeled with JPG file extensions), incorrect labels, or just surprising combinations. Tom White has made some wonderful discoveries in ImageNet using inspection, including the “Sunglass” label actually referring to an archaic device for magnifying sunlight, glamor shots for “garbage truck”, and a bias towards undead women for “cloak”. Andrej’s work manually classifying photos from ImageNet taught me a lot about the dataset too, including how hard it is to tell all the different dog breeds apart, even for a person.

sunglass

What action you’ll take depends on what you find, but you should always do this kind of inspection before you do any other data cleanup, since an intuitive knowledge of what’s in the set will help you make decisions on the rest of the steps.

Pick a Model Fast

Don’t spend very long choosing a model. If you’re doing image classification, check out AutoML, otherwise look at something like TensorFlow’s model repository or Fast.AI’s collection of examples to find a model that’s solving a similar problem to your product. The important thing is to begin iterating as quickly as possible, so you can try out your model with real users early and often. You’ll always be able to swap out an improved model down the road, and maybe see better results, but you have to get the data right first. Deep learning still obeys the fundamental computing law of “garbage in, garbage out”, so even the best model will be limited by flaws in your training set. By picking a model and testing it, you’ll be able to understand what those flaws are and start improving them.

To speed up your iteration speed even more, try to start with a model that’s been pre-trained on a large existing dataset and use transfer learning to finetune it with the (probably much smaller) set of data you’ve gathered. This usually gives much better results than training only on your smaller dataset, and is much faster, so you can quickly get a feel for how you need to adjust your data gathering strategy. The most important thing is that you are able to incorporate feedback from your results into your collection process, to adapt it as you learn, rather than running collection as a separate phase before training.

Fake It Before You Make It

The biggest difference between building models for research and production is that research usually has a clear problem statement defined at the start, but the requirements for real applications are locked inside users heads and can only be extracted over time. For example, for Jetpac we wanted to find good photos to show in automated travel guides for cities. We started off asking raters to label a photo if they considered it “Good”, but we ended up with lots of pictures of smiling people, since that’s how they interpreted the question. We put these into a mockup of the product to see how test users reacted, and they weren’t impressed, they weren’t inspirational. To tackle that, we refined the question to “Would this photo make you want to travel to the place it shows?”. This got us content that was a lot better, but it turned out that we were using workers in south-east asia who thought that conference photos looked amazing, full of people with suits and glasses of wine in large hotels. This mismatch was a sobering reminder of the bubble we live in, but it was also a practical problem because our target audience in the US saw conference photos as depressing and non-aspirational. In the end, the six of us on the Jetpac team manually rated over two million photos ourselves, since we knew the criteria better than anyone we could train.

This is an extreme example, but it demonstrates how the labeling process depends heavily on the application’s requirements. For most production use cases there’s a long period of figuring out the right question for the model to answer, and this is crucial to get right. If you’re answering the wrong question with your model, you’ll never be able to build a solid user experience on that poor foundation.

tin_manPhoto by Thomas Hawk

The only way I’ve found to tell if you are asking the right question is to mock up your application, but instead of having a machine learning model have a human in the loop. This is sometimes known as “Wizard-of-Oz-ing”, since there’s a man behind the curtain. In the Jetpac case, we had people manually choose photos for some sample travel guides, rather than training a model, and used feedback from showing test users to adjust the criteria we used for picking the pictures. Once we were reliably getting positive feedback from the tests, we then transferred the photo choosing rules we’d developed into a label playbook for going through millions of images for the training set. This then trained the model that was able to predict quality for billions of photos, but its DNA came from those original manual rules we developed.

Train on Realistic Data

With Jetpac the images we used for training our models were from the same sources (largely Facebook and Instagram) as the photos we wanted to apply the models too, but a common problem I see is that the training dataset is different in important ways from the inputs a model will eventually see in production. For example, I’ll frequently see teams that have a model trained on ImageNet hitting problems when they try to use it in a drone or robot. This happens because ImageNet is full of photos taken by people, and these have a lot of properties in common. They’re shot with phones or still cameras, using neutral lenses, at roughly head height, in daylight or with artificial lighting, with the labeled object centered and in the foreground. Robots and drones use video cameras, often with high field-of-view lenses, from either floor level or from above, with poor lighting, and without intelligent framing of any objects so they’re typically cropped. These differences mean that you’ll see poor accuracy if you just take a model trained on photos from ImageNet and deploy it on one of those devices.

There are also more subtle ways that your training data can diverge from what your final application will see. Imagine you were building a camera to recognize wildlife and used a dataset of animals around the world to train on. If you were only ever going to deploy in the jungles of Borneo, then the odds of a penguin label ever being correct are astronomically low. If Antarctic photos were included in the training data, then there will be a much higher chance that it will mistake something else for a penguin, and so your overall error rate will be worse than if you’d excluded those images from training.

There are ways to calibrate your results based on known priors (for example scale penguin probabilities down massively in jungle environments) but it’s much easier and more effective to use a training set that reflects what the product will actually encounter. The best way I’ve found to do that is to always use data captured directly from your actual application, which ties in nicely with the Wizard of Oz approach I suggested above. Your human-in-the-loop becomes the labeler of your initial dataset, and even if the number of labels gathered is quite small, they’ll reflect real usage and should hopefully be enough for some initial experiments with transfer learning.

Follow the Metrics

When I was working on the Speech Commands example, one of the most frequent reports I looked at was the confusion matrix during training. Here’s an example of how that’s shown in the console:

[[258 0 0 0 0 0 0 0 0 0 0 0]
 [ 7 6 26 94 7 49 1 15 40 2 0 11]
 [ 10 1 107 80 13 22 0 13 10 1 0 4]
 [ 1 3 16 163 6 48 0 5 10 1 0 17]
 [ 15 1 17 114 55 13 0 9 22 5 0 9]
 [ 1 1 6 97 3 87 1 12 46 0 0 10]
 [ 8 6 86 84 13 24 1 9 9 1 0 6]
 [ 9 3 32 112 9 26 1 36 19 0 0 9]
 [ 8 2 12 94 9 52 0 6 72 0 0 2]
 [ 16 1 39 74 29 42 0 6 37 9 0 3]
 [ 15 6 17 71 50 37 0 6 32 2 1 9]
 [ 11 1 6 151 5 42 0 8 16 0 0 20]]

This might look intimidating, but it’s actually just a table showing details about the mistakes the network is making. Here’s a labeled version that’s a bit prettier:

Untitled document (4)

Each row in this table represents a set of samples where the actual true label is the same, and each column shows the numbers for the predicted labels. For example the highlighted row represents all of the audio samples that were actually silent, and if you read from left to right, you can see that the predicted labels for those were correct, with every one falling in the column for predicted silence. What this tells us is that the model is very good at correctly spotting real silences, there are no false negatives. If we look at the whole column, showing how many clips were predicted to be silence, we can see that some clips that were actually words were mistaken for silence, with quite a few false positives. This turned out to be helpful to know, because it caused me to look more closely at the clips that were mistakenly being classified as silence, and a lot of them were unusually quiet recordings. That helped me improve the quality of the data by removing low-volume clips, which I wouldn’t have known to do without the clue from the confusion matrix.

Almost any kind of summary of the results can be useful, but I find the confusion matrix to be a good compromise that gives more information than a single accuracy number but doesn’t overwhelm me with too much detail. It’s also useful to watch the numbers change during training, since it can tell you what categories the model is struggling to learn, and give you areas to concentrate on when cleaning and expanding your dataset.

Birds of a Feather

One of my favorite ways of understanding how my networks are interpreting my training data is by visualizing clusters. TensorBoard has fantastic support for this kind of exploration, and while it’s often used for viewing word embeddings, I find it useful for almost any layer that works like an embedding. For example, image classification networks usually have a penultimate layer before the final fully-connected or softmax unit which can be used as an embedding (which is how simple transfer learning examples like TensorFlow for Poets work). These aren’t strictly embeddings because there’s no effort during training to ensure that there are the desirable spatial properties you’d hope for in a true embedding layout, but clustering their vectors does produce interesting results.

As a practical example, a team I was working with were puzzled by high error rates for certain animals in their image classification model. They used a clustering visualization to see how their training data was distributed for various categories, and when they looked at “Jaguar”, they clearly saw the data sorted into two distinct groups some distance from each other.

clusterPhotos by djblock99 and Dave Adams

Here’s a diagram of the kind of thing they saw. Once the photos in each cluster were shown, it became obvious that a lot of Jaguar-brand vehicles were incorrectly labeled as jaguar cats. Once they knew that, they were able to look at the labeling process and realized that the directions and the user-interface for the workers were confusing. With that information they were able to improve the (human) training process for the labelers and fix the tooling, which removed all the automobile images from the jaguar category and gave a model with much better accuracy for that class.

Clustering gives a lot of the same benefits you get from just looking at your data, by giving you a deep familiarity with what’s in your training set, but the network actually guides your exploration by sorting the inputs into groups based on its own learned understanding. As people we’re great at spotting anomalies visually, so the combination of our intuition and a computer’s ability to process large numbers of inputs gives a very scalable solution to tracking down dataset quality issues. A full tutorial on using TensorBoard to do this is beyond the scope of this post (it’s already long enough that I’m grateful you’re still reading this far in!) but if you’re serious about boosting your results I highly recommend getting familiar with the tool.

Always Be Gathering

I’ve never seen gathering more data not improve model accuracy, and it turns out that there’s a lot of research to back up my experience.

gathering_diagram

This diagram is from “Revisiting the Unreasonable Effectiveness of Data“, and shows how model accuracy for image classification keeps increasing even as the training dataset size grows into the hundreds of millions. Facebook recently took this even further and used billions of Instagram images labeled with tags to achieve new record accuracy on ImageNet classification. What this shows is that even for problems with large, high-quality datasets, increasing the size of the training set still boosts model results.

This means that you need a strategy for continuous improvement of your dataset for as long as there’s any user benefit to better model accuracy. If you can, find creative ways to harness even weak signals to access larger datasets. Facebook’s use of Instagram tags is a great example of this. Another approach is to increase the intelligence of your labeling pipeline, for example by augmenting the tooling by suggesting labels predicted by the initial version of your model so that labelers can make faster decisions. This has the danger of baking in initial biases, but in practice the benefits often outweigh this risk. Throwing money at the problem by hiring more people to label new training inputs is usually a worthwhile investment too, though it can be difficult in organizations that don’t traditionally have a line item in their budget for this kind of expenditure. If you’re a non-profit, making it easier for your supporters to voluntarily contribute data through some kind of public tool can be a great way to increase your set size without breaking the bank.

Of course the holy grail for any organization is to have a product that generates more labeled data naturally as it’s being used. I wouldn’t get too fixated on this idea though, it doesn’t fit with a lot of real-world use cases where people just want to get an answer as quickly as possible without the complications involved in labeling. It’s a great investment pitch if you’re a startup, since it’s like a perpetual motion machine for model improvements, but there’s almost always some per-unit cost involved in cleaning up or augmenting the data you’ll receive, so the economics often end up looking more like a cheaper version of commercial crowdsourcing than something truly free.

Highway to the Danger Zone

There are almost always model errors that have bigger impacts on your application’s users than the loss function captures. You should think about the worst possible outcomes ahead of time and try to engineer a backstop to the model to avoid them. This might just be a blacklist of categories you never want to predict, because the cost of a false positive is so high, or you might have a simple algorithmic set of rules to ensure that the actions taken don’t exceed some boundary parameters you’ve decided. For example, you might keep a list of swear words that you never want a text generator to output, even if they’re in the training set, because it wouldn’t be appropriate in your product.

It’s not always so obvious ahead of time what the bad outcomes might be though, so it’s essential to learn from your mistakes in the real world. One of the simplest ways to do this, once you have a half-decent product/market fit, is to use bug reports. When people use your application, and they get a result they don’t like from the model, make it easy for them to tell you. If possible get the full input to the model but if it’s sensitive data, just knowing what the bad output was can be helpful to guide your investigation. These categories can be used to choose where you gather more data, and which classes you explore to understand their current label quality. Once you have a new revision of your model, have a set of inputs that previously produced bad results and run a separate evaluation on those, in addition to the normal test set. This rogues gallery works a bit like a regression test, and gives you a way to track how well you’re improving the user experience, since a single model accuracy metric will never fully capture everything that people care about. By looking at a small number of examples that prompted a strong reaction in the past, you’ve got some independent evidence that you’re actually making things better for your users. If you can’t capture the input data to your model in these cases because it’s too sensitive, use dogfooding or internal experimentation to figure out what inputs you do have access to produce these mistakes, and substitute those in your regression set instead.

What’s the Story, Morning Glory?

I hope I’ve managed to convince you to spend more time on your data, and given you some ideas on how to invest to improve it. There isn’t as much attention given to this area as it deserves, and I barely feel like I’m scraping the surface with the advice here, so I’m grateful to everyone who has shared their strategies with me, and I hope that I’ll be hearing from a lot more of you about the approaches you’ve had success with. I think there will be an increasing number of organizations who dedicate teams of engineers exclusively to dataset improvement, rather than leaving it to ML researchers to drive progress, and I’m looking forward to seeing the whole field move forward thanks to that. I’m constantly amazed at how well models work even with deeply flawed training data, so I can’t wait to see what we’ll be able to do as our sets improve!

Why ML interfaces will be more like pets than machines

cyborg_dogPhoto by Dave Parker

When I talk to people about what’s happening in deep learning, I often find it hard to get across why I’m so excited. If you look at a lot of the examples in isolation, they just seem like incremental progress over existing features, like better search for photos or smarter email auto-replies. Those are great of course, but what strikes me when I look ahead is how the new capabilities build on each other as they’re combined together. I believe that they will totally change the way we interact with technology, moving from the push-button model we’ve had since the industrial revolution to something that’s more like a collaboration with our tools. It’s not a perfect analogy, but the most useful parallel I can think of is how our relationship with pets differs from our interactions with machines.

To make what I’m saying more concrete, imagine a completely made-up device for helping around the house (I have no idea if anyone’s building something like this, so don’t take it as any kind of prediction, but I’d love one if anybody does get round to it!). It’s a small indoors drone that assists with the housework, with cleaning attachments and a grabbing arm. I’ve used some advanced rendering technology to visualize a mockup below:

mopbot

Ignoring all the other questions this raises (why can’t I pick up my own socks?), here are some of the behaviors I’d want from something like this:

  • It only runs when I’m not home.
  • It learns where I like to put certain items.
  • It can scan and organize my paper receipts and mail.
  • It will help me find mislaid items.
  • It can be summoned with a voice command, or when it hears an accident.

Here are the best approaches I can think of to meet those requirements without using deep learning:

  • It only runs when I’m not home.
    • Run on a fixed schedule I program in.
  • It learns where I like to put certain items.
    • Puts items in fixed locations.
  • It can scan and organize my paper receipts and mail.
    • Can OCR receipts, but identifying them in the clutter is hard.
  • It will help me find mislaid items.
    • Not possible.
  • It can be summoned with a voice command, or when it hears an accident.
    • Difficult and hard to generalize.

These limitations are part of the reason nothing like this has been released. Now, let’s look at how these challenges can be met with current deep learning technology:

  • It only runs when I’m not home.
    • Person detection.
  • It learns where I like to put certain items.
    • Object classification.
  • It can scan and organize my paper receipts and mail.
    • Object classification and OCR.
  • It will help me find mislaid items.
    • Natural language processing and object classification.
  • It can be summoned with a voice command, or when it hears an accident.
    • Higher-quality voice and audio recognition.

The most important part about all these capabilities is that for the first time they are starting to work reliably enough to be useful, but there will still be plenty of mistakes. For this application we’re actually asking the device to understand a lot about us and the world around it, and make decisions on its own. I believe we’re at a point where that’s now possible, but their fallibility deeply changes how we’ll need to interact with products. We’ll benefit as devices become more autonomous, but it also means we’ll need to tolerate more mistakes and find ways to give feedback so they can learn smarter behaviors over time.

This is why the only analogy that I can think of to what’s coming is our pets. They don’t always do what we want, but (sometimes) they learn and even when they don’t they bring so much that we’re happy to have them in our lives. This is very different from our relationship with machines. There we’re always deciding what needs to happen based on our own observations of the world, and then instructing our tools to do exactly as we order. Any deviation from the behavior we specify is usually a serious bug, but there’s no easy way to teach changes, we usually have to build a whole new version. They will also carry out any order, no matter how little sense it might make. Everything from a Spinning Jenny to a desktop GUI relies on the same implicit command and control division of labor between people and tools.

Ever since we started building complex machines this is how our world has worked, but the advances in deep learning are going to take us in a different direction. Of course, tools that are more like agents aren’t a new idea, and there have been some notable failures in the past.

clippy Photo by Rhonda Oglesby

So what’s different? I believe machine learning is now able to do a much better job of understanding user behavior and the surrounding world, and so we won’t be in the uncanny valley that Clippy was stuck in, aggressively misunderstanding people’s intent and then never learning from their evident frustration. He’s a good reminder of the dangers that lurk along the path of autonomy though. To help think about how future interfaces will be developing, here are a few key areas I see them differing in from the current state of the art.

Fallible versus Foolproof

The world is messy, and so any device that’s trying to make sense of it will need to interpret unclear data and make the best decisions it can. There still need to be hard limits around anything to do with safety, but deep learning products will need to be designed with inevitable mistakes in mind. The cost of any mistakes will have to be much less than the value of the benefits they bring, but part of that cost can be mitigated by design, so that it’s easy to cancel actions or there’s more of a pause and request for confirmation when there’s uncertainty.

Learning versus Hardcoded

One of the hardest problems when you work with complex deep learning models is how to run a quality assurance process, and it only gets tougher once systems can learn after they’re deployed. There’s no substitute for real-world testing, but the whole process of evaluating products will need to be revamped to cope with more flexible and unpredictable responses. Tay is another cautionary tale for what can go wrong with uncontrolled learning.

Attentive or Ignorant

Traditional tools wait to be told what to do by their owner, and don’t have any concept of common sense. Even if the house is burning down around it, your television won’t try to wake you up. Future products will have a much richer sense of what’s happening in the world around them, and will be expected to respond in sensible ways to all sorts of situations outside of their main function. This is vital for smart devices to become truly useful but vastly expands the “surface” of their interfaces, making designs based around flow charts impossible.

I definitely don’t have all the answers for how we’ll deal with this new breed of interfaces, but I do know that we need some new ways of thinking about them. Personally I’d much rather spend time with pets than machines, so I hope that I am right about where we’re headed!

Enter the OVIC low-power challenge!

Screen Shot 2018-04-20 at 4.28.29 PM.pngPhoto by Pete

I’m a big believer in the power of benchmarks to help innovators compete and collaborate together. It’s hard to imagine deep learning taking off in the way it did without ImageNet, and I’ve learned so much from the Kaggle community as teams work to come up with the best solutions. It’s surprisingly hard to create good benchmarks though, as I’ve learned in the Kaggle competitions I’ve run. Most of engineering is about tradeoffs, and when you specify just a single metric you end up with solutions that ignore other costs you might care about. It made sense in the early days of the ImageNet challenge to focus only on accuracy because that was by far the biggest problem that blocked potential users from deploying computer vision technology. If the models don’t work well enough with infinite resources, then nothing else matters.

Now that deep learning can produce models that are accurate enough for many applications, we’re facing a different set of challenges. We need models that are fast and small enough to run on mobile and embedded platforms, and now that the maximum achievable accuracy is so high, we’re often able to trade some of it off to fit the resource constraints. Models like SqueezeNet, MobileNet, and recently MobileNet v2 have emerged that offer the ability to pick the best accuracy you can get given particular memory and latency constraints. These are extremely useful solutions for many applications, and I’d like to see research in this area continue to flourish, but because the models all involve trade-offs it’s not possible to evaluate them with a single metric. It’s also tricky to measure some of the properties we care about, like latency and memory usage, because they’re tied to particular hardware and software implementations. For example, some of the early NASNet models had very low numbers of floating-point operations, but it turned out because of the model structure and software implementations they didn’t translate into as low latency as we’d expected in practice.

All this means it’s a lot of work to propose a useful benchmark in this area, but I’m very pleased to say that Bo Chen, Jeff Gilbert, Andrew Howard, Achille Brighton, and the rest of the Mobile Vision team have put in the effort to launch the On-device Visual Intelligence Challenge for CVPR. This includes a complete suite of software for measuring accuracy and latency on known devices, and I’m hoping it will encourage a lot of innovative new model architectures that will translate into practical advances for application developers. One of the exciting features of this competition is that there are a lot of ways to produce an impressive entry, even if it doesn’t win the main 30ms-on-a-Pixel-phone challenge, because the state of the art is a curve not a point. For example, I’d love a model that gave me 40% top-one accuracy in well under a millisecond, since that would probably translate well to even smaller devices and would still be extremely useful. You can read more about the rules here, and I look forward to seeing your creative entries!