Why ML interfaces will be more like pets than machines

cyborg_dogPhoto by Dave Parker

When I talk to people about what’s happening in deep learning, I often find it hard to get across why I’m so excited. If you look at a lot of the examples in isolation, they just seem like incremental progress over existing features, like better search for photos or smarter email auto-replies. Those are great of course, but what strikes me when I look ahead is how the new capabilities build on each other as they’re combined together. I believe that they will totally change the way we interact with technology, moving from the push-button model we’ve had since the industrial revolution to something that’s more like a collaboration with our tools. It’s not a perfect analogy, but the most useful parallel I can think of is how our relationship with pets differs from our interactions with machines.

To make what I’m saying more concrete, imagine a completely made-up device for helping around the house (I have no idea if anyone’s building something like this, so don’t take it as any kind of prediction, but I’d love one if anybody does get round to it!). It’s a small indoors drone that assists with the housework, with cleaning attachments and a grabbing arm. I’ve used some advanced rendering technology to visualize a mockup below:

mopbot

Ignoring all the other questions this raises (why can’t I pick up my own socks?), here are some of the behaviors I’d want from something like this:

  • It only runs when I’m not home.
  • It learns where I like to put certain items.
  • It can scan and organize my paper receipts and mail.
  • It will help me find mislaid items.
  • It can be summoned with a voice command, or when it hears an accident.

Here are the best approaches I can think of to meet those requirements without using deep learning:

  • It only runs when I’m not home.
    • Run on a fixed schedule I program in.
  • It learns where I like to put certain items.
    • Puts items in fixed locations.
  • It can scan and organize my paper receipts and mail.
    • Can OCR receipts, but identifying them in the clutter is hard.
  • It will help me find mislaid items.
    • Not possible.
  • It can be summoned with a voice command, or when it hears an accident.
    • Difficult and hard to generalize.

These limitations are part of the reason nothing like this has been released. Now, let’s look at how these challenges can be met with current deep learning technology:

  • It only runs when I’m not home.
    • Person detection.
  • It learns where I like to put certain items.
    • Object classification.
  • It can scan and organize my paper receipts and mail.
    • Object classification and OCR.
  • It will help me find mislaid items.
    • Natural language processing and object classification.
  • It can be summoned with a voice command, or when it hears an accident.
    • Higher-quality voice and audio recognition.

The most important part about all these capabilities is that for the first time they are starting to work reliably enough to be useful, but there will still be plenty of mistakes. For this application we’re actually asking the device to understand a lot about us and the world around it, and make decisions on its own. I believe we’re at a point where that’s now possible, but their fallibility deeply changes how we’ll need to interact with products. We’ll benefit as devices become more autonomous, but it also means we’ll need to tolerate more mistakes and find ways to give feedback so they can learn smarter behaviors over time.

This is why the only analogy that I can think of to what’s coming is our pets. They don’t always do what we want, but (sometimes) they learn and even when they don’t they bring so much that we’re happy to have them in our lives. This is very different from our relationship with machines. There we’re always deciding what needs to happen based on our own observations of the world, and then instructing our tools to do exactly as we order. Any deviation from the behavior we specify is usually a serious bug, but there’s no easy way to teach changes, we usually have to build a whole new version. They will also carry out any order, no matter how little sense it might make. Everything from a Spinning Jenny to a desktop GUI relies on the same implicit command and control division of labor between people and tools.

Ever since we started building complex machines this is how our world has worked, but the advances in deep learning are going to take us in a different direction. Of course, tools that are more like agents aren’t a new idea, and there have been some notable failures in the past.

clippy Photo by Rhonda Oglesby

So what’s different? I believe machine learning is now able to do a much better job of understanding user behavior and the surrounding world, and so we won’t be in the uncanny valley that Clippy was stuck in, aggressively misunderstanding people’s intent and then never learning from their evident frustration. He’s a good reminder of the dangers that lurk along the path of autonomy though. To help think about how future interfaces will be developing, here are a few key areas I see them differing in from the current state of the art.

Fallible versus Foolproof

The world is messy, and so any device that’s trying to make sense of it will need to interpret unclear data and make the best decisions it can. There still need to be hard limits around anything to do with safety, but deep learning products will need to be designed with inevitable mistakes in mind. The cost of any mistakes will have to be much less than the value of the benefits they bring, but part of that cost can be mitigated by design, so that it’s easy to cancel actions or there’s more of a pause and request for confirmation when there’s uncertainty.

Learning versus Hardcoded

One of the hardest problems when you work with complex deep learning models is how to run a quality assurance process, and it only gets tougher once systems can learn after they’re deployed. There’s no substitute for real-world testing, but the whole process of evaluating products will need to be revamped to cope with more flexible and unpredictable responses. Tay is another cautionary tale for what can go wrong with uncontrolled learning.

Attentive or Ignorant

Traditional tools wait to be told what to do by their owner, and don’t have any concept of common sense. Even if the house is burning down around it, your television won’t try to wake you up. Future products will have a much richer sense of what’s happening in the world around them, and will be expected to respond in sensible ways to all sorts of situations outside of their main function. This is vital for smart devices to become truly useful but vastly expands the “surface” of their interfaces, making designs based around flow charts impossible.

I definitely don’t have all the answers for how we’ll deal with this new breed of interfaces, but I do know that we need some new ways of thinking about them. Personally I’d much rather spend time with pets than machines, so I hope that I am right about where we’re headed!

Enter the OVIC low-power challenge!

Screen Shot 2018-04-20 at 4.28.29 PM.pngPhoto by Pete

I’m a big believer in the power of benchmarks to help innovators compete and collaborate together. It’s hard to imagine deep learning taking off in the way it did without ImageNet, and I’ve learned so much from the Kaggle community as teams work to come up with the best solutions. It’s surprisingly hard to create good benchmarks though, as I’ve learned in the Kaggle competitions I’ve run. Most of engineering is about tradeoffs, and when you specify just a single metric you end up with solutions that ignore other costs you might care about. It made sense in the early days of the ImageNet challenge to focus only on accuracy because that was by far the biggest problem that blocked potential users from deploying computer vision technology. If the models don’t work well enough with infinite resources, then nothing else matters.

Now that deep learning can produce models that are accurate enough for many applications, we’re facing a different set of challenges. We need models that are fast and small enough to run on mobile and embedded platforms, and now that the maximum achievable accuracy is so high, we’re often able to trade some of it off to fit the resource constraints. Models like SqueezeNet, MobileNet, and recently MobileNet v2 have emerged that offer the ability to pick the best accuracy you can get given particular memory and latency constraints. These are extremely useful solutions for many applications, and I’d like to see research in this area continue to flourish, but because the models all involve trade-offs it’s not possible to evaluate them with a single metric. It’s also tricky to measure some of the properties we care about, like latency and memory usage, because they’re tied to particular hardware and software implementations. For example, some of the early NASNet models had very low numbers of floating-point operations, but it turned out because of the model structure and software implementations they didn’t translate into as low latency as we’d expected in practice.

All this means it’s a lot of work to propose a useful benchmark in this area, but I’m very pleased to say that Bo Chen, Jeff Gilbert, Andrew Howard, Achille Brighton, and the rest of the Mobile Vision team have put in the effort to launch the On-device Visual Intelligence Challenge for CVPR. This includes a complete suite of software for measuring accuracy and latency on known devices, and I’m hoping it will encourage a lot of innovative new model architectures that will translate into practical advances for application developers. One of the exciting features of this competition is that there are a lot of ways to produce an impressive entry, even if it doesn’t win the main 30ms-on-a-Pixel-phone challenge, because the state of the art is a curve not a point. For example, I’d love a model that gave me 40% top-one accuracy in well under a millisecond, since that would probably translate well to even smaller devices and would still be extremely useful. You can read more about the rules here, and I look forward to seeing your creative entries!

Speech Commands is now larger and cleaner!

waveform.png

Picture by Aaron Parecki

When I launched the Speech Commands dataset last year I wasn’t quite sure what to expect, but I’ve been very happy to see all the creative ways people have used it, like guiding embedded optimizations or testing new model architectures. The best part has been all the conversations I’ve ended up having because of it, and how much I’ve learned about the area of microcontroller machine learning from other people in the field.

Having a lot of eyes on the data (especially through the Kaggle competition) gave me a lot more insight into how to improve its quality, and there’s been a steady stream of volunteers donating their voices to expand the number of utterances. I also had a lot of requests for a paper giving more details on the dataset, especially covering how it was collected and what the best approaches to benchmarking accuracy were. With all of that in mind, I spent the past few weeks gathering the voice data that had been donated recently, improving the labeling process, and documenting it all in much more depth. I’m pleased to say that the resulting paper is now up on Arxiv, and you can download the expanded and improved archive of over one hundred thousand utterances. The folder layout is still compatible with the first version, so to run the example training script from the tutorial, you can just execute:

python tensorflow/examples/speech_commands/train.py \
--data_url=http://download.tensorflow.org/data/speech_commands_v0.02.tar.gz

I’m looking forward to hearing more about how you’re using the dataset, and continuing the conversations it has already sparked, so I hope you have as much fun with it as I have!

The Machine Learning Reproducibility Crisis

Gosper Glider Gun

I was recently chatting to a friend whose startup’s machine learning models were so disorganized it was causing serious problems as his team tried to build on each other’s work and share it with clients. Even the original author sometimes couldn’t train the same model and get similar results! He was hoping that I had a solution I could recommend, but I had to admit that I struggle with the same problems in my own work. It’s hard to explain to people who haven’t worked with machine learning, but we’re still back in the dark ages when it comes to tracking changes and rebuilding models from scratch. It’s so bad it sometimes feels like stepping back in time to when we coded without source control.

When I started out programming professionally in the mid-90’s, the standard for keeping track and collaborating on source code was Microsoft’s Visual SourceSafe. To give you a flavor of the experience, it didn’t have atomic check-ins, so multiple people couldn’t work on the same file, the network copy required nightly scans to avoid mysterious corruption, and even that was no guarantee the database would be intact in the morning. I felt lucky though, one of the places I interviewed at just had a wall of post-it notes, one for each file in the tree, and coders would take them down when they were modifying files, and return them when they were done!

This is all to say, I’m no shrinking violet when it comes to version control. I’ve toughed my way through some terrible systems, and I can still monkey together a solution using rsync and chicken wire if I have to. Even with all that behind me, I can say with my hand on my heart, that machine learning is by far the worst environment I’ve ever found for collaborating and keeping track of changes.

To explain why, here’s a typical life cycle of a machine learning model:

  • A researcher decides to try a new image classification architecture.
  • She copies and pastes some code from a previous project to handle the input of the dataset she’s using.
  • This dataset lives in one of her folders on the network. It’s probably one of the ImageNet downloads, but it isn’t clear which one. At some point, someone may have removed some of the images that aren’t actually JPEGs, or made other minor modifications, but there’s no history of that.
  • She tries out a lot of slightly different ideas, fixing bugs and tweaking the algorithms. These changes are happening on her local machine, and she may just do a mass file copy of the source code to her GPU cluster when she wants to kick off a full training run.
  • She executes a lot of different training runs, often changing the code on her local machine while jobs are in progress, since they take days or weeks to complete.
  • There might be a bug towards the end of the run on a large cluster that means she modifies the code in one file and copies that to all the machines, before resuming the job.
  • She may take the partially-trained weights from one run, and use them as the starting point for a new run with different code.
  • She keeps around the model weights and evaluation scores for all her runs, and picks which weights to release as the final model once she’s out of time to run more experiments. These weights can be from any of the runs, and may have been produced by very different code than what she currently has on her development machine.
  • She probably checks in her final code to source control, but in a personal folder.
  • She publishes her results, with code and the trained weights.

This is an optimistic scenario with a conscientious researcher, but you can already see how hard it would be for somebody else to come in and reproduce all of these steps and come out with the same result. Every one of these bullet points is an opportunity to inconsistencies to creep in. To make things even more confusing, ML frameworks trade off exact numeric determinism for performance, so if by a miracle somebody did manage to copy the steps exactly, there would still be tiny differences in the end results!

In many real-world cases, the researcher won’t have made notes or remember exactly what she did, so even she won’t be able to reproduce the model. Even if she can, the frameworks the model code depend on can change over time, sometimes radically, so she’d need to also snapshot the whole system she was using to ensure that things work. I’ve found ML researchers to be incredibly generous with their time when I’ve contacted them for help reproducing model results, but it’s often months-long task even with assistance from the original author.

Why does this all matter? I’ve had several friends contact me about their struggles reproducing published models as baselines for their own papers. If they can’t get the same accuracy that the original authors did, how can they tell if their new approach is an improvement? It’s also clearly concerning to rely on models in production systems if you don’t have a way of rebuilding them to cope with changed requirements or platforms. At that point your model moves from being a high-interest credit card of technical debt to something more like what a loan-shark offers. It’s also stifling for research experimentation; since making changes to code or training data can be hard to roll back it’s a lot more risky to try different variations, just like coding without source control raises the cost of experimenting with changes.

It’s not all doom and gloom, there are some notable efforts around reproducibility happening in the community. One of my favorites is the TensorFlow Benchmarks project Toby Boyd’s leading. He’s made it his team’s mission not only to lay out exactly how to train some of the leading models from scratch with high training speed on a lot of different platforms, but also ensures that the models train to the expected accuracy. I’ve seen him sweat blood trying to get models up to that precision, since variations in any of the steps I listed above can affect the results and there’s no easy way to debug what the underlying cause is, even with help from the authors. It’s also a never-ending job, since changes in TensorFlow, in GPU drivers, or even datasets, can all hurt accuracy in subtle ways. By doing this work, Toby’s team helps us spot and fix bugs caused by changes in TensorFlow in the models they cover, and chase down issues caused by external dependencies, but it’s hard to scale beyond a comparatively small set of platforms and models.

I also know of other teams who are serious about using models in production who put similar amounts of time and effort into ensuring their training can be reproduced, but the problem is that it’s still a very manual process. There’s no equivalent to source control or even agreed best-practices about how to archive a training process so that it can be successfully re-run in the future. I don’t have a solution in mind either, but to start the discussion here are some principles I think any approach would need to follow to be successful:

  •  Researchers must be able to easily hack around with new ideas, without paying a large “process tax”. If this isn’t true, they simply won’t use it. Ideally, the system will actually boost their productivity.
  • If a researcher gets hit by a bus founds their own startup, somebody else should be able to step in the next day and train all the models they have created so far, and get the same results.
  • There should be some way of packaging up just what you need to train one particular model, in a way that can be shared publicly without revealing any history the author doesn’t wish to.
  • To reproduce results, code, training data, and the overall platform need to be recorded accurately.

I’ve been seeing some interesting stirrings in the open source and startup world around solutions to these challenges, and personally I can’t wait to spend less of my time dealing with all the related issues, but I’m not expecting to see a complete fix in the short term. Whatever we come up with will require a change in the way we all work with models, in the same way that source control meant a big change in all of our personal coding processes. It will be as much about getting consensus on the best practices and educating the community as it will be about the tools we come up with. I can’t wait to see what emerges!

Why Low-Power NN Accelerators Matter

gapduino_small

When I released the Speech Commands dataset and code last year, I was hoping they would give a boost to teams building low-energy-usage hardware by providing a realistic application benchmark. It’s been great to see Vikas Chandra of ARM using them to build keyword spotting examples for Cortex M-series chips, and now a hardware startup I’ve been following, Green Waves, have just announced a new device and shared some numbers using the dataset as a benchmark. They’re showing power usage numbers of just a few milliwatts for an always-on keyword spotter, which is starting to approach the coin-battery-for-a-year target I think will open up a whole new world of uses.

I’m not just excited about this for speech recognition’s sake, but because the same hardware can also accelerate vision, and other advanced sensor processing, turning noisy signals into something actionable. I’m also fascinated by the idea that we might be able to build tiny robots with the intelligence of insects if we can get the energy usage and mass small enough, or even send smart nano-probes to nearby stars!

Neural networks offer a whole new way of programming that’s inherently a lot easier to scale down than conventional instruction-driven approaches. You can transform and convert network models in ways we’ve barely begun to explore, fitting them to hardware with few resources while preserving performance. Chips can also take a lot of shortcuts that aren’t possible with traditional code, like tolerating calculation errors, and they don’t have to worry about awkward constructs like branches, everything is straight-line math at its heart.

I’ve put in my preorder for a GAP8 developer kit, to join the ARM-based prototyping devices on my desk, and I’m excited to see so much activity in this area. I think we’re going to see a lot of progress over the next couple of years, and I can’t wait to see what new applications emerge as hardware capabilities keep improving!

Blue Pill: A 72MHz 32-Bit Computer for $2!

blue_pill_0.jpg

Some people love tiny houses, but I’m fascinated by tiny computers. My house is littered with Raspberry Pi’s, but recently my friend Andy Selle introduced me to Blue Pill single-board computers. These are ARM M3 CPUs running at 72MHz, available for $2 or less on Ebay and Aliexpress, even when priced individually. These are complete computers with 20KB of RAM and 64KB of Flash for programs, and while that may not sound like much memory, their computing power as 32-bit ARM CPUs running at a fast clock-rate make them very attractive for applications like machine learning that rely more on arithmetic than memory. Even better, they can run for weeks or months on a single battery thanks to their ultra-low energy usage.

This makes them interesting platforms to explore the emerging world of smart sensors; they may not quite be fifty cents each, but they’re in the same ballpark. Unfortunately I’m a complete novice when it comes to microcontrollers, but luckily Andy was able to give me a few pointers to help me get started. After I struggled through a few hurdles, I managed to get a workflow laid out that I like, and ran some basic examples. To leave a trail of breadcrumbs for anyone else who’s fascinated by the possibilities of these devices, I’ve open-sourced stm32_bare_lib on GitHub. It includes step by step instructions designed for a newbie like me, especially on the wiring (which doesn’t require soldering or any special tools, thankfully), and has some examples written in plain C to play with. I hope you have as much fun playing with these tiny computers as I have!

How to Compile for ARM M-series Chips from the Command Line

blue_pill_stm32

Image from stm32duino.com

When I was first learning programming in the 90’s, embedded systems were completely out of reach. I couldn’t afford the commercial development boards and toolchains that I’d need to even get started, and I’d need a lot of proprietary knowledge just to get started. I was excited a few years ago when the Arduino environment first appeared, since it removed a lot of the barriers to the general public. I still didn’t dive in though, because I couldn’t see an easy way to port the kind of algorithms I was interested in to eight-bit hardware, and even the coding environment wasn’t a good fit for my C++ background.

That’s why I’ve been fascinated by the rise of the ARM M-series chips. They’re cheap, going for as little as $2 for a “blue pill” M3 board on ebay, they can run with very low power usage which can be less than a milliwatt (offering the chance to run on batteries for months or years), and they have the full 32-bit ARM instruction I’m familiar with. This makes them tempting as a platform to prototype the kind of smart sensors I believe the combination of deep learning eating software and cheap, low-power compute power is going to enable.

I was still daunted by the idea of developing for them though. Raspberry Pi’s are very approachable because you can use a very familiar Linux environment to program them, but there wasn’t anything as obvious for me to use in the M-series world. Happily I was able to get advice from some experts at ARM, who helped steer me as a newbie through this unfamiliar world. I was very pleasantly surprised by the maturity and ease of use of the development ecosystem, so I want to share what I learned for anyone else who’s interested.

The first tip they had was to check out STMicroelectronics “Discovery” boards. These are small circuit boards with an M-series CPU and often a lot of peripherals built in to make experimentation easy. I started with the 32F746G which included a touch screen, audio input and output, microphones, and even ethernet, and cost about $70. There are cheaper versions available too, but I wanted something easy to demo with. I also chose the M7 chip because it has support for floating point calculations, even though I don’t expect I’ll need that long term it’s helpful when porting and prototyping to have it.

The unboxing experience was great, I just plugged the board into a USB socket on my MacBook Pro and it powered itself up into some demonstration programs. It showed up on my MacOS file system as a removable drive, and the usefulness of that quickly became clear when I went to the mbed online IDE. This is one of the neatest developer environments I’ve run across, it runs completely in the browser and makes it easy to clone and modify examples. You can pick your device, grab a project, press the compile button and in a few seconds you’ll have a “.bin” file downloaded. Just drag and drop that from your downloads folder into the USB device in the Finder and the board will reboot and run the program you’ve just built.

I liked this approach a lot as a way to get started, but I wasn’t sure how to integrate larger projects into the IDE. I thought I’d have to do some kind of online import, and then keep copies of my web code in sync with more traditional github and file system versions. When I was checking out the awesome keyword spotting code from ARM research I saw they were using a tool called “mbed-cli“, which sounded a lot easier to integrate into my workflow. When I’m doing a lot of cross platform work, I usually find it easier to use Emacs as my editor, and a custom IDE can actually get in the way. As it turns out, mbed-cli offers a command line experience while still keeping a lot of the usability advantages I’d discovered in the web IDE. Adding libraries and aiming at devices is easy, but it integrates smoothly with my local file system and github. Here’s what I did to get started with it:

  • I used pip install mbed-cli on my MacOS machine to add the Python tools to my system.
  • I ran mbed new mbed-hello-world to create a new project folder, which the mbed tools populated with all the baseline files I needed, and then I cd-ed into it with cd mbed-hello-world
  • I decided to use gcc for consistency with other platforms, so I downloaded the GCC v7 toolchain from ARM, and set a global variable to point to it by running ​​mbed config -G GCC_ARM_PATH "/Users/petewarden/projects/arm-compilers/gcc-arm-none-eabi-7-2017-q4-major/bin"
  • I then added a ‘main.cpp’ file to my empty project, by writing out this code:
#include <mbed.h>

Serial pc(USBTX, USBRX);

int main(int argc, char** argv) {
  pc.printf("Hello world!\r\n");
  return 0;
}

The main thing to notice here is that we don’t have a natural place to view stdout results from a normal printf on an embedded system, so what I’m doing here is creating an object that can send text over a USB port using an API exported from the main mbed framework, and then doing a printf call on that object. In a later step we’ll set up something on the laptop side to display that. You can see the full mbed documentation on using printf for debugging here.

  • I did git add main.cpp and git commit -a "Added main.cpp" to make sure the file was part of the project.
  • I created a new terminal window to look at the output of the printf after it’s been sent over the USB connection. How to view this varies for different platforms, but for MacOS you need to enter the command screen /dev/tty.usbm, and press Tab to autocomplete the correct device name. After that, the terminal may contain some random text, but after you’ve successfully compiled and run the program you should see “Hello World!” output. One quirk I noticed was that \n on its own was not enough to cause the normal behavior I expected from a new line, in that it just moved down one line but the next output started at the same horizontal position. That’s why I added a \r carriage return character in the example above.
  • I ran mbed compile -m auto -t GCC_ARM -f to build and flash the resulting ‘.bin’ file onto the Discovery board I had plugged in to the USB port. The -m auto part made the compile process auto-discover what device it should be targeting, based on what was plugged in, and -f triggered the transfer of the program after a successful build.
  • If it built and ran correctly, you should see “Hello World!” in the terminal window where you’re running your screen command.

I’ve added my version of this project on Github as github.com/petewarden/mbed-hello-world, in case you want to compare with what you get following these steps. I found the README for the mbed-cli project extremely clear though, and it’s what most of this post is based on.

My end goal is to set up a simple way to compile a code base that’s shared with other platforms for M-series devices. I’m not quite certain how to do that (for example integrating with a makefile, or otherwise syncing a list of files between an mbed project and other build systems), but so far mbed has been such a pleasant experience I’m hopeful I’ll be able to figure that out soon!