Enter the OVIC low-power challenge!

Screen Shot 2018-04-20 at 4.28.29 PM.pngPhoto by Pete

I’m a big believer in the power of benchmarks to help innovators compete and collaborate together. It’s hard to imagine deep learning taking off in the way it did without ImageNet, and I’ve learned so much from the Kaggle community as teams work to come up with the best solutions. It’s surprisingly hard to create good benchmarks though, as I’ve learned in the Kaggle competitions I’ve run. Most of engineering is about tradeoffs, and when you specify just a single metric you end up with solutions that ignore other costs you might care about. It made sense in the early days of the ImageNet challenge to focus only on accuracy because that was by far the biggest problem that blocked potential users from deploying computer vision technology. If the models don’t work well enough with infinite resources, then nothing else matters.

Now that deep learning can produce models that are accurate enough for many applications, we’re facing a different set of challenges. We need models that are fast and small enough to run on mobile and embedded platforms, and now that the maximum achievable accuracy is so high, we’re often able to trade some of it off to fit the resource constraints. Models like SqueezeNet, MobileNet, and recently MobileNet v2 have emerged that offer the ability to pick the best accuracy you can get given particular memory and latency constraints. These are extremely useful solutions for many applications, and I’d like to see research in this area continue to flourish, but because the models all involve trade-offs it’s not possible to evaluate them with a single metric. It’s also tricky to measure some of the properties we care about, like latency and memory usage, because they’re tied to particular hardware and software implementations. For example, some of the early NASNet models had very low numbers of floating-point operations, but it turned out because of the model structure and software implementations they didn’t translate into as low latency as we’d expected in practice.

All this means it’s a lot of work to propose a useful benchmark in this area, but I’m very pleased to say that Bo Chen, Jeff Gilbert, Andrew Howard, Achille Brighton, and the rest of the Mobile Vision team have put in the effort to launch the On-device Visual Intelligence Challenge for CVPR. This includes a complete suite of software for measuring accuracy and latency on known devices, and I’m hoping it will encourage a lot of innovative new model architectures that will translate into practical advances for application developers. One of the exciting features of this competition is that there are a lot of ways to produce an impressive entry, even if it doesn’t win the main 30ms-on-a-Pixel-phone challenge, because the state of the art is a curve not a point. For example, I’d love a model that gave me 40% top-one accuracy in well under a millisecond, since that would probably translate well to even smaller devices and would still be extremely useful. You can read more about the rules here, and I look forward to seeing your creative entries!

Speech Commands is now larger and cleaner!

waveform.png

Picture by Aaron Parecki

When I launched the Speech Commands dataset last year I wasn’t quite sure what to expect, but I’ve been very happy to see all the creative ways people have used it, like guiding embedded optimizations or testing new model architectures. The best part has been all the conversations I’ve ended up having because of it, and how much I’ve learned about the area of microcontroller machine learning from other people in the field.

Having a lot of eyes on the data (especially through the Kaggle competition) gave me a lot more insight into how to improve its quality, and there’s been a steady stream of volunteers donating their voices to expand the number of utterances. I also had a lot of requests for a paper giving more details on the dataset, especially covering how it was collected and what the best approaches to benchmarking accuracy were. With all of that in mind, I spent the past few weeks gathering the voice data that had been donated recently, improving the labeling process, and documenting it all in much more depth. I’m pleased to say that the resulting paper is now up on Arxiv, and you can download the expanded and improved archive of over one hundred thousand utterances. The folder layout is still compatible with the first version, so to run the example training script from the tutorial, you can just execute:

python tensorflow/examples/speech_commands/train.py \
--data_url=http://download.tensorflow.org/data/speech_commands_v0.02.tar.gz

I’m looking forward to hearing more about how you’re using the dataset, and continuing the conversations it has already sparked, so I hope you have as much fun with it as I have!