I’m a big believer in the power of benchmarks to help innovators compete and collaborate together. It’s hard to imagine deep learning taking off in the way it did without ImageNet, and I’ve learned so much from the Kaggle community as teams work to come up with the best solutions. It’s surprisingly hard to create good benchmarks though, as I’ve learned in the Kaggle competitions I’ve run. Most of engineering is about tradeoffs, and when you specify just a single metric you end up with solutions that ignore other costs you might care about. It made sense in the early days of the ImageNet challenge to focus only on accuracy because that was by far the biggest problem that blocked potential users from deploying computer vision technology. If the models don’t work well enough with infinite resources, then nothing else matters.
Now that deep learning can produce models that are accurate enough for many applications, we’re facing a different set of challenges. We need models that are fast and small enough to run on mobile and embedded platforms, and now that the maximum achievable accuracy is so high, we’re often able to trade some of it off to fit the resource constraints. Models like SqueezeNet, MobileNet, and recently MobileNet v2 have emerged that offer the ability to pick the best accuracy you can get given particular memory and latency constraints. These are extremely useful solutions for many applications, and I’d like to see research in this area continue to flourish, but because the models all involve trade-offs it’s not possible to evaluate them with a single metric. It’s also tricky to measure some of the properties we care about, like latency and memory usage, because they’re tied to particular hardware and software implementations. For example, some of the early NASNet models had very low numbers of floating-point operations, but it turned out because of the model structure and software implementations they didn’t translate into as low latency as we’d expected in practice.
All this means it’s a lot of work to propose a useful benchmark in this area, but I’m very pleased to say that Bo Chen, Jeff Gilbert, Andrew Howard, Achille Brighton, and the rest of the Mobile Vision team have put in the effort to launch the On-device Visual Intelligence Challenge for CVPR. This includes a complete suite of software for measuring accuracy and latency on known devices, and I’m hoping it will encourage a lot of innovative new model architectures that will translate into practical advances for application developers. One of the exciting features of this competition is that there are a lot of ways to produce an impressive entry, even if it doesn’t win the main 30ms-on-a-Pixel-phone challenge, because the state of the art is a curve not a point. For example, I’d love a model that gave me 40% top-one accuracy in well under a millisecond, since that would probably translate well to even smaller devices and would still be extremely useful. You can read more about the rules here, and I look forward to seeing your creative entries!
Too bad Pixel XL (or any other Pixel device) does not include support for the DSP Bridge or OpenCL otherwise 30ms on our smaller SSD/YOLO models is easy to achieve with a 40% top-1 accuracy.