On Monday I’ll be giving a keynote at the IEEE Custom Integrated Circuits Conference, which is quite surprising even to me, considering I’m a software engineer who can barely solder! Despite that, I knew exactly what I wanted to talk about when I was offered the invitation. If I have a room full of hardware designers listening to me to twenty minutes, I want them to understand what people building machine learning applications need out of their chips. After thirteen years(!) of blogging, I find writing a post the most natural way of organizing my thoughts, so I hope any attendees don’t mind some spoilers on what I’ll be asking for.
At TinyML last month, I think it was Simon Craske from Arm who said that a few years ago hardware design was getting a little bit boring, since the requirements seemed well understood and it was mostly just an exercise in iterating on existing ideas. The rise of machine learning (and to be more specific deep learning since that’s been the almost exclusive focus so far) has changed all that. The good news is that chip design is no longer boring, but it can be hard to understand what the new requirements are, so in this post I’ll try to cover my perspective from the software side. I won’t be proposing hardware solutions, since I don’t know what the answers are, but I will try to distill what the hundreds of teams I’ve worked with building products using machine learning are asking for.
The single most important message I want to get across is that there are a lot of new applications that are blocked from launching because we don’t have enough computing power. Many other existing products would be improved if we could run models that require more computing power than is available. It doesn’t matter whether you’re on the cloud, a mobile device, or even embedded, teams have models they’d like to run that they can’t.
To give a practical example, take a look at what might seem like a mature area, speech recognition. After heroic efforts, a team at Google was recently able to squeeze server-quality transcription onto local compute on Pixel phones. The model itself is comparatively small too, at just 80 MB. This network is pushing the limits of what a modern application processor on a mobile device can manage, but it’s almost entirely arithmetic-bound. That means if we could offer the same level of raw compute in a chip for lower energy and a cheaper price, this sort of general speech recognition could be added to almost any product. Even if you aren’t looking to expand beyond current phones, there are still problems like the cocktail party effect that can benefit from running additional neural networks to improve the overall accuracy. You can get even more context from visual sensors by looking at things like gaze direction, which, again, require more deep learning models to calculate.
This is just one application area. Every product domain I’ve worked with has similar stories, where improvements in latency and energy usage when running networks would translate directly into new or enhanced user experiences. If you look at a time profile of these networks you’ll see almost all the time is going into multiply-adds, so improving the hardware does mean improving its ability to run the arithmetic primitives efficiently.
I may be biased because my work almost entirely focuses on running already-trained models (‘inference’ in ML terms) but I believe the biggest need over the next few years is for inference hardware, not training. Someone once said to me “training scales with the number of researchers, inference scales with the number of applications times the number of users“, and that idea has always stuck with me. An individual model author’s training needs are immense, but there’s comparatively few of them and the growth is limited by human educational processes. By comparison, popular applications have hundreds of millions of users, and each application can require many models, so the scale of inference calculations can easily grow much faster.
On a less philosophical note, as I talked about earlier I see a lot of teams who are able to train more complex models than they can deploy on their production platforms. Even cloud applications have compute budgets driven by the economic costs of running servers, and other devices usually have hard limits on the resources available. For those reasons, I’d love to see a lot of attention paid to speeding up ML inference from the hardware community.
It’s now widely accepted that eight bits are enough for running inference on convolutional neural networks. The picture is a bit more complicated for training, and for recurrent networks, because both processes require the addition of many small increments to stored values to achieve their results, but at the very least full thirty-two bit floating point values are overkill in every case I’ve seen. If you design inference hardware for eight-bit precision you’ll cover a large number of practical use cases. Of course, exactly what eight-bit means isn’t necessarily obvious, so I’m hoping the TensorFlow team will be able to produce some guidance based on our experience soon, detailing exactly what we believe the best practices are for executing eight-bit calculations.
There’s also a lot of evidence that it’s possible to go lower than eight bits, but that is a lot less settled. As some background it’s worth reading this survey by my colleague Raghu, which has a lot of experiments investigating different possibilities.
I’ve saved what I expect may be my most controversial request until last. The typical design process I’ve seen from hardware teams is that they will look at some existing ML workloads, note that almost all of the time goes into just a few operations, and so design an accelerator that speeds up those critical-path ops.
This sounds fine in principle, but when an accelerator like that is integrated into a full system it often fails to live up to its potential. The problem is that even though most of the compute for almost all models does go into a handful of common operations, there are hundreds of others that often appear. Almost every model I see has some of these, and they’re almost always different from network to network. A good example is ‘non-max suppression’ in MobileSSD and similar object detection models, where we need some very specific and custom operations to merge the many bounding boxes that are output by the model into just a few coherent final results. This doesn’t require very much raw compute, but it does take a lot of logic, and is hard to express except as general C++ code. In a similar way, many audio networks have a feature generation preprocessing step that converts raw audio data into tensors to feed into the neural networks. Even more tricky are custom steps (like modified activation functions) that show up in the middle of networks. Almost none of these operations are compute intensive, but they aren’t supported by specialized accelerators.
There are two common answers to this from hardware teams. The first is to fall back to a main application processor to implement these custom operations. If the accelerator is across a system bus from the main CPU this can involve a lot of latency as the two processors have to communicate and synchronize with each other. This latency can easily cancel out any speed advantages from using the accelerator in the first place. Alternatively, the team may direct users towards using ‘blessed’ models that will run entirely on the accelerator, avoiding any of the tricky custom operations. This can work for some cases, but the majority of the product teams I work with are struggling to train their models to the accuracy they require for their application, so they’re usually using custom approaches to achieve the results they need. This makes asking them to switch to a new model and figure out how to achieve similar results within tighter constraints a big ask.
This is a big problem for accelerator adoption in practice. What I’m hoping is that future accelerators will offer some kind of general compute capability so that arbitrary C++ custom operations can be easily ported and run on them. The work we’re doing on dependency-free reference implementations in TensorFlow Lite is initially aimed at microcontrollers and embedded systems, but I’m hoping it will eventually be useful for porting ops to devices like accelerators too. The nice thing is that these custom operations almost always involve much less compute than the core accelerated ops, so you don’t need a fast way of running general purpose code, just an escape valve that avoids the latency hit of delegating to the main application processor.
To illustrate the issue, I tried to estimate the number of operations in core TensorFlow using the following command from inside the source folder:
grep -Ir 'REGISTER_OP("' tensorflow/core | grep -vE '(test)|(contrib)' | wc -l
This gives me a rough estimate of 1,202 operations in TensorFlow. Some of these are internal details, or only used for debugging, but in my experience you need to be prepared to deal with many hundreds of different ops if you’re accepting models from authors. I don’t expect this problem to get any easier, since researchers seem to be creating new and improved operations faster than accelerators can support the ones that are already there!
The exciting news is that we’re all in a great position to see improvements we make in our systems translate very quickly into better experiences for our users. The work we’re doing has the potential to have a lot of impact on people’s lives, and I think we have the right tools to make fast progress. What it will require is a lot of cooperation between the hardware and software communities, with rapid iteration and sharing of requirements because this is such a new area, so I’m looking forward to continuing to share my experiences, and hearing from hardware experts about ways we can move forward together.