2018 October « Pete Warden's blog

Photo by Greg Simenoff

When I talk to people about machine learning on phones and devices I often get asked “What’s the killer application?“. I have a lot of different answers, everything from voice interfaces to entirely new ways of using sensor data, but the one I’m most excited about in the near-team is compression. Despite being fairly well-known in the research community, this seems to surprise a lot of people, so I wanted to share some of my personal thoughts on why I see compression as so promising.

I was reminded of this whole area when I came across an OSDI paper on “Neural Adaptive Content-aware Internet Video Delivery“. The summary is that by using neural networks they’re able to improve a quality-of-experience metric by 43% if they keep the bandwidth the same, or alternatively reduce the bandwidth by 17% while preserving the perceived quality. There have also been other papers in a similar vein, such as this one on generative compression, or adaptive image compression. They all show impressive results, so why don’t we hear more about compression as a machine learning application?

We don’t (yet) have the compute

All of these approaches require comparatively large neural networks, and the amount of arithmetic needed scales with the number of pixels. This means large images or video with high frames-per-second can require more computing power than current phones and similar devices have available. Most CPUs can only practically handle tens of billions of arithmetic operations per second, and running ML compression on HD video could easily require ten times that.

The good news is that there are hardware solutions, like the Edge TPU amongst others, that offer the promise of much more compute being available in the future. I’m hopeful that we’ll be able to apply these resources to all sorts of compression problems, from video and image, to audio, and even more imaginative approaches.

Natural language is the ultimate compression

One of the other reasons I think ML is such a good fit for compression is how many interesting results we’ve had recently with natural language. If you squint, you can see captioning as a way of radically compressing an image. One of the projects I’ve long wanted to create is a camera that runs captioning at one frame per second, and then writes each one out as a series of lines in a log file. That would create a very simplistic story of what the camera sees over time, I think of it as a narrative sensor.

The reason I think of this as compression is that you can then apply a generative neural network to each caption to recreate images. The images won’t be literal matches to the inputs, but they should carry the same meaning. If you want results that are closer to the originals, you can also look at stylization, for example to create a line drawing of each scene. What these techniques have in common is that they identify parts of the input that are most important to us as people, and ignore the rest.

It’s not just images

There’s a similar trend in the speech world. Voice recognition is improving rapidly, and so is the ability to synthesize speech. Recognition can be seen as the process of compressing audio into natural language text, and synthesis as the reverse. You could imagine being able to highly compress conversations down to transmitting written representations rather than audio. I can’t imagine a need to go that far, but it does seem likely that we’ll be able to achieve much better quality and lower bandwidth by exploiting our new understanding of the patterns in speech.

I even see interesting possibilities for applying ML compression to text itself. Andrej Karpathy’s char-rnn shows how well neural networks can mimic styles given some examples, and that prediction is a similar problem to compression. If you think about how much redundancy is in a typical HTML page, it seems likely that there would be some decent opportunities for ML to improve on gzip. This is getting into speculation though, since I don’t have any ML text compression papers handy.

Compression is already a budget item

One of the things I learned while unsuccessfully trying to sell to businesses during my startup career was that it was much easier to make a sale if there was already a chunk of money allocated to what you were selling. The existence of a budget line item meant that the hard battle over whether the company should spend money on a solution had already been won, now the only questions was which solution to buy. That’s one of the reasons why I think that ML could make dramatic inroads in this area, because manufacturers already have engineers, money, and silicon area earmarked for video and audio compression. If we can show that adding machine learning to existing solutions improves them in measurable ways (for example quality, speed, or power consumption) then they will be adopted quickly.

Bandwidth costs users and carriers money, and quality and battery life are selling points for products, so the motivation behind adopting ML for compression is much more direct than many other use cases. Existing research shows that it can be very effective, and I’m optimistic that there’s a lot more to be discovered, so I’m hopeful that it will develop into a key use of the technology.

Photo by Fort George G. Meade Public Affairs Office

Over the past few weeks,a few different people have asked me about the state of model training on phones and embedded devices. The good news is that it’s definitely possible, I know of multiple examples of teams doing this successfully. The bad news is that our tools don’t yet make it easy.

Back in 2014 I released some code and a video showing how you could do simple transfer learning on a phone. At that point I was using a support vector machine to handle the on-device training of the final layer, but there was no fundamental reason I couldn’t have used back-propagation instead, it was just easier to use a technique I knew. I’m sure there are other examples even earlier than that too.

One area where on-device learning with neural networks has long been common is speech recognition. Even today, your phone will often ask you to say the wake word (for example “OK Google” or “Hey Siri”) a few times, so it can learn your pronunciation. Typically this won’t involve a complete retraining of the network using back propagation, but something that’s more like transfer learning on either the feature extraction process, or the final layer. Apple’s implementation uses a supplemental network to compare utterances to your references in what they term speaker space, so it’s pretty different than how we normally think of training. I still consider these cases valid examples of on-device learning, but often use the term “personalization” to capture the limited extent of the network changes.

There are situations where the kind of full-network back propagation familiar to deep learning practitioners building models in Python happens on devices too. A good published example is for the Google Keyboard application. The paper focuses on the novel aspects of sending anonymized model updates over the network, but those updates come from a traditional process of running forward and backward passes on the phone.

So, corporate teams are building applications that use learning on-device, but how can you do the same thing? The key thing to realize is that there are two different stages to training a model. The first is building the back-propagation machinery to update the gradients. The second is actually feeding examples through the forward pass, and then passing the gradient changes through the backward pass.

Most libraries, including TensorFlow, handle the automatic differentiation you need to build the back propagation operations in the Python layer. This code looks at the model you’ve defined in terms of inference operations (convolution layers feeding into activation functions, etc) and builds a complementary set of operations for the back propagation, including a loss function. The important part is that once the graph is built, at least in TensorFlow, you don’t need the Python code any more. Running the backward pass just means running the sub-graph of operations that was created by the differentiation stage.

In TensorFlow terms, this means after you’ve created the backward pass you can save off the GraphDef containing both the forward and backward operations. If you then load that in C++, either on-device, or on a server, theoretically you can also run gradient updates from labeled examples using Session::Run(). I say theoretically because we don’t have documentation on this process, and it involves a lot of unfamiliar steps that can be hard to get right in C++, like initializing variables, loading and saving checkpoints, and handling losses. This is a gap I’m hoping we can fill, but there are internal teams who’ve been able to past these issues, so at least there’s proof that it’s possible. I’m sure there must be examples of external teams doing similar work, so feel free to add comments or reply on Twitter if you do have code to share, in TensorFlow or any other framework, I’d love to share them!

	ademidun on Leave a trail of breadcru…
	oi0841oi on Understanding the Raspberry Pi…
	Pete Warden on Understanding the Raspberry Pi…
	oi0841oi on Understanding the Raspberry Pi…
	Alan on Understanding the Raspberry Pi…

Pete Warden's blog

Ever tried. Ever failed. No matter. Try Again. Fail again. Fail better.

Monthly Archives: October 2018

Will Compression Be Machine Learning’s Killer App?

What Does it Take to Train Deep Learning Models On-Device?