When I talk to people about machine learning on phones and devices I often get asked “What’s the killer application?“. I have a lot of different answers, everything from voice interfaces to entirely new ways of using sensor data, but the one I’m most excited about in the near-team is compression. Despite being fairly well-known in the research community, this seems to surprise a lot of people, so I wanted to share some of my personal thoughts on why I see compression as so promising.
I was reminded of this whole area when I came across an OSDI paper on “Neural Adaptive Content-aware Internet Video Delivery“. The summary is that by using neural networks they’re able to improve a quality-of-experience metric by 43% if they keep the bandwidth the same, or alternatively reduce the bandwidth by 17% while preserving the perceived quality. There have also been other papers in a similar vein, such as this one on generative compression, or adaptive image compression. They all show impressive results, so why don’t we hear more about compression as a machine learning application?
We don’t (yet) have the compute
All of these approaches require comparatively large neural networks, and the amount of arithmetic needed scales with the number of pixels. This means large images or video with high frames-per-second can require more computing power than current phones and similar devices have available. Most CPUs can only practically handle tens of billions of arithmetic operations per second, and running ML compression on HD video could easily require ten times that.
The good news is that there are hardware solutions, like the Edge TPU amongst others, that offer the promise of much more compute being available in the future. I’m hopeful that we’ll be able to apply these resources to all sorts of compression problems, from video and image, to audio, and even more imaginative approaches.
Natural language is the ultimate compression
One of the other reasons I think ML is such a good fit for compression is how many interesting results we’ve had recently with natural language. If you squint, you can see captioning as a way of radically compressing an image. One of the projects I’ve long wanted to create is a camera that runs captioning at one frame per second, and then writes each one out as a series of lines in a log file. That would create a very simplistic story of what the camera sees over time, I think of it as a narrative sensor.
The reason I think of this as compression is that you can then apply a generative neural network to each caption to recreate images. The images won’t be literal matches to the inputs, but they should carry the same meaning. If you want results that are closer to the originals, you can also look at stylization, for example to create a line drawing of each scene. What these techniques have in common is that they identify parts of the input that are most important to us as people, and ignore the rest.
It’s not just images
There’s a similar trend in the speech world. Voice recognition is improving rapidly, and so is the ability to synthesize speech. Recognition can be seen as the process of compressing audio into natural language text, and synthesis as the reverse. You could imagine being able to highly compress conversations down to transmitting written representations rather than audio. I can’t imagine a need to go that far, but it does seem likely that we’ll be able to achieve much better quality and lower bandwidth by exploiting our new understanding of the patterns in speech.
I even see interesting possibilities for applying ML compression to text itself. Andrej Karpathy’s char-rnn shows how well neural networks can mimic styles given some examples, and that prediction is a similar problem to compression. If you think about how much redundancy is in a typical HTML page, it seems likely that there would be some decent opportunities for ML to improve on gzip. This is getting into speculation though, since I don’t have any ML text compression papers handy.
Compression is already a budget item
One of the things I learned while unsuccessfully trying to sell to businesses during my startup career was that it was much easier to make a sale if there was already a chunk of money allocated to what you were selling. The existence of a budget line item meant that the hard battle over whether the company should spend money on a solution had already been won, now the only questions was which solution to buy. That’s one of the reasons why I think that ML could make dramatic inroads in this area, because manufacturers already have engineers, money, and silicon area earmarked for video and audio compression. If we can show that adding machine learning to existing solutions improves them in measurable ways (for example quality, speed, or power consumption) then they will be adopted quickly.
Bandwidth costs users and carriers money, and quality and battery life are selling points for products, so the motivation behind adopting ML for compression is much more direct than many other use cases. Existing research shows that it can be very effective, and I’m optimistic that there’s a lot more to be discovered, so I’m hopeful that it will develop into a key use of the technology.