When I talk to people about machine learning on phones and devices I often get asked “What’s the killer application?“. I have a lot of different answers, everything from voice interfaces to entirely new ways of using sensor data, but the one I’m most excited about in the near-team is compression. Despite being fairly well-known in the research community, this seems to surprise a lot of people, so I wanted to share some of my personal thoughts on why I see compression as so promising.
I was reminded of this whole area when I came across an OSDI paper on “Neural Adaptive Content-aware Internet Video Delivery“. The summary is that by using neural networks they’re able to improve a quality-of-experience metric by 43% if they keep the bandwidth the same, or alternatively reduce the bandwidth by 17% while preserving the perceived quality. There have also been other papers in a similar vein, such as this one on generative compression, or adaptive image compression. They all show impressive results, so why don’t we hear more about compression as a machine learning application?
We don’t (yet) have the compute
All of these approaches require comparatively large neural networks, and the amount of arithmetic needed scales with the number of pixels. This means large images or video with high frames-per-second can require more computing power than current phones and similar devices have available. Most CPUs can only practically handle tens of billions of arithmetic operations per second, and running ML compression on HD video could easily require ten times that.
The good news is that there are hardware solutions, like the Edge TPU amongst others, that offer the promise of much more compute being available in the future. I’m hopeful that we’ll be able to apply these resources to all sorts of compression problems, from video and image, to audio, and even more imaginative approaches.
Natural language is the ultimate compression
One of the other reasons I think ML is such a good fit for compression is how many interesting results we’ve had recently with natural language. If you squint, you can see captioning as a way of radically compressing an image. One of the projects I’ve long wanted to create is a camera that runs captioning at one frame per second, and then writes each one out as a series of lines in a log file. That would create a very simplistic story of what the camera sees over time, I think of it as a narrative sensor.
The reason I think of this as compression is that you can then apply a generative neural network to each caption to recreate images. The images won’t be literal matches to the inputs, but they should carry the same meaning. If you want results that are closer to the originals, you can also look at stylization, for example to create a line drawing of each scene. What these techniques have in common is that they identify parts of the input that are most important to us as people, and ignore the rest.
It’s not just images
There’s a similar trend in the speech world. Voice recognition is improving rapidly, and so is the ability to synthesize speech. Recognition can be seen as the process of compressing audio into natural language text, and synthesis as the reverse. You could imagine being able to highly compress conversations down to transmitting written representations rather than audio. I can’t imagine a need to go that far, but it does seem likely that we’ll be able to achieve much better quality and lower bandwidth by exploiting our new understanding of the patterns in speech.
I even see interesting possibilities for applying ML compression to text itself. Andrej Karpathy’s char-rnn shows how well neural networks can mimic styles given some examples, and that prediction is a similar problem to compression. If you think about how much redundancy is in a typical HTML page, it seems likely that there would be some decent opportunities for ML to improve on gzip. This is getting into speculation though, since I don’t have any ML text compression papers handy.
Compression is already a budget item
One of the things I learned while unsuccessfully trying to sell to businesses during my startup career was that it was much easier to make a sale if there was already a chunk of money allocated to what you were selling. The existence of a budget line item meant that the hard battle over whether the company should spend money on a solution had already been won, now the only questions was which solution to buy. That’s one of the reasons why I think that ML could make dramatic inroads in this area, because manufacturers already have engineers, money, and silicon area earmarked for video and audio compression. If we can show that adding machine learning to existing solutions improves them in measurable ways (for example quality, speed, or power consumption) then they will be adopted quickly.
Bandwidth costs users and carriers money, and quality and battery life are selling points for products, so the motivation behind adopting ML for compression is much more direct than many other use cases. Existing research shows that it can be very effective, and I’m optimistic that there’s a lot more to be discovered, so I’m hopeful that it will develop into a key use of the technology.
Pingback: Will Compression Be #MachineLearning's Killer App? https://petewarden.com/2018/10/16/will-compression-be-machine-learnings-killer-app/ … via @wordpressdotcom #AI #IA #BigData #BlockChain #Startup #Python #DataViz #DeepLearning #DataScience #JavaSc
Pingback: Will Compression Be Machine Learning's Killer App? - R- Pakistan Daily Roznama
Pingback: Will Compression Be Machine Studying's Killer App? - Doers Nest
Pingback: Will Compression Be Machine Learning’s Killer App? – InfoSec Blog!
Pingback: Will Compression Be Machine Learning's Killer App? - Wiki Blog
The idea of re-generating video from a caption-based compression scheme is interesting, but are we sure that Next Animation Studio hasn’t been doing this all along? https://www.youtube.com/watch?v=riDAwDlVln0 I’m imagining the first re-generative models looking something like this
Yeah, I’ve been thinking about this a lot… every time I check the Nest camera we’re using to spy on the baby and I think about the HUGE amount of data that’s going up to the cloud and coming back down to my phone. All for a scene that basically doesn’t change for, sometimes, many hours at a time. (And when it does change, it may only be a change in an area that’s a small fraction of the full frame).
I understand that there’s not a lot of horsepower on the cameras (yet) but it sure seems like the initial training phase – learning what the scene/room looks like, how different lighting affects it, etc. – could happen remotely and then all the camera would need to send continuously would be a low rez, low FPS stream until it gets word from the server that “something is happening, time to kick into full quality upload”.
And yes, I’m definitely waiting for the day when the 2am notice of ‘motion detected at your back door’ can go away because it was recognized as coming from the headlight of a car going by.
Seems to me that someone should be working on making all of these things a part of a next-generation open codec… – a bidirectional communication channel where the server can control the camera’s behavior, along with a well-defined ‘caption’ stream that would include descriptions of the imagery along with transcriptions of any words being spoken.
I am not an expert in this field, but I tend to think that this is already existing. When you are doing a fourier transform you are already able to extract the most important features of an image and fourier transform and other transformations type are already well established within image processing toolkits. Don’t you think ?
That was a very interesting read.
Outside of niche compression use cases (such as the baby monitor described above), I am not sure I see the point.
There are two questions that need to be answered for compression to be the killer app for machine learning
1) How does the cost of storage compare with cost of compute. My gut here is NAND at $0.2/GB is difficult to beat vs an ASIC or even existing ARM SoC with ML capability.
2) How much energy is consumed in compute (i.e. generating the scene from the compressed information) vs moving the (uncompressed) information from NAND to display. This would tell us whether there is any battery life improvement or not. Here I have no clue.
I dont know if you have thought about it or have answers/pointers.
Pingback: Data Science newsletter – October 19, 2018 | Sports.BradStenger.com
nice information about ML