The Unstoppable Rise of Disposable ML Frameworks

Photo by Steve Harwood

On Friday my long-time colleague Nat asked if we should try and expand our Useful Transformers library into something that could be suitable for a lot more use cases. We worked together on TensorFlow, as did the main author of UT, Manjunath, so he was surprised when I didn’t want to head too far in a generic direction. As I was discussing it with him I realized how much my perspective on ML library design has changed since we started TensorFlow, and since I think by writing I wanted to get my thoughts down as this post.

The GGML framework is just over a year old, but it has already changed the whole landscape of machine learning. Before GGML, an engineer wanting to run an existing ML model would start with a general purpose framework like PyTorch, find a data file containing the model architecture and weights, and then figure out the right sequence of calls to load and execute it. Today it’s much more likely that they will pick a model-specific code library like whisper.cpp or llama.cpp, based on GGML.

This isn’t the whole story though, because there are also popular model-specific libraries like llama2.cpp or llama.c that don’t use GGML, so this movement clearly isn’t based on the qualities of just one framework. The best term I’ve been able to come up with to describe these libraries is “disposable”. I know that might sound derogatory, but I don’t mean it like that, I actually think it’s the key to all their virtues! They’ve limited their scope to just a few models, focus on inference or fine-tuning rather than training from scratch, and overall try to do a few things very well. They’re not designed to last forever, as models change they’re likely to be replaced by newer versions, but they’re very good at what they do.

By contrast, traditional frameworks like PyTorch or TensorFlow try to do many different things for a lot of different audiences. They are designed to be toolkits that can be reused for almost any possible model, for full training as well as deployment in production, scaling from laptops (or even in TF’s case microcontrollers) to distributed clusters of hundreds of GPUs or TPUs. The idea is that you learn the fundamentals of the API, and then you can reuse that knowledge for years in many different circumstances.

What I’ve seen firsthand with TensorFlow is how coping with such a wide range of requirements forces its code to become very complex and hard to understand. The hope is always that the implementation details can be hidden behind an interface, so that people can use the system without becoming aware of the underlying complexity. In practice this is impossible to achieve, because latency and throughput are so important. The only reason to use ML frameworks instead of a NumPy Python script is to take advantage of hardware acceleration, since training and inference time need to be minimized for many projects to be achievable. If a model takes years to train, it’s effectively untrainable. If a chatbot response takes days, why bother?

But details leak out from the abstraction layer as soon as an engineer needs to care about speed. Do all of my layers fit on a TPU? Am I using more memory than I have available on my GPU? Is there a layer in the middle of my network that’s only implemented as a CPU operation, and so is causing massive latencies as data is copied to and from the accelerator? This is where the underlying complexity of the system comes back to bite us. There are so many levels of indirection involved that building a mental model of what code is executing and where is not practical. You can’t even easily step through code in a debugger or analyze it using a profiler, because much of it executes asynchronously on an accelerator, goes through multiple compilation steps before running on a regular processor, or is dispatched to platform-specific libraries that may not even have source code available. This opaqueness makes it extremely hard for anyone outside of the core framework team to even identify performance problems, let alone propose fixes. Because every code path is used by so many different models and use cases, just verifying that any change doesn’t cause a regression is a massive job.

By contrast, debugging and profiling issues with disposable frameworks is delightfully simple. There’s a single big program that you can inspect to understand the overall flow, and then debug and profile using very standard tools. If you spot an issue, you can find and change the code easily yourself, and either keep it on your local copy or create a pull request after checking the limited number of use cases the framework supports.

Another big pain point for “big” frameworks is installation and dependency management. I was responsible for creating and maintaining the Raspberry Pi port of TensorFlow for a couple of years, and it was one of the hardest engineering jobs I’ve had in my career. It was so painful I eventually gave up, and nobody else was willing to take it on! Because TF supported so many different operations, platforms, and libraries, porting and keeping it building on non-x86 platform was a nightmare. There were constantly new layers and operations being added, many of which in turn relied on third party code that also had to be ported. I groaned when I saw a new dependency appear in the build files, usually for something like an Amazon AWS input authentication pip package that didn’t add much value for the Pi users, but still required me to figure out how to install it on a platform that was often unsupported by the authors.

The beauty of single-purpose frameworks is that they can include all of the dependencies they need, right in the source code. This makes them a dream to install, often only requiring a checkout and build, and makes porting them to different platforms much simpler.

This is not a new problem, and during my career at Google I saw a lot of domain or model-specific libraries emerge internally as alternatives to using TensorFlow. These were often enthusiastically adopted by application engineers, because they were so much easier to work with. There was often a lot of tension about this with the infrastructure team, because while this approach helped ship products, there were fears about the future maintenance cost of supporting many different libraries. For example, adding support for new accelerators like TPUs would be much harder if it had to be done for a multitude of internal libraries rather than just one, and it increased the cost of switching to new models.

Despite these valid concerns, I think disposable frameworks will only grow in importance. More people are starting to care about inference rather than training, and a handful of foundation models are beginning to dominate applications, so the value of using a framework that can handle anything but is great at nothing is shrinking.

One reason I’m so sure is that we’ve seen this movie before. I spent the first few years of my career working in games, writing rendering engines in the Playstation 1 era. The industry standard was for every team to write their own renderer for each game, maybe copying and pasting some code from other titles but otherwise with little reuse. This made sense because the performance constraints were so tight. With only two megabytes of memory on a PS1 and a slow processor, every byte and cycle counted, so spending a lot of time jettisoning anything unnecessary and hand-optimizing the functions that mattered was a good use of programming time. Every large studio had the same worries about maintaining such a large number of engines across all their games, and every few years they’d task an internal group to build a more generic renderer that could be reused by multiple titles. Inevitably these efforts failed. It was faster and more effective for engineers to write something specialized from scratch than it was to whittle down and modify a generic framework to do what they needed.

Eventually a couple of large frameworks like Unity and Unreal came to dominate the industry, but it’s still not unheard of for developers to write their own, and even getting this far took decades. ML frameworks face the same challenges as game engines in the 90’s, with application developers given tight performance and memory constraints that are hard to hit using generic tools. If the past is any guide we’ll see repeated attempts to promote unified frameworks while real-world developers rely on less-generic but simpler libraries.

Of course it’s not a totally binary choice. For example we’re still planning on expanding Useful Transformers to support the LLM and translation models we’re using for our AI in a Box so we’ll have some genericity, but the mid-2010’s vision of “One framework to rule them all” is dead. It might be that PyTorch (which has clearly won the research market) becomes more like MatLab, a place to prototype and create algorithms, which are then hand-converted to customized inference frameworks by experienced engineers rather than automated tools or compilers.

What makes me happiest is that the movement to disposable frameworks is clearly opening up the world of ML development to many more people. By removing the layers of indirection and dependencies, the underlying simplicity of machine learning becomes a lot clearer, and hopefully less intimidating. I can’t wait to see all of the amazing products this democratization of the technology produces!

2 responses

  1. Pingback: The Rise of Single-Purpose ML Frameworks – Curated SQL

  2. Pingback: My AI Timelines Have Sped Up (Again)

Leave a comment