Why are ML Compilers so Hard?

File:Modern Loose Reed Power Loom-marsden.png
Image from Wikimedia

Even before the first version of TensorFlow was released, the XLA project was integrated as a “domain-specific compiler” for its machine learning graphs. Since then there have been a lot of other compilers aimed at ML problems, like TVM, MLIR, EON, and GLOW. They have all been very successful in different areas, but they’re still not the primary way for most users to run machine learning models. In this post I want to talk about some of the challenges that face ML compiler writers, and some approaches I think may help in the future.

I’m not a compiler expert at all, but I have been working on infrastructure to run deep learning models across different platforms for the last ten years, so most of my observations come from being a user rather than an implementer of compiler technology. I’m also writing from my own personal experience, these are all just my own opinions rather than anything endorsed by TensorFlow or Google, so take them for what they’re worth. I’m outside my area of expertise, and I’d love to hear what other people think about the areas I’m highlighting, I bet I’ll learn something interesting and new from any responses!

What is an ML compiler?

So far I’ve been talking about ML compilers like they’re a well-defined class of technology, but the term actually covers a very wide range of tools. Intuitively there’s an analogy with procedural programming, where interpreted languages tend to be used for experimentation, prototyping and research because of their flexibility, but compilation is deployed when performance is a higher priority. The computation graphs used in deep learning look a lot like programs, and the major frameworks like PyTorch and TensorFlow use interpretation to execute them, so using compilation to improve performance seems like a logical next step. All of the ML compilers take a model defined in Python in one of the major training frameworks and attempt to convert them into a different form that produces the same results. The output form is usually chosen to have some advantages in performance or portability over the original version.

For example, XLA takes the layers defined at the TensorFlow Graph level, and converts them initially into what’s known as an HLO (high-level operation) representation. This term slightly confused me initially, since from my perspective as a TensorFlow engineer the HLO operations were *lower* level than Graph operations, as individual TF ops are often broken into multiple HLOs, but it comes from the fact that these are at the highest level of XLA’s interface. These HLOs are designed to be implementable efficiently on GPUs, CPUs, and TPUs, with the hope that supporting a smaller number of mathematical operations will allow many more TF ops to be implemented by composition, increasing portability.

The definition I’ve given above may seem overly broad, and it probably is, but that’s also one of the challenges in this area. When someone offers an ML compiler as a potential solution, most engineers’ experience with procedural compilers makes them receptive to the idea, because traditional compilers have become such a vital tool for all of us. Using the term compiler is popular because of this halo effect, but it doesn’t say much about the scope of the tool. As another example, TensorFlow Lite has a different representation for its computation graph from TensorFlow, and a tool is required to convert TF models to TFLite. The current version of this tool uses MLIR, a compiler technology, to perform the conversion, but the resulting graph representation is interpreted, so it seems strange to call it a compiler. One of the common assumptions when the term compiler is used is that it will generate code, but many of them actually generate intermediate representations which are then handed over to other tools to perform further steps. This makes it necessary to dig a bit deeper into the actual capabilities of anything labeled as an ML compiler to better understand what problems it can solve.

Why is ML compilation not like procedural compilation?

I’ve mentioned that the analogy between procedural and ML compilers is imperfect, but why? The biggest reason is that deep learning computation graphs are made up of a large, arbitrary, and ever-growing set of layers. Last time I checked, stock TensorFlow has over 2,000 different operations. PyTorch is not as expansive, but it also relies more on Python within its models to implement functionality, as does JAX. This is in contrast to modern procedural languages which tend to have a comparatively small number of core primitives (keywords, built-in types and operations) and a lot of libraries implemented using those primitives to provide most of their useful functionality. It’s also a Big Deal to add primitives to a procedural language, with a long period of debate, prototyping, and consensus building before new ones are accepted.

The reason for this difference is that deep learning model authors are tightly constrained by latency and memory performance. There seems to be a practical time limit of around a week to train a model to completion, if it takes any longer then the author’s unlikely to be able to iterate with enough prototypes to produce a successful result in the end. Because training a model means running it many millions of times, it makes sense for authors trying new techniques to invest time optimizing the code they’re using. Because many people use Nvidia GPUs this often means writing a function in CUDA to implement any new computation that’s needed, rather than leaving it in the Python that might be more natural for experimenting. The consequence of this is that even operators like activation functions that involve trivial math that could easily be represented as a simple NumPy operation get implemented as separate layers, and so show up as such in the compute graph. Even worse from the framework implementer’s perspective is that authors may actually fuse together multiple conceptually-separate operations into a single layer purely for performance reasons. You can see this in the plethora of different LSTM layers available in TensorFlow, they exist because manual fusing helped speed up training for particular models.

What this means in practice is that compute graphs are made up of layers chosen for model authors’ convenience, only defined by their already-optimized implementations in C++/CUDA, and any unit tests, which are often written to test against Python libraries like NumPy, adding another layer of indirection when trying to understand what they do. They are also likely to be comparatively large in their scope, rather than being constrained to a more primitive operation. All this makes the job of anyone trying to convert them into another representation very hard. Even worse, new layers are constantly being invented by researchers!

Most ML compilers “solve” this problem by only supporting a subset of the layers. Even TensorFlow Lite and XLA only support some operations. What I’ve found from my experience is that this is an unpleasant surprise to many users hoping to convert their model from the training environment to run on another platform. Most authors aren’t even particularly aware of which ops they’re using, since they’re likely to be using a higher-level interface like Keras, so figuring out how to change a model definition to fit with any constraints can be a frustrating and confusing process.

I believe this to be the single biggest problem facing ML compilers. The only way we can hope to provide a good experience to everyday users is by changing the training environment so that models are automagically expressed in a manageable representation from the start. The current situation asks compiler authors to turn a hamburger back into a cow, it’s simply not feasible. The challenge is that adding such constraints in makes it harder to experiment with new approaches, since any additional layers would need to be represented in a form other than Python, C++, or CUDA, which are the preferred languages of researchers. Either compiler writers will have to keep chasing all the new layer implementations constantly being produced by researchers, or we’ll have to persuade researchers to write them in a more portable form.

Why are there so many layers?

So far I’ve focused on “classical” deep learning operations, but one of the reasons that there are so many layers is that compute graphs also include a lot of computation that can’t easily be expressed in a mathematical form. Layers like convolution, fully-connected, or activations can be written using a math notation and implemented using a comparatively small number of primitives, and they take up the majority of the compute time, so they’re often chosen as the first targets by compiler writers. Unfortunately there are many other layers that don’t fit as easily into something as mathematical, and where the only practical definition can be written in a procedural function using something like C++. A favorite example of mine is the non-max suppression layer used to prune the soup of bounding boxes produced by networks doing image localization. This algorithm is hard to describe except as a series of sorting, loops, and conditional statements, and it’s difficult to see how it could be represented in anything less general than LLVM’s IR.

There are a lot of operations that generate features or perform other pre-processing, or do post-processing like scoring or beam search. These are tempting to exclude from any compiler solution because they often occur before or after the body of the model where the bulk of the computation happens, and so aren’t a priority for optimization, but these do sometimes occur at performance-critical points and so I think we need a solution.

What about fallbacks?

One answer I’ve heard to this problem is that it’s always possible to fall back to the original CPU C++ implementation for the long tail of layers that are not easy to use a specialized representation for. In my opinion this removes a lot of the advantages of using a compiler. It’s no longer possible to perform fusing or other IR optimizations across the barrier formed by a non-IR layer, and the model itself is not portable across different platforms. You might think that you still have portability across platforms that support C++, but as I mentioned earlier, most layer implementations were created by research scientists and are only present in an optimized form. This means that the code is likely to rely on libraries like Eigen and use functions from the training framework itself. Consequently, porting a single layer often means porting most of the training framework and its dependencies too. This is possible, we use it for the Flex delegate in TensorFlow Lite, and PyTorch Mobile takes a similar approach, but it is a lot of work even for comparatively mainstream platforms like Android and iOS, and doesn’t work at all for anything non-Posix-like such as most embedded devices. It also takes up a lot of binary space, since server code is written to very different constraints than other platforms. Another problem is that even if the libraries relied upon for optimization are portable to other platforms, it’s not likely that they’ll offer the same performance that they do in the original environment. Performance doesn’t tend to be portable.

What about compiling Python?

A lot of frameworks rely on Python glue code to help implement models. This is great for model authors because they can use a familiar and flexible language, but it makes porting to other environments very tough. Mobile platforms don’t support Python for example, and neither do GPUs. The need for GPU support tends to push authors to re-implement any Python code that becomes a performance bottleneck, but that still leaves a lot of places where it can be useful for training. The problem from a compiler’s perspective is that parts of the definition of the computation that needs to be performed to run the model is now held in Python code, not in the regular compute graph of layers, making those parts inaccessible.

A solution to this has been to compile regular Python code, as shown by TensorFlow’s tf.function. This is very helpful in some cases, but there are plenty of times when the Python code is actually relying on other libraries, often only available as C or C++ implementations. For example, a lot of audio models will do things like create spectrograms using a specialized library, which ML compilers don’t have visibility into, or the ability to translate into another representation.

How can we make progress?

I hope this post doesn’t sound too much like an Airing of Grievances (though it is December 23rd as I write this), I’m honestly a big fan of all the compiler work that’s been happening in the ML world and I want it to continue growing. With that in mind, how can we move forward? In my mind, there are two possible futures, one where the ML ecosystem looks like Matlab, and another where it looks like LLVM.

If you’re not familiar with it, Matlab is a tool used by researchers for a lot of mathematical prototyping and exploration. There are tools to compile the resulting Matlab projects into standalone executable code, but because it’s so hard to do so completely and in an optimal way, a common workflow is for researchers to hand over their projects to engineers to write a C or C++ implementation by hand. In this flavor of the future, we’d do something similar for ML, where researchers would use frameworks focused on easy experimentation and flexibility, and the conversion process for production deployments would involve manual engineering into a more portable and optimized representation once it was finalized. As an engineer who would be likely to be responsible for this conversion process, I’m hoping we can do better for my own sake. It will also remove a lot of scope for collaboration between model authors and deploying engineers, which is a shame because iterative feedback loops can make both models and software implementations better. Unfortunately I think the Matlab model is the most likely to happen unless we can change direction.

The key LLVM innovation was the invention of an intermediate representation that was rich enough for a large set of languages, but small enough to be supported by a lot of different platforms without requiring exorbitant engineering resources for each. An IR like this for machine learning is the dream of most of the hardware vendors I know, since it would allow them to support a lot of different models and frameworks with comparatively little effort, at least compared to the current status quo. There are existing attempts that have had some success, such as ONNX or MLIR’s TOSA dialect, but they’ve all struggled either with coverage or have increased the number of layers they support to a level that makes them tougher for hardware teams to implement. This is why I come back to the need to change the training environment itself. We somehow need to come up with tools that permit researchers the flexibility to experiment, give them the performance they need to complete training in a reasonable time, but also result in a representation that can be understood separate from the training environment. Researchers are in high demand in the ML world, so it would have to be something they want to use if it’s going to get adoption. These three requirements might end up being impossible to meet, but I’m hopeful that the ML compiler community can keep innovating and come up with something that does!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: