How to get started with Coqui’s open source on-device speech to text tool

Image from Wikimedia

I think the transformative power of on-device speech to text is criminally under-rated (and I’m not alone), so I’m a massive fan of the work Coqui are doing to make the technology more widely accessible. Coqui is a startup working on a complete open source solution to speech recognition, as well as text to speech, and I’ve been lucky enough to collaborate with their team on datasets like Multilingual Spoken Words.

They have have great documentation already, but over the holidays I’ve been playing around with the code and I always like to leave a trail of breadcrumbs if I can, so in this post I’ll try to show you how to get speech recognition running locally yourself in just a few minutes. I’ve tried it on my PopOS 21.04 laptop, but it will hopefully work on most modern Linux distributions, and should be trivial to modify for other platforms that Coqui provide binaries for. To accompany this post, I’ve also published a Colab notebook, which you can use from your browser on almost any system, and demonstrates all these steps.

You’ll need to be comfortable using a terminal, but because they do offer pre-built binaries you won’t need to worry about touching code or compilation. I’ll show you how to use their tools to recognize English language text from a WAV file. The code sections below (in a monospace font) should all be run from a shell terminal window.

First we download the example executable, stt, and the shared library, libstt.so, that contains the framework code, all parts of the native_client archive.

wget --quiet https://github.com/coqui-ai/STT/releases/download/v1.1.0/native_client.tflite.Linux.tar.xz
unxz native_client.tflite.Linux.tar.xz
tar -xf native_client.tflite.Linux.tar

Next, we need to fetch a model. For this example I’ve chosen the English large vocabulary model, but there are over 80 different versions available for many languages at coqui.ai/models. Note that this is the recognition model, not the language model. Language models are used to post-process the results of the neural network, and are optional. To keep things simple, in this example we’re just using the raw recognition model output, but there are lots of options to improve the quality for a particular application if you investigate things like language models and hotwords.

wget --quiet https://github.com/coqui-ai/STT-models/releases/download/english/coqui/v1.0.0-large-vocab/model.tflite

To demonstrate how the speech to text tool works, we need some WAV files to try it out on. Luckily Coqui provide some examples, together with transcripts of the expected output.

wget --quiet https://github.com/coqui-ai/STT/releases/download/v1.1.0/audio-1.1.0.tar.gz
!tar -xzf audio-1.1.0.tar.gz

The stt file is a command line tool that lets you run speech to text translation using Coqui’s framework. It has a lot of options you can explore, but the simplest way to use it is to provide a recognition model and then point it at a WAV file. After some version logging you should see the predicted transcript of the speech in the audio file as the final line.

./stt --model ./model.tflite --audio ./audio/4507-16021-0012.wav

You should see output that looks something like this:

TensorFlow: v2.3.0-14-g4bdd3955115
 Coqui STT: v1.1.0-0-gf3605e23
why should one halt on the way

If you’ve made it this far, congratulations, you’ve just run your own speech to text engine locally on your machine! Coqui have put a lot of work into their open source speech framework, so if you want to dive in deeper I highly recommend browsing their documentation and code. Everything’s open source, even the training, so if you need something special for your own application, like a different language or specialized vocabulary, you have the chance to do it yourself.

Update – I’ve also just added a new Colab notebook showing how to build a program using STT with just a makefile and the binary releases, without requiring Bazel.

Why are ML Compilers so Hard?

File:Modern Loose Reed Power Loom-marsden.png
Image from Wikimedia

Even before the first version of TensorFlow was released, the XLA project was integrated as a “domain-specific compiler” for its machine learning graphs. Since then there have been a lot of other compilers aimed at ML problems, like TVM, MLIR, EON, and GLOW. They have all been very successful in different areas, but they’re still not the primary way for most users to run machine learning models. In this post I want to talk about some of the challenges that face ML compiler writers, and some approaches I think may help in the future.

I’m not a compiler expert at all, but I have been working on infrastructure to run deep learning models across different platforms for the last ten years, so most of my observations come from being a user rather than an implementer of compiler technology. I’m also writing from my own personal experience, these are all just my own opinions rather than anything endorsed by TensorFlow or Google, so take them for what they’re worth. I’m outside my area of expertise, and I’d love to hear what other people think about the areas I’m highlighting, I bet I’ll learn something interesting and new from any responses!

What is an ML compiler?

So far I’ve been talking about ML compilers like they’re a well-defined class of technology, but the term actually covers a very wide range of tools. Intuitively there’s an analogy with procedural programming, where interpreted languages tend to be used for experimentation, prototyping and research because of their flexibility, but compilation is deployed when performance is a higher priority. The computation graphs used in deep learning look a lot like programs, and the major frameworks like PyTorch and TensorFlow use interpretation to execute them, so using compilation to improve performance seems like a logical next step. All of the ML compilers take a model defined in Python in one of the major training frameworks and attempt to convert them into a different form that produces the same results. The output form is usually chosen to have some advantages in performance or portability over the original version.

For example, XLA takes the layers defined at the TensorFlow Graph level, and converts them initially into what’s known as an HLO (high-level operation) representation. This term slightly confused me initially, since from my perspective as a TensorFlow engineer the HLO operations were *lower* level than Graph operations, as individual TF ops are often broken into multiple HLOs, but it comes from the fact that these are at the highest level of XLA’s interface. These HLOs are designed to be implementable efficiently on GPUs, CPUs, and TPUs, with the hope that supporting a smaller number of mathematical operations will allow many more TF ops to be implemented by composition, increasing portability.

The definition I’ve given above may seem overly broad, and it probably is, but that’s also one of the challenges in this area. When someone offers an ML compiler as a potential solution, most engineers’ experience with procedural compilers makes them receptive to the idea, because traditional compilers have become such a vital tool for all of us. Using the term compiler is popular because of this halo effect, but it doesn’t say much about the scope of the tool. As another example, TensorFlow Lite has a different representation for its computation graph from TensorFlow, and a tool is required to convert TF models to TFLite. The current version of this tool uses MLIR, a compiler technology, to perform the conversion, but the resulting graph representation is interpreted, so it seems strange to call it a compiler. One of the common assumptions when the term compiler is used is that it will generate code, but many of them actually generate intermediate representations which are then handed over to other tools to perform further steps. This makes it necessary to dig a bit deeper into the actual capabilities of anything labeled as an ML compiler to better understand what problems it can solve.

Why is ML compilation not like procedural compilation?

I’ve mentioned that the analogy between procedural and ML compilers is imperfect, but why? The biggest reason is that deep learning computation graphs are made up of a large, arbitrary, and ever-growing set of layers. Last time I checked, stock TensorFlow has over 2,000 different operations. PyTorch is not as expansive, but it also relies more on Python within its models to implement functionality, as does JAX. This is in contrast to modern procedural languages which tend to have a comparatively small number of core primitives (keywords, built-in types and operations) and a lot of libraries implemented using those primitives to provide most of their useful functionality. It’s also a Big Deal to add primitives to a procedural language, with a long period of debate, prototyping, and consensus building before new ones are accepted.

The reason for this difference is that deep learning model authors are tightly constrained by latency and memory performance. There seems to be a practical time limit of around a week to train a model to completion, if it takes any longer then the author’s unlikely to be able to iterate with enough prototypes to produce a successful result in the end. Because training a model means running it many millions of times, it makes sense for authors trying new techniques to invest time optimizing the code they’re using. Because many people use Nvidia GPUs this often means writing a function in CUDA to implement any new computation that’s needed, rather than leaving it in the Python that might be more natural for experimenting. The consequence of this is that even operators like activation functions that involve trivial math that could easily be represented as a simple NumPy operation get implemented as separate layers, and so show up as such in the compute graph. Even worse from the framework implementer’s perspective is that authors may actually fuse together multiple conceptually-separate operations into a single layer purely for performance reasons. You can see this in the plethora of different LSTM layers available in TensorFlow, they exist because manual fusing helped speed up training for particular models.

What this means in practice is that compute graphs are made up of layers chosen for model authors’ convenience, only defined by their already-optimized implementations in C++/CUDA, and any unit tests, which are often written to test against Python libraries like NumPy, adding another layer of indirection when trying to understand what they do. They are also likely to be comparatively large in their scope, rather than being constrained to a more primitive operation. All this makes the job of anyone trying to convert them into another representation very hard. Even worse, new layers are constantly being invented by researchers!

Most ML compilers “solve” this problem by only supporting a subset of the layers. Even TensorFlow Lite and XLA only support some operations. What I’ve found from my experience is that this is an unpleasant surprise to many users hoping to convert their model from the training environment to run on another platform. Most authors aren’t even particularly aware of which ops they’re using, since they’re likely to be using a higher-level interface like Keras, so figuring out how to change a model definition to fit with any constraints can be a frustrating and confusing process.

I believe this to be the single biggest problem facing ML compilers. The only way we can hope to provide a good experience to everyday users is by changing the training environment so that models are automagically expressed in a manageable representation from the start. The current situation asks compiler authors to turn a hamburger back into a cow, it’s simply not feasible. The challenge is that adding such constraints in makes it harder to experiment with new approaches, since any additional layers would need to be represented in a form other than Python, C++, or CUDA, which are the preferred languages of researchers. Either compiler writers will have to keep chasing all the new layer implementations constantly being produced by researchers, or we’ll have to persuade researchers to write them in a more portable form.

Why are there so many layers?

So far I’ve focused on “classical” deep learning operations, but one of the reasons that there are so many layers is that compute graphs also include a lot of computation that can’t easily be expressed in a mathematical form. Layers like convolution, fully-connected, or activations can be written using a math notation and implemented using a comparatively small number of primitives, and they take up the majority of the compute time, so they’re often chosen as the first targets by compiler writers. Unfortunately there are many other layers that don’t fit as easily into something as mathematical, and where the only practical definition can be written in a procedural function using something like C++. A favorite example of mine is the non-max suppression layer used to prune the soup of bounding boxes produced by networks doing image localization. This algorithm is hard to describe except as a series of sorting, loops, and conditional statements, and it’s difficult to see how it could be represented in anything less general than LLVM’s IR.

There are a lot of operations that generate features or perform other pre-processing, or do post-processing like scoring or beam search. These are tempting to exclude from any compiler solution because they often occur before or after the body of the model where the bulk of the computation happens, and so aren’t a priority for optimization, but these do sometimes occur at performance-critical points and so I think we need a solution.

What about fallbacks?

One answer I’ve heard to this problem is that it’s always possible to fall back to the original CPU C++ implementation for the long tail of layers that are not easy to use a specialized representation for. In my opinion this removes a lot of the advantages of using a compiler. It’s no longer possible to perform fusing or other IR optimizations across the barrier formed by a non-IR layer, and the model itself is not portable across different platforms. You might think that you still have portability across platforms that support C++, but as I mentioned earlier, most layer implementations were created by research scientists and are only present in an optimized form. This means that the code is likely to rely on libraries like Eigen and use functions from the training framework itself. Consequently, porting a single layer often means porting most of the training framework and its dependencies too. This is possible, we use it for the Flex delegate in TensorFlow Lite, and PyTorch Mobile takes a similar approach, but it is a lot of work even for comparatively mainstream platforms like Android and iOS, and doesn’t work at all for anything non-Posix-like such as most embedded devices. It also takes up a lot of binary space, since server code is written to very different constraints than other platforms. Another problem is that even if the libraries relied upon for optimization are portable to other platforms, it’s not likely that they’ll offer the same performance that they do in the original environment. Performance doesn’t tend to be portable.

What about compiling Python?

A lot of frameworks rely on Python glue code to help implement models. This is great for model authors because they can use a familiar and flexible language, but it makes porting to other environments very tough. Mobile platforms don’t support Python for example, and neither do GPUs. The need for GPU support tends to push authors to re-implement any Python code that becomes a performance bottleneck, but that still leaves a lot of places where it can be useful for training. The problem from a compiler’s perspective is that parts of the definition of the computation that needs to be performed to run the model is now held in Python code, not in the regular compute graph of layers, making those parts inaccessible.

A solution to this has been to compile regular Python code, as shown by TensorFlow’s tf.function. This is very helpful in some cases, but there are plenty of times when the Python code is actually relying on other libraries, often only available as C or C++ implementations. For example, a lot of audio models will do things like create spectrograms using a specialized library, which ML compilers don’t have visibility into, or the ability to translate into another representation.

How can we make progress?

I hope this post doesn’t sound too much like an Airing of Grievances (though it is December 23rd as I write this), I’m honestly a big fan of all the compiler work that’s been happening in the ML world and I want it to continue growing. With that in mind, how can we move forward? In my mind, there are two possible futures, one where the ML ecosystem looks like Matlab, and another where it looks like LLVM.

If you’re not familiar with it, Matlab is a tool used by researchers for a lot of mathematical prototyping and exploration. There are tools to compile the resulting Matlab projects into standalone executable code, but because it’s so hard to do so completely and in an optimal way, a common workflow is for researchers to hand over their projects to engineers to write a C or C++ implementation by hand. In this flavor of the future, we’d do something similar for ML, where researchers would use frameworks focused on easy experimentation and flexibility, and the conversion process for production deployments would involve manual engineering into a more portable and optimized representation once it was finalized. As an engineer who would be likely to be responsible for this conversion process, I’m hoping we can do better for my own sake. It will also remove a lot of scope for collaboration between model authors and deploying engineers, which is a shame because iterative feedback loops can make both models and software implementations better. Unfortunately I think the Matlab model is the most likely to happen unless we can change direction.

The key LLVM innovation was the invention of an intermediate representation that was rich enough for a large set of languages, but small enough to be supported by a lot of different platforms without requiring exorbitant engineering resources for each. An IR like this for machine learning is the dream of most of the hardware vendors I know, since it would allow them to support a lot of different models and frameworks with comparatively little effort, at least compared to the current status quo. There are existing attempts that have had some success, such as ONNX or MLIR’s TOSA dialect, but they’ve all struggled either with coverage or have increased the number of layers they support to a level that makes them tougher for hardware teams to implement. This is why I come back to the need to change the training environment itself. We somehow need to come up with tools that permit researchers the flexibility to experiment, give them the performance they need to complete training in a reasonable time, but also result in a representation that can be understood separate from the training environment. Researchers are in high demand in the ML world, so it would have to be something they want to use if it’s going to get adoption. These three requirements might end up being impossible to meet, but I’m hopeful that the ML compiler community can keep innovating and come up with something that does!

The Death of Feature Engineering is Greatly Exaggerated

Image by OfSmallThings

One of the most exciting aspects of deep learning’s emergence in computer vision a few years ago was that it didn’t appear to require any feature engineering, unlike previous techniques like histograms-of-gradients or Haar cascades. As neural networks ate up other fields like NLP and speech, the hope was that feature engineering would become unnecessary for those domains too. At first I fully bought into this idea, and saw any remaining manually-engineered feature pipelines as legacy code that would soon be subsumed by more advanced models.

Over the last few years of working with product teams to deploy models in production I’ve realized I was wrong. I’m not the first person to raise this idea, but I have some thoughts I haven’t seen widely discussed on exactly why feature engineering isn’t going away anytime soon. One of them is that even the original vision case actually does rely on a *lot* of feature engineering, we just haven’t been paying attention. Here’s a quote from a typical blog post discussing image models:

“a deep learning system is a fully trainable system beginning from raw input, for example image pixels

(Emphasis added by me)

I spent over a decade working on graphics and image processing, so the implicit assumption that the kinds of images we train networks on are at all “raw” always bothered me a bit. I was used to starting with truly RAW image files to preserve as much information from the original scene as possible. These formats reflect the output of the camera’s CCD hardware pretty closely. This means that the values for each pixel correspond roughly linearly to the number of photons hitting the detector at that point, and the position of each measured value is actually in a Bayer pattern, rather than a simple grid of pixels.

Image from Wikipedia

So, even to get to the kind of two-dimensional array of evenly spaced pixels with RGB values that ML practitioners expect an image to contain, we have to execute some kind of algorithm to resample the original values. There are deep learning approaches to this problem, but it’s clear that this is an important preprocessing step, and one that I’d argue should count as feature engineering. There’s a whole world of other transformations like this that have to be performed before we get what we’d normally recognize as an image. These include some very complex and somewhat arbitrary transformations like white balancing, which everyday camera users might only become aware of during an apocalypse. There are also steps like gamma correction, which take the high dynamic ranges possible for the CCD output values (which reflect photon counts) and scale them into numbers which more closely resemble the human eye’s response curve. Put very simplistically, we can see small differences in dark areas with much more sensitivity than differences in bright parts, so to represent images in an eight-bit byte it’s convenient to apply a gamma curve so that more of the codes are used for darker values.

I don’t want this to turn into an image processing tutorial, but I hope that these examples illustrate that there’s a lot of engineering happening before ML models get an image. I’ve come to think of these steps as feature engineering for the human visual system, and see deep learning as piggy-backing on all this work without realizing it. It makes intuitive sense to me that models benefit from the kinds of transformations that help us recognize objects in the world too. My instinct is that gamma correction makes it a lot easier to spot things in natural scenes, because you’d hope that the differences between two materials would remain roughly constant regardless of lighting conditions, and scaling the values keeps the offsets between the colors from varying as widely as they would with the raw measurements. I can easily believe that neural networks benefit from this property just like we do.

If you accept that there is a lot of hidden feature engineering happening behind the scenes even for the classic vision models, what does this mean for other applications of deep networks? My experience has been that it’s important to think explicitly about feature engineering when designing models, and if you believe your inputs are raw, it’s worth doing a deep dive to understand what’s really happening before you get your data. For example, I’ve been working with a team that’s using accelerometer and gyroscope data to interpret gestures. They were getting good results in their application, but thanks for supply-chain problems they had to change the IMU they were using. It turned out that the original part included sensor fusion to produce estimates of the device’s absolute orientation and that’s what they were feeding into the network. Other parts had different fusion algorithms which didn’t work as well, and even trying software fusion wasn’t effective. Some problems included significant lag responding to movement and biases that sent the orientation way off over time. We switched the model to using the unfused accelerometer and gyroscope values, and were able to get back a lot of the accuracy we’d lost.

In this case, deep learning did manage to eat that part of the feature engineering pipeline, but because we didn’t have a good understanding of what was happening to our input data before we started we ended up spending extra time having to deal with problems that could have been more easily handled in the design and prototyping phase. Also, I don’t have the knowledge of accelerometer hardware but I wouldn’t be at all surprised if the “raw” values we’re now using have actually been through some significant processing.

Another area that feature engineering has surprised me with its usefulness is around labeling and debugging data problems. When I was working on building a more reliable magic wand gesture model, I was getting very frustrated with my inability to tell if the training data I was capturing from people was good enough. Just staring at six curves of the acceleration and gyroscope X, Y, Z values over time wasn’t enough for me to tell if somebody had actually performed the expected gesture or not. I thought about trying to record video of the contributors, but that seemed a lot to ask. Instead, I put some work into reconstructing the absolute position and movement from the “raw” values. This effectively became an extremely poor man’s version of sensor fusion, but focused on the needs of this particular application. I was not only able to visualize the data to check its quality, I started feeding the rendered results into the model itself, improving the accuracy. It also had the side-benefit that I could display an intuitive visualization of the gesture as seen by the model back to the user, so that they could gain an understanding of why it failed to recognize some attempts and learn to adapt their movements to be clearer from the model’s perspective!

From Colab notebook

I don’t want to minimize deep learning’s achievements in reducing the toil involved in building feature pipelines, I’m still constantly amazed at how effective they are. I would like to see more emphasis put on feature engineering in research and teaching though, since it’s still an important issue that practitioners have to wrestle with to successfully deploy ML applications. I’m hoping this post will at least spark some curiosity about where your data has really been before you get it!