I think the transformative power of on-device speech to text is criminally under-rated (and I’m not alone), so I’m a massive fan of the work Coqui are doing to make the technology more widely accessible. Coqui is a startup working on a complete open source solution to speech recognition, as well as text to speech, and I’ve been lucky enough to collaborate with their team on datasets like Multilingual Spoken Words.
They have have great documentation already, but over the holidays I’ve been playing around with the code and I always like to leave a trail of breadcrumbs if I can, so in this post I’ll try to show you how to get speech recognition running locally yourself in just a few minutes. I’ve tried it on my PopOS 21.04 laptop, but it will hopefully work on most modern Linux distributions, and should be trivial to modify for other platforms that Coqui provide binaries for. To accompany this post, I’ve also published a Colab notebook, which you can use from your browser on almost any system, and demonstrates all these steps.
You’ll need to be comfortable using a terminal, but because they do offer pre-built binaries you won’t need to worry about touching code or compilation. I’ll show you how to use their tools to recognize English language text from a WAV file. The code sections below (in a monospace font) should all be run from a shell terminal window.
First we download the example executable, stt, and the shared library, libstt.so, that contains the framework code, all parts of the native_client archive.
wget --quiet https://github.com/coqui-ai/STT/releases/download/v1.1.0/native_client.tflite.Linux.tar.xz unxz native_client.tflite.Linux.tar.xz tar -xf native_client.tflite.Linux.tar
Next, we need to fetch a model. For this example I’ve chosen the English large vocabulary model, but there are over 80 different versions available for many languages at coqui.ai/models. Note that this is the recognition model, not the language model. Language models are used to post-process the results of the neural network, and are optional. To keep things simple, in this example we’re just using the raw recognition model output, but there are lots of options to improve the quality for a particular application if you investigate things like language models and hotwords.
wget --quiet https://github.com/coqui-ai/STT-models/releases/download/english/coqui/v1.0.0-large-vocab/model.tflite
To demonstrate how the speech to text tool works, we need some WAV files to try it out on. Luckily Coqui provide some examples, together with transcripts of the expected output.
wget --quiet https://github.com/coqui-ai/STT/releases/download/v1.1.0/audio-1.1.0.tar.gz !tar -xzf audio-1.1.0.tar.gz
stt file is a command line tool that lets you run speech to text translation using Coqui’s framework. It has a lot of options you can explore, but the simplest way to use it is to provide a recognition model and then point it at a WAV file. After some version logging you should see the predicted transcript of the speech in the audio file as the final line.
./stt --model ./model.tflite --audio ./audio/4507-16021-0012.wav
You should see output that looks something like this:
TensorFlow: v2.3.0-14-g4bdd3955115 Coqui STT: v1.1.0-0-gf3605e23 why should one halt on the way
If you’ve made it this far, congratulations, you’ve just run your own speech to text engine locally on your machine! Coqui have put a lot of work into their open source speech framework, so if you want to dive in deeper I highly recommend browsing their documentation and code. Everything’s open source, even the training, so if you need something special for your own application, like a different language or specialized vocabulary, you have the chance to do it yourself.
Update – I’ve also just added a new Colab notebook showing how to build a program using STT with just a makefile and the binary releases, without requiring Bazel.