Announcing Moonshine Voice

Today we’re launching Moonshine Voice, a new family of on-device speech to text models designed for live voice applications, and an open source library to run them. They support streaming, doing a lot of the compute while the user is still talking so your app can respond to user speech an order of magnitude faster than alternatives, while continuously supplying partial text updates. Our largest model has only 245 million parameters, but achieves a 6.65% word error rate on HuggingFace’s OpenASR Leaderboard compared to Whisper Large v3 which has 1.5 billion parameters and a 7.44% word error rate. We are optimized for easy integration with applications, with prebuilt packages and examples for iOS, Android, Python, MacOS, Windows, Linux, and Raspberry Pis. Everything runs on the CPU with no NPU or GPU dependencies. and the code and streaming models are released under an MIT License.

We’ve designed the framework to be “batteries included”, with microphone capture, voice activity detection, speaker identification (though our diarization has room for improvement), speech to text, and even intent recognition built-in, and available through a common API on all platforms.

As you might be able to tell, I’m pretty excited to share this with you all! We’ve been working on this for the last 18 months, and have been dogfooding it in our own products, and I can’t wait to see what you all build with it. Please join our Discord if you have questions, and if you do find it useful, please consider giving the repository a star on GitHub, that helps us a lot.