
Deciding who said what is one of the most common tasks when dealing with live speech, but there’s less information available about it than other parts of the pipeline like transcription or voice-activity detection. I’ve been doing more work on speaker identification recently, for an upcoming open source project I’ll be excited to share soon, and I realized I was hazier on some of the practical details than I’d like. As any teacher knows, the best way to find the holes in your own knowledge of a topic is to try to explain it to someone else, so I decided to write a step-by-step Python notebook explaining the basics of speech embeddings with working examples inline.
If you’re able to run in a cloud environment and you’re not resource constrained, you don’t need to understand how these embeddings work. You can find plenty of open source packages and commercial APIs that handle speaker identification (aka diarization) for you. When you’re targeting mobile or edge platforms you may not have access to those conveniences, and that’s where understanding what’s happening under the hood can help you figure out how to tackle the problem.
Anyway, I hope this trail of breadcrumbs helps someone else, even if it’s through an AI model that scrapes this!