When I first tried ChatGPT, it blew my mind. Its ability to respond intelligently to almost any prompt I gave it was astonishing, it was obvious to me it was the future. It seemed like we’d finally built the kind of AI we’ve all seen in the movies. Over time though, one big limitation became clear – they’re all talk and no action. By that I mean they’re fantastic for anything that requires generating text, but persuading them to make something happen is a lot harder. For example, we can now build a model that could have a natural conversation with a person, just like HAL 9000, but if you ask it to open the pod bay doors, there’s no easy way to connect the LLM’s output to those doors’ controls.
The challenge of converting something somebody said into an action is known as the “speech to intent” problem in the research world. If you’ve ever used a voice assistant, you’ll know that you have to be careful about how you phrase requests. “Alexa, living room lights on” may work, but “Alexa, turn on the lights in the living room” might not. If you were talking to a person, you wouldn’t have this problem, they would be able to understand what you meant even if you didn’t use the exact phrase they were expecting. In natural conversations we’re just as likely to say something like “Can you hit the switch for the lights by the TV?” or “We need light in the living room“, and we’d expect someone else to understand. Solving speech to intent means recognizing all of those possible natural language phrases as inputs, and outputting a structured result that unambiguously tells the rest of the system to turn a particular light on.
As you can probably tell from your own experiences with voice assistants, this problem is far from solved. A lot of current solutions still work a lot like Infocom text games from the 80’s – here’s a genuine example from Azure’s “AI Services”:
// Creates a pattern that uses groups of optional words. "[Go | Take me]" will match either "Go", "Take me", or "".
String patternWithOptionalWords = "[Go | Take me] to [floor|level] {floorName}";
You might already be able to spot a few problems with this. What if someone said “Go to six” or “Six please“? This kind of pattern matching is very brittle because it either relies on the developer coming up with every likely variation on a command, or the user choosing exactly the expected phrase. Even worse, there’s usually no way for a user to tell what the correct phrases actually are, so the interface is incredibly undiscoverable too! I believe the problems that this rule-based approach causes are a big reason that very few people use voice interfaces. We expect our assistants to be able to understand us when we talk naturally to them, and right now they don’t.
Large Language Models seem to be great at understanding people, so are they the solution? I think they will be soon, but the best paper I’ve found on this approach shows we still have some work to do. The authors’ experiments show that you can get results as good as the non-LLM state of the art by using ChatGPT 3.5 on a simple intent classification task (table 3), but the LLM approach is much worse when the requirements are tougher (table 4). ChatGPT also struggles with the kinds of word errors that show up on transcribed text. I’m optimistic that we can solve these issues (and we’re actively working on this at Useful) but it will require some new approaches to training and using models.
So, why is speech to intent so important? I believe it’s the last missing piece before we finally have voice interfaces that are a joy to use! Imagine leaning back on your couch with your laptop open and browsing purely through speech. Blade Runner has a beautiful example of how this might work in its zoom and enhance scene:
Of course I’m more likely to be buying jeans from Zappos than playing robot detective, but almost any interactive experience can be improved with a voice interface that actually understands people. Speech won’t replace keyboards or touch screens, we’ll still be typing into spreadsheets, but there will be a lot of cases where it will be the easiest way to interact. This change won’t just be an incremental one, it will open up experiences on devices that have never been possible before. If voice truly works, you’ll be able to use your TV to browse the web, get a quick summary of a page from your smart speaker, or work with apps from your AR or VR devices. It will free us from remote controls and having to physically touch something to make it work. If you’re using voice, then the results can be displayed on any screen that’s convenient, and computing becomes much more ambient, rather than something you have to carry around with you.
This is why I’m so excited to be working on this problem. We’ve been suffering through a long voice interface winter, but almost all of the ingredients are in place to make speech work. If we can persuade LLMs to turn their words into deeds, then we’ll be finally be able to talk to machines like we can to people, and I think that will be glorious.