Terrifying fridge with human teeth, via DALLE.

Imagine asking a box on a pillar at Home Depot “Where are the nails?” and getting directions, your fridge responding with helpful advice when you say “Why is the ice maker broken?”, or your car answering “How do I change the wiper speed?”. I think of these kinds of voice assistants for everyday objects as “Little Googles”, agents that are great at answering questions, but only in a very specific domain. I want them in my life, but they don’t yet exist. If they’re as useful as I think, why aren’t they already here, and why is now the right time for them to succeed?

What are “Little Googles?”

I’m a strong believer in Computers as Social Actors, the idea that people want to interact with new technology as if it was another person. With that in mind, I always aim to make user experiences as close to existing interactions as possible to increase the likelihood of adoption. If you think about everyday life, we often get information we need from a short conversation with someone else, whether it’s a clerk at Home Depot, or your spouse who knows the car controls better than you do. I believe that speech to text and LLMs are now sufficiently advanced to allow a computer to answer 80% of these kinds of informational queries, all through a voice interface.

The reason we ask people these kinds of questions rather than Googling on our phones is that the other person has a lot of context and specialized knowledge that isn’t present in a search engine. The clerk knows which store you’re in, and how items are organized. Your spouse knows what car you’re driving, and has learned the controls themselves. It’s just quicker and easier to ask somebody right now! The idea of “Little Googles” is that we can build devices that offer the same convenience as a human conversation, even when there’s nobody else nearby.

Why don’t they exist already?

If this is such a good idea, why hasn’t anyone built these? There are a couple of big reasons, one technical and the other financial. The first is that it used to take hundreds of engineers years to build a reliable speech to text service. Apple paid $200m to buy Siri in 2010, Alexa reportedly lost $10b in 2022, and I know from my own experience that Google’s speech team was large, busy, and deservedly well-paid. This meant that the technology to offer a voice interface was only available to a few large companies, and they reserved it for their own products, or other use cases that drove traffic directly to them. Speech to text was only available if it served those companies’ purposes, which meant that other potential customers like auto manufacturers or retail stores couldn’t use it.

The big financial problem came from the requirement for servers. If you’re a fridge manufacturer you only get paid once, when a consumer buys the appliance. That fridge might have a useful lifetime of over a decade, so if you offered a voice interface you’d need to pay for servers to process incoming audio for years to come. Because most everyday objects aren’t supported by subscriptions (despite BMW’s best efforts) the money to keep those servers running for an indeterminate amount of time has to come from the initial purchase. The ongoing costs associated with voice interfaces have been enough to deter almost anyone who isn’t making immediate revenue from their use.

Having to be connected also meant that the audio was sent to someone else’s data center, with all the privacy issues involved, and required wifi availability, which is an ongoing maintenance cost in any commercial environment and such a pain for consumers to set up that less than half of “smart” appliances are ever connected.

Why is now the right time?

OpenAI’s release of Whisper changed everything for voice interfaces. Suddenly anyone could download a speech to text model that performs well enough for most use cases, and use it commercially with few strings attached. It shattered the voice interface monopoly of the big tech companies, removing the technical barrier.

The financial change was a bit more subtle. These models have become small enough to fit in 40 megabytes and run on a $50 SoC. This means it’s starting to be possible to run speech to text on the kinds of chips already found in many cars and appliances, with no server or internet connection required. This removes the ongoing costs from the equation, now running a voice interface is just something that needs to be part of the on-device compute budget, a one-time, non-recurring expense for the manufacturer.

Moving the voice interface code to the edge also removes the usability problems and costs of requiring a network connection. You can imagine a Home Depot product finder being a battery-powered box that is literally glued to a pillar in the store. You’d just need somebody to periodically change the batteries and plug in a new SD card as items are moved around. The fridge use case is even easier, you’d ship the equivalent of the user manual with the appliance and never update it (since the paper manual doesn’t get any).

Nice idea, but where’s the money?

Voice interfaces have often seemed like a solution looking for a problem (see Alexa’s $10b burn rate). What’s different now is that I’m talking to customers with use cases that they believe will make them money immediately. Selling appliance warranties is a big business, but call centers, truck rolls for repairs, and returns can easily wipe out any profit. A technology that can be shown to reduce all three would save a lot of money in a very direct way, so there’s been strong interest in the kinds of “Talking user manuals” we’re offering at Useful. Helping customers find what they need in a store is another obvious moneymaker, since a good implementation will increase sales and consumer satisfaction, so that’s been popular too.

What’s next?

It’s Steam Engine Time for this kind of technology. There are still a lot of details to be sorted out, but it feels so obvious that it’s now possible and that this would be a pleasant addition* to most people’s lives as well as promising profit, that I can’t imagine something like this won’t happen. I’ll be busy with the team at Useful trying to build some of the initial implementations and prove that it isn’t such a crazy idea, so I’d love to hear from you if this is something that resonates. I’d also like to see other implementations of similar ideas, since I know I can’t be the only one seeing these trends.

(*) Terrifying AI-generated images of fridges with teeth notwithstanding.

4 responses

alicecoucke says:

November 30, 2023 at 9:51 am

Hi Pete, nice read! We’ve been studying Whisper’s perf on realistic acoustic environments (far field, noisy conditions) and there are definitely some improvements needed, making it, I believe, not fit for all use cases yet. I’d be curious to have your opinion on that.

alexandertolley says:

November 30, 2023 at 3:19 pm

To pick up on the inventors vs innovators idea in the link, IIRC, voice controls have been a desired idea for a long time. A colleague was working on voice controls for BMW back in the late 1990s. What you are doing is marrying the idea of voice commands with all that implies with knowledge (i.e. “Google”) to provide a superior service. As I get older and my “thumbing” text messages get ever slower, I have started to resort to voice input on my iPhone, as Siri now recognizes my British accent pretty well. I have also used Amazon’s Alexa to answer simple factual questions that would be slower to type in a Google search. Your suggestion seems like a simple extrapolation with the key technology being possible at the edge rather than a central server.

I fully expect to see such devices as ubiquitous.

Ideas:
Bus stops that can respond to when the bus is next arriving.
Check-ins that avoid using terminals (I am looking at you, Quest Diagnostics).

Pingback: Doom, Dark Compute, and AI « Pete Warden's blog
Chris D. says:

January 28, 2024 at 10:53 pm

Home Depot example – the workers are there doing other tasks like stocking shelves. There are always people on multiple aisles.

If you are actually planning ahead there is already a HD app that works for any given store and tells you the aisle-section.

Like all products, you want to find a killer app so it isn’t just local to a retailer or such or even in a retail chain’s entire store footprint.

	ademidun on Leave a trail of breadcru…
	oi0841oi on Understanding the Raspberry Pi…
	Pete Warden on Understanding the Raspberry Pi…
	oi0841oi on Understanding the Raspberry Pi…
	Alan on Understanding the Raspberry Pi…

Pete Warden's blog

Ever tried. Ever failed. No matter. Try Again. Fail again. Fail better.

Little Googles Everywhere

What are “Little Googles?”

Why don’t they exist already?

Why is now the right time?

Nice idea, but where’s the money?

What’s next?

4 responses

Leave a comment Cancel reply

What are “Little Googles?”

Why don’t they exist already?

Why is now the right time?

Nice idea, but where’s the money?

What’s next?

Share this:

Related

4 responses

Leave a comment Cancel reply