Terrifying fridge with human teeth, via DALLE.

Imagine asking a box on a pillar at Home Depot “Where are the nails?” and getting directions, your fridge responding with helpful advice when you say “Why is the ice maker broken?”, or your car answering “How do I change the wiper speed?”. I think of these kinds of voice assistants for everyday objects as “Little Googles”, agents that are great at answering questions, but only in a very specific domain. I want them in my life, but they don’t yet exist. If they’re as useful as I think, why aren’t they already here, and why is now the right time for them to succeed?

What are “Little Googles?”

I’m a strong believer in Computers as Social Actors, the idea that people want to interact with new technology as if it was another person. With that in mind, I always aim to make user experiences as close to existing interactions as possible to increase the likelihood of adoption. If you think about everyday life, we often get information we need from a short conversation with someone else, whether it’s a clerk at Home Depot, or your spouse who knows the car controls better than you do. I believe that speech to text and LLMs are now sufficiently advanced to allow a computer to answer 80% of these kinds of informational queries, all through a voice interface.

The reason we ask people these kinds of questions rather than Googling on our phones is that the other person has a lot of context and specialized knowledge that isn’t present in a search engine. The clerk knows which store you’re in, and how items are organized. Your spouse knows what car you’re driving, and has learned the controls themselves. It’s just quicker and easier to ask somebody right now! The idea of “Little Googles” is that we can build devices that offer the same convenience as a human conversation, even when there’s nobody else nearby.

Why don’t they exist already?

If this is such a good idea, why hasn’t anyone built these? There are a couple of big reasons, one technical and the other financial. The first is that it used to take hundreds of engineers years to build a reliable speech to text service. Apple paid $200m to buy Siri in 2010, Alexa reportedly lost $10b in 2022, and I know from my own experience that Google’s speech team was large, busy, and deservedly well-paid. This meant that the technology to offer a voice interface was only available to a few large companies, and they reserved it for their own products, or other use cases that drove traffic directly to them. Speech to text was only available if it served those companies’ purposes, which meant that other potential customers like auto manufacturers or retail stores couldn’t use it.

The big financial problem came from the requirement for servers. If you’re a fridge manufacturer you only get paid once, when a consumer buys the appliance. That fridge might have a useful lifetime of over a decade, so if you offered a voice interface you’d need to pay for servers to process incoming audio for years to come. Because most everyday objects aren’t supported by subscriptions (despite BMW’s best efforts) the money to keep those servers running for an indeterminate amount of time has to come from the initial purchase. The ongoing costs associated with voice interfaces have been enough to deter almost anyone who isn’t making immediate revenue from their use.

Having to be connected also meant that the audio was sent to someone else’s data center, with all the privacy issues involved, and required wifi availability, which is an ongoing maintenance cost in any commercial environment and such a pain for consumers to set up that less than half of “smart” appliances are ever connected.

Why is now the right time?

OpenAI’s release of Whisper changed everything for voice interfaces. Suddenly anyone could download a speech to text model that performs well enough for most use cases, and use it commercially with few strings attached. It shattered the voice interface monopoly of the big tech companies, removing the technical barrier.

The financial change was a bit more subtle. These models have become small enough to fit in 40 megabytes and run on a $50 SoC. This means it’s starting to be possible to run speech to text on the kinds of chips already found in many cars and appliances, with no server or internet connection required. This removes the ongoing costs from the equation, now running a voice interface is just something that needs to be part of the on-device compute budget, a one-time, non-recurring expense for the manufacturer.

Moving the voice interface code to the edge also removes the usability problems and costs of requiring a network connection. You can imagine a Home Depot product finder being a battery-powered box that is literally glued to a pillar in the store. You’d just need somebody to periodically change the batteries and plug in a new SD card as items are moved around. The fridge use case is even easier, you’d ship the equivalent of the user manual with the appliance and never update it (since the paper manual doesn’t get any).

Nice idea, but where’s the money?

Voice interfaces have often seemed like a solution looking for a problem (see Alexa’s $10b burn rate). What’s different now is that I’m talking to customers with use cases that they believe will make them money immediately. Selling appliance warranties is a big business, but call centers, truck rolls for repairs, and returns can easily wipe out any profit. A technology that can be shown to reduce all three would save a lot of money in a very direct way, so there’s been strong interest in the kinds of “Talking user manuals” we’re offering at Useful. Helping customers find what they need in a store is another obvious moneymaker, since a good implementation will increase sales and consumer satisfaction, so that’s been popular too.

What’s next?

It’s Steam Engine Time for this kind of technology. There are still a lot of details to be sorted out, but it feels so obvious that it’s now possible and that this would be a pleasant addition* to most people’s lives as well as promising profit, that I can’t imagine something like this won’t happen. I’ll be busy with the team at Useful trying to build some of the initial implementations and prove that it isn’t such a crazy idea, so I’d love to hear from you if this is something that resonates. I’d also like to see other implementations of similar ideas, since I know I can’t be the only one seeing these trends.

(*) Terrifying AI-generated images of fridges with teeth notwithstanding.

As you might know I’m working on my PhD at Stanford, and one of my favorite parts is taking courses. For this second year I need to follow the new foundation and breadth requirements which in practice means taking a course a quarter, with each course chosen from one of four areas. For the fall quarter I took Riana Pfefferkorn and Alex Stamos’ Hacklab: Introduction to Cybersecurity, which I thoroughly enjoyed and learned a lot, especially about the legal side. I’m especially thankful to Danny Zhang, my excellent lab RA who had a lot of patience as I struggled with the difference between a search warrant and civil sanctions!

That satisfied the “People and Society” section of the requirements, but means I need to pick a course from one of the other three sections for the upcoming winter quarter. You might think this would be simple, but as any Googler can tell you, the more technically advanced the organization the worse the internal search tools are. The requirements page just has a bare list of course numbers, with no descriptions or links, and the enrollment tool is so basic that you have to put a space between the letters and the numbers of the course ID (“CS 243” instead of “CS243”) before it can find them, so it’s not even just a case of copying and pasting. To add to the complexity, many of the courses aren’t offered this coming quarter, so just figuring out what my viable options are was hard. I thought about writing a script to scrape the results, given a set of course numbers, but decided to do it manually in the end.

This will be a *very* niche post, but since there are around 100 other second year Stanford CS PhD students facing the same search problem, I thought I’d post my notes here in case they’ll be helpful. I make no guarantees about the accuracy of these results, I may well have fat-fingered some search terms, but let me know if you spot a mistake. I’ve indexed all 2xx and 3xx level courses in the first three breadth sections (since I already had the fourth covered), and I didn’t check 1xx because they tend to be more foundational. For what it’s worth, I’m most excited about CS 224N – Natural Language Processing with Deep Learning, and hope I can get signed up once enrollment opens.

2xx/3xx Breadth Courses Available Winter 2024

CS 205L – Continuous Mathematical Methods with an Emphasis on Machine Learning, Tuesday/Thursday
CS 212 – Operating Systems and Systems Programming, Monday/Wednesday
CS 223A – Introduction to Robotics, Monday/Wednesday
CS 224N – Natural Language Processing with Deep Learning, Tuesday/Thursday
CS 228 – Probabilistic Graphical Models: Principles and Techniques, Tuesday/Thursday
CS 229 – Machine Learning, Monday/Wednesday
CS 233 – Geometric and Topological Data Analysis, Monday/Wednesday
CS 237B – Principles of Robot Autonomy II, Monday/Wednesday
CS 243 – Program Analysis and Optimizations, Monday/Wednesday
CS 254 – Computational Complexity, Monday/Wednesday
CS 255 – Introduction to Cryptography, Monday/Wednesday
CS 256 – Algorithmic Fairness, Tuesday/Thursday
CS 246 – Mining Massive Data Sets, Tuesday/Thursday
CS 249I – The Modern Internet, Monday/Wednesday
CS 348I – Computer Graphics in the Era of AI, Tuesday/Thursday
EE 364A – Convex Optimization I, Tuesday/Thursday
CS 369O – Optimization Algorithms, Tuesday/Thursday

2xx/3xx Courses Not Offered for Winter 2024

CS 221
CS 227
CS 230
CS 231
CS 234
CS 236
CS 240
CS 242
CS 244
CS 245
CS 250
CS 251
CS 257
CS 258
CS 259
CS 261
CS 263
CS 265
CS 269
CS 271
CS 272
CS 273
CS 274
CS 279
CS 281
CS 316
CS 324
CS 326
CS 328
CS 329D
CS 329H
CS 329X
CS 330
CS 331
CS 332
CS 333
CS 334
CS 354
CS 355
CS 356
CS 358
CS 359
CS 371
CS 373
EE 282

	bouquetsweetly69036a… on Meet Fiona and Abby
	softlysuitcb91a8b8b1 on Meet Fiona and Abby
	Zero-Copy GPU Infere… on Why GEMM is at the heart of de…
	Moonshine Voice完全解説｜… on Announcing Moonshine Voice
	Moonshine KI-Sprache… on Introducing Moonshine, the new…

Pete Warden's blog

Ever tried. Ever failed. No matter. Try Again. Fail again. Fail better.

Monthly Archives: November 2023

Little Googles Everywhere