I Know We’re in an AI Bubble Because Nobody Wants Me 😭

I first got into deep learning in 2012, when AlexNet came out. I was CTO of Jetpac, a startup that aimed to provide information about bars, hotels, and restaurants by analyzing public photos, for example finding hipster (and Turk) friendly cafes. The results from the paper were so astonishing I knew AlexNet would be incredibly helpful, so I spent my Christmas holidays heating our house using a gaming rig with two GPUs and the CudaConvNet software, since that was the only way to train my own version of the model.

The results were even better than I’d hoped, but then I faced the problem of how to apply the model across the billions of photos we’d collected. The only GPU instances on Amazon were designed for video streaming and were prohibitively expensive. The CPU support in the Caffe framework was promising, but it was focused on training models, not running them after they’d been trained (aka inference). What I needed was software that would let me run the model at a massive scale on low-cost hardware. That was the original reason I wrote the Jetpac framework, so I could spin up hundreds of cheap EC2 instances to process our huge backlog of images for tens of thousands of dollars instead of millions.

It turned out that the code was small and fast enough to even run on phones, and after Jetpac was acquired by Google I continued in that direction by leading the mobile support for TensorFlow. While I love edge devices, and that’s what I’m known for these days, my real passion is for efficiency. I learned to code in the 80’s demo scene, went on to write PC game engines professionally in the 90’s, and I got addicted to the dopamine rush of optimizing inner loops. There’s nothing quite like having hard constraints, clear requirements, and days to spend solving the puzzle of how to squeeze just a little bit more speed out of a system.

If you’re not a programmer, it might to difficult to imagine what an emotional process optimizing can be. There’s no guarantee that it’s even possible to find a good answer, so the process itself can be endlessly frustrating. The first thrill comes when you see an opening, a possibility that nobody else has spotted. There’s the satisfaction of working hard to chase down the opportunity, and then too often the despair when it turns out not to work. Even then, that means I’ve learned something, and being good at optimization means learning everything you can about the hardware, operating system, the requirements themselves, and studying others’ code in depth. I can never guarantee that I’ll find a solution, but my consolation is always that I have a better understanding of the world than when I started. The deepest satisfaction comes when I do finally find an approach that runs faster, or uses fewer resources. It’s even a social joy, it almost always contributes to a wider solution that the team is working on, making a product better, or even possible in a way it wasn’t before. The best optimizations come from a full stack team that’s able to make tradeoffs all the way from the product manager to the model architects, from hardware to operating system to software.

Anyway, enough rhapsodizing about the joy of coding, what does this have to do with the AI bubble? When I look around, I see hundreds of billions of dollars being spent on hardware – GPUs, data centers, and power stations. What I don’t see are people waving large checks at ML infrastructure engineers like me and my team. It’s been an uphill battle to raise the investment we’ve needed for Moonshine, and I don’t think it’s just because I’m a better coder than I am a salesman. Thankfully we have found investors who believe in our vision, and we’re on track to be cashflow-positive in Q1 2026, but in general I don’t see many startups able to raIse money on the promise of improving AI efficiency.

This makes no sense to me from any rational economic point of view. If you’re a tech company spending billions of dollars a month on GPUs, wouldn’t spending a few hundreds of millions of dollars a year on software optimization be a good bet? We know that GPU utilization is usually below 50%, and in my experience is often much lower for interactive applications where batches are small and memory-bound decoding dominates. We know that motivated engineers like Scott Gray can do better than Nvidia’s libraries on their own GPUs, and from my experience at Jetpac and Google I’m certain there are a lot of opportunities to run inference on much lower cost CPU machines. Even if you don’t care about the cost, the impact AI power usage has on us and the planet should make this a priority.

So, why is this money being spent? As far as I can tell, it’s because of the signaling benefits to the people making the decisions. Startups like OpenAI are motivated to point to the number of GPUs they’re buying as a moat, suggesting that they’ll be the top AI company for years to come because nobody else will be able to catch up with their head start on compute capacity. Hardware projects are also a lot easier to manage than software, they don’t take up so much scarce management attention. Investors are on board because they’ve seen early success turn into long-term dominance before, it’s clear that AI is a world-changing technology so they need to be part of it, and OpenAI and others are happy to absorb billions of dollars of investment, making VCs’ jobs much easier than it would be if they had to allocate across hundreds of smaller companies. Nobody ever got fired for buying IBM, and nobody’s going to get fired for investing in OpenAI.

I’m picking on OpenAI here, but across the industry you can see everyone from Oracle to Microsoft boasting of the amounts of money they’re spending on hardware, and for the same reasons. They get a lot more positive coverage, and a much larger share price boost, from this than they would announcing they’re hiring a thousand engineers to get more value from their existing hardware.

If I’m right, this spending is unsustainable. I was in the tech industry during the dot com boom, and I saw a similar dynamic with Sun workstations. For a couple of years every startup needed to raise millions of dollars just to launch a website, because the only real option was buying expensive Sun servers and closed software. Then Google came along, and proved that using a lot of cheap PCs running open-source software was cheaper and much more scalable. Nvidia these days feels like Sun did then, and so I bet over the next few years there will be a lot of chatbot startups based on cheap PCs with open source models running on CPUs. Of course I made a similar prediction in 2023, and Nvidia’s valuation has quadrupled since then, so don’t look to me for stock tips!

All AI Benchmarks are Wrong, but some are Useful

Photo by Pixnio

When I was new to Google Brain, I got involved in a long and heated discussion about evaluation numbers for some models we were using. As we walked out of the room, the most senior researcher told me “Look, the only metrics that matter are app store ratings. Everything else is just an approximation.“.

The Word Lens team, who were acquired around the same time Jetpac was, soon gave me a vivid example of this. Google Translate already had a visual translation feature for signs and menus, and the evaluation scores on test datasets were higher than Word Lens’s model achieved. What surprised the Google product managers was that consumers still preferred the Word Lens app over Google Translate for this use case, despite the lower metrics. It turned out the key difference was latency. With Google Translate you snapped a picture, it was uploaded to the server, and a result was returned in a second or two. Word Lens ran at multiple frames per second. This meant that users got instant on-screen feedback about the results, and would jiggle the camera angle until it locked on to a good translation. Google Translate had a higher chance of providing the right translation for a single still image, but because Word Lens was interactive, users ended up with better results overall. Smart product design allowed them to beat Google’s best models, despite apparently falling short on metrics.

I was thinking of this again today as I prepared a data sheet for a potential customer. They wanted to know the BLEU score for our on-device translation solutions. Calculating this caused me almost physical pain because while it remains the most common metric for evaluating machine translation, it doesn’t correlate well with human evaluations of the quality of the results. BLEU is a purely textual measure, and it compares the actual result of the translation word by word against one or more expected translations prepared as ground truth by fluent speakers of the language. There are a lot of problems with this approach. For example, think of a simple French phrase like “Le lac est très beau en automne“. One translation could be “The lake is very beautiful in the autumn“. Another could be “The lake is very pretty in the fall“. “In the fall, the lake’s very pretty” would also be a fair translation that captures the meaning, and might read better in some contexts. You can probably imagine many more variations, and as the sentences get more complex, the possibilities increase rapidly. Unless the ground truth in the dataset includes all of them, any results that are textually different from the listed sentences will be given a low accuracy score, even if they convey the meaning effectively. This means that the overall BLEU score doesn’t give you much information about how good a model is, and using it to compare different models against each other isn’t a reliable way to tell which one users will be happy with.

So why does BLEU still dominate the machine translation field? Model creators need a number that’s straightforward to calculate to optimize towards. If you’re running experiments comparing changes to datasets, optimization techniques, and architectures, you need to be able to quickly tell which seem to be improving the results, and its impractical to evaluate all of these by A/B testing them with actual users. The only way to iterate quickly and at scale is with metrics you can run in an automated way. While BLEU isn’t great for comparing different models, relative changes do at least tend to correlate with improvements or declines for a single model. If an experiment shows that the BLEU score has dropped significantly, there’s a good chance that the users will be happier with this version of the model compared to the original. That makes it a helpful directional signal.

This is why people who are actively working on training models are obsessed with benchmarks and metrics. They sound boring to outsiders, and they’re inherently poor approximations to the actual properties you need for your actual product, but without them it’s impossible to make progress. As George Box said – “All models are wrong, but some are useful“. You can see this clearly with modern LLMs. In general I’m pretty skeptical about the advantages OpenAI and Anthropic gain from their scale, but they have millions of people using their products every day and have the data to understand which metrics correlate to customer satisfaction. There are lots of external efforts to benchmark LLMs, but it’s not clear what they tell us about how well they actually work, and which are best.

This is important because a lot of big decisions get made based on benchmarks. Research papers need to show they beat the state of the art on commonly accepted metrics to be published. Companies get investment funding from their benchmark results. The output and content of the LLMs we use in our daily lives are driven by which metrics are used during their training process. What the numbers capture and what they miss has a direct and growing impact on our world, as LLMs are adopted in more and more applications.

That’s a big reason why Natalie and I started the AI Benchmark Club meetup in SF. There are a lot of AI events in the Bay Area, but if you’re actually training models from scratch, it can be hard to find other people facing similar challenges amongst all the business, marketing, and sales discussions that often dominate. The nice thing about benchmarks is that they sound unimportant to everyone except those of us who rely on them to build new models. This works as a great filter to ensure we have a lot of actual researchers and engineers, with talks and discussions on the practical challenges of our job. As Picasso said – “When art critics get together they talk about content, style, trend and meaning, but when painters get together they talk about where can you get the best turpentine“. I think benchmarks are turpentine for ML researchers, and if you agree then come join us at our next meetup!

Why does a Local AI Voice Agent Running on a Super-Cheap Soc Matter?

Most recent news about AI seems to involve staggering amounts of money. OpenAI and Nvidia sign a $100b data center contract. Meta offers researchers $100m salaries. VCs invested almost $200b in AI startups in the first half of 2025.

Frankly, I think we’re in a massive bubble that dwarfs the dot-com boom, and we’ll look back on these as crazy decisions. One of the reasons I believe this is because I’ve seen how much is possible running AI locally, with no internet connection, on low-cost hardware. The video above is one of my favourite recent examples. It comes from a commercial contract we received to help add a voice assistant to appliances. The idea is that when a consumer runs into a problem with their dishwasher, they can press a help button and talk to get answers to common questions.

What I’m most proud of here is that this is cutting-edge AI actually helping out with a common issue that many of us run into in our daily lives. This isn’t speculative, it’s real and running, and it doesn’t pose a lot of the ethical dilemmas other AI applications face. Here’s why I think this matters:

  • The consumer doesn’t have to do anything beyond pressing a button to use it. There’s no phone app to download, no new account to create, and no Wifi to set up. The solution works as soon as they plug the appliance in. This is important because less than half of all smart appliances ever get connected to the internet.
  • It’s using Moonshine and an LLM to do a much better job of understanding natural speech than traditional voice assistants. The questions I asked in the demo were off-the-cuff, I deliberately used vague and informal language, and it still understood me.
  • It addresses a genuine problem that manufacturers are already paying money to solve. They are currently spending a lot on call centers and truck rolls to help consumers. This solution has the potential to reduce those costs, and increase consumer satisfaction, by offering quick answers in an easy way.
  • Running locally means that audio recordings never have to go to the cloud, increasing privacy.
  • Local also means fast. The response times in the video are real, this is running on actual hardware.
  • This doesn’t require a GPU or expensive hardware. It runs on a Synaptics chip that has just launched, and will be available in bulk for low-single-digit dollars. This means it can be added to mass-market equipment like appliances, and even toys. Since it’s also able to run all the regular appliance control functions,  it can replace similarly-priced existing SoCs in those products without raising the price.
  • More functionality, like voice-driven controls, can easily be added incrementally through software changes. This can be a gateway to much richer voice interactions, all running locally and privately.

All these properties give local AI a much better chance to change our daily lives in the long term, compared to a chat bot that you access through a text box on a web page. AI belongs out in the world, not in a data center! If you agree, I’d love to hear from you.

How to Try Chrome’s Hidden AI Model

A black dog with a pink towel over its head, against a background of white tiles.

There’s an LLM hiding in Chrome. Buried in the browser’s basement, behind a door with a “Beware of Leopard” sign.

But I’ll show you how to find it. In a couple minutes, you’ll have a private, free chatbot running on your machine.

Instructions
We’re going to enable some developer flags in desktop Chrome so you can get full access to the AI model. We have to do this because the functionality is only being slowly rolled out by Google, and by turning on these developer options we can skip to the front of the line. There’s also a screencast version of these instructions if you’d like to follow along on YouTube.

You’ll need access to Chrome’s internal debugging pages to try out the model, so enter chrome://chrome-urls/ into the URL bar, scroll down, and click on “Enable internal debugging pages”.

Next type or copy and paste chrome://flags/#prompt-api-for-gemini-nano-multimodal-input into the URL bar.

Click on the “Default” drop down menu, choose enabled, and then relaunch Chrome.

If you’re familiar with the console you can copy and paste “await LanguageModel.availability();” to trigger the next step, but I’ve also created this page to make it easier for non-developers to do it by just clicking a button.

Next, type or copy and paste the URL “chrome://on-device-internals/”. In that page, click on “Load Default” and you should see a message confirming that the model has been downloaded.

Now you have access to the Gemini Nano LLM running locally in Chrome! You can enter text in the input box, and it will respond just like a cloud-based chatbot.

To verify this is truly happening locally, you can turn off the wifi and enter new prompts. You can even use it to transcribe audio, or analyze images.

Why does this matter?

It’s free: These models work with the PC you have and require no subscriptions. Your usage is only limited by the speed of the model.

It’s 100% privacy-safe: None of your questions or answers leave your PC. Go ahead, turn off your WiFi and start prompting – everything works perfectly.

It works offline: The first time I used a local model to help with a coding task while flying on an airplane without WiFi, it felt like magic. There’s something crazy about the amount of knowledge these models condense into a handful of gigabytes.

It’s educational: This is the main reason you should bother with local LLMs right now. Just trying out this model demystifies the field, and should be an antidote to the constant hype the AI industry fosters. By getting your hands just slightly dirty, you’ll start to understand the real-world trajectory of these things.

It’s the future: Local models are only getting better and faster, while cloud-based chatbots like Claude and ChatGPT plateau. The market is inevitably going to shift to free models like this that are integrated into platforms and operating systems.

Why Speech to Intent is so Vital for Voice

When I first tried ChatGPT, it blew my mind. Its ability to respond intelligently to almost any prompt I gave it was astonishing, it was obvious to me it was the future. It seemed like we’d finally built the kind of AI we’ve all seen in the movies. Over time though, one big limitation became clear – they’re all talk and no action. By that I mean they’re fantastic for anything that requires generating text, but persuading them to make something happen is a lot harder. For example, we can now build a model that could have a natural conversation with a person, just like HAL 9000, but if you ask it to open the pod bay doors, there’s no easy way to connect the LLM’s output to those doors’ controls.

The challenge of converting something somebody said into an action is known as the “speech to intent” problem in the research world. If you’ve ever used a voice assistant, you’ll know that you have to be careful about how you phrase requests. “Alexa, living room lights on” may work, but “Alexa, turn on the lights in the living room” might not. If you were talking to a person, you wouldn’t have this problem, they would be able to understand what you meant even if you didn’t use the exact phrase they were expecting. In natural conversations we’re just as likely to say something like “Can you hit the switch for the lights by the TV?” or “We need light in the living room“, and we’d expect someone else to understand. Solving speech to intent means recognizing all of those possible natural language phrases as inputs, and outputting a structured result that unambiguously tells the rest of the system to turn a particular light on.

As you can probably tell from your own experiences with voice assistants, this problem is far from solved. A lot of current solutions still work a lot like Infocom text games from the 80’s – here’s a genuine example from Azure’s “AI Services”:

You might already be able to spot a few problems with this. What if someone said “Go to six” or “Six please“? This kind of pattern matching is very brittle because it either relies on the developer coming up with every likely variation on a command, or the user choosing exactly the expected phrase. Even worse, there’s usually no way for a user to tell what the correct phrases actually are, so the interface is incredibly undiscoverable too! I believe the problems that this rule-based approach causes are a big reason that very few people use voice interfaces. We expect our assistants to be able to understand us when we talk naturally to them, and right now they don’t.

Large Language Models seem to be great at understanding people, so are they the solution? I think they will be soon, but the best paper I’ve found on this approach shows we still have some work to do. The authors’ experiments show that you can get results as good as the non-LLM state of the art by using ChatGPT 3.5 on a simple intent classification task (table 3), but the LLM approach is much worse when the requirements are tougher (table 4). ChatGPT also struggles with the kinds of word errors that show up on transcribed text. I’m optimistic that we can solve these issues (and we’re actively working on this at Useful) but it will require some new approaches to training and using models.

So, why is speech to intent so important? I believe it’s the last missing piece before we finally have voice interfaces that are a joy to use! Imagine leaning back on your couch with your laptop open and browsing purely through speech. Blade Runner has a beautiful example of how this might work in its zoom and enhance scene:

Of course I’m more likely to be buying jeans from Zappos than playing robot detective, but almost any interactive experience can be improved with a voice interface that actually understands people. Speech won’t replace keyboards or touch screens, we’ll still be typing into spreadsheets, but there will be a lot of cases where it will be the easiest way to interact. This change won’t just be an incremental one, it will open up experiences on devices that have never been possible before. If voice truly works, you’ll be able to use your TV to browse the web, get a quick summary of a page from your smart speaker, or work with apps from your AR or VR devices. It will free us from remote controls and having to physically touch something to make it work. If you’re using voice, then the results can be displayed on any screen that’s convenient, and computing becomes much more ambient, rather than something you have to carry around with you.

This is why I’m so excited to be working on this problem. We’ve been suffering through a long voice interface winter, but almost all of the ingredients are in place to make speech work. If we can persuade LLMs to turn their words into deeds, then we’ll be finally be able to talk to machines like we can to people, and I think that will be glorious.