Meet Fiona and Abby

June 4, 2026 By Pete Warden in Uncategorized 2 Comments

Since it’s coming up to their “gotcha” anniversaries I would like to introduce you to Fiona (top) and Abby (below). After we lost our beloved Minpin, we kept an eye out for other Miniature Pinschers available for adoption, and Joanne spotted Fiona at Beyond Rescue. When we met her in person she was very anxious, but once she rolled over for belly rubs I knew she had to come home with us. It turned out she had good reason for her anxiety, she had been badly savaged by another dog while she was being fostered, and so she was very fearful when out in the world. Thankfully with a lot of love and behavioral work she is now far more comfortable. She can still be reactive to other dogs out in the world, and barks up a storm when anyone in the street slams a car door, but she’s very food motivated, so crackers and a great dog walker (thanks Rhys!) have done wonders. After we gave her a DNA test, it turns out she’s 100% Miniature Pinscher, which isn’t something we were fixated on, but it is lovely when we see the traits she shares with Minpin. She’s a tiny dancer whenever there’s a chance of a snack, loves to gnaw on things, and is chatty and vocal pretty much all the time. It’s almost exactly a year since she joined us and I can’t imagine life without her.

Abby has been with us for two years, and we found her at Milo Rescue. Joanne knew I love playing fetch, thanks to a childhood with a Jack Russell as my companion, and her website photo had a ball in her mouth. It turns out that was truth in advertising because there’s nothing she likes more than running after a tennis ball, and will happily lounge around the house with one in her mouth if we let her. She’s incredibly athletic, seeing her leap into the air for a catch is incredible, and I’ve never felt like such a celebrity when strangers oohh and aahh as she trots around holding the ball. Thankfully for my old bones she loves snoozing when she’s not at the park, and while they’re not yet bosom buddies, she has reached an accommodation with Fiona so they sit either side of me on the couch when I work from home, and coexist in our bed at night. We don’t know much about her background, but we think she was on the streets before we found her. Despite that she has mostly overcome her anxiety, but she will lick my face for hours and will occasionally freak out if she thinks another dog is after her ball, or a toy she’s claimed. A DNA test showed that she was about 15% of everything, she’s truly unique. One of her superpowers is that she has a jaw stronger than a steel vice, so getting a ball out of her mouth after she has decided to hold onto it is impossible. The motivation of another throw is always enough for her to relinquish it eventually though, and just like Fiona, she’s deeply embedded in our family now.

Abby in her natural habitat, playing fetch by the sea

The warm air from our furnace brings them together

They both have a *lot* of energy, so they give me a work out on the local hills

I do think she’s ready for a modeling career

Launching a free, open-source, on-device transcription app

February 27, 2026 By Pete Warden in Uncategorized Tags: ai, artificial-intelligence, chatgpt, technology, writing 2 Comments

TL;DR – Please try Moonshine Note Taker on your Mac!

For years I’ve been telling people that AI wants to be local, that on-device models aren’t just a poor man’s alternative to cloud solutions, and that for some applications they can actually provide a much better user experience. It’s been an uphill battle though, because models all start in a datacenter and using cloud APIs is often so much easier for developers. There was a saying at Google that a picture is worth a thousand words, but a working demonstration is worth a thousand pictures, so with the release of the new Moonshine models I decided to show the advantages in a tangible way.

As a CEO my primary job seems to be joining meetings to nod sagely along while I try to figure out what’s going on, and to remember what we decided in previous meetings. Like a lot of people whose job involves this kind of work, I’ve found AI meeting note taking and transcription apps increasingly useful, but I kept wishing the user experience was better:

It was often hard to correct or format the transcriptions, especially during meetings.
The results would end up on a website I’d have to log into, or in my inbox, when I usually just want to save them on my laptop.
Even if an app gave me a live view, there was usually a long delay before text appeared, and it didn’t update very frequently.
I found trying to review the notes afterwards more difficult than it needed to be. I often wanted to hear the recording for an important sentence to help my understanding, and most apps don’t let you do that.
Trusting a startup to store and protect very sensitive conversations makes me nervous. Servers full of thousands of people’s meetings are always going to be tempting targets for hackers, and you never know when a startup’s business model will change.
I already have a thousand subscriptions, keeping track of them is a pain, and there were often usage limits even when I did pay.

I was also frustrated as an engineer that using the cloud for this use case was an inelegant solution. Speech to text deserves to be a core operating system function, just like keyboard drivers, and using the cloud adds unneeded complexity.

To address these issues, I’ve just released the first version of Moonshine Note Taker, for Macs.

You can edit and lay out the notes as people are talking with no delay, and using a familiar native Apple interface.
The results are .transcript files that you save just like any other document, locally on your machine, never touching the cloud.
The transcriptions show up almost instantaneously.
Audio is saved alongside the transcription, and playing back a particular section is as simple as selecting the text or moving the caret and pressing the play button.
There is absolutely no connection to the cloud. All data is kept entirely on your drive, and can be deleted instantly whenever you decide. Because it’s local, your app will never be bricked by an acquisition or pivot either.
Because I don’t have to pay server costs, I can afford to make this free and open source without losing money, and I’ll never have to impose usage limits.

If you get a chance, please give it a try and let me know what you think. I’m hoping this will be a tangible demonstration of the power of local AI, and inspire more integrations of the Moonshine framework into new and existing applications, so feedback will help a lot.

Announcing Moonshine Voice

February 13, 2026 By Pete Warden in Uncategorized 4 Comments

Today we’re launching Moonshine Voice, a new family of on-device speech to text models designed for live voice applications, and an open source library to run them. They support streaming, doing a lot of the compute while the user is still talking so your app can respond to user speech an order of magnitude faster than alternatives, while continuously supplying partial text updates. Our largest model has only 245 million parameters, but achieves a 6.65% word error rate on HuggingFace’s OpenASR Leaderboard compared to Whisper Large v3 which has 1.5 billion parameters and a 7.44% word error rate. We are optimized for easy integration with applications, with prebuilt packages and examples for iOS, Android, Python, MacOS, Windows, Linux, and Raspberry Pis. Everything runs on the CPU with no NPU or GPU dependencies. and the code and streaming models are released under an MIT License.

We’ve designed the framework to be “batteries included”, with microphone capture, voice activity detection, speaker identification (though our diarization has room for improvement), speech to text, and even intent recognition built-in, and available through a common API on all platforms.

As you might be able to tell, I’m pretty excited to share this with you all! We’ve been working on this for the last 18 months, and have been dogfooding it in our own products, and I can’t wait to see what you all build with it. Please join our Discord if you have questions, and if you do find it useful, please consider giving the repository a star on GitHub, that helps us a lot.

De-ICE Disco at the Googleplex

January 31, 2026 By Pete Warden in Uncategorized Tags: google, ice, immigration, minnesota, politics, writing 2 Comments

When Renee Good and Alex Pretti were murdered, and I saw the incredible courage of people in Minneapolis in the face of state brutality, I had to find some way to show that tech workers stand with Minnesota, even if our leaders don’t. I signed the ICEout petition, and I’d encourage you to do the same. I’ve also been talking to the press about why I signed it, and on the Wired Uncanny Valley podcast Kate Drummond asked what the next steps were for me. Off the cuff I said I wanted an in-person event, but at that point I had no idea what that might be.

As some of you know, I’ve been going to protests at the SF Tesla dealership since March 2025. The energy and solidarity I’ve experienced there has been a big part of what’s kept me going during all the dark times. Once I saw the incredible footage of Seth Todd facing off against federal agents in an inflatable frog costume in October I knew that was a way I could use my natural goofiness to fight what’s happening. I immediately bought the same costume (yes, I know, Amazon) and started attending the Tesla Takedowns in it. It seems to have had an impact, I encourage cars to honk in return for more dancing, and I often have other protestors, kids, and even passing tourists take selfies with me. For me personally, I enjoy finally getting to cosplay as someone 6′ 6”, and as an introvert who enjoys performing, being hidden inside a suit while drawing attention to the cause is perfect.

After the podcast, I realized I wanted to bring some of the energy from the Tesla protests to a tech event. I thought about setting up a meetup, but that felt too boring. Then I remembered how many of my former colleagues at Google have talked to me about wanting to show their support, but are struggling to find ways to have their voice heard without being targeted. Instead of a traditional protest with speeches, slogans, and signups, maybe we could find another way to be visible. I decided to get a few friends together in Charleston Park, a public park next to the Googleplex in Mountain View, and hold a popup dance party. De-ICE Disco sounded good to me, and so after TGIF, between 5pm and 5:30pm on Thursday (February 5th) we’ll be bopping around in inflatable costumes to disco classics. Join us, costume or not, to show ICE we won’t be intimidated, that we’ll protect our neighbors and colleagues when they come, and that we stand with Minneapolis.

I’ve never done anything like this before, but it’s the best way I can think of to show the world that there are Googlers and Xooglers who care, and to recognize the courage of those in Minnesota who are standing up to ICE at great personal risk. De-ICE Disco isn’t an organization, just an idea, and it’s not affiliated with ICEout.tech, but I’m hoping it will be another way to push back against what’s happening to our country. I’d love to see any of you who can make it on Thursday, and please do share with anyone else who might be interested. Let’s fight facism and have fun!

Speech Embeddings for Engineers

January 30, 2026 By Pete Warden in Uncategorized Leave a comment

Deciding who said what is one of the most common tasks when dealing with live speech, but there’s less information available about it than other parts of the pipeline like transcription or voice-activity detection. I’ve been doing more work on speaker identification recently, for an upcoming open source project I’ll be excited to share soon, and I realized I was hazier on some of the practical details than I’d like. As any teacher knows, the best way to find the holes in your own knowledge of a topic is to try to explain it to someone else, so I decided to write a step-by-step Python notebook explaining the basics of speech embeddings with working examples inline.

If you’re able to run in a cloud environment and you’re not resource constrained, you don’t need to understand how these embeddings work. You can find plenty of open source packages and commercial APIs that handle speaker identification (aka diarization) for you. When you’re targeting mobile or edge platforms you may not have access to those conveniences, and that’s where understanding what’s happening under the hood can help you figure out how to tackle the problem.

Anyway, I hope this trail of breadcrumbs helps someone else, even if it’s through an AI model that scrapes this!

See my friend Annie edit videos with her eyes

January 22, 2026 By Pete Warden in Uncategorized Leave a comment

I first met Annie through her work with Muttville, a local non-profit for adopting senior dogs where we found our MinPin. My wife started editing videos to help get more dogs adopted, and Annie was another volunteer doing the same work. I initially got to know her through her videos, where she did an amazing job bringing out the personalities of all the pups she was showcasing. All of her work has a happy energy, and when I met her in person I realized that all came from her.

I was also astonished to find out that she was producing multiple videos a week using just her eyes. Her disability means this is the most effective way for her to interact with a computer, and she has become very proficient with the interface, to the point that she’s playing Far Cry better than I can.

Because Annie’s doing such extraordinary work, my wife Joanne decided to collaborate with her on a documentary about her daily life, to share her story and have her voice heard more widely. You can check out the one-minute trailer above, and find the full documentary here.

Annie is a big fan of Canva, and uses it for all her video editing. She would love to connect with anyone who works there to pass along her thanks for an application which enables so much creativity. If any of my readers know someone at the company, please pass this along!

TV Shows I Love That Nobody’s Ever Heard Of

December 15, 2025 By Pete Warden in Uncategorized Tags: movies, television, tv, tv-shows, writing 2 Comments

A big reason I started this blog (almost twenty years ago!) was to have a safe space to rant about things I’m obsessed with. One of those obsessions is TV, but growing up in the UK and living in the US most of my adult life has left me with tastes that don’t seem to match up with anyone’s demographic. That means I spend a lot of time trying to find shows that I enjoy, and while I hope I’m not a snob (I watched almost every 9-1-1 show, love Rob Lowe and Angela Bassett) I do sometimes discover obscure programs that I can’t believe aren’t better known. Here’s my brain dump of recent TV shows I’ve loved that I don’t feel like got the audiences they deserved.

Harlots

Despite the risque title and setting, this period drama is a razor-sharp examination of power, class, and gender politics. Based very loosely on a historical guide to the prostitutes of Covent Garden, the three seasons follow the fight of a group of women to find their own space and safety in 1760s London. It features some top-tier performances from actors like Lesley Manville, Kate Fleetwood (whose stunning cheekbones you may know from Wheel of Time), Holli Dempsey, Julian Rhind-Tutt, and Liv Tyler. The story moves fast, it’s often a pitch-black comedy, and the stakes always feel high. In the US you can find its three seasons on Hulu.

Killjoys

This was a show that I thought I’d hate based on first impressions, but two seasons in I’m hooked. It’s a throwback to a time before scifi shows had to be prestige TV, a space western with a non-existent budget but strong writing that doesn’t take itself too seriously. It jumps right into archetypes we’ve seen before, but manages to breathe a lot of life into some stale cliches. It has hints of other Canadian productions like BSG and Orphan Black in its best moments, playing with a lot of the themes of identity, and always entertains. I’ve been watching it on Apple TV.

The Equalizer

I have to admit this one is a guilty pleasure. Did you know that Queen Latifah starred in an updated version of the old Edward Woodward show for five seasons? I love her, which helped me get through the crazily ridiculous plots of most episodes. She wears sweaters that only she could pull off, is a badass assassin, and generally has an incredible amount of fun onscreen. Sometimes I just need a show where I can turn off my brain and be swept along, and this definitely scratches that itch. I watch it on Amazon Prime.

The Bureau

A French spy thriller that focuses on the flow, denial, and corruption of intelligence in what feels like a very grounded and realistic way. Nobody here is 007, villains and heroes aren’t clearly separated, and everyone is working within larger systems that constrain their actions. A lot of the elements even felt familiar from my decades working in an office, going against the bureaucracy often leads to disaster, and unlike most US thrillers there’s a real price to pay for going rogue. The writing, world, and characters are fresh and absorbing, this show hooked me in a way few others have. I watched it on Amazon Prime.

This Fool

A Chris Estrada comedy set in LA, this show was one of the funniest things I’ve seen in years. The whole cast is spot on, with Michael Imperioli giving a scene-stealing performance as the broken-down Unitarian minister running “Hugs not Thugs”, the non-profit that Chris’s uptight Julio is drawn into by his bad boy cousin, who’s trying to go straight. The comic chemistry between Julio and his cousin played by Frankie Quiñones is perfect, and Michelle Ortiz brings crazy-eyed energy as Julio’s sometime-girlfriend. Short and sweet, I watched this on Hulu.

Britannia

Game of Thrones’ deranged younger cousin, this show starts with Donovan’s Hurdy Gurdy Man as the theme song, and gets weirder from there. Set during the Roman invasion of Britain, it manages to make the past seem truly alien in a way I’ve never seen before. It helps that David Morrissey, Zoe Wanamaker, McKenzie Crook, Kelly Reilly (you may know her from Yosemite) and Julian Rhind-Tutt (again) are absolutely committed to their roles. This is a world where everyone believes in spirits, gods, and demons to a terrifying extent, and the show does an excellent job leaving the viewer unsure of whether what they’re seeing is truly supernatural or just the consequences of fanatical belief. David Morrissey’s Roman general manages to be charming, even sympathetic, while behaving in monstrous ways, and Eleanor Worthington-Cox brings depth to a teenage role that could easily have been lightweight, even irritating if it wasn’t handled carefully. I watched it on Prime.

I’ve only made it partway down my mental list of shows I want to feature, but dinner calls, so I guess this post will be part of a series? Stay tuned for more, and let me know any shows that might fit my sensibilities in the comments!

How I Screwed Up Sales Hiring

December 5, 2025 By Pete Warden in Uncategorized Tags: business, digital-marketing, entrepreneurship, marketing, startup 2 Comments

I founded Moonshine back in 2022, together with Manjunath, another engineer and researcher. My entire career up until that point had been working on consumer products, so I felt very comfortable with how those are sold, and I thought to myself “How hard can B2B sales be?”. The answer, of course, is very hard!

My investors knew that before I did, and pushed me to hire a senior sales person to make up for my lack of experience. It’s taken me three years and multiple failed attempts to build a working sales team, mostly because I didn’t even know enough to ask the right questions. The biggest mistake I kept making was hiring people with ten or twenty years of enterprise sales experience. This wasn’t because they were bad at their jobs, everyone who made it through our interview process had done amazing things at larger companies, but I set them up to fail at my startup. Here’s why:

Startup Sales aren’t Enterprise Sales

Experienced sales people are used to being given a list of qualified leads, a clear set of sales materials, and in general a “repeatable sales motion” that they can follow to close deals. There’s a whole world of Sales Development Representatives (SDRs) who handle finding and qualifying leads through cold-calling, linkedin, searching the web, etc. These are junior roles that hires new to sales are given when they start, and people who want to focus on sales usually graduate from them within six months to a year.

Any sales person with experience won’t have had to generate their own leads for a long time, they’re used to having a team behind them. Even if they’re willing to roll up their sleeves and commit to what’s consider a low-status job, they won’t have a good idea of how to do SDR for a novel product.

Startup Incentives are Long Term

One of the best sales people I met described himself as “coin operated”, and the usual incentive structure is set up to reinforce that attitude, since sales people make most of their earnings through commissions on a quarterly basis. This isn’t a good fit with an early-stage startup because you’re probably going to be making proof-of-concept deals initially where the time to close is uncertain and the revenue is small. A 10% slice of that isn’t interesting compared to the steady, large income stream they get at an established company. The alternative is setting up performance-based bonuses (for example $x for each paid pilot signed) but even that is unlikely to be a very compelling amount for them.

The hope of course is that you can convince candidates to focus on the stock they can earn, but coming from a world where incentives are liquid cash they get within a couple of months, it’s a hard perspective switch to make. They’ve chosen comparatively low-risk compensation for years, why are they going to change now?

Market Discovery

If there’s one thing I’m certain of, it’s that you won’t end up selling to the companies you thought you would at the start. As you learn more about your product and people’s needs, you’ll inevitably adjust who you’re targeting. This is a problem because most senior sales people have a lot of experience in a particular industry, but those skills aren’t portable. They may know the customer needs and have warm relationships with key players in one market, but when your startup changes focus they’ve lost all of those advantages that they’ve spent years building. Even changing the sales model within a single industry will have a big impact on their effectiveness. Someone who has spent years doing high-touch, long sales cycle engagements is going to be starting from scratch if you move to self-serve subscriptions.

So, What Has Worked?

If hiring established sales leaders didn’t work for us, what has?

The first thing I had to learn was that a lot of the work I was thinking of as sales was actually business development. Closing deals is a job for sales, but there will be a lot of other steps before that, like figuring out which role in an organization to reach out to, developing materials, finding conferences where decision makers attend) that are much more about BD. Think about hiring someone with those skills first, before you get a sales person.

What worked for us was finding somebody super-keen who has a business background, but was early in their career, and willing to take on the time-consuming BD work with a song in their heart. The feedback has been that it’s great experience for them, and a lot more interesting than most MBA jobs at that level.

You should also prepare to spend a lot of time on sales yourself. The first few sales are going to be founder-led, and there’s a lot to learn to be successful, so take it as a serious time commitment. Customers prefer talking to founders over salespeople. Founders know the product better than anyone, can answer technical questions, and bring the passion. If you can get to the point where there’s a license to be closed, you have a much better chance of making it happen than anyone else in the company.

Happily you don’t have to go it alone. Good advisors can be incredibly helpful in figuring out domain-specific and process-related questions, as well as being able to introduce you to the people you should be talking to. Find someone who’s got a lot of experience and contacts in the industry and get them excited about what you’re doing, they can be a massive help. A lot of good later-career people are bored because their job is no longer as challenging, so they can be surprisingly open to taking an advisory role for equity. Think about people like lawyers in your field too, they are often very well connected and will know a lot about the actual sales process.

There’s so much inertia at most companies, cultivating champions within your target companies is the only effective way I’ve found to make things happen. You need someone who’s willing to be a pest on your behalf to avoid getting stuck in an endless sales purgatory. To get that level of engagement you have to make sure they feel included in your decision making and invested in the success of your startup. One way is to set up an advisory board that includes any promising champions, that way they get bragging rights if you succeed, they can network with other key industry people, and you can give them an advisory stake too, as long as that works ethically.

I’d imagine that having another founder with good sales experience would have save me learning a lot of these lessons the hard way, but if you’re starting with a technical team, resist the urge to bring in somebody to “handle sales”. It’s so critical to the existence of your startup, it’s not something you can hire your way out of. As CEO, getting those early sales across the line has taken up the majority of my time, even more than product direction and hiring, and I wish I’d embraced that earlier. There are a lot of ways to get help from other people, but at the end of the day only a founder can close those crucial deals.

I Know We’re in an AI Bubble Because Nobody Wants Me 😭

November 29, 2025 By Pete Warden in Uncategorized Tags: ai, artificial-intelligence, chatgpt, llm, technology 13 Comments

I first got into deep learning in 2012, when AlexNet came out. I was CTO of Jetpac, a startup that aimed to provide information about bars, hotels, and restaurants by analyzing public photos, for example finding hipster (and Turk) friendly cafes. The results from the paper were so astonishing I knew AlexNet would be incredibly helpful, so I spent my Christmas holidays heating our house using a gaming rig with two GPUs and the CudaConvNet software, since that was the only way to train my own version of the model.

The results were even better than I’d hoped, but then I faced the problem of how to apply the model across the billions of photos we’d collected. The only GPU instances on Amazon were designed for video streaming and were prohibitively expensive. The CPU support in the Caffe framework was promising, but it was focused on training models, not running them after they’d been trained (aka inference). What I needed was software that would let me run the model at a massive scale on low-cost hardware. That was the original reason I wrote the Jetpac framework, so I could spin up hundreds of cheap EC2 instances to process our huge backlog of images for tens of thousands of dollars instead of millions.

It turned out that the code was small and fast enough to even run on phones, and after Jetpac was acquired by Google I continued in that direction by leading the mobile support for TensorFlow. While I love edge devices, and that’s what I’m known for these days, my real passion is for efficiency. I learned to code in the 80’s demo scene, went on to write PC game engines professionally in the 90’s, and I got addicted to the dopamine rush of optimizing inner loops. There’s nothing quite like having hard constraints, clear requirements, and days to spend solving the puzzle of how to squeeze just a little bit more speed out of a system.

If you’re not a programmer, it might to difficult to imagine what an emotional process optimizing can be. There’s no guarantee that it’s even possible to find a good answer, so the process itself can be endlessly frustrating. The first thrill comes when you see an opening, a possibility that nobody else has spotted. There’s the satisfaction of working hard to chase down the opportunity, and then too often the despair when it turns out not to work. Even then, that means I’ve learned something, and being good at optimization means learning everything you can about the hardware, operating system, the requirements themselves, and studying others’ code in depth. I can never guarantee that I’ll find a solution, but my consolation is always that I have a better understanding of the world than when I started. The deepest satisfaction comes when I do finally find an approach that runs faster, or uses fewer resources. It’s even a social joy, it almost always contributes to a wider solution that the team is working on, making a product better, or even possible in a way it wasn’t before. The best optimizations come from a full stack team that’s able to make tradeoffs all the way from the product manager to the model architects, from hardware to operating system to software.

Anyway, enough rhapsodizing about the joy of coding, what does this have to do with the AI bubble? When I look around, I see hundreds of billions of dollars being spent on hardware – GPUs, data centers, and power stations. What I don’t see are people waving large checks at ML infrastructure engineers like me and my team. It’s been an uphill battle to raise the investment we’ve needed for Moonshine, and I don’t think it’s just because I’m a better coder than I am a salesman. Thankfully we have found investors who believe in our vision, and we’re on track to be cashflow-positive in Q1 2026, but in general I don’t see many startups able to raIse money on the promise of improving AI efficiency.

This makes no sense to me from any rational economic point of view. If you’re a tech company spending billions of dollars a month on GPUs, wouldn’t spending a few hundreds of millions of dollars a year on software optimization be a good bet? We know that GPU utilization is usually below 50%, and in my experience is often much lower for interactive applications where batches are small and memory-bound decoding dominates. We know that motivated engineers like Scott Gray can do better than Nvidia’s libraries on their own GPUs, and from my experience at Jetpac and Google I’m certain there are a lot of opportunities to run inference on much lower cost CPU machines. Even if you don’t care about the cost, the impact AI power usage has on us and the planet should make this a priority.

So, why is this money being spent? As far as I can tell, it’s because of the signaling benefits to the people making the decisions. Startups like OpenAI are motivated to point to the number of GPUs they’re buying as a moat, suggesting that they’ll be the top AI company for years to come because nobody else will be able to catch up with their head start on compute capacity. Hardware projects are also a lot easier to manage than software, they don’t take up so much scarce management attention. Investors are on board because they’ve seen early success turn into long-term dominance before, it’s clear that AI is a world-changing technology so they need to be part of it, and OpenAI and others are happy to absorb billions of dollars of investment, making VCs’ jobs much easier than it would be if they had to allocate across hundreds of smaller companies. Nobody ever got fired for buying IBM, and nobody’s going to get fired for investing in OpenAI.

I’m picking on OpenAI here, but across the industry you can see everyone from Oracle to Microsoft boasting of the amounts of money they’re spending on hardware, and for the same reasons. They get a lot more positive coverage, and a much larger share price boost, from this than they would announcing they’re hiring a thousand engineers to get more value from their existing hardware.

If I’m right, this spending is unsustainable. I was in the tech industry during the dot com boom, and I saw a similar dynamic with Sun workstations. For a couple of years every startup needed to raise millions of dollars just to launch a website, because the only real option was buying expensive Sun servers and closed software. Then Google came along, and proved that using a lot of cheap PCs running open-source software was cheaper and much more scalable. Nvidia these days feels like Sun did then, and so I bet over the next few years there will be a lot of chatbot startups based on cheap PCs with open source models running on CPUs. Of course I made a similar prediction in 2023, and Nvidia’s valuation has quadrupled since then, so don’t look to me for stock tips!

All AI Benchmarks are Wrong, but some are Useful

October 20, 2025 By Pete Warden in Uncategorized Tags: ai, artificial-intelligence, chatgpt, llm, technology Leave a comment

When I was new to Google Brain, I got involved in a long and heated discussion about evaluation numbers for some models we were using. As we walked out of the room, the most senior researcher told me “Look, the only metrics that matter are app store ratings. Everything else is just an approximation.“.

The Word Lens team, who were acquired around the same time Jetpac was, soon gave me a vivid example of this. Google Translate already had a visual translation feature for signs and menus, and the evaluation scores on test datasets were higher than Word Lens’s model achieved. What surprised the Google product managers was that consumers still preferred the Word Lens app over Google Translate for this use case, despite the lower metrics. It turned out the key difference was latency. With Google Translate you snapped a picture, it was uploaded to the server, and a result was returned in a second or two. Word Lens ran at multiple frames per second. This meant that users got instant on-screen feedback about the results, and would jiggle the camera angle until it locked on to a good translation. Google Translate had a higher chance of providing the right translation for a single still image, but because Word Lens was interactive, users ended up with better results overall. Smart product design allowed them to beat Google’s best models, despite apparently falling short on metrics.

I was thinking of this again today as I prepared a data sheet for a potential customer. They wanted to know the BLEU score for our on-device translation solutions. Calculating this caused me almost physical pain because while it remains the most common metric for evaluating machine translation, it doesn’t correlate well with human evaluations of the quality of the results. BLEU is a purely textual measure, and it compares the actual result of the translation word by word against one or more expected translations prepared as ground truth by fluent speakers of the language. There are a lot of problems with this approach. For example, think of a simple French phrase like “Le lac est très beau en automne“. One translation could be “The lake is very beautiful in the autumn“. Another could be “The lake is very pretty in the fall“. “In the fall, the lake’s very pretty” would also be a fair translation that captures the meaning, and might read better in some contexts. You can probably imagine many more variations, and as the sentences get more complex, the possibilities increase rapidly. Unless the ground truth in the dataset includes all of them, any results that are textually different from the listed sentences will be given a low accuracy score, even if they convey the meaning effectively. This means that the overall BLEU score doesn’t give you much information about how good a model is, and using it to compare different models against each other isn’t a reliable way to tell which one users will be happy with.

So why does BLEU still dominate the machine translation field? Model creators need a number that’s straightforward to calculate to optimize towards. If you’re running experiments comparing changes to datasets, optimization techniques, and architectures, you need to be able to quickly tell which seem to be improving the results, and its impractical to evaluate all of these by A/B testing them with actual users. The only way to iterate quickly and at scale is with metrics you can run in an automated way. While BLEU isn’t great for comparing different models, relative changes do at least tend to correlate with improvements or declines for a single model. If an experiment shows that the BLEU score has dropped significantly, there’s a good chance that the users will be happier with this version of the model compared to the original. That makes it a helpful directional signal.

This is why people who are actively working on training models are obsessed with benchmarks and metrics. They sound boring to outsiders, and they’re inherently poor approximations to the actual properties you need for your actual product, but without them it’s impossible to make progress. As George Box said – “All models are wrong, but some are useful“. You can see this clearly with modern LLMs. In general I’m pretty skeptical about the advantages OpenAI and Anthropic gain from their scale, but they have millions of people using their products every day and have the data to understand which metrics correlate to customer satisfaction. There are lots of external efforts to benchmark LLMs, but it’s not clear what they tell us about how well they actually work, and which are best.

This is important because a lot of big decisions get made based on benchmarks. Research papers need to show they beat the state of the art on commonly accepted metrics to be published. Companies get investment funding from their benchmark results. The output and content of the LLMs we use in our daily lives are driven by which metrics are used during their training process. What the numbers capture and what they miss has a direct and growing impact on our world, as LLMs are adopted in more and more applications.

That’s a big reason why Natalie and I started the AI Benchmark Club meetup in SF. There are a lot of AI events in the Bay Area, but if you’re actually training models from scratch, it can be hard to find other people facing similar challenges amongst all the business, marketing, and sales discussions that often dominate. The nice thing about benchmarks is that they sound unimportant to everyone except those of us who rely on them to build new models. This works as a great filter to ensure we have a lot of actual researchers and engineers, with talks and discussions on the practical challenges of our job. As Picasso said – “When art critics get together they talk about content, style, trend and meaning, but when painters get together they talk about where can you get the best turpentine“. I think benchmarks are turpentine for ML researchers, and if you agree then come join us at our next meetup!

	bouquetsweetly69036a… on Meet Fiona and Abby
	softlysuitcb91a8b8b1 on Meet Fiona and Abby
	Zero-Copy GPU Infere… on Why GEMM is at the heart of de…
	Moonshine Voice完全解説｜… on Announcing Moonshine Voice
	Moonshine KI-Sprache… on Introducing Moonshine, the new…

Pete Warden's blog

Ever tried. Ever failed. No matter. Try Again. Fail again. Fail better.

Category Archives: Uncategorized