I Know We’re in an AI Bubble Because Nobody Wants Me 😭

I first got into deep learning in 2012, when AlexNet came out. I was CTO of Jetpac, a startup that aimed to provide information about bars, hotels, and restaurants by analyzing public photos, for example finding hipster (and Turk) friendly cafes. The results from the paper were so astonishing I knew AlexNet would be incredibly helpful, so I spent my Christmas holidays heating our house using a gaming rig with two GPUs and the CudaConvNet software, since that was the only way to train my own version of the model.

The results were even better than I’d hoped, but then I faced the problem of how to apply the model across the billions of photos we’d collected. The only GPU instances on Amazon were designed for video streaming and were prohibitively expensive. The CPU support in the Caffe framework was promising, but it was focused on training models, not running them after they’d been trained (aka inference). What I needed was software that would let me run the model at a massive scale on low-cost hardware. That was the original reason I wrote the Jetpac framework, so I could spin up hundreds of cheap EC2 instances to process our huge backlog of images for tens of thousands of dollars instead of millions.

It turned out that the code was small and fast enough to even run on phones, and after Jetpac was acquired by Google I continued in that direction by leading the mobile support for TensorFlow. While I love edge devices, and that’s what I’m known for these days, my real passion is for efficiency. I learned to code in the 80’s demo scene, went on to write PC game engines professionally in the 90’s, and I got addicted to the dopamine rush of optimizing inner loops. There’s nothing quite like having hard constraints, clear requirements, and days to spend solving the puzzle of how to squeeze just a little bit more speed out of a system.

If you’re not a programmer, it might to difficult to imagine what an emotional process optimizing can be. There’s no guarantee that it’s even possible to find a good answer, so the process itself can be endlessly frustrating. The first thrill comes when you see an opening, a possibility that nobody else has spotted. There’s the satisfaction of working hard to chase down the opportunity, and then too often the despair when it turns out not to work. Even then, that means I’ve learned something, and being good at optimization means learning everything you can about the hardware, operating system, the requirements themselves, and studying others’ code in depth. I can never guarantee that I’ll find a solution, but my consolation is always that I have a better understanding of the world than when I started. The deepest satisfaction comes when I do finally find an approach that runs faster, or uses fewer resources. It’s even a social joy, it almost always contributes to a wider solution that the team is working on, making a product better, or even possible in a way it wasn’t before. The best optimizations come from a full stack team that’s able to make tradeoffs all the way from the product manager to the model architects, from hardware to operating system to software.

Anyway, enough rhapsodizing about the joy of coding, what does this have to do with the AI bubble? When I look around, I see hundreds of billions of dollars being spent on hardware – GPUs, data centers, and power stations. What I don’t see are people waving large checks at ML infrastructure engineers like me and my team. It’s been an uphill battle to raise the investment we’ve needed for Moonshine, and I don’t think it’s just because I’m a better coder than I am a salesman. Thankfully we have found investors who believe in our vision, and we’re on track to be cashflow-positive in Q1 2026, but in general I don’t see many startups able to raIse money on the promise of improving AI efficiency.

This makes no sense to me from any rational economic point of view. If you’re a tech company spending billions of dollars a month on GPUs, wouldn’t spending a few hundreds of millions of dollars a year on software optimization be a good bet? We know that GPU utilization is usually below 50%, and in my experience is often much lower for interactive applications where batches are small and memory-bound decoding dominates. We know that motivated engineers like Scott Gray can do better than Nvidia’s libraries on their own GPUs, and from my experience at Jetpac and Google I’m certain there are a lot of opportunities to run inference on much lower cost CPU machines. Even if you don’t care about the cost, the impact AI power usage has on us and the planet should make this a priority.

So, why is this money being spent? As far as I can tell, it’s because of the signaling benefits to the people making the decisions. Startups like OpenAI are motivated to point to the number of GPUs they’re buying as a moat, suggesting that they’ll be the top AI company for years to come because nobody else will be able to catch up with their head start on compute capacity. Hardware projects are also a lot easier to manage than software, they don’t take up so much scarce management attention. Investors are on board because they’ve seen early success turn into long-term dominance before, it’s clear that AI is a world-changing technology so they need to be part of it, and OpenAI and others are happy to absorb billions of dollars of investment, making VCs’ jobs much easier than it would be if they had to allocate across hundreds of smaller companies. Nobody ever got fired for buying IBM, and nobody’s going to get fired for investing in OpenAI.

I’m picking on OpenAI here, but across the industry you can see everyone from Oracle to Microsoft boasting of the amounts of money they’re spending on hardware, and for the same reasons. They get a lot more positive coverage, and a much larger share price boost, from this than they would announcing they’re hiring a thousand engineers to get more value from their existing hardware.

If I’m right, this spending is unsustainable. I was in the tech industry during the dot com boom, and I saw a similar dynamic with Sun workstations. For a couple of years every startup needed to raise millions of dollars just to launch a website, because the only real option was buying expensive Sun servers and closed software. Then Google came along, and proved that using a lot of cheap PCs running open-source software was cheaper and much more scalable. Nvidia these days feels like Sun did then, and so I bet over the next few years there will be a lot of chatbot startups based on cheap PCs with open source models running on CPUs. Of course I made a similar prediction in 2023, and Nvidia’s valuation has quadrupled since then, so don’t look to me for stock tips!

13 responses

  1. I do really resonate with this, for the very core reason that for most of it LLM inference is already getting better and cheaper to incredible degrees – this has been proven by https://github.com/microsoft/BitNet bitnet based improvements, as well as (notable) inference gains via MoE based transformer models.

    However – with all the money against the sun being poured into bitnet/quantisation based research, I am surprised that open source models with combined (bitnet + quantisation) inference gains have not become more mainstream, not to mention gains due to specialised inference architecture, my suspicion is that this is due to the giant number of libraries that have come alongside native transformer models (i.e. transformers via huggingface, vllm, sglang) and their lack of support for bitnet natively.

    Further, scaling has hit it’s peak limit, further gains, (as you’ve…again mentioned) seem to come from research improvements and specialist improvements, and in my person opinion – via having better data and having a moat of having enough people who would be willing to create specialised pipelines to create training data for better LLMs and transformer based models.

    That being said… the research moat is also complex at it’s core, idk perhaps there’s a problem with the curse of knowledge here – all these libraries are a wrapper over pytorch…and as absurd as it sounds, I share your intuition that I can’t see GPU based inference lasting for very long as a moat – but where researchers do take the priority is that you need someone with actual intuitive experience to maintain these tools, they will inevitably break, need newer improvements, need shielding from legislative improvements, and that is something of severe need for bigger organisations.

    The true moats lie towards

    1. Having better Researchers who specialise and know the things they are doing
    2. Data Labeling/Annotation farms

  2. I would have thought that the DeepSeek case, and the experiments on small models is at least an argument for going in teh direction you suggest. Apple seems to be going down that path too, having inference run on local machines. whether laptops or iPhones.

    If Zitron is correct, OpenAI loses money on every inference invoked. If correct, it begs for ways to reduce this cost to make AI inference via LLMs make a viable business model.

    I suspect the the problem is that investors are chasing teh big AI companies as the ones apparently increasing stock values today, possibly with the hope of selling at the top to greater fools. This is very much in alignment with the “Buy Sun’s Solaris servers”. I know our biotech startup did, even though we were at the cusp of the move to using many cheap PCs at the backend. (We couldn’t use cloud services for customer data security reasons.) A replacement server was horrendously expensive.

    I suspect when teh AI bubble eventually bursts, your ideas on edge computing on tiny hardware using all the tools in your toolbox will be in demand to make AI profitable in a range of business domains. The only historical counter examples I can think of was the mistake that Starfish made on trying to create code fot single 1.44MB floppy disks. It is missed the CD and later DVD media revolution that obsoleted that idea.

  3. Love this perspective. Everyone’s chasing more GPUs, but almost nobody is chasing efficiency even though that’s where the real moat will be. When compute hype cools down, optimization will suddenly look like the smartest investment in the room.

  4. You’re assuming investors as a class are rational. A close look at most of their returns show they are not. It’s a fact that folks like you and me with a few grey hairs typically provide a better return than a fresh college graduate, but the VC community is still largely fixated on the young male demographic.

    A lot of the current bubble stems from the concept that for many investors the only thing worse than investing in a flop is passing on the Next Big Thing. So the chase is on.

    However, as is often the case, I do think you are right. Figuring out significant efficiencies is incredibly important. It will either happen here as a result of work by folks like you, or it will happen in places that are much more resource constrained so they have no choice but to improvise. Either way, it will happen. However, it’s not particularly glamorous in comparison to some throwaway consumer app that gets immediate traction and almost no revenue.

    I’m delighted to hear that you have new funding. Best of luck with the road ahead!

    I’ll now go back to telling kids to get off my lawn 😂

  5. Pingback: Pete Warden: I Know We’re in an AI Bubble Because Nobody Wants Me 😭 | ResearchBuzz: Firehose

  6. Pingback: Regulated Plants, North Macedonia History, China LLMs, More: Tuesday ResearchBuzz, December 2, 2025 – ResearchBuzz

  7. “This makes no sense to me from any rational economic point of view. If you’re a tech company spending billions of dollars a month on GPUs, wouldn’t spending a few hundreds of millions of dollars a year on software optimization be a good bet?”

    I tend to agree with you, Pete. I noticed recently companies are noting the energy usage of their farms vs. the # of GPUs.

    I was joking around with a friend who was funding one of these farms that if he put a power to ground short, he could advertise the first 1 TW installation.

  8. The hold everything and then the arrival of the big Sun box you describes here resonates for me. It was the ANSWER – and it’s cost a limiter – until it wasn’t. At some point some hardworking bright folks figured out open source hard drives on velcro worked. As one commenter in this thread suggests, one would have thought the Deep Seek paper got enough publicity to change the AI ANSWER. It could have cast a lot of light on the efficient crew that has been working on Tiny ML, and others. I read the birth of the ChatGPT phenomenon this way: A frustrated Nadella threw the gauntlet down to blast Microsoft ahead of Google, which was tentative to sic this on the world. So, it was a game at the start, and elements of it are a game [of “billions and billions”] today.

  9. Making a GPU more efficient helps the owner of the GPU as well. And software efficiency will arrive probably later than chip efficiency. At least for a few more years.

    Servers don’t seem like a very challenging technical issue; HPC creation and manufacturing is extremely challenging.

    ASICS are ok but will only get you to a certain point.

    We’re pursuing AGI and beyond.

    This is like discussing what we’re gonna do, instead of just doing it

  10. Thank you really much for this writing. I have been feeling this for maybe 10-12 years already.

    I guess we would need a company where optimization is directly correlated with gained money, or at least conveyed to the clients as such.

    It seems spending fast to implement cheap and dirty beats optimization for short term money any single time, but we end up paying the years after whether directly by hiring infinite contractors to fix the mistakes made by “moving [way too] fast” or in our everyday life with infinite ecological disasters.

    If you have a solution to this, please do let know.

  11. Disclosure: I work in HW-SW co-design for AI ASICs.

    I think this misses a key layer: Hardware innovation isn’t dead; it has just become a high-stakes prediction game. Because silicon takes 2-3 years to bake, we have to guess where software will be in the future.

    A few technical corrections/additions:

    1. The Real Moat: GPUs create a training moat, not an inference moat. Training is high-margin, high-ROI, and incredibly hard to replicate. Inference is easier to commoditize.
    2. Co-Design is Mandatory: You can’t solve everything with software later. For example, Nvidia’s new NVFP4 requires changes in the hardware, the software, and the model architecture simultaneously.
    3. Quantization Limits: Software tricks alone hit a wall. Post-Training Quantization (PTQ) isn’t enough for the next generation of extreme efficiency; you need hardware that supports Quantization Aware Training (QAT), or the accuracy drops too much.
    4. The Bandwidth Trap: Even custom ASICs will fail if they just add compute. If the model is IO-bound (memory limited), adding more teraflops does nothing.

    The model designer ultimately dictates the constraints. Software can optimize, but it can’t fix a system where the physics of data movement (networking/memory) is the bottleneck.

Leave a reply to botmetkas@gmail.com Cancel reply