Why Nvidia’s AI Supremacy is Only Temporary

Nvidia is an amazing company that has executed a contrarian vision for decades, and has rightly become one of the most valuable corporations on the planet thanks to its central role in the AI revolution. I want to explain why I believe it’s top spot in machine learning is far from secure over the next few years. To do that, I’m going to talk about some of the drivers behind Nvidia’s current dominance, and then how they will change in the future.

Currently

Here’s why I think Nvidia is winning so hard right now.

#1 – Almost Nobody is Running Large ML Apps

Outside of a few large tech companies, very few corporations have advanced to actually running large scale AI models in production. They’re still figuring out how to get started with these new capabilities, so the main costs are around dataset collection, hardware for training, and salaries for model authors. This means that machine learning is focused on training, not inference.

#2 – All Nvidia Alternatives Suck

If you’re a developer creating or using ML models, using an Nvidia GPU is a lot easier and less time consuming than an AMD OpenCL card, Google TPU, a Cerebras system, or any other hardware. The software stack is much more mature, there are many more examples, documentation, and other resources, finding engineers experienced with Nvidia is much easier, and integration with all of the major frameworks is better. There is no realistic way for a competitor to beat the platform effect Nvidia has built. It makes sense for the current market to be winner-takes-all, and they’re the winner, full stop.

#3 – Researchers have the Purchasing Power

It’s incredibly hard to hire ML researchers, anyone with experience has their pick of job offers right now. That means they need to be kept happy, and one of the things they demand is use of the Nvidia platform. It’s what they know, they’re productive with it, picking up an alternative would take time and not result in skills the job market values, whereas working on models with the tools they’re comfortable with does. Because researchers are so expensive to hire and retain, their preferences are given a very high priority when purchasing hardware.

#4 – Training Latency Rules

As a rule of thumb models need to be trainable from scratch in about a week. I’ve seen this hold true since the early days of AlexNet, because if the iteration cycle gets any longer it’s very hard to do the empirical testing and prototyping that’s still essential to reach your accuracy goals. As hardware gets faster, people build bigger models up until the point that the training once again takes roughly the same amount of time, and reap the benefits through higher-quality models rather than reduced total training time. This makes buying the latest Nvidia GPUs very attractive, since your existing code will mostly just work, but faster. In theory there’s an opportunity here for competitors to win with lower latency, but the inevitably poor state of their software stack (CUDA has had decades of investment) means it’s mostly an illusion.

What’s going to change?

So, hopefully I’ve made a convincing case that there are strong structural reasons behind Nvidia’s success. Here’s how I see those conditions changing over the next few years.

#1 – Inference will Dominate, not Training

Somebody years ago told me “Training costs scale with the number of researchers, inference costs scale with the number of users”. What I took away from this is that there’s some point in the future where the amount of compute any company is using for running models on user requests will exceed the cycles they’re spending on training. Even if the cost of a single training run is massive and running inference is cheap, there are so many potential users in the world with so many different applications that the accumulated total of those inferences will exceed the training total. There are only ever going to be so many researchers.

What this means for hardware is that priorities will shift towards reducing inference costs. A lot of ML researchers see inference as a subset of training, but this is wrong in some fundamental ways. It’s often very hard to assemble a sizable batch of inputs during inference, because that process trades off latency against throughput, and latency is almost always key in user-facing applications. Small or single-input batches change the workload dramatically, and call for very different optimization approaches. There are also a lot of things (like the weights) that remain constant during inference, and so can benefit from pre-processing techniques like weight compression or constant folding.

#2 – CPUs are Competitive for Inference

I didn’t even list CPUs in the Nvidia alternatives above because they’re still laughably slow for training. The main desktop CPUs (x86, Arm, and maybe RISC-V soon) have the benefit of many decades of toolchain investment. They have an even more mature set of development tools and community than Nvidia. They can also be much cheaper per arithmetic op than any GPU.

Old-timers will remember the early days of the internet when most of the cost of setting up a dot-com was millions of dollars for a bunch of high-end web server hardware from someone like Sun. This was because they were the only realistic platform that could serve web pages reliably and with low-latency. They had the fastest hardware money could buy, and that was important when entire sites needed to fit on a single machine. Sun’s market share was rapidly eaten by the introduction of software that could distribute the work across a large number of individually much less capable machines, commodity x86 boxes that were far cheaper.

Training is currently very hard to distribute in a similar way. The workloads make it possible to split work across a few GPUs that are tightly interconnected, but the pattern of continuous updates makes reducing latency by sharding across low-end CPUs unrealistic. This is not true for inference though. The model weights are fixed and can easily be duplicated across a lot of machines at initialization time, so no communication is needed. This makes an army of commodity PCs very appealing for applications relying on ML inference.

#3 – Deployment Engineers gain Power

As inference costs begin to dominate training, there will be a lot of pressure to reduce those costs. Researchers will no longer be the highest priority, so their preferences will carry less weight. They will be asked to do things that are less personally exciting in order to streamline production. There are also going to be a lot more people capable of training models coming into the workforce over the next few years, as the skills involved become more widely understood. This all means researchers’ corporate power will shrink and the needs of the deployment team will be given higher priority.

#4 – Application Costs Rule

When inference dominates the overall AI budget, the hardware and workload requirements are very different. Researchers value the ability to quickly experiment, so they need flexibility to prototype new ideas. Applications usually change their models comparatively infrequently, and may use the same fundamental architecture for years, once the researchers have come up with something that meets their needs. We may almost be heading towards a world where model authors use a specialized tool, like Matlab is for mathematical algorithms, and then hand over the results to deployment engineers who will manually convert the results into something more efficient for an application. This will make sense because any cost savings will be multiplied over a long period of time if the model architecture remains constant (even if the weights change).

What does this Mean for the Future?

If you believe my four predictions above, then it’s hard to escape the conclusion that Nvidia’s share of the overall AI market is going to drop. That market is going to grow massively so I wouldn’t be surprised if they continue to grow in absolute unit numbers, but I can’t see how their current margins will be sustainable.

I expect the winners of this shift will be traditional CPU platforms like x86 and Arm. Inference will need to be tightly integrated into traditional business logic to run end user applications, so it’s difficult to see how even hardware specialized for inference can live across a bus, with the latency involved. Instead I expect CPUs to gain much more tightly integrated machine learning support, first as co-processors and eventually as specialized instructions, like the evolution of floating point support.

On a personal level, these beliefs drive my own research and startup focus. The impact of improving inference is going to be so high over the next few years, and it still feels neglected compared to training. There are signs that this is changing though. Communities like r/LocalLlama are mostly focused on improving inference, the success of GGML shows how much of an appetite there is for inference-focused frameworks, and the spread of a few general-purpose models increases the payoff of inference optimizations. One reason I’m so obsessed with the edge is that it’s the closest environment to the army of commodity PCs that I think will run most cloud AI in the future. Even back in 2013 I originally wrote the Jetpac SDK to accelerate computer vision on a cluster of 100 m1.small AWS servers, since that was cheaper and faster than a GPU instance for running inference across millions of images. It was only afterwards that I realized what a good fit it was for mobile devices.

I’d love to hear your thoughts on whether inference is going to be as important as I’m predicting! Let me know in the comments if you think I’m onto something, or if I should be stocking up on Nvidia stock.