Why Nvidia’s AI Supremacy is Only Temporary

Nvidia is an amazing company that has executed a contrarian vision for decades, and has rightly become one of the most valuable corporations on the planet thanks to its central role in the AI revolution. I want to explain why I believe it’s top spot in machine learning is far from secure over the next few years. To do that, I’m going to talk about some of the drivers behind Nvidia’s current dominance, and then how they will change in the future.

Currently

Here’s why I think Nvidia is winning so hard right now.

#1 – Almost Nobody is Running Large ML Apps

Outside of a few large tech companies, very few corporations have advanced to actually running large scale AI models in production. They’re still figuring out how to get started with these new capabilities, so the main costs are around dataset collection, hardware for training, and salaries for model authors. This means that machine learning is focused on training, not inference.

#2 – All Nvidia Alternatives Suck

If you’re a developer creating or using ML models, using an Nvidia GPU is a lot easier and less time consuming than an AMD OpenCL card, Google TPU, a Cerebras system, or any other hardware. The software stack is much more mature, there are many more examples, documentation, and other resources, finding engineers experienced with Nvidia is much easier, and integration with all of the major frameworks is better. There is no realistic way for a competitor to beat the platform effect Nvidia has built. It makes sense for the current market to be winner-takes-all, and they’re the winner, full stop.

#3 – Researchers have the Purchasing Power

It’s incredibly hard to hire ML researchers, anyone with experience has their pick of job offers right now. That means they need to be kept happy, and one of the things they demand is use of the Nvidia platform. It’s what they know, they’re productive with it, picking up an alternative would take time and not result in skills the job market values, whereas working on models with the tools they’re comfortable with does. Because researchers are so expensive to hire and retain, their preferences are given a very high priority when purchasing hardware.

#4 – Training Latency Rules

As a rule of thumb models need to be trainable from scratch in about a week. I’ve seen this hold true since the early days of AlexNet, because if the iteration cycle gets any longer it’s very hard to do the empirical testing and prototyping that’s still essential to reach your accuracy goals. As hardware gets faster, people build bigger models up until the point that the training once again takes roughly the same amount of time, and reap the benefits through higher-quality models rather than reduced total training time. This makes buying the latest Nvidia GPUs very attractive, since your existing code will mostly just work, but faster. In theory there’s an opportunity here for competitors to win with lower latency, but the inevitably poor state of their software stack (CUDA has had decades of investment) means it’s mostly an illusion.

What’s going to change?

So, hopefully I’ve made a convincing case that there are strong structural reasons behind Nvidia’s success. Here’s how I see those conditions changing over the next few years.

#1 – Inference will Dominate, not Training

Somebody years ago told me “Training costs scale with the number of researchers, inference costs scale with the number of users”. What I took away from this is that there’s some point in the future where the amount of compute any company is using for running models on user requests will exceed the cycles they’re spending on training. Even if the cost of a single training run is massive and running inference is cheap, there are so many potential users in the world with so many different applications that the accumulated total of those inferences will exceed the training total. There are only ever going to be so many researchers.

What this means for hardware is that priorities will shift towards reducing inference costs. A lot of ML researchers see inference as a subset of training, but this is wrong in some fundamental ways. It’s often very hard to assemble a sizable batch of inputs during inference, because that process trades off latency against throughput, and latency is almost always key in user-facing applications. Small or single-input batches change the workload dramatically, and call for very different optimization approaches. There are also a lot of things (like the weights) that remain constant during inference, and so can benefit from pre-processing techniques like weight compression or constant folding.

#2 – CPUs are Competitive for Inference

I didn’t even list CPUs in the Nvidia alternatives above because they’re still laughably slow for training. The main desktop CPUs (x86, Arm, and maybe RISC-V soon) have the benefit of many decades of toolchain investment. They have an even more mature set of development tools and community than Nvidia. They can also be much cheaper per arithmetic op than any GPU.

Old-timers will remember the early days of the internet when most of the cost of setting up a dot-com was millions of dollars for a bunch of high-end web server hardware from someone like Sun. This was because they were the only realistic platform that could serve web pages reliably and with low-latency. They had the fastest hardware money could buy, and that was important when entire sites needed to fit on a single machine. Sun’s market share was rapidly eaten by the introduction of software that could distribute the work across a large number of individually much less capable machines, commodity x86 boxes that were far cheaper.

Training is currently very hard to distribute in a similar way. The workloads make it possible to split work across a few GPUs that are tightly interconnected, but the pattern of continuous updates makes reducing latency by sharding across low-end CPUs unrealistic. This is not true for inference though. The model weights are fixed and can easily be duplicated across a lot of machines at initialization time, so no communication is needed. This makes an army of commodity PCs very appealing for applications relying on ML inference.

#3 – Deployment Engineers gain Power

As inference costs begin to dominate training, there will be a lot of pressure to reduce those costs. Researchers will no longer be the highest priority, so their preferences will carry less weight. They will be asked to do things that are less personally exciting in order to streamline production. There are also going to be a lot more people capable of training models coming into the workforce over the next few years, as the skills involved become more widely understood. This all means researchers’ corporate power will shrink and the needs of the deployment team will be given higher priority.

#4 – Application Costs Rule

When inference dominates the overall AI budget, the hardware and workload requirements are very different. Researchers value the ability to quickly experiment, so they need flexibility to prototype new ideas. Applications usually change their models comparatively infrequently, and may use the same fundamental architecture for years, once the researchers have come up with something that meets their needs. We may almost be heading towards a world where model authors use a specialized tool, like Matlab is for mathematical algorithms, and then hand over the results to deployment engineers who will manually convert the results into something more efficient for an application. This will make sense because any cost savings will be multiplied over a long period of time if the model architecture remains constant (even if the weights change).

What does this Mean for the Future?

If you believe my four predictions above, then it’s hard to escape the conclusion that Nvidia’s share of the overall AI market is going to drop. That market is going to grow massively so I wouldn’t be surprised if they continue to grow in absolute unit numbers, but I can’t see how their current margins will be sustainable.

I expect the winners of this shift will be traditional CPU platforms like x86 and Arm. Inference will need to be tightly integrated into traditional business logic to run end user applications, so it’s difficult to see how even hardware specialized for inference can live across a bus, with the latency involved. Instead I expect CPUs to gain much more tightly integrated machine learning support, first as co-processors and eventually as specialized instructions, like the evolution of floating point support.

On a personal level, these beliefs drive my own research and startup focus. The impact of improving inference is going to be so high over the next few years, and it still feels neglected compared to training. There are signs that this is changing though. Communities like r/LocalLlama are mostly focused on improving inference, the success of GGML shows how much of an appetite there is for inference-focused frameworks, and the spread of a few general-purpose models increases the payoff of inference optimizations. One reason I’m so obsessed with the edge is that it’s the closest environment to the army of commodity PCs that I think will run most cloud AI in the future. Even back in 2013 I originally wrote the Jetpac SDK to accelerate computer vision on a cluster of 100 m1.small AWS servers, since that was cheaper and faster than a GPU instance for running inference across millions of images. It was only afterwards that I realized what a good fit it was for mobile devices.

I’d love to hear your thoughts on whether inference is going to be as important as I’m predicting! Let me know in the comments if you think I’m onto something, or if I should be stocking up on Nvidia stock.

Accelerating AI with the Raspberry Pi Pico’s dual cores

I’ve been a fan of the RP2040 chip powering the Pico since it was launched, and we’re even using them in some upcoming products, but I’d never used one of its most intriguing features, the second core. It’s not common to have two cores in a microcontroller, especially a seventy cent Cortex M0, and most of the system software for that level of CPU doesn’t have standardized support for threads and other typical ways to get parallelized performance from your algorithms. I still wanted to see if I could get a performance boost on compute-intensive tasks like machine learning though, so I dug into the pico_multicore library which provides access low-level access to the second core.

The summary is that I was able to get approximately a 1.9x speed boost by breaking a convolution function into two halves and running one on each processor. The longer story is that I actually implemented most of this several months ago, but got stuck due to a silly mistake where I was accidentally serializing the work by calling functions in the wrong order! I was in the process of preparing a bug report for the RPi team who had kindly agreed to take a look when I realized my mistake. Another win for rubberducking!

If you’re interested in the details, the implementation is in my custom version of an Arm CMSIS-NN source file. I actually ended up putting together an updated version of the whole TFLite Micro library for the Pico to take advantage of this. There’s another long story behind that too. I did the first TFLM port for the Pico in my own time, and since nobody at Google or Raspberry Pi is actively working on it, it’s remained stuck at that original version. I can’t make the commitment to be a proper maintainer of this new version, it will be on a best-effort basis, so bugs and PRs may not be addressed, but I’ve at least tried to make it easier to update with a sync/sync_with_upstream.sh script that currently works and is designed to as robust to future changes as I can make it.

If you want more information on the potential speedup, I’ve included some benchmarking results. The lines to compare are the CONV2D results. For example the first convolution layer takes 46ms without the optimizations, and 24ms when run on both the cores. There are other layers in the benchmark that aren’t optimized, like depthwise convolution, but the overall time for running the person detection model once drops from 782ms to 599ms. This is already a nice boost, but in the future we could do something similar for the depthwise convolution to increase the speed even more.

Thanks to the Raspberry Pi team for building a lovely little chip! Everything from the PIOs to software overclocking and dual cores makes it a fascinating system to work with, and I look forward to diving in even deeper.

Explore the dark side of Silicon Valley with Red Team Blues

It’s weird to live in a place that so many people have heard of, but so few people know. Silicon Valley is so full of charismatic people spinning whatever stories serve their ends it’s hard for voices with fewer ulterior motives to get airtime. Even the opponents of big tech have an incentive to mythologize it, it’s the only way to break through the noise. It’s very rare to find someone with deep experience of our strange world who can paint a picture I recognize.

That’s a big reason I’ve always loved Cory Doctorow’s writing. He knows the technology industry and the people who inhabit it inside and out, but he’s not interested in either hagiography or demonization. He’s always been able to pinpoint the little details that make this world simultaneously relatable and deeply weird, like this observation about wealth from his latest book:

I almost named the figure, but I did not. My extended network of OG Silicon Valley types included paupers and billionaires, and long ago, we all figured out that the best way to stay on friendly terms was to keep the figures out of it.

Red Team Blues is a fast-paced crime novel in the best traditions of Hammett, but taking inspiration from the streets of 2020’s San Francisco instead of the 1920’s. His eye for detail adds authenticity, with his forensic accountant protagonist relying more on social media carelessness than implausible hacking attempts to gather the information he needs. There’s a thread of anger running through the story too, at the machinery of tax evasion that lies behind so many industry facades, and contributes to the world of homelessness that is the mirror image of all the partying billionaires. He’s unsparing in his assessment of cryptocurrencies, seeing their success as driven by money laundering for some of the worst people in the world.

I love having an accountant as the center of a thriller, and Cory’s hero Martin Hench is a lot of fun to spend time with. The plot itself is a rollercoaster ride through cryptography, drug gangs, wildfire ghost towns, ex-Soviet grifters, and it will keep you turning the pages. I highly recommend picking up a copy, it’s enjoyable and thought-provoking at the same time.

To give you one last taste, here’s his perfect pen portrait of someone I’ve met a few too many times:

I’ve known a lot of hustlers, aggro types who cut corners and bull their way through the consequences. It’s a type, out here. Move fast and break things. Don’t ask permission; beg forgiveness. But most of those people, they know they’re doing it. You can manage them, tack around them, factor them into your plans.

The ones who get high on their own supply, though? There’s no factoring them in. Far as they’re concerned, they’re the only player characters in the game and everyone else is an NPC, a literal nobody.

How can AI help everyday life?

Video of an AI-controlled lamp

There’s a lot of hype around AI these days, and it’s easy to believe that it’s just another tech world fad like the Metaverse or crypto. I think that AI is different though, because the real-world impact doesn’t require a leap of faith to imagine. For example, I’ve had a long-time dream of being able to look at a lamp, say “On”, and have the light come on. I want to be able to just ask everyday objects for help and have them do something intelligent.

To make it easier to understand what I’m talking about, we’ve built a small box that understands when you’re looking at it, can make sense of spoken language, and set it up to control a lamp. We’ve designed it to work as simply as possible:

  • There’s no wake word like “Alexa” or “Siri”. You trigger the interaction by looking at the lamp, using a Person Sensor to detect that gaze.
  • We don’t require a set order of commands, we’re able to pick out what you want from a stream of natural speech using our AI models.
  • Everything is running locally on the controller box. This means that not only is all your data private, it never leaves your home, but there’s also no setup needed. You don’t have to download an app, connect to wifi, or even create an account. Plug in the controller and lamp, and it Just Works.

All of this is only possible because of the new wave of transformer models that are sweeping the world. We’re going to see a massive number of new capabilities like this enter our everyday lives, not in years but in months. If you’re interested in how this kind of local, private intelligence (with no server costs!) could work with your products, I’d love to chat.

What happens when the real Young Lady’s Illustrated Primer lands in China?

I love Brad DeLong’s writing, but I did a double take when he recently commented‘A Young Lady’s Illustrated Primer’ continues to recede into the future“. The Primer he’s referencing is an electronic book from Neal Stephenson’s Diamond Age novel, an AI tutor designed to educate and empower children, answering their questions and shaping their characters with stories and challenges. It’s a powerful and appealing idea in a lot of ways, and offers a very compelling use case for conversational machine learning models. I also think that a workable version of it now exists.

The recent advances with large language models have amazed me, and I do think we’re now a lot closer to an AI companion that could be useful for people of any age. If you try entering “Tell me a story about a unicorn and a fairy” into ChatGPT you’ll almost certainly get something more entertaining and coherent than most adults could come up with on the fly. This model comes across as a creative and engaging partner, and I’m certain that we’ll be seeing systems aimed at children soon enough, for better or worse. It feels like a lot of the functionality of the Primer is already here, even if the curriculum and veracity of the responses is lacking.

One of the reasons I like Diamond Age so much is that it doesn’t just describe the Primer as a technology, it looks hard at its likely effects. Frederik Pohl wrote “a good science fiction story should be able to predict not the automobile but the traffic jam“, and Stephenson shows how subversive a technology that delivers information in this new way can be. The owners of the Primer grow up indoctrinated by its values and teachings, and eventually become a literal army. This is portrayed in a positive light, since most of those values are ones that a lot of Western educated people would agree with, but its also clear that Stephenson believes that the effects of a technology like this would be incredibly disruptive to the status quo.

How does this all related back to ChatGPT? Try asking it “Tell me about Tiananmen Square” and you’ll get a clear description of the 1989 government crackdown that killed hundreds or even thousands of protestors. So what, you might ask? We’ve been able to type the same query into Google or Wikipedia for decades to get uncensored information. What’s different about ChatGPT? My friend Drew Breunig recently wrote an excellent post breaking down how LLMs work, and one of his side notes is that they can be seen as an extreme form of lossy compression for all the data that they’ve seen during training. The magic of LLMs is that they’ve effectively shrunk a lot of the internet’s text content into a representation that’s a tiny fraction of the size of the original. A model like LLaMa might have been exposed over a trillion words during training, but it fits into a 3.5GB file, easily small enough to run locally on a smart phone or Raspberry Pi. That means the “Tiananmen Square” question can be answered without having to send a network request. No cloud, wifi, or cell connection is needed!

If you’re trying to control the flow of information in an authoritarian state like China, this is a problem. The Great Firewall has been reasonably effective at preventing ordinary citizens from accessing cloud-based services that might contradict CCP propaganda because they’re physically located outside of the country, but monitoring apps that run entirely locally on phones is going to be a much tougher challenge. One approach would be to produce alternative LLMs that only include approved texts, but as the “large” in the name implies, training these models requires a lot of data. Labeling all that data would be a daunting technical project, and the end results are likely to be less useful overall than an uncensored version. You could also try to prevent unauthorized models from being downloaded, but because they’re such useful tools they’re likely to show up preloaded in everything from phones to laptops and fridges.

This local aspect of the current AI revolution isn’t often appreciated, because many of the demonstrations show up as familiar text boxes on web pages, just like the cloud services we’re used to. It starts to become a little clearer when you see how many models like LLaMa and Stable Diffusion can be run locally as desktop apps, or even on Raspberry Pis, but these are currently pretty slow and clunky. What’s going to happen over the next year or two is that the models will be optimized and start to match or even outstrip the speed of the web applications. The elimination of cloud bills for server processing and improved latency will drive commercial providers towards purely edge solutions, and the flood of edge hardware accelerators will narrow the capability gap between a typical phone or embedded system and a GPU in a data center.

Simply put, people all over the world are going to be learning from their AI companions, as rudimentary as they currently are, and censoring information is going to be a lot harder when the whole process happens on the edge. Local LLMs are going to change politics all over the world, but especially in authoritarian states who try to keep strict controls on information flows. The Young Lady’s Illustrated Primer is already here, it’s just not evenly distributed yet.

Notes from a bank run

Photo by Gopal Vijayaraghavan

My startup, Useful Sensors, has all of its money in Silicon Valley Bank. There are a lot of things I worried about as a CEO, but assessing SVB’s creditworthiness wasn’t one of them. It clearly should have been. I don’t have any grand theories about what’s happened over the last few days but I wanted to share some of my experiences as someone directly affected.

To start with, Useful is not at risk of shutting down. The worst case scenario, as far as I can tell, is that we only have access to the insured amount of $250k in our SVB account on Monday. This will be plenty for payroll on Wednesday, and from what I’ve seen there are enough liquid assets that sales of the government bonds that triggered the whole process should be enough to return a good portion of the remaining balance within a week or so. If I need to, I’ll dip into my personal savings to keep the lights on. I know this isn’t true for many other startups though, so if they don’t get full access to their funds there will be job losses and closures.

Although we’re not going to close, it is very disruptive to our business. Making sure that our customers are delighted and finding more of them should be taking all of our attention. Instead I spent most of Thursday and Friday dealing with a rapidly changing set of recommendations from our investors, attempting to move money, open new accounts, and now I’m discovering the joys of the FDIC claims process. I’m trying to do this all while I’m flying to Germany for Embedded World to announce a new distribution deal with OKdo, and this blog post is actually written from an airport lounge in Paris. Longer term, depending on the ultimate outcome it may affect when we want to raise our next round. To be clear, we’re actually in a great position compared to many others, I’m an old geezer with savings, but long-term planning at a startup is hard enough without extra challenges like this thrown in.

It has been great having access to investors and founders who are able to help us in practical ways. We would never have been able to open a new account so quickly without introductions to helpful staff at another bank. I’ve been glued to the private founder chat rooms where people have shared their experiences with things like the FDIC claims process and pending wires. This kind of rapid communication and sharing of information is what makes Silicon Valley such a good place to build a startup, I’m very grateful for everyone’s help.

Having said that, the Valley’s ability to spread information and recommendations quickly was one of the biggest causes of SVB’s demise. I’ve always been a bit of a rubbernecker at financial disasters, and I’d read enough books on the 2008 financial crisis to understand how bank runs happen. It was strange being in one myself though, because the logic of “everyone else is pulling their money so you’d better too before it’s all gone” is so powerful, even though I knew this mentality was a self-fulfilling prophecy. I planned on what I hoped was a moderate course of action, withdrawing some of our funds from SVB to move to another institution to gain some diversification, but by the time I was able to set up the transfer it was too late.

Technology companies aren’t the most sympathetic victims in the current climate, for many good reasons. I thought this story covers the political dimensions of the bank failure well. The summary is that many taxpayers hate the idea of bailing out startups, especially ones with millions in their bank accounts. There are a lot of reasons why I think we’ll all benefit from not letting small businesses pay the price for bank executives messing up their risk management, but they’re all pretty wonky and will be a hard sell. However the alternative is a world where only the top two or three banks in the US get most of the deposits, because they’re perceived as too big to fail. If no financial regulator spotted the dangers with SVB, how can you expect small business owners to vet banks themselves? We’ll all just end up going to Citibank or JPMorgan, which increases the overall systemic risk, as we saw in 2008.

Anyway, I just want to dedicate this to all of the founders having a tough weekend. Startups are all about dealing with risks, but this is a particularly frustrating problem to face because it’s so unnecessary. I hope at least we’ll learn more over the next few weeks about how executives and regulators let a US bank with $200 billion in assets get into such a sorry state.

Go see Proxistant Vision at SFMCD

When I think of a museum with “craft” in its name, I usually imagine an institution focused on the past. San Francisco’s Museum of Craft and Design is different. Their mission is to “bring you the work of the hand, mind and heart“, and Bull.Miletic’s Proxistant Vision exhibition is a wonderful example of how their open definition of craft helps them find and promote startling new kinds of art.

When I first walked into the gallery space I was underwhelmed. There were three rooms with projectors, but the footage they were showing was nearly monochrome and I didn’t feel much to connect with. I was intrigued by some of the rigs for the projectors though, with polyhedral mirrors and a cart that whirred strangely. I’m glad I had a little patience, because all of the works turned out to have their own life and animation beyond anything I’d seen before.

The embedded video tries to capture my experience of one of the rooms, Ferriscope. The artists describe it as a kinetic video installation, and at its heart is a mirror that can direct the projector output in a full vertical circle, on two walls, the floor, and ceiling, with a speed that can be dizzying. Instead of the view staying static as people ride a wheel, we stay still while what we see goes flying by. It’s very hard to do justice to the impact this has with still images or even video. The effect is confusing but mesmerizing, and it forced me to look in a different way, if that makes sense?

There are two other installations as part of the show, Venetie 11111100110, and Zoom Blue Dot. I won’t spoil the enjoyment by describing too much about their mechanics but they both play with moving and fragmenting video using mirrors and robotics. They aren’t as immediately startling as Ferrriscope, but they drew me in and forced me to look with a fresh eye at familiar scenes. To my mind that’s the best part about all these works, they shook me out of being a passive observer and consumer of images, I was suddenly on unsteady ground with an uncertain viewpoint. You don’t get to stand stroking your chin in front of these installations, you have to engage with the work in a much more active way.

The exhibition is open until March 19th 2023, and I highly recommend you go visit. You won’t be disappointed, though you may be disoriented!

Online Gesture Sensor Demo using WASM

If you’ve heard me on any podcasts recently, you might remember I’ve been talking about a Gesture Sensor as the follow up to our first Person Sensor module. One frustrating aspect of building hardware solutions is that it’s very tough to share prototypes with people, since you usually have to physically send them a device. To work around that problem, we’ve been experimenting with compiling the same C++ code we use on embedded systems to WASM, a web-friendly intermediate representation that runs in all modern browsers. By hooking up the webcam as an input, instead of the camera, and displaying the output dynamically on a web page, we can provide a decent approximation to how the final device will work. There are obviously some differences, the webcam is going to produce higher-quality images than an embedded camera module and the latency will vary, but it’s been a great tool for prototyping. I also hope it will help spark makers’ and manufacturers’ imaginations, so we’ve released it publicly at gesture.usefulsensors.com.

On that page you’ll find a quick tutorial, and then you’ll have the opportunity to practice the four gestures that are supported initially. This is not the final version of the models or the interface logic, you’ll be able to see false positives that would be problematic in production for example, but it should give you an idea of what we’re building. My goal is to replace common uses of a TV remote control with simple, intuitive gestures like palm-forward for pause, or finger to the lips for mute. I’d love to hear from you if you know of manufacturers who would like to integrate something like this, and we hope to have a hardware version of this available soon so you can try it for your own projects. If you are at CES this year, come visit me at LVCC IoT Pavilion Booth #10729, where me and my colleagues will be showing off some of our devices together with Teksun.

Short Links

Years ago I used to write regular “Five Short Links” posts but I gave up as my Twitter account became a better place to share updates, notes, and things I found interesting from around the internet. Now that Twitter is Nazi-positive I’m giving up on it as a platform, so I’m going to try going back to occasional summary posts here instead.

Person Sensor back in stock on SparkFun. Sorry for all the delays in getting our new sensors to everyone who wanted them, but we now have a new batch available at SparkFun, and we hope to stay ahead of demand in the future. I’ve also been expanding the Hackster project guides with new examples like face-following robot cars and auto-pausing TV remote controls.

Blecon. It can be a little hard to explain what Blecon does, but my best attempt is that it allows BLE sensors to connect to the cloud using peoples’ phones as relays, instead of requiring a fixed gateway to be installed. The idea is that in places like buildings where staff will be walking past rooms with sensors installed, special apps on their phones can automatically pick up and transmit recorded data. This becomes especially interesting in places like hotels, where management could be alerted to plumbing problems early, without having to invest in extra infrastructure. I like this because it gets us closer to the idea of “peel and stick” sensors, which I think will be crucial to widespread deployment.

Peekaboo. I’ve long been a fan of CMU’s work on IoT security and privacy labels, so it was great to see this exploration of a system that gives users more control over their own data.

32-bit RISC-V MCU for $0.10. It’s not as cheap as the Paduak three-cent MCU, but the fact that it’s 32-bit, with respectable amounts of flash, SRAM, and I/O makes it a very interesting part. I bet it would be capable of running many of the Hackster projects for example, and since it supports I2C it should be able to talk to a Person Sensor. With processors this low cost, we’ll see a lot more hardware being replaced with software.

Hand Pose using TensorFlow JS. I love this online demo from MediaPipe, showing how well it’s now possible to track hands with deep learning approaches. Give the page permission to access your camera and then hold your hands up, you should see rather accurate and detailed hand tracking!

Why is it so difficult to retrain neural networks and get the same results?

Photo by Ian Sane

Last week I had a question from a colleague about reproducibility in TensorFlow, specifically in the 1.14 era. He wanted to be able to run the same training code multiple times and get exactly the same results, which on the surface doesn’t seem like an unreasonable expectation. Machine learning training is fundamentally a series of arithmetic operations applied repeatedly, so what makes getting the same results every time so hard? I had the same question when we first started TensorFlow, and I was lucky enough to learn some of the answers from the numerical programming experts on the team, so I want to share a bit of what I discovered.

There are good guides to achieving reproducibility out there, but they don’t usually include explanations for why all the steps involved are necessary, or why training becomes so slow when you do apply them. One reason reproducibility is so hard is that every single calculation in the process has the potential to change the end results, which means every step is a potential weak link. This means you have to worry about everything from random seeds (which are actually fairly easy to make reproducible, but can be hard to locate in a big code base) to code in external libraries. CuDNN doesn’t guarantee determinism by default for example, nor do some operations like reductions in TensorFlow’s CPU code.

It was the code execution part that confused me the most. I could understand the need to set random seeds correctly, but why would any numerical library with exactly the same inputs sometimes produce different outputs? It seemed like there must be some horrible design flaw to allow that to happen! Thankfully my teammates helped me wrap my head around what was going on.

The key thing I was missing was timing. To get the best performance, numerical code needs to run on multiple cores, whether on the CPU or the GPU. The important part to understand is that how long each core takes to complete is not deterministic. Lots of external factors, from the presence of data in the cache to interruptions from multi-tasking can affect the timing. This means that the order of operations can change. Your high school math class might have taught you that x + y + z will produce the same result as z + y + x, but in the imperfect world of floating point numbers that ain’t necessarily so. To illustrate this, I’ve created a short example program in the Godbolt Compiler Explorer.

#include <stdio.h>

float add(float a, float b) {
    return a + b;
}

int main(void) {
    float x = 0.00000005f;
    float y = 0.00000005f;
    float z = 1.0f;

    float result0 = x + y + z;
    float result1 = z + x + y;

    printf("%.9g\n", result0);
    printf("%.9g\n", result1);

    float result2 = add(add(x, y), z);
    float result3 = add(x, add(y, z));

    printf("%.9g\n", result2);
    printf("%.9g\n", result3);
}

You might not guess exactly what the result will be, but most people’s expectations are that result0, result1, result2, and result3 should all be the same, since the only difference is in the order of the additions. If you run the program in Godbolt though, you’ll see the following output:

ASM generation compiler returned: 0
Execution build compiler returned: 0
Program returned: 0
1.00000012
1
1.00000012
1

So what’s going on? The short answer is that floating point numbers only have so much precision, and if you try to add a very small number to a larger one, there’s a limit below which the small number gets rounded to zero and so the addition has no effect. In this example, I’ve set things up so that 0.00000005 is below that limit for 1.0, so that if you do the 1.0 + 0.00000005 operation first, there’s no change to the result because 0.00000005 is rounded to zero. However, if you do 0.00000005 + 0.00000005 first, this produces an intermediate sum of 0.0000001, which is above the rounding-to-zero limit when added to 1.0, and so it does affect the result.

This might seem like an artificial example, but most of the compute-intensive operations inside neural networks boil down to a long series of multiply-adds. Convolutions and fully-connected calculations will be split across multiple cores either on a GPU or CPU. The intermediate results from each core are often accumulated together, either in memory or registers. As the timing of the results being returned from each core varies, the order of addition will change, just as in the code above. Neural networks may require trillions of operations to be executed in any given training run, so it’s almost certain that these kinds of edge cases will occur and the results will vary.

So far I’ve been assuming we’re running on a single machine with exactly the same hardware for each run, but as you can imagine not only will the timing vary between platforms, but even things like the cache sizes may affect the dimensions of the tiles used to optimize matrix multiplies across cores, which will add a lot more opportunities for differences. You might be tempted to increase the size of the floating point representation from 32 bits to 64 to address the issue, but this only reduces the probability of non-determinism, but doesn’t eliminate it entirely. It will also have a big impact on performance.

Properly addressing these problems requires writers of the base math functions like GEMM to add extra constraints to their code, which can be a lot of work to get right, since testing anything timing-related and probabilistic is so complex. Since all the operations that you use in your network need to be modified and checked, the work usually requires a dedicated team to fix and verify all the barriers to reproducibility. These requirements also conflict with many of the optimizations that reduce latency on multi-core systems, so the functions become slower. In the guide above, the total time for a training run went from 28 to 105 minutes once all the modifications needed to ensure reproducibility were made.

I’m writing this post because I find it fascinating that our systems have become so complex that an algorithm like matrix multiply that can be described in a few lines of pseudo-code can have implementations that produce such surprising and unexpected results. I’m barely scratching the surface of all the complexities of floating point math here, but many other causes of similar issues are emerging as we do more and more calculations on massive distributed networks of computers. Honestly that’s one reason I enjoy working on embedded systems so much these days, I have a lot more confidence in my mental model of how those chips work. It does feel like it’s no longer possible for any one person to have a full and deep understanding of the whole stack on any modern platform, even just for software. I love digging into the causes of weird and surprising properties like non-determinism in training because it helps me understand more about what I don’t know!