Why is it so difficult to retrain neural networks and get the same results?

Photo by Ian Sane

Last week I had a question from a colleague about reproducibility in TensorFlow, specifically in the 1.14 era. He wanted to be able to run the same training code multiple times and get exactly the same results, which on the surface doesn’t seem like an unreasonable expectation. Machine learning training is fundamentally a series of arithmetic operations applied repeatedly, so what makes getting the same results every time so hard? I had the same question when we first started TensorFlow, and I was lucky enough to learn some of the answers from the numerical programming experts on the team, so I want to share a bit of what I discovered.

There are good guides to achieving reproducibility out there, but they don’t usually include explanations for why all the steps involved are necessary, or why training becomes so slow when you do apply them. One reason reproducibility is so hard is that every single calculation in the process has the potential to change the end results, which means every step is a potential weak link. This means you have to worry about everything from random seeds (which are actually fairly easy to make reproducible, but can be hard to locate in a big code base) to code in external libraries. CuDNN doesn’t guarantee determinism by default for example, nor do some operations like reductions in TensorFlow’s CPU code.

It was the code execution part that confused me the most. I could understand the need to set random seeds correctly, but why would any numerical library with exactly the same inputs sometimes produce different outputs? It seemed like there must be some horrible design flaw to allow that to happen! Thankfully my teammates helped me wrap my head around what was going on.

The key thing I was missing was timing. To get the best performance, numerical code needs to run on multiple cores, whether on the CPU or the GPU. The important part to understand is that how long each core takes to complete is not deterministic. Lots of external factors, from the presence of data in the cache to interruptions from multi-tasking can affect the timing. This means that the order of operations can change. Your high school math class might have taught you that x + y + z will produce the same result as z + y + x, but in the imperfect world of floating point numbers that ain’t necessarily so. To illustrate this, I’ve created a short example program in the Godbolt Compiler Explorer.

#include <stdio.h>

float add(float a, float b) {
    return a + b;

int main(void) {
    float x = 0.00000005f;
    float y = 0.00000005f;
    float z = 1.0f;

    float result0 = x + y + z;
    float result1 = z + x + y;

    printf("%.9g\n", result0);
    printf("%.9g\n", result1);

    float result2 = add(add(x, y), z);
    float result3 = add(x, add(y, z));

    printf("%.9g\n", result2);
    printf("%.9g\n", result3);

You might not guess exactly what the result will be, but most people’s expectations are that result0, result1, result2, and result3 should all be the same, since the only difference is in the order of the additions. If you run the program in Godbolt though, you’ll see the following output:

ASM generation compiler returned: 0
Execution build compiler returned: 0
Program returned: 0

So what’s going on? The short answer is that floating point numbers only have so much precision, and if you try to add a very small number to a larger one, there’s a limit below which the small number gets rounded to zero and so the addition has no effect. In this example, I’ve set things up so that 0.00000005 is below that limit for 1.0, so that if you do the 1.0 + 0.00000005 operation first, there’s no change to the result because 0.00000005 is rounded to zero. However, if you do 0.00000005 + 0.00000005 first, this produces an intermediate sum of 0.0000001, which is above the rounding-to-zero limit when added to 1.0, and so it does affect the result.

This might seem like an artificial example, but most of the compute-intensive operations inside neural networks boil down to a long series of multiply-adds. Convolutions and fully-connected calculations will be split across multiple cores either on a GPU or CPU. The intermediate results from each core are often accumulated together, either in memory or registers. As the timing of the results being returned from each core varies, the order of addition will change, just as in the code above. Neural networks may require trillions of operations to be executed in any given training run, so it’s almost certain that these kinds of edge cases will occur and the results will vary.

So far I’ve been assuming we’re running on a single machine with exactly the same hardware for each run, but as you can imagine not only will the timing vary between platforms, but even things like the cache sizes may affect the dimensions of the tiles used to optimize matrix multiplies across cores, which will add a lot more opportunities for differences. You might be tempted to increase the size of the floating point representation from 32 bits to 64 to address the issue, but this only reduces the probability of non-determinism, but doesn’t eliminate it entirely. It will also have a big impact on performance.

Properly addressing these problems requires writers of the base math functions like GEMM to add extra constraints to their code, which can be a lot of work to get right, since testing anything timing-related and probabilistic is so complex. Since all the operations that you use in your network need to be modified and checked, the work usually requires a dedicated team to fix and verify all the barriers to reproducibility. These requirements also conflict with many of the optimizations that reduce latency on multi-core systems, so the functions become slower. In the guide above, the total time for a training run went from 28 to 105 minutes once all the modifications needed to ensure reproducibility were made.

I’m writing this post because I find it fascinating that our systems have become so complex that an algorithm like matrix multiply that can be described in a few lines of pseudo-code can have implementations that produce such surprising and unexpected results. I’m barely scratching the surface of all the complexities of floating point math here, but many other causes of similar issues are emerging as we do more and more calculations on massive distributed networks of computers. Honestly that’s one reason I enjoy working on embedded systems so much these days, I have a lot more confidence in my mental model of how those chips work. It does feel like it’s no longer possible for any one person to have a full and deep understanding of the whole stack on any modern platform, even just for software. I love digging into the causes of weird and surprising properties like non-determinism in training because it helps me understand more about what I don’t know!

Machines of Loving Understanding

Sparko, the world’s first electrical dog, as he looked on arrival at the engineer’s club, New York City, on his way to the World’s fair, where he will be an attraction at the Westinghouse Building. He walks, barks, wags his tail and sits up to beg. With Sparko, is Elektro, Westinghouse mechanical man. Both are creations of J.M. Barnett, Westinghouse engineer of Mansfield, OH.
I like to think
(it has to be!)
of a cybernetic ecology
where we are free of our labors
and joined back to nature,
returned to our mammal
brothers and sisters,
and all watched over
by machines of loving grace.

Brautigan’s poem inspires and terrifies me at the same time. It’s a reminder of how creepy a world full of devices that blur the line between life and objects could be, but there’s also something appealing about connecting more closely to the things we build. Far more insightful people than me have explored these issues, from Mary Shelley to Phillip K. Dick, but the aspect that has fascinated me most is how computers understand us.

We live in a world where our machines are wonderful at showing us near-photorealistic scenes in real time, and can even talk to us in convincing voices. Up until recently though, they’ve not been able to make sense of images or audio that are given to them as inputs. We’ve been able to synthesize voices for decades, but speech recognition has only really started working well in the last few years. Computers have been like Crocodile Sales Reps, with enormous mouths and tiny ears, great at talking but terrible at listening. That means they can be immensely frustrating to deal with, since they seem to have no ability to do what we mean. Instead we have to spend a lot of time painstakingly communicating our needs in a form that makes sense to them, even it is unnatural for us.

This process started with toggling switches on a control panel, moved to punch cards, teletypes, CRT terminals, mouse-driven GUIs, swiping on a touch screen and most recently basic voice interfaces. Each of these steps was a big advance, but compared to how we communicate with other people, or even our pets, they still feel clumsy.

What has me most excited about all the recent advances in machine learning is that they’re starting to give computers the ability to understand us in a much deeper and more natural way. The video above is just a small robot that I built for a few dollars as a technology demonstration, but because it followed my face around, I ended up becoming quite attached. It was exhibiting behavior that we associate with people or animals who like us. Even though I knew it was just code underneath, it was hard not to see it as a character instead of an object. It became a Pencil named Steve.

Face following is a comparatively simple ability, but it’s enough to build more useful objects like a fan that always points at you, or a laptop screen that locks when nobody is around. As one of the comments says, the fan is a bit creepy. I believe this is because it’s an object that’s exhibiting attributes that we associate with living beings, entering the Uncanny Valley. The googly eyes probably didn’t help. The confounding part is that the property that makes it most creepy is the same thing that makes it helpful.

We’re going to see more and more of these capabilities making it into everyday objects (at least if I have anything to do with it) so I expect the creepiness and usefulness will keep growing in parallel too. Imagine a robot vacuum that you can talk to naturally and it will respond, that you can shoo away or control with hand gestures, and that follows you around while you’re eating to pick up crumbs you drop. Doesn’t that sound a lot like a dog? All of these behaviors help it do its job better, it’s understanding us in a more natural way instead of expecting us to learn its language, but they also make it feel a lot more alive. Increased understanding goes hand in hand with creepiness.

This already leads to a lot of unresolved tension in our relationships with voice assistants. 79% of Americans believe they spy on their conversations, but 42% of us still use them! I think this belief is so widespread because it’s hard not to treat something that you can talk to as a pseudo-person, which also makes it hard not to expect that it is listening all the time, even if it doesn’t respond. That feeling will only increase once they take account of glances, gestures, even your mood.

If I’m right, we’re going to be entering a new age of creepy but useful objects that seem somewhat alive. What should we do about it? The first part might seem obvious but it rarely happens with any new technology – have a public debate about what we as a community think should be acceptable, right now, while it’s in the early stages of deployment, not after it’s a done deal. I’m a big fan of representative democracy, with all its flaws, so let’s encourage people outside the tech world to help draw the lines of what’s ethical and reasonable. I’m trying to take a step in that direction by putting our products up on maker sites so that anyone can try them out for themselves, but I’d love to figure out how to do something like a roadshow demonstrating what’s coming in the near future. I guess this blog post is an attempt at that too. If there’s going to be a tradeoff between creepiness and utility, let’s give ordinary people the power to determine what the balance should be.

The second important realization is that the tech industry is beyond the point where we can just say “trust us” and reasonably expect people to believe our claims. We’ve lost too much credibility. Moving forward we need to build our products in a way that third parties can check what we’re doing in a meaningful way. As I wrote a few months ago, I know Google isn’t spying on your conversations, but I can’t prove it. I’ve proposed the ML sensors approach we use as a response to that problem, so that someone like Underwriters Laboratories can test our privacy claims on the behalf of consumers.

That’s just one idea though, anything that lets people outside the manufacturers verify their claims would be welcome. To go along with that, I’d love to see enforceable laws that require creators of devices to label what information they collect, like an ingredients list for food items. What I don’t know is how to prevent these from turning into meaningless Prop 65-style privacy policies, where every company basically says they can do anything with any information and share it with anyone they choose. Even though GDPR is flawed in many ways, it did force a lot of companies to be more careful with how they handle personal data, internally and externally. I’d love smarter people than me to figure out how we make privacy claims minimal and enforceable, but I believe the foundation has to be designing systems that can be audited.

Whether this new world we’re moving towards becomes more of a utopia or dystopia depends on the choices we make now. Computers that understand us better can help when an elderly person falls, but the exact same technology could send police after a homeless person bedding down in a doorway. Our ML models can spot drowning victims, or criminalize wild swimming. Ubiquitous, cheap, battery-powered voice recognition could make devices accessible to many more people, or supercharge bugging by repressive regimes. Technologists alone shouldn’t have the power to decide the direction we head in, we need everyone’s help to chart the right path, and make the hard tradeoffs. We need to make sure that the machines that watch over us truly will be loving.