The Machine Learning Reproducibility Crisis

Gosper Glider Gun

I was recently chatting to a friend whose startup’s machine learning models were so disorganized it was causing serious problems as his team tried to build on each other’s work and share it with clients. Even the original author sometimes couldn’t train the same model and get similar results! He was hoping that I had a solution I could recommend, but I had to admit that I struggle with the same problems in my own work. It’s hard to explain to people who haven’t worked with machine learning, but we’re still back in the dark ages when it comes to tracking changes and rebuilding models from scratch. It’s so bad it sometimes feels like stepping back in time to when we coded without source control.

When I started out programming professionally in the mid-90’s, the standard for keeping track and collaborating on source code was Microsoft’s Visual SourceSafe. To give you a flavor of the experience, it didn’t have atomic check-ins, so multiple people couldn’t work on the same file, the network copy required nightly scans to avoid mysterious corruption, and even that was no guarantee the database would be intact in the morning. I felt lucky though, one of the places I interviewed at just had a wall of post-it notes, one for each file in the tree, and coders would take them down when they were modifying files, and return them when they were done!

This is all to say, I’m no shrinking violet when it comes to version control. I’ve toughed my way through some terrible systems, and I can still monkey together a solution using rsync and chicken wire if I have to. Even with all that behind me, I can say with my hand on my heart, that machine learning is by far the worst environment I’ve ever found for collaborating and keeping track of changes.

To explain why, here’s a typical life cycle of a machine learning model:

  • A researcher decides to try a new image classification architecture.
  • She copies and pastes some code from a previous project to handle the input of the dataset she’s using.
  • This dataset lives in one of her folders on the network. It’s probably one of the ImageNet downloads, but it isn’t clear which one. At some point, someone may have removed some of the images that aren’t actually JPEGs, or made other minor modifications, but there’s no history of that.
  • She tries out a lot of slightly different ideas, fixing bugs and tweaking the algorithms. These changes are happening on her local machine, and she may just do a mass file copy of the source code to her GPU cluster when she wants to kick off a full training run.
  • She executes a lot of different training runs, often changing the code on her local machine while jobs are in progress, since they take days or weeks to complete.
  • There might be a bug towards the end of the run on a large cluster that means she modifies the code in one file and copies that to all the machines, before resuming the job.
  • She may take the partially-trained weights from one run, and use them as the starting point for a new run with different code.
  • She keeps around the model weights and evaluation scores for all her runs, and picks which weights to release as the final model once she’s out of time to run more experiments. These weights can be from any of the runs, and may have been produced by very different code than what she currently has on her development machine.
  • She probably checks in her final code to source control, but in a personal folder.
  • She publishes her results, with code and the trained weights.

This is an optimistic scenario with a conscientious researcher, but you can already see how hard it would be for somebody else to come in and reproduce all of these steps and come out with the same result. Every one of these bullet points is an opportunity to inconsistencies to creep in. To make things even more confusing, ML frameworks trade off exact numeric determinism for performance, so if by a miracle somebody did manage to copy the steps exactly, there would still be tiny differences in the end results!

In many real-world cases, the researcher won’t have made notes or remember exactly what she did, so even she won’t be able to reproduce the model. Even if she can, the frameworks the model code depend on can change over time, sometimes radically, so she’d need to also snapshot the whole system she was using to ensure that things work. I’ve found ML researchers to be incredibly generous with their time when I’ve contacted them for help reproducing model results, but it’s often months-long task even with assistance from the original author.

Why does this all matter? I’ve had several friends contact me about their struggles reproducing published models as baselines for their own papers. If they can’t get the same accuracy that the original authors did, how can they tell if their new approach is an improvement? It’s also clearly concerning to rely on models in production systems if you don’t have a way of rebuilding them to cope with changed requirements or platforms. At that point your model moves from being a high-interest credit card of technical debt to something more like what a loan-shark offers. It’s also stifling for research experimentation; since making changes to code or training data can be hard to roll back it’s a lot more risky to try different variations, just like coding without source control raises the cost of experimenting with changes.

It’s not all doom and gloom, there are some notable efforts around reproducibility happening in the community. One of my favorites is the TensorFlow Benchmarks project Toby Boyd’s leading. He’s made it his team’s mission not only to lay out exactly how to train some of the leading models from scratch with high training speed on a lot of different platforms, but also ensures that the models train to the expected accuracy. I’ve seen him sweat blood trying to get models up to that precision, since variations in any of the steps I listed above can affect the results and there’s no easy way to debug what the underlying cause is, even with help from the authors. It’s also a never-ending job, since changes in TensorFlow, in GPU drivers, or even datasets, can all hurt accuracy in subtle ways. By doing this work, Toby’s team helps us spot and fix bugs caused by changes in TensorFlow in the models they cover, and chase down issues caused by external dependencies, but it’s hard to scale beyond a comparatively small set of platforms and models.

I also know of other teams who are serious about using models in production who put similar amounts of time and effort into ensuring their training can be reproduced, but the problem is that it’s still a very manual process. There’s no equivalent to source control or even agreed best-practices about how to archive a training process so that it can be successfully re-run in the future. I don’t have a solution in mind either, but to start the discussion here are some principles I think any approach would need to follow to be successful:

  • ¬†Researchers must be able to easily hack around with new ideas, without paying a large “process tax”. If this isn’t true, they simply won’t use it. Ideally, the system will actually boost their productivity.
  • If a researcher gets hit by a bus¬†founds their own startup, somebody else should be able to step in the next day and train all the models they have created so far, and get the same results.
  • There should be some way of packaging up just what you need to train one particular model, in a way that can be shared publicly without revealing any history the author doesn’t wish to.
  • To reproduce results, code, training data, and the overall platform need to be recorded accurately.

I’ve been seeing some interesting stirrings in the open source and startup world around solutions to these challenges, and personally I can’t wait to spend less of my time dealing with all the related issues, but I’m not expecting to see a complete fix in the short term. Whatever we come up with will require a change in the way we all work with models, in the same way that source control meant a big change in all of our personal coding processes. It will be as much about getting consensus on the best practices and educating the community as it will be about the tools we come up with. I can’t wait to see what emerges!