Chesterton’s shell script

wooden_cratesPhoto by Linnaea Mallette

I have a few days off over the holidays, so I’ve had a chance to think about what I’ve learned over the last year. Funnily enough, the engineering task that taught me the most was writing and maintaining a shell script of less than two hundred lines. One of my favorite things about programming is how seemingly simple problems turn out to have endless complexity the longer you work with them. It’s also one of the most frustrating things too, of course!

The problem I faced was that TensorFlow Lite for Microcontrollers depends on external frameworks like KissFft or the Gnu Compiler Toolchain to build. For TFL Micro to be successful it has to be easy to compile, even for engineers from the embedded world who are may not be familiar with the wonderful world of installing Linux or MacOS dependencies. That meant downloading these packages had to be handled automatically by the build process.

Because of the degree of customization we needed to handle cross-compilation across a large number of very different platforms, early on I made the decision to use ‘make’ as our build tool, rather than a more modern system like cmake or bazel. This is still a controversial choice within the team, since it involves extensive use of make’s scripting language to do everything we need, which is a maintenance nightmare. The bus factor in this area is effectively one, since I’m the only engineer familiar with that code, and even I can’t easily debug complex issues. I believe the tradeoffs were worth it though, and I’m hoping to mitigate the problems by improving the build infrastructure in the future.

A downside of choosing make is that doesn’t have any built-in support for downloading external packages. The joy and terror of make is that it’s a very low-level language for expressing build dependency rules which allow you to build almost anything on top of them, so the obvious next step was to implement the functionality I needed within that system.

The requirements at the start were that I needed a component that I could pass in a URL to a compressed archive, and it would handle downloading and unpacking it into a folder of source files. The most common place these archives come from is GitHub, where every commit to an open source project can be accessed as a ZIP or Gzipped Tar archive, like https://github.com/google/stm32_bare_lib/archive/c07d611fb0af58450c5a3e0ab4d52b47f99bc82d.zip.

The first choice I needed to make was what language to implement this component in. One option was make’s own scripting language, but as I mention above it’s quite obscure and hard to maintain, so I preferred an external script I could call from a makefile rule. The usual Google style recommendation is to use a Python script for anything that would be more than a few lines of shell script, but I’ve found that doesn’t make as much sense when most of the actual work will be done through shell tools, which I expected to be the case with the archive extraction.

I settled on writing the code as a bash script. You can see the first version I checked in at https://github.com/tensorflow/tensorflow/blob/e98d1810c607e704609ffeef14881f87c087394c/tensorflow/lite/experimental/micro/tools/make/download_and_extract.sh. This is a long way from the very first script I wrote, because of course I encountered a lot of requirements I hadn’t thought of at the start. The primordial version was something like this:

curl -Ls "${1}" > ${tempfile}
if [[ "${url}" == *gz ]]; then
  tar -C "${dir}" -xzf ${tempfile}
elif [[ "${url}" == *zip ]]; then
    unzip ${tempfile}

Here’s a selection of the issues I had to fix with this approach before I was able to even check in that first version:

  • A few of the dependencies were bzip files, so I had to handle them too.
  • Sometimes the destination root folder didn’t exist, so I had to add an ‘mkdir -p’.
  • If the download failed halfway through or the file at the URL changed, there wouldn’t always be an error, so I had to add MD5 checksum verification.
  • If there were any BUILD files in the extracted folders, bazel builds (which are supported for x86 host compilation) would find them through a recursive search and fail.
  • We needed the extracted files to be in a first-level folder with a constant name, but by default many of the archives had versioned names (like the project name followed by the checksum). To handle this for tars I could use strip-components=1, but for zipping I had to cobble together a more manual approach.
  • Some of the packages required small amounts of patching after download to work successfully.

All of those requirements meant that this initial version was already 124 lines long! If you look at the history of this file (continued here after the experimental move), you can see that I was far from done making changes even after all those discoveries. Here are some of the additional problems I tackled over the last six months:

  • Added (and then removed) some debug logging so I could tell what archives were being downloaded.
  • Switched the MD5 command line tool the script used to something that was present by default on MacOS, and that had a nicer interface as a bonus, so I no longer had to write to a temporary file.
  • Tried one approach to dealing with persistent ‘56′ errors from curl by adding retries.
  • When that failed to solve the problem, tried manually looping and retrying curl.
  • Pete Blacker fixed a logic problem where we didn’t raise an error if an unsupported archive file suffix was passed in.
  • During an Arm AIoT workshop I found that some users were hitting problems when curl was not installed on their machine on the first run of the build script, and then they re-ran the build script after installation but the downloaded folders had already been created but were empty, leading to weird errors later in the process. I fixed this by erroring out early if curl isn’t present.

What I found most interesting about dealing with these issues is how few of them were easily predictable. Pete Blacker’s logic fix was one that could have been found by inspection (not having a final else or default on a switch statement is a classic lintable error), but most of the other problems were only visible once the script became heavily used across a variety of systems. For example, I put in a lot of time to mitigating occasional ’56’ errors from curl because they seemed to show up intermittently (maybe 10% of the time or less) for one particular archive URL. They bypassed the curl retry logic (presumably because they weren’t at the http error code level?), and since they were hard to reproduce consistently I had to make several attempts at different approaches and test them in production to see which techniques worked.

A less vexing issue was the md5sum tool not being present by default on MacOS, but this was also one that I wouldn’t have caught without external testing, because my Mac environment did have it installed.

This script came to mind as I was thinking back over the year for a few reasons. One of them was that I spent a non-trivial amount of time writing and debugging it, despite its small size and the apparent simplicity of the problem it tackled. Even in apparently glamorous fields like machine learning, 90% of the work is nuts and bolts integration like this. If anything you should be doing more of it as you become more senior, since it requires a subtle understanding of the whole system and its requirements, but doesn’t look impressive from the outside. Save the easier-to-explain projects for more junior engineers, they need them for their promotion packets.

The reason this kind of work is so hard is precisely because of all the funky requirements and edge cases that only become apparent when code is used in production. As a young engineer my first instinct when looking at a snarl of complex code for something that looked simple on the surface was to imagine the original authors were idiots. I still remember scoffing at the Diablo PC programmers as I was helping port the codebase to the Playstation because they used inline assembler to do a simple signed to unsigned cast. My lead, Gary Liddon, very gently reminded me that they had managed to ship a chart-topping game and I hadn’t, so maybe I had something to learn from their approach?

Over the last two decades of engineering I’ve come to appreciate the wisdom of Chesterton’s Fence, the idea that it’s worth researching the history of a solution I don’t understand before I replace it with something new. I’m not claiming that the shell script I’m highlighting is a coding masterpiece, in fact I have a strong urge to rewrite it because I know so many of its flaws. What holds me back is that I know how valuable the months of testing in production have been, and how any rewrite would probably be a step backwards because it would miss the requirements that aren’t obvious. There’s a lot I can capture in unit tests (and one of its flaws is that the script doesn’t have any) but it’s not always possible or feasible to specify everything a component needs to do at that level.

One of the hardest things to learn in engineering is good judgment, because it’s always subjective and usually relies on pattern matching to previous situations. Making a choice about how to approach maintaining legacy software is no different. In its worst form Chesterton’s Fence is an argument for never changing anything and I’d hate to see progress on any project get stymied by excessive caution. I do try to encourage the engineers I work with (and remind myself) that we have a natural bias towards writing new code rather than reusing existing components though. If even a minor shell script for downloading packages has requirements that are only discoverable through seasoning in production, what issues might you unintentionally introduce if you change a more complex system?

Leave a comment