Non-Max Suppressions, How do they Work?

(En espaƱol:

I’ve been working with neural networks to do image recognition for almost a decade now, but I have to admit I never really understood the details of how they output things like bounding boxes. I didn’t have a good mental model for how it all worked, and the reference functions always seemed pretty intimidating. In a lot of cases this doesn’t matter, the conversion process is handled by internal layers inside a model, and the application developer doesn’t need to worry about what’s happening under the hood. Recently though, I’ve begun working with some networks that expect the conversion to be handled externally, and so I’ve been writing code from scratch to perform the translation.

That has forced me to understand the details, and to make sure I have a good grasp, and have something to refer to in the future, I’ve put together this blog post and a Python Colab demonstrating it all, step by step. I’m using an example model from the awesome MediaPipe framework (which handles all the conversion itself, if you’re on a platform that supports it), and I’ve written reference code to explain the workflow to get from raw tensors to a cleaned-up set of bounding boxes. In particular, I feel like I’ve finally got a handle on “non-max suppression”, which turned out to be less intimidating than I’d feared.

How do they Work?

I recommend working through the Colab to get the best understanding, but the summary is that most neural networks designed to produce bounding boxes use a grid of anchor points across the image as a base. Each anchor point is associated with a score value, along with x and y offsets, width, height, and any other feature coordinates (like nose or eye positions). All of these coordinates are output relative to the anchor points, normalized between 0.0 and 1.0, where 1.0 is the image size. There is one score, and one set of coordinates for each anchor point, so in the case of the face model I’m using in the notebook, there are 48 columns and 48 rows of anchors, spread 4 pixels apart on a 192×192 image, which means 2,304 entries.

There are two outputs to the model, the first with a shape of (1, 2304, 16), holding 8 pairs of (x, y) coordinates for each anchor. The second is (1, 2304, 1) and holds the score for each anchor. For this model, the first two pairs of coordinates are the origin of the bounding box and its width and height. The other six are the positions of facial landmarks like the mouth or nose. The first stage of decoding is to turn these from relative positions into absolute coordinates by adding the corresponding anchor origins. This gives you a soup of overlapping bounding boxes, each associated with a score.

The next challenge is reducing this set of overlapping boxes into a single one for each real object detection. That’s where the non-max suppression algorithm comes in.

The initial step is to sort the boxes with the highest scores first. After that, we find all the boxes that overlap significantly and merge them together. The exact methods we use to determine if the overlap is significant can be seen in the `overlap_similarity()` function. The merging process either involves just taking the top-scoring box from an overlapping set (`unweighted_non_max_suppression()`) or averaging all the boxes and features in a set, weighted by their score (`weighted_non_max_suppression()`). And that’s how non-max suppression works!

My favorite debugging hack

Image from Wikimedia

The first machine I programmed commercially was the original PlayStation, and I didn’t appreciate it at the time but it had the best debugger I’ve ever had the pleasure to use. One of the best features was a continuously updating variable view, so you could see how values you cared about were changing in real time, without pausing the code. I’ve not been able to find anything quite that good since, but one of my leads did teach me an approach that’s almost as useful that works on any system. I’ve been surprised that I haven’t seen the technique used more widely (it wasn’t part of Google’s standard toolkit for example) so I wanted to share it here.

The short story is that you can easily output the values of variables at any point in your C or C++ code by inserting a line like:


Every time that line of code is hit, it will output the location, name, and value to stderr:

bar.c:101 foo=10

This may seem blindingly obvious – why not just write an fprintf() statement that does the same thing? What I find most useful is that it turns a minute of thinking about format strings and typing the variable name twice into a few seconds of adding a simple macro call. Lowering the effort involved means I’m a lot more likely to actually add the instrumentation and learn more about what’s going on, versus stubbornly trying to debug the problem by staring at the code in frustration and willing it to work. I often find scattering a bunch of these statements throughout the area I’m struggling with will help me align my mental model of what I think the code should be doing with what’s actually happening.

The implementation is just a few lines of code, included below or available as a Gist here. It’s so simple I often write it out again from memory when I’m starting a new codebase. The biggest piece of magic is the way it automatically pulls the variable name from the input argument, so you can easily see the source of what’s being logged. The do/while construct is just there so that a semicolon is required at the end of the macro call, like a normal function invocation. I’m making the code available here under a CC0 license, which is equivalent to public domain in the US, so if you think it might be useful feel free to grab it for yourself.


#include <stdio.h>
#include <stdint.h>

#define TRACE_STR(variable) do { fprintf(stderr, __FILE__":%d "#variable"=%s\n", __LINE__, variable); } while (0)
#define TRACE_INT(variable) do { fprintf(stderr, __FILE__":%d "#variable"=%d\n", __LINE__, variable); } while (0)
#define TRACE_PTR(variable) do { fprintf(stderr, __FILE__":%d "#variable"=0x%016lx\n", __LINE__, (uint64_t)(variable)); } while (0)
#define TRACE_SIZ(variable) do { fprintf(stderr, __FILE__":%d "#variable"=%zu\n", __LINE__, variable); } while (0)

#endif  // INCLUDE_TRACE_H

Leaving Google, Starting Stanford

I’ve been at Google for seven years, and I’ve been lucky enough to work with some amazing people on projects like TensorFlow that I’m very proud of. I’ve been talking about all the wonderful TinyML things you can build using TensorFlow Lite Micro a lot over the last few years, and the time has finally come to start trying to build some of them myself! Much as I’d like to, it’s very costly and time-consuming to launch new hardware devices at Google, because the downsides of a failed or buggy launch to any large company’s reputation are so high. Instead, I’ve decided to go back to college after more than twenty years away, and work on a Computer Science PhD at Stanford.

I’ve enjoyed teaching EE292D there for the last couple of years, it’s been wonderful being able to draft off the students’ enthusiasm about the possibilities with these emerging technologies, and I’ve learned a lot from faculty like Zain Asgar, Sachin Katti, and Boris Murmann. I’m very pleased I’ll have a chance to spend more time on campus.

TensorFlow Lite Micro is in very good hands with Advait Jain and the rest of the team, usage and headcount has continued to grow over the last couple of years, so I’m very optimistic about its future. I’ll be publishing more details about my plans soon, along with some demos, but I’ll be using the framework myself to create some of the devices I’ve been dreaming about since the project started.

It’s going to be an interesting new adventure, I’m definitely going to be feeling a bit like Rodney Dangerfield in the classes I’m taking, but I want to thank everyone who’s supported me getting this far. If you want to get in touch, my Stanford home page has more details on how to reach me, I’m looking forward to learning, teaching, and researching in a whole new environment.