Don’t worry if college isn’t the happiest time of your life

I was digging through paperwork today to help complete my PhD admission process, and I stopped short when I saw the academic transcript from my undergraduate years. I was a terrible student! I got 0% on one course, awful scores on many others, and had to do a lot of retakes. It brought back memories of how I was feeling when I was 18. I was a mess. I was totally unprepared for life away from home, suffered from so much anxiety I wasn’t even able to name it, like a fish having no concept of water, and jumped right into a terrible relationship the first chance I got. I was working almost full-time at Kwik-Save to pay rent, and didn’t even have a computer at home I could use.

It wasn’t supposed to be like this. Since I was a kid I’d dreamed of escaping my tiny village for university. I wasn’t sure what it was going to be like, since nobody in my immediate family had completed college, but I had vague ideas from being a townie in Cambridge and shows like Brideshead Revisited that I would be transported to a magical world of privilege and punts. Most of all, I looked forward to meeting people I could talk to about important things, people who might listen to me. I also knew I was “good at computers”, and was looking forward to diving deeper into programming. The reality of being just one of hundreds of students, with little ability to connect with any of the staff, and discovering that most of what I’d learned about coding wasn’t a big help with Computer Science, left me more than deflated. I was still the same screwed-up person, there had been no magical transformation. A lot of the time I resented time spent on my classes, and felt I wasn’t learning what I needed for my true vocation, being a programmer, and you can see that in my grades.

Looking back, this wasn’t Manchester’s fault. It’s a fantastic university with a CS program that’s world-class, and despite my best efforts I did learn a lot from great teachers like Steve Furber and Carole Goble. Their lessons turned out to be far more useful in my career than I ever would have expected. The staff were kind and helpful on the few occasions I did reach out, but I had such a lack of confidence I seldom dared try. I managed to scrape through, with a lot of retakes, and helped by the fact that the overall marks were heavily weighted to the final year. It left me feeling cheated though, somehow. I’d always heard the cliche that these would be the happiest days of my life. If I was miserable and it was all downhill from here, what was even the point of carrying on? It didn’t help that the first technical job I could find out of college paid less than I’d made stacking shelves at a supermarket.

The good news is that life has pretty much continuously got better from that point on. Years of therapy, a career path that has gifted me some fascinating and impactful problems to work on, along with enough money to be comfortable, and building the kind of community I’d dreamed of at college through writing and the internet, have left me feeling happier than I’ve ever been. I feel very lucky to have found a way to engage with so many smart people, and find my voice, through open source coding, blogging, research papers, and teaching.

I didn’t write this post to humblebrag about how wonderful my life is, I still have plenty of challenges and disappointments. I just want to provide a datapoint for anyone else who is struggling or has struggled at college. It doesn’t have to define you for the rest of your life. If the experience isn’t what you’d expected and hoped, there’s no need for despair. Life can get so much better.

Why cameras are soon going to be everywhere

i-FlatCam demo

I’ve finally had the chance to play Cyberpunk 2077 over the last few weekends, and it’s an amazing feat of graphics programming, especially with ray-tracing enabled. I’ve had fun, but I have been struck by how the cyberpunk vision of the future is rooted in the ’80s. Even though William Gibson was incredibly prescient in so many ways, the actual future we’re living in is increasingly diverging from the one he painted. One of the differences that struck me most was how cameras exist in the game’s world, compared to what I can see happening as an engineer working in the imaging field. They still primarily show up as security cameras, brick-sized devices that are attached to walls and would look familiar to someone from forty years ago.

A lot of people still share the expectation that cameras will be obvious, standalone components of a system. Even though phone cameras and webcams are smaller, they still have a noticeable physical presence, and often come with indicators like red lights that show when they’re recording. What is clear to me from my work is that these assumptions aren’t going to hold much longer. Soon imaging sensors will be so small, cheap, and energy efficient that they’ll be added to many more devices in our daily lives, and because they’re so tiny they won’t even be noticeable!

What am I basing this prediction on? The clearest indicator for me is that you can already buy devices like the Himax HM01B0 with an imaging sensor that’s less than 2mm by 2mm in size, low single-digit dollars in cost, and 2 milliwatts or less in power usage. Even more striking are the cameras that are emerging from research labs. At the TinyML Summit the University of Michigan presented a complete system that fits on the tip of a finger.

The video at the top of this post shows another project from Rice that is able to perform state of the art eye tracking at 253 FPS, using 23 milliwatts, in a lens-less system that lets it achieve a much smaller size than other solutions.

Hopefully this makes it clear that there’s a growing supply of these kinds of devices. Why do I think there will be enough demand to include them in appliances and other items around homes and offices? This is tricky to show as clearly because the applications aren’t deployed yet, but cameras can replace or augment lots of existing sensors, and enable entirely new features. Here are a few examples:

Each one of these may or may not turn out to be useful, but there are so many potential applications (including many I’m sure nobody’s thought of yet) that I can’t imagine some of them won’t take off once the technology is widely available. Many scientists believe that the Cambrian Explosion occurred because of the evolution of eyes opened up so many new possibilities and functions. I’m hoping we’ll see a similarly massive expansion in the technology space once all our devices can truly see and understand.

So, that’s why I believe we’re going to end up in a world where we’re each surrounded by thousands of cameras. What does that mean? As an engineer I’m excited, because we have the chance to make a positive impact on peoples’ lives. As a human being, I’m terrified because the potential for harm is so large, through unwanted tracking, recording of private moments, and the sharing of massive amounts of data with technology suppliers.

If you accept my argument about why we’re headed for a world full of tiny cameras watching us at all times, then I think we all have a responsibility to plan ahead now to mitigate the potential harms. This is my motivation behind the ML sensors proposal to wall off sensitive data in a secure component, but I see this as just a starting point in the discussion. Do we need regulation? Even if we don’t get it in the US, will Europe take the lead? Should there be voluntary standards around labeling products that contain cameras or microphones? I don’t know the answers, but I don’t think we have the luxury of waiting too long to figure them out, because if we don’t make any changes we’ll be deploying billions of poorly-secured devices into everybody’s lives as a giant uncontrolled experiment.

What are ML Sensors?

I’ve spent a lot of time at conferences talking about all the wonderful things that are now possible using machine learning on embedded devices, but as Stacey Higginbotham pointed out at this year’s TinyML Summit, despite all the potential there haven’t been many shipping applications. My experience is that companies like Google with big ML teams have been able to deploy products successfully, but it has been a lot harder for teams in other industries. For example, when I visited an appliance manufacturer in China and pitched them on the glorious future they could access thanks to TensorFlow Lite Micro, they told me they didn’t even know how to open a Python notebook. Instead, they asked if I could just give them a voice interface, or something that told them when somebody sat down in front of their TV.

I realized that the software framework model that had worked so well for TensorFlow Lite adoption on phone apps didn’t translate to other domains. Many of the firms that could most benefit from ML just don’t have the software engineering resources to integrate a library, even with great tools like Edge Impulse that make the process much easier. As I thought about how to make on-device ML more widely accessible, I realized that providing ML capabilities as small, cheap hardware modules might be a good solution. This was the seed of the idea that became the ML sensors proposal, now available as a paper on arXiv.

The basic idea is that system builders are already able to integrate components like sensors into their products, so why not expose some higher-level information about the environment in the same form factor? For example, a person sensor might have a pin that goes high when someone is present, and then an I2C interface to supply more detailed information about their pose, activities, and identity. That would allow a TV manufacturer to wake up the display when someone sat down on the couch, and maybe even customize the UI to show recently-watched shows based on which family members are present. All of the complexity of the ML implementation would be taken care of by the sensor manufacturer and hidden inside the hardware module, which would have a microcontroller and a camera under the hood. The OEM would just need to respond to the actionable signals from the component.

At the same time as I was thinking about how to get ML into more peoples hands, I was also worried about the potential for abuse that the proliferation of cameras and microphones in everyday devices enables. I realized that the modular approach might have some advantages there too. I think of personal information as toxic waste, because any leaks can be highly damaging to individuals, and to the companies involved, and there are few data sources that have as much potential for harm as video and audio streams from within peoples’ homes. I believe it’s our responsibility as developers to engineer systems that are as leak-resistant as possible, especially if we’re dealing with cameras and microphones. I’d already explored the idea of using Arm’s TrustZone to keep sensitive data contained, but by moving the ML processing off the central microcontroller and onto a peripheral, we have the chance to design something that has a very small attack surface (because there’s no memory shared with the rest of the system) and can be audited by a third-party to ensure any claims of safety are credible.

The ML sensor paper brings all these ideas together into a proposal for designing systems that are easier to build, and safer by default. I’m hoping this will start a discussion about how to improve usability and privacy in everyday systems, and lead to more practical prototyping and experimentation to answer a lot of the questions it raises. I’d love to get feedback on this proposal, especially from product designers who might want to try integrating these into their systems. I’m looking forward to seeing more work in this area, I know I’m going to be busy trying to get some examples up and running, so watch this space!

Caches Considered Harmful for Machine Learning

Photo by the National Park Service

I’ve been working on a new research paper, and a friend gave me the feedback that he was confused by the statement “memory accesses can be accurately predicted at the compilation stage” for machine learning workloads, and that this made them a poor fit for conventional processor architectures with predictive caches. I realized that this was received wisdom among the ML engineers I know, but I wasn’t aware of any papers that discuss this point. I put out a request for help on Twitter, but while there were a lot of interesting resources in the answers, I still couldn’t find any papers that focused on what feels like an important property for machine learning systems. With that in mind, I wanted to at least describe the issue as best as I can in this blog post, so there’s a trail of breadcrumbs for anyone else interested in how system designs might need to change to accommodate ML.

So, what am I talking about? Modern processors are almost universally constructed around multiple layers of predictive memory caches. These are small areas of memory that can be accessed much faster than the main system memory, and are needed because processors can execute instructions far more quickly than they can fetch values from the DRAM used for main memory. In fact, you can usually run hundreds of instructions in the time it takes to bring one byte from the DRAM. This means if processors all executed directly from system memory, they would run hundreds of times more slowly than they could. For decades, the solution to the mismatch has been predictive caches. It’s possible to build memory that’s much faster to access than DRAM, but for power and area reasons it’s not easy to fit large amounts onto a chip. In modern systems you might have gigabytes of DRAM, but only single-digit megabytes of total cache. There are some great papers like What Every Programmer Should Know About Memory that go into a lot more detail about the overall approach, but the most important thing to know is that memory stored in these caches can be accessed in a handful of cycles, instead of hundreds, so moving data into these areas is crucial if you want to run your programs faster.

How do we decide what data should be placed in these caches though? This requires us to predict what memory locations will be accessed hundreds or thousands of cycles in the future, and with general programs with a lot of data dependent branches, comparisons, and complex address calculations this isn’t possible to do with complete accuracy. Instead, the caches use heuristics (like we just accessed address N, so also fetch N+1, N+2, etc in case we’re iterating through an array) to guess how to populate these small, fast areas of memory. The cost of making a mistake is still hundreds of cycles, but as long as most of the accesses are predicted correctly this works pretty well in practice. However there is still an underlying tension between the model used for programming languages, where memory is treated as a uniform arena, and the reality of hardware where data lives in multiple different places with very different characteristics. I never thought I’d be linking to a Hacker News comment, the community has enough toxic members I haven’t read it for years, but this post I was pointed to actually does a good job of talking about all the complexities that are introduced to make processors appear as if they’re working with uniform memory.

Why does all this matter for machine learning? The fundamental problem predictive caches are trying to solve is “What data needs to be prefetched into fast memory from DRAM?”. For most computing workloads, like rendering HTML pages or dealing with network traffic, the answer to this question is highly dependent on the input data to the algorithm. The code is full of lines like ‘if (a[i] == 10) { value = b[j] } else { value = b[k]; }‘, so predicting which addresses will be accessed requires advance knowledge of i, a[i], j, and k, at least. As more of these data-dependent conditionals accumulate, the permutations of possible access addresses become unmanageable, and effectively it’s impossible to predict addresses for code like this without accessing the data itself. Since the problem we’re trying to solve is that we can’t access the underlying data efficiently without a cache, we end up having to rely on heuristics instead.

Machine learning operations are very different. The layers that take up the majority of the time for most models tend to be based on operations like convolutions, which can be expressed as matrix multiplies. Crucially, the memory access patterns don’t depend on the input data. There’s no ‘if (a[i] == 10) {...‘ code in the inner loops of these kernels, they’re much simpler. The sizes of the inputs are also usually known ahead of time. These properties mean that we know exactly what data we need in fast memory for the entire execution of the layer ahead of time, with no dependencies on the values in that data. Each layer can often take hundreds of thousands of arithmetic operations to compute, and each value fetched has the potential to be used in multiple instructions, so making good use of the small amounts of fast memory available is crucial to reducing latency. What quickly becomes frustrating to any programmer trying to optimize these algorithms on conventional processors is that it’s very hard to transfer our complete knowledge of future access patterns into compiled code.

The caches rely almost entirely on the heuristics that were designed for conventional usage patterns, so we essentially have to reverse-engineer those heuristics to persuade them to load the data we know we’ll need. There are some tools to help like prefetching instructions and branch hints, but optimizing inner loops often feels like a struggle against a system that thinks it’s being helpful, but is actually getting in the way. Optimized matrix multiplication implementations usually require us to gather the needed data into tiles that are a good fit for the fast memory available, so we can do as much as possible with the values while they’re quickly accessible. Getting these tiles the right size and ensuring they’re populated with the correct data before it’s needed requires in-depth knowledge of the capacity, access latencies, and predictive algorithms of all levels of the cache hierarchy on a particular processor. An implementation that works well on one chip may produce drastically poorer performance on another in the same family if any of those characteristics change.

It would make more sense to expose the small, fast memories to the programmer directly, instead of relying on opaque heuristics to populate them. They could be made available as separate address spaces that can be explicitly preloaded ahead of time with data before it’s needed. We know what address ranges we’ll want to have and when, so give us a way to use this knowledge to provide perfect predictions to fill those areas of memory. Some embedded chips do offer this capability, known variously as tightly-coupled memory, or XY memory, and we do use this to improve performance for TensorFlow Lite Micro on platforms that support it.

There are lots of challenges to making this available more widely though. Modern desktop and mobile apps don’t have the luxury of targeting a single hardware platform, and are expected to be able to run across a wide variety of different chips within the same processor family. It would be very difficult to write efficient code that works for all those combinations of cache size, speed, and prefetch heuristics. Software libraries from the processor manufacturers themselves (like CuDNN or Intel’s MKL) are usually the best answer right now, since they are written by engineers with detailed knowledge of the hardware systems and will be updated to handle new releases. These still have to work around the underlying challenges of a programming model that tries to hide the cache hierarchy though, and every engineer I’ve talked to who has worked on these inner loops wishes they had a better way to take advantage of their knowledge of memory access patterns.

This is also the kind of radical workload difference that has inspired a lot of new kinds of NPU hardware aimed specifically at deep learning. From my perspective, these have also been hard to work with, because while their programming models may work better for core operations like convolutions, models also require layers like non-max suppressions that are only efficiently written as procedural code with data-dependent branches. Without the ability to run this kind of general purpose code, accelerators lose many of their advantages because they have to keep passing off work to the main CPU, with a high latency cost (partly because this kind of handover usually involves flushing all caches to keep different memory areas in sync).

I don’t know what the ultimate solution will look like, but I’d imagine it will either involve system programmers being able to populate parts of caches using explicit prefetching, maybe even just supplying a set of address ranges as requirements and relying on the processor to sort it out, or something more extreme. One possible idea is making matrix multiplies first-class instructions at the machine code level, and having each processor implement the optimal strategy in microcode, in a similar way to how floating-point operations have migrated from accelerators, to co-processors, and now to the core CPU. Whatever the future holds, I hope this post at least helps explain why conventional predictive caches are so unhelpful when trying to optimize machine learning operations.