Doom, Dark Compute, and AI

Back in 2020 Foone Turing caused a sensation when she showed Doom running on a pregnancy test. For anyone who remembered desktop computers from the 90’s, it was amazing to see a disposable device run something that used to take thousands of dollars worth of hardware. It’s not a fluke either – calculators, ATMs, fridges, and even keychains can run the game. What this shows is how much computing power low-cost, everyday objects now have. If you’d told teenage me that I could buy a 50 cent chip as powerful as my PC, my imagination would have raced with all of the amazing things that people could build.

So why does the world of embedded devices feel so boring? We have orders of magnitude more compute available than even a couple of decades ago, but no real killer apps have emerged, outside of mobile and wearables. The truth is that most compute is sitting idle most of the time. It’s like Dark Fibre after the Dot Com Bubble. In both cases it made engineering sense to add the extra capacity since the marginal cost was so low, even though the applications weren’t yet there to use it. Dark Fibre eventually gave us streaming, video calls, and the internet we know today. I think all of this Dark Compute in embedded devices will lead to a wave of innovation too, once product designers understand the possibilities.

How much Dark Compute is out there?

From Arm’s own data, there are 100 billion (or 1e14) Arm Cortex M chips out in the world. Even if we assume most of those are the cheapest M0 class running at 100MHz, this translates to 100 million (or 1e8) integer arithmetic ops per second per CPU. This suggests that 1e22 integer ops per second could be executed if they were all working at full capacity. Though this is not comparing apples to apples, it is more than twice the number of FLOPs available through all the world’s active GPUs and TPUs. I’ll explain why comparing float and integer operations is interesting below, but the headline is that the embedded world contains a massive amount of computing power.

Estimating how much is actually used is harder, but the vast majority of current applications are for things like fans, appliances, or other devices that don’t need much more than simple control logic. They’re using these over-powered chips because once the price of a 32-bit MCU drops below fifty cents (or even ten cents!) it’s cheaper overall to buy a system that is easy to program and well supported, as the NRE costs start to dominate. My best guess is that ninety percent of the time these processors are left idle. That still leaves us in the 1e22 range for the total amount of Dark Compute.

What can we use Dark Compute for?

AI!

You might have guessed where I’m going from the title, but we have an amazing opportunity to turn all of this dead silicon into delightful experiences for users. It’s now possible run speech recognition to offer voice interfaces on everyday devices, local closed captions and translations for accessibility, person sensing so your TV can pause when you get up to make a cup of tea, play air drums, recognize gestures, brew coffee perfectly, or a hundred other interface improvements, all using the same underlying machine learning technology. In many cases, this doesn’t even need a hardware change, because the systems already have Dark Compute lying idle. Even better, the quality of the AI scales with the compute available, so as more modern chips are used the capabilities of these interface features grow too. It also only needs 8-bit operations to execute, so the comparisons betweens FLOPS and integer ops in terms of computing capacity are valid.

There are plenty of challenges still to overcome, from battery usage limiting compute, to including the right sensors and making the tools easy enough to use, but I’m convinced we’re going to see a wave of incredible AI innovations once the engineering community figures out how to effectively use all this idle capacity. I’m working to make this happen with Useful Sensors, so please get in touch if you’re interested too, and I’ll be at CES next week if anyone’s around. Let’s move our compute away from the dark side!

Why I Love my Chevy Bolt EV

I got my drivers license at 17, on the third attempt, but I never owned a car in the UK since I always biked or took public transport to work. When I was 25 I moved to Los Angeles, so I had to become a car owner for the first time. I wasn’t looking for anything fancy, and so I bought an extremely used 1989 Honda Civic for $2,000 which I drove for years down the 405 on my 90-minute commute from Simi Valley to Santa Monica, before eventually upgrading to the cheapest new car I could find, a Ford Focus, once the Civic became impossible to fix. I drove that across to Colorado and back multiple times before moving to San Francisco and happily selling it. I would bike or Muni to my startup in SoMa, and then once Jetpac was acquired by Google, I’d catch the company bus to Mountain View most days. I was far from car-free, I used Joanne’s vehicle to get groceries or run other errands, but I was able to avoid driving for the bulk of my travel.

All of this is to say that I am definitely not a Car Guy. My brother is, and I admired the hard work he put into hand-restoring a Carmen Ghia, but cars have always left me cold. Growing up I also paid far too much attention to terrifying PSAs about the dangers of car crashes, so I’ve got a lasting fear of hurting someone while I’m driving. Controlling a ton of metal speeding at 70MPH using only our human senses still feels like a crazy idea, so I do my best to be very alert, drive defensively, and generally err on the side of caution. I’ve been the butt of “Driving Miss Daisy” jokes from friends, and I usually finish last at go-kart outings, but (knock on wood) I’ve still got a clean record thirty years after I first got my license.

That’s why I’m so surprised at how much I enjoy the Chevy Bolt I bought last year. After I left Google it made sense to still have our startup offices in Mountain View since many of the team were Xooglers too, but with no Google Bus available I started to use Joanne’s car more, especially when I needed to go to Stanford too. This became tricky because we needed transport during the day for the dogs, and while I tried Caltrain it just took too long and it was tricky for me to get to and from the nearest stations. I resigned myself to the inconvenience of owning a car again, and hoped I might at least find a plugin hybrid, but when I started searching I was impressed at how many pure electric vehicles were available. I didn’t want to support Musk (I’m genuinely worried that he’s on a Tony Hsieh-esque tailspin) but even without Tesla there were a lot of options. After researching online, it became clear that the Chevy Bolt EV was a good fit for what I needed. It had over 200 miles of range, plenty for my 40 mile each-way commute, and had just gone through a major recall for battery fires, which as an engineer ironically reassured me, since I expected there would be a lot of scrutiny of all the safety systems after that! I wasn’t expecting to pick a Chevrolet since I associate them more with trucks and cheap rental cars, but the price, features, reviews, and reliability ratings ended up convincing me.

I went in person at Stevens Creek Chevrolet to make the purchase. I’m not a fan of the US’s weird government-protected dealership setup, and the process included a bogus-but-obligatory $995 markup masquerading as a security device, but the staff were pleasant enough and after a couple of hours I was able to drive off the lot with a new Bolt EUV for around $35,000. One new experience was having to set up all the required online accounts, even as a tech professional this was a bit daunting.

Since then I’ve driven almost 15,000 miles and for the first time in my life I feel excited about owning a car. I actually like Chevrolet’s software, which is not perfect but it does function surprisingly well. I do rely heavily on Android Auto though, so GM’s decision to ditch this for future models makes me nervous. I’d never even owned a car with a reversing camera, so this alone was a big upgrade. What really makes it shine are the safety features. Audio alerts for oncoming cars when I’m reversing, even before they’re visible, driver assist for emergency braking, visual alerts for blindspot lurkers, and a smart rearview mirror that reduces glare all help me be a better driver.

I also love home charging. So far I’ve only used the Bolt for commuting and similar trips in the Bay Area, we still have Joanne’s old ICE VW Golf for longer journeys, so I’ve not had to use a public charger. Once we had the Level 2 charger installed (free of charge, as part of the purchase) I was able to get to 100% in just a few hours, so I just leave it plugged in overnight and leave every morning with a full charge. I still have range anxiety about longer trips, especially since the biggest drawback with the Bolt is that it doesn’t offer very speedy charges on Level 3 stations, but my schedule has prevented me from doing those anyway so it hasn’t been an issue so far.

I have to admit that the Bolt is a lot of fun to drive too. You just press the pedal and it accelerates! This sounds simple, but even an automatic transmission gas car now feels clunky to me, after the smoothness and responsiveness of an electric motor. It steers nicely as well, though for some reason I have clipped the curb with my back tire a couple of times, which is a bigger problem than you might expect since there’s no spare tire and any changes require a visit to a dealership. The dogs also love curling up on the heated seats.

Electric vehicles aren’t enough by themselves to solve our climate and congestion issues, I would still love to have better public transport infrastructure, but they are part of the solution. Since I’m on SF’s 100% renewable energy electricity plan I do feel good about reducing my environmental impact as well as minimizing our reliance on the horrific foreign regimes that are propped up by oil exports. I’m also lucky that I have a garage that I can use to recharge my vehicle, it wouldn’t have been possible in most of the places I live, and I’m glad that I could afford a new vehicle. Unfortunately you can’t buy the same model I purchased, since GM discontinued-then-recontinued the Bolt, but what’s most impressed me is that many mainstream brands now offer excellent electric-only cars. If you’re interested in an electric vehicle, but aren’t sure you want to take the plunge, I hope this post will at least give you some reassurance that the technology is now pretty mature. Take it from me, I don’t easily get excited about a car, but my Bolt is one of the best purchases I’ve ever made!

Stanford’s HackLab Course

As many of you know, I’m an old geezer working on a CS PhD at Stanford and part of that involves me taking some classes. The requirements are involved, but this quarter I ended up taking “Hack Lab: Introduction to Cybersecurity“. I was initially attracted to it because it focuses on the legal as well as the technical side of security, knowledge which could have been useful earlier in my career. I also noticed it was taught by Alex Stamos and Riana Pfefferkorn, two academics with an amazing amount of experience between them, so I expected they’d have a lot to share.

I’ve just finished the final work for the course, and while it was challenging in surprising ways, I learned a lot, and had some fun too. I found the legal questions the hardest because of how much the answers depended on what seem like very subtle and arbitrary distinctions, like that between stored communications and those being transmitted. As an engineer I know how much storage is involved in any network and that even “at rest” data gets shuttled around behind the scenes, but what impressed me was how hard lawyers and judges have worked to match the practical rules with the intent of the lawmakers. Law isn’t code, it’s run by humans, not machines, which meant I had to put aside my pedantry about technological definitions to understand the history of interpretations. I still get confused between a warrant and a writ, but now I have a bit more empathy for the lawyers in my life at least.

The other side of the course introduced the tools and techniques around security and hacking through a series of practical workshops. I’ve never worked in this area, so a lot of the material was new to me, but it was so well presented I never felt out of my depth. The team had set up example servers and captured sequences to demonstrate things like sniffing passwords from wifi, XSS attacks, and much more. I know from my own experience how tough it can be to produce these kinds of guided tutorials, you have to anticipate all the ways students can get confused and ensure there are guard rails in place, so I appreciate the work Alex, Riana, and the TAs put into them all. I was also impressed by some of the external teaching tools, like Security Shepherd, that were incorporated.

The course took a very broad view of cybersecurity, including cryptocurrency, which finally got me to download a wallet for one exercise, breaking my years of studiously ignoring the blockchain. I also now have Tor on my machine, and understand a bit more about how that all works in case I ever need it. The section on web fundamentals forced me to brush up on concepts like network layers in the OSI model, and gave me experience using Wireshark and Burp to understand network streams, which I may end up using next time I need to debug an undocumented REST API. The lectures were top notch too, with a lot of real world examples from Alex and Riana’s lives outside Stanford that brought depth to the material. There was a lot of audience involvement too, and my proudest moment was being able to answer what MtGOX originally stood for (Magic the Gathering Online eXChange).

If you ever get the chance to take INTLPOL 268 (as it’s officially known) I’d highly recommend it. A lot of the students were from the law school, and the technical exercises are well designed to be do-able without previous experience of the field, so it’s suitable for people from a wide range of backgrounds. It’s covering an area that often falls between the gaps of existing academic disciplines, but is crucial to understand whether you’re designing a computer system or planning policy. Thanks to the whole team for a fantastic learning experience, but especially my lab TA Danny Zhang for his patience as I attempted to tackle legal questions with an engineering mindset.

Little Googles Everywhere

Terrifying fridge with human teeth, via DALLE.

Imagine asking a box on a pillar at Home Depot “Where are the nails?” and getting directions, your fridge responding with helpful advice when you say “Why is the ice maker broken?”, or your car answering “How do I change the wiper speed?”. I think of these kinds of voice assistants for everyday objects as “Little Googles”, agents that are great at answering questions, but only in a very specific domain. I want them in my life, but they don’t yet exist. If they’re as useful as I think, why aren’t they already here, and why is now the right time for them to succeed?

What are “Little Googles?”

I’m a strong believer in Computers as Social Actors, the idea that people want to interact with new technology as if it was another person. With that in mind, I always aim to make user experiences as close to existing interactions as possible to increase the likelihood of adoption. If you think about everyday life, we often get information we need from a short conversation with someone else, whether it’s a clerk at Home Depot, or your spouse who knows the car controls better than you do. I believe that speech to text and LLMs are now sufficiently advanced to allow a computer to answer 80% of these kinds of informational queries, all through a voice interface.

The reason we ask people these kinds of questions rather than Googling on our phones is that the other person has a lot of context and specialized knowledge that isn’t present in a search engine. The clerk knows which store you’re in, and how items are organized. Your spouse knows what car you’re driving, and has learned the controls themselves. It’s just quicker and easier to ask somebody right now! The idea of “Little Googles” is that we can build devices that offer the same convenience as a human conversation, even when there’s nobody else nearby.

Why don’t they exist already?

If this is such a good idea, why hasn’t anyone built these? There are a couple of big reasons, one technical and the other financial. The first is that it used to take hundreds of engineers years to build a reliable speech to text service. Apple paid $200m to buy Siri in 2010, Alexa reportedly lost $10b in 2022, and I know from my own experience that Google’s speech team was large, busy, and deservedly well-paid. This meant that the technology to offer a voice interface was only available to a few large companies, and they reserved it for their own products, or other use cases that drove traffic directly to them. Speech to text was only available if it served those companies’ purposes, which meant that other potential customers like auto manufacturers or retail stores couldn’t use it.

The big financial problem came from the requirement for servers. If you’re a fridge manufacturer you only get paid once, when a consumer buys the appliance. That fridge might have a useful lifetime of over a decade, so if you offered a voice interface you’d need to pay for servers to process incoming audio for years to come. Because most everyday objects aren’t supported by subscriptions (despite BMW’s best efforts) the money to keep those servers running for an indeterminate amount of time has to come from the initial purchase. The ongoing costs associated with voice interfaces have been enough to deter almost anyone who isn’t making immediate revenue from their use. 

Having to be connected also meant that the audio was sent to someone else’s data center, with all the privacy issues involved, and required wifi availability, which is an ongoing maintenance cost in any commercial environment and such a pain for consumers to set up that less than half of “smart” appliances are ever connected.

Why is now the right time?

OpenAI’s release of Whisper changed everything for voice interfaces. Suddenly anyone could download a speech to text model that performs well enough for most use cases, and use it commercially with few strings attached. It shattered the voice interface monopoly of the big tech companies, removing the technical barrier.

The financial change was a bit more subtle. These models have become small enough to fit in 40 megabytes and run on a $50 SoC. This means it’s starting to be possible to run speech to text on the kinds of chips already found in many cars and appliances, with no server or internet connection required. This removes the ongoing costs from the equation, now running a voice interface is just something that needs to be part of the on-device compute budget, a one-time, non-recurring expense for the manufacturer.

Moving the voice interface code to the edge also removes the usability problems and costs of requiring a network connection. You can imagine a Home Depot product finder being a battery-powered box that is literally glued to a pillar in the store. You’d just need somebody to periodically change the batteries and plug in a new SD card as items are moved around. The fridge use case is even easier, you’d ship the equivalent of the user manual with the appliance and never update it (since the paper manual doesn’t get any).

Nice idea, but where’s the money?

Voice interfaces have often seemed like a solution looking for a problem (see Alexa’s $10b burn rate). What’s different now is that I’m talking to customers with use cases that they believe will make them money immediately. Selling appliance warranties is a big business, but call centers, truck rolls for repairs, and returns can easily wipe out any profit. A technology that can be shown to reduce all three would save a lot of money in a very direct way, so there’s been strong interest in the kinds of “Talking user manuals” we’re offering at Useful. Helping customers find what they need in a store is another obvious moneymaker, since a good implementation will increase sales and consumer satisfaction, so that’s been popular too.

What’s next?

It’s Steam Engine Time for this kind of technology. There are still a lot of details to be sorted out, but it feels so obvious that it’s now possible and that this would be a pleasant addition* to most people’s lives as well as promising profit, that I can’t imagine something like this won’t happen. I’ll be busy with the team at Useful trying to build some of the initial implementations and prove that it isn’t such a crazy idea, so I’d love to hear from you if this is something that resonates. I’d also like to see other implementations of similar ideas, since I know I can’t be the only one seeing these trends.

(*) Terrifying AI-generated images of fridges with teeth notwithstanding.

Stanford CS PhD Course Choices for Winter 2024

As you might know I’m working on my PhD at Stanford, and one of my favorite parts is taking courses. For this second year I need to follow the new foundation and breadth requirements which in practice means taking a course a quarter, with each course chosen from one of four areas. For the fall quarter I took Riana Pfefferkorn and Alex Stamos’ Hacklab: Introduction to Cybersecurity, which I thoroughly enjoyed and learned a lot, especially about the legal side. I’m especially thankful to Danny Zhang, my excellent lab RA who had a lot of patience as I struggled with the difference between a search warrant and civil sanctions!

That satisfied the “People and Society” section of the requirements, but means I need to pick a course from one of the other three sections for the upcoming winter quarter. You might think this would be simple, but as any Googler can tell you, the more technically advanced the organization the worse the internal search tools are. The requirements page just has a bare list of course numbers, with no descriptions or links, and the enrollment tool is so basic that you have to put a space between the letters and the numbers of the course ID (“CS 243” instead of “CS243”) before it can find them, so it’s not even just a case of copying and pasting. To add to the complexity, many of the courses aren’t offered this coming quarter, so just figuring out what my viable options are was hard. I thought about writing a script to scrape the results, given a set of course numbers, but decided to do it manually in the end.

This will be a *very* niche post, but since there are around 100 other second year Stanford CS PhD students facing the same search problem, I thought I’d post my notes here in case they’ll be helpful. I make no guarantees about the accuracy of these results, I may well have fat-fingered some search terms, but let me know if you spot a mistake. I’ve indexed all 2xx and 3xx level courses in the first three breadth sections (since I already had the fourth covered), and I didn’t check 1xx because they tend to be more foundational. For what it’s worth, I’m most excited about CS 224N – Natural Language Processing with Deep Learning, and hope I can get signed up once enrollment opens.

2xx/3xx Breadth Courses Available Winter 2024

2xx/3xx Courses Not Offered for Winter 2024

  • CS 221
  • CS 227
  • CS 230
  • CS 231
  • CS 234
  • CS 236
  • CS 240
  • CS 242
  • CS 244
  • CS 245
  • CS 250
  • CS 251
  • CS 257
  • CS 258
  • CS 259
  • CS 261
  • CS 263
  • CS 265
  • CS 269
  • CS 271
  • CS 272
  • CS 273
  • CS 274
  • CS 279
  • CS 281
  • CS 316
  • CS 324
  • CS 326
  • CS 328
  • CS 329D
  • CS 329H
  • CS 329X
  • CS 330
  • CS 331
  • CS 332
  • CS 333
  • CS 334
  • CS 354
  • CS 355
  • CS 356
  • CS 358
  • CS 359
  • CS 371
  • CS 373
  • EE 282

Why We’re Building an Open-Source Universal Translator

We all grew up with TV shows, books, and movies that assume everybody can understand each when they speak, even if they’re aliens. There are various in-universe explanations for this convenient feature, and most of them involve a technological solution. Today, the Google Translate app is the closest thing we have to this kind of universal translator, but the experience isn’t good enough to be used everywhere it could be useful. I’ve often found myself bending over a phone with someone, both of us staring at the screen to see the text, and switching back and forth between email or another app to share information.

Science fiction translators are effortless. You can walk up to someone and talk normally, and they understand what you’re saying as soon as you speak. There’s no setup, no latency, it’s just like any other conversation. So how can we get there from here?

One of the most common answers is a wearable earpiece. This is in line with Hitchhikers’ Babel Fish, but there are still massive technical obstacles to fitting the processing required into such a small device, even offloading compute to a phone would require a lot of radio and battery usage. These barriers mean we’ll have to wait for hardware innovations like Syntiant’s to go through a few more generations before we can create this sort of dream device.

Instead, we’re building a small, unconnected box with a built-in display that can automatically translate between dozens of different languages. You can see it in the video above, and we’ve got working demos to share if you’re interested in trying it out. The form factor means it can be left in-place on a hotel front desk, brought to a meeting, placed in front of a TV, or anywhere you need continuous translation. The people we’ve shown this to have already asked to take them home for visiting relatives, colleagues, or themselves when traveling.

You can get audio out using a speaker or earpiece, but you’ll also see real-time closed captions of the conversation in the language of your choice. The display means it’s easy to talk naturally, with eye contact, and be aware of the whole context. Because it’s a single-purpose device, it’s less complicated to use than a phone app, and it doesn’t need a network connection, so there are no accounts or setup involved, it starts working as soon as you plug it in.

We’re not the only ones heading in this direction, but what makes us different is that we’ve removed the need for any network access, and partly because of that we’re able to run with much lower latency, using the latest AI techniques to still achieve high accuracy.

We’re also big believers in open source, so we’re building on top of work like Meta’s NLLB and OpenAI’s Whisper and will be releasing the results under an open license. I strongly believe that language translation should be commoditized, making it into a common resource that a lot of stakeholders can contribute to, so I hope this will be a step in that direction. This is especially essential for low-resource languages, where giving communities the opportunity to be involved in digital preservation is vital for their future survival. Tech companies don’t have a big profit incentive to support translation, so advancing the technology will have to rely on other groups for support.

I’m also hoping that making translation widely available will lead manufacturers to include it in devices like TVs, kiosks, ticket machines, help desks, phone support, and any other products that could benefit from wider language support. It’s not likely to replace the work of human translators (as anyone who’s read auto-translated documentation can vouch for) but I do think that AI can have a big impact here, bringing people closer together.

If this sounds interesting, please consider supporting our crowdfunding campaign. One of the challenges we’re facing is showing that there’s real demand for something like this, so even subscribing to follow the updates helps demonstrate that it’s something people want. If you have a commercial use case I’d love to hear from you too, we have a limited number of demo units we are loaning to the most compelling applications.

The Unstoppable Rise of Disposable ML Frameworks

Photo by Steve Harwood

On Friday my long-time colleague Nat asked if we should try and expand our Useful Transformers library into something that could be suitable for a lot more use cases. We worked together on TensorFlow, as did the main author of UT, Manjunath, so he was surprised when I didn’t want to head too far in a generic direction. As I was discussing it with him I realized how much my perspective on ML library design has changed since we started TensorFlow, and since I think by writing I wanted to get my thoughts down as this post.

The GGML framework is just over a year old, but it has already changed the whole landscape of machine learning. Before GGML, an engineer wanting to run an existing ML model would start with a general purpose framework like PyTorch, find a data file containing the model architecture and weights, and then figure out the right sequence of calls to load and execute it. Today it’s much more likely that they will pick a model-specific code library like whisper.cpp or llama.cpp, based on GGML.

This isn’t the whole story though, because there are also popular model-specific libraries like llama2.cpp or llama.c that don’t use GGML, so this movement clearly isn’t based on the qualities of just one framework. The best term I’ve been able to come up with to describe these libraries is “disposable”. I know that might sound derogatory, but I don’t mean it like that, I actually think it’s the key to all their virtues! They’ve limited their scope to just a few models, focus on inference or fine-tuning rather than training from scratch, and overall try to do a few things very well. They’re not designed to last forever, as models change they’re likely to be replaced by newer versions, but they’re very good at what they do.

By contrast, traditional frameworks like PyTorch or TensorFlow try to do many different things for a lot of different audiences. They are designed to be toolkits that can be reused for almost any possible model, for full training as well as deployment in production, scaling from laptops (or even in TF’s case microcontrollers) to distributed clusters of hundreds of GPUs or TPUs. The idea is that you learn the fundamentals of the API, and then you can reuse that knowledge for years in many different circumstances.

What I’ve seen firsthand with TensorFlow is how coping with such a wide range of requirements forces its code to become very complex and hard to understand. The hope is always that the implementation details can be hidden behind an interface, so that people can use the system without becoming aware of the underlying complexity. In practice this is impossible to achieve, because latency and throughput are so important. The only reason to use ML frameworks instead of a NumPy Python script is to take advantage of hardware acceleration, since training and inference time need to be minimized for many projects to be achievable. If a model takes years to train, it’s effectively untrainable. If a chatbot response takes days, why bother?

But details leak out from the abstraction layer as soon as an engineer needs to care about speed. Do all of my layers fit on a TPU? Am I using more memory than I have available on my GPU? Is there a layer in the middle of my network that’s only implemented as a CPU operation, and so is causing massive latencies as data is copied to and from the accelerator? This is where the underlying complexity of the system comes back to bite us. There are so many levels of indirection involved that building a mental model of what code is executing and where is not practical. You can’t even easily step through code in a debugger or analyze it using a profiler, because much of it executes asynchronously on an accelerator, goes through multiple compilation steps before running on a regular processor, or is dispatched to platform-specific libraries that may not even have source code available. This opaqueness makes it extremely hard for anyone outside of the core framework team to even identify performance problems, let alone propose fixes. Because every code path is used by so many different models and use cases, just verifying that any change doesn’t cause a regression is a massive job.

By contrast, debugging and profiling issues with disposable frameworks is delightfully simple. There’s a single big program that you can inspect to understand the overall flow, and then debug and profile using very standard tools. If you spot an issue, you can find and change the code easily yourself, and either keep it on your local copy or create a pull request after checking the limited number of use cases the framework supports.

Another big pain point for “big” frameworks is installation and dependency management. I was responsible for creating and maintaining the Raspberry Pi port of TensorFlow for a couple of years, and it was one of the hardest engineering jobs I’ve had in my career. It was so painful I eventually gave up, and nobody else was willing to take it on! Because TF supported so many different operations, platforms, and libraries, porting and keeping it building on non-x86 platform was a nightmare. There were constantly new layers and operations being added, many of which in turn relied on third party code that also had to be ported. I groaned when I saw a new dependency appear in the build files, usually for something like an Amazon AWS input authentication pip package that didn’t add much value for the Pi users, but still required me to figure out how to install it on a platform that was often unsupported by the authors.

The beauty of single-purpose frameworks is that they can include all of the dependencies they need, right in the source code. This makes them a dream to install, often only requiring a checkout and build, and makes porting them to different platforms much simpler.

This is not a new problem, and during my career at Google I saw a lot of domain or model-specific libraries emerge internally as alternatives to using TensorFlow. These were often enthusiastically adopted by application engineers, because they were so much easier to work with. There was often a lot of tension about this with the infrastructure team, because while this approach helped ship products, there were fears about the future maintenance cost of supporting many different libraries. For example, adding support for new accelerators like TPUs would be much harder if it had to be done for a multitude of internal libraries rather than just one, and it increased the cost of switching to new models.

Despite these valid concerns, I think disposable frameworks will only grow in importance. More people are starting to care about inference rather than training, and a handful of foundation models are beginning to dominate applications, so the value of using a framework that can handle anything but is great at nothing is shrinking.

One reason I’m so sure is that we’ve seen this movie before. I spent the first few years of my career working in games, writing rendering engines in the Playstation 1 era. The industry standard was for every team to write their own renderer for each game, maybe copying and pasting some code from other titles but otherwise with little reuse. This made sense because the performance constraints were so tight. With only two megabytes of memory on a PS1 and a slow processor, every byte and cycle counted, so spending a lot of time jettisoning anything unnecessary and hand-optimizing the functions that mattered was a good use of programming time. Every large studio had the same worries about maintaining such a large number of engines across all their games, and every few years they’d task an internal group to build a more generic renderer that could be reused by multiple titles. Inevitably these efforts failed. It was faster and more effective for engineers to write something specialized from scratch than it was to whittle down and modify a generic framework to do what they needed.

Eventually a couple of large frameworks like Unity and Unreal came to dominate the industry, but it’s still not unheard of for developers to write their own, and even getting this far took decades. ML frameworks face the same challenges as game engines in the 90’s, with application developers given tight performance and memory constraints that are hard to hit using generic tools. If the past is any guide we’ll see repeated attempts to promote unified frameworks while real-world developers rely on less-generic but simpler libraries.

Of course it’s not a totally binary choice. For example we’re still planning on expanding Useful Transformers to support the LLM and translation models we’re using for our AI in a Box so we’ll have some genericity, but the mid-2010’s vision of “One framework to rule them all” is dead. It might be that PyTorch (which has clearly won the research market) becomes more like MatLab, a place to prototype and create algorithms, which are then hand-converted to customized inference frameworks by experienced engineers rather than automated tools or compilers.

What makes me happiest is that the movement to disposable frameworks is clearly opening up the world of ML development to many more people. By removing the layers of indirection and dependencies, the underlying simplicity of machine learning becomes a lot clearer, and hopefully less intimidating. I can’t wait to see all of the amazing products this democratization of the technology produces!

Request for Sensors

At Useful Sensors we’re focused on building intelligent sensors, ones that use machine learning to take raw data and turn it into actionable insights. Sometimes I run across problems in my own life that don’t need advanced algorithms or AI to solve, but are blocked by hardware limitations. A classic one is “Did I leave my garage door open?”. A few months ago I even had to post to our street’s mailing list to ask someone to check it while I was away, since I was anxious I’d left it open. Thankfully several of my great neighbors jumped in and confirmed it was closed, but relying on their patience isn’t a long term or scalable solution.

Sensors to help with this do exist, and have for a long time, so why are they still a niche product? For me, the holdbacks are difficult setup procedures and short battery lives. The tradeoff generally seems to be that to get a battery life measured in years, you need to use a specialized protocol like ZigBee, Threads, or Matter, which requires a hub, which adds to the setup time and likelihood of having to troubleshoot issues. Wifi-enabled sensors like the Swann linked to above don’t specify a battery life (the support team refuses to give an estimate further down on the page) but I’ve found similar devices last months, not years. What I would love is a cell-data-connected sensor with zero accounts, apps, or setup, beyond maybe scanning a QR code to claim it. One of the reasons I’m a big fan of Blues is that their fixed-cost cell package could make a device like this possible, but I’m guessing it would still need to be comparatively large for the hardware and battery required, and comparatively costly too.

What all of the current solutions have in common is that they demand more of my time than I’m willing to give. I have plenty of frustrating issues to debug in my technical work, the last thing I want to do when I get home is deal with poorly documented setup workflows or change batteries more than once in a blue moon. I’m guessing that I’m not alone, every product I’ve seen that truly “just works” has had an order of magnitude more sales than a competitor that has even a bit of friction.

I would happily pay a lot for a device that I could stick on a garage door, scan a code on my phone that took me to a web URL where I could claim it (no more terrible phone apps, please) and then simply sent me a text if it was open for more than ten minutes. My sense is that the problems that need to be solved are around power consumption, radio, and cost. These aren’t areas I have expertise in, so I won’t be attempting this challenge, but I hope someone out there will, and soon.

A similar application is medication detection. I’m old enough to have my own pill organizer (don’t laugh too loud, it’s coming for you eventually) but an accelerometer attached to a pill bottle could tell if I’ve picked it up, and so presumably taken a dose, on time, and I’d never again have to measure out my tablets into little plastic slots. Devices like these do exist, but the setup, cost, and power consumption challenges are even higher, so they’re restricted to specialized use cases like clinical trials.

It feels like we’ve been on the verge of being able to build products like this for decades, but so many systems need to work smoothly to make the experience seamless that nothing has taken off. I really hope that the stars will align soon and I’ll be able to remove one or two little anxieties from my life!

A Personal History of ML Quantization

Tomorrow I’ll be giving a remote talk at the LBQNN workshop at ICCV. The topic is the history of quantization in machine learning, and while I don’t feel qualified to give an authoritative account, I did think it might be interesting to cover the developments I was aware of.

I don’t know if the talk will be recorded, but here are the slides in case they are useful for reference. Apologies for any mistakes, please do let me know so I can improve the presentation.

Why Nvidia’s AI Supremacy is Only Temporary

Nvidia is an amazing company that has executed a contrarian vision for decades, and has rightly become one of the most valuable corporations on the planet thanks to its central role in the AI revolution. I want to explain why I believe it’s top spot in machine learning is far from secure over the next few years. To do that, I’m going to talk about some of the drivers behind Nvidia’s current dominance, and then how they will change in the future.

Currently

Here’s why I think Nvidia is winning so hard right now.

#1 – Almost Nobody is Running Large ML Apps

Outside of a few large tech companies, very few corporations have advanced to actually running large scale AI models in production. They’re still figuring out how to get started with these new capabilities, so the main costs are around dataset collection, hardware for training, and salaries for model authors. This means that machine learning is focused on training, not inference.

#2 – All Nvidia Alternatives Suck

If you’re a developer creating or using ML models, using an Nvidia GPU is a lot easier and less time consuming than an AMD OpenCL card, Google TPU, a Cerebras system, or any other hardware. The software stack is much more mature, there are many more examples, documentation, and other resources, finding engineers experienced with Nvidia is much easier, and integration with all of the major frameworks is better. There is no realistic way for a competitor to beat the platform effect Nvidia has built. It makes sense for the current market to be winner-takes-all, and they’re the winner, full stop.

#3 – Researchers have the Purchasing Power

It’s incredibly hard to hire ML researchers, anyone with experience has their pick of job offers right now. That means they need to be kept happy, and one of the things they demand is use of the Nvidia platform. It’s what they know, they’re productive with it, picking up an alternative would take time and not result in skills the job market values, whereas working on models with the tools they’re comfortable with does. Because researchers are so expensive to hire and retain, their preferences are given a very high priority when purchasing hardware.

#4 – Training Latency Rules

As a rule of thumb models need to be trainable from scratch in about a week. I’ve seen this hold true since the early days of AlexNet, because if the iteration cycle gets any longer it’s very hard to do the empirical testing and prototyping that’s still essential to reach your accuracy goals. As hardware gets faster, people build bigger models up until the point that the training once again takes roughly the same amount of time, and reap the benefits through higher-quality models rather than reduced total training time. This makes buying the latest Nvidia GPUs very attractive, since your existing code will mostly just work, but faster. In theory there’s an opportunity here for competitors to win with lower latency, but the inevitably poor state of their software stack (CUDA has had decades of investment) means it’s mostly an illusion.

What’s going to change?

So, hopefully I’ve made a convincing case that there are strong structural reasons behind Nvidia’s success. Here’s how I see those conditions changing over the next few years.

#1 – Inference will Dominate, not Training

Somebody years ago told me “Training costs scale with the number of researchers, inference costs scale with the number of users”. What I took away from this is that there’s some point in the future where the amount of compute any company is using for running models on user requests will exceed the cycles they’re spending on training. Even if the cost of a single training run is massive and running inference is cheap, there are so many potential users in the world with so many different applications that the accumulated total of those inferences will exceed the training total. There are only ever going to be so many researchers.

What this means for hardware is that priorities will shift towards reducing inference costs. A lot of ML researchers see inference as a subset of training, but this is wrong in some fundamental ways. It’s often very hard to assemble a sizable batch of inputs during inference, because that process trades off latency against throughput, and latency is almost always key in user-facing applications. Small or single-input batches change the workload dramatically, and call for very different optimization approaches. There are also a lot of things (like the weights) that remain constant during inference, and so can benefit from pre-processing techniques like weight compression or constant folding.

#2 – CPUs are Competitive for Inference

I didn’t even list CPUs in the Nvidia alternatives above because they’re still laughably slow for training. The main desktop CPUs (x86, Arm, and maybe RISC-V soon) have the benefit of many decades of toolchain investment. They have an even more mature set of development tools and community than Nvidia. They can also be much cheaper per arithmetic op than any GPU.

Old-timers will remember the early days of the internet when most of the cost of setting up a dot-com was millions of dollars for a bunch of high-end web server hardware from someone like Sun. This was because they were the only realistic platform that could serve web pages reliably and with low-latency. They had the fastest hardware money could buy, and that was important when entire sites needed to fit on a single machine. Sun’s market share was rapidly eaten by the introduction of software that could distribute the work across a large number of individually much less capable machines, commodity x86 boxes that were far cheaper.

Training is currently very hard to distribute in a similar way. The workloads make it possible to split work across a few GPUs that are tightly interconnected, but the pattern of continuous updates makes reducing latency by sharding across low-end CPUs unrealistic. This is not true for inference though. The model weights are fixed and can easily be duplicated across a lot of machines at initialization time, so no communication is needed. This makes an army of commodity PCs very appealing for applications relying on ML inference.

#3 – Deployment Engineers gain Power

As inference costs begin to dominate training, there will be a lot of pressure to reduce those costs. Researchers will no longer be the highest priority, so their preferences will carry less weight. They will be asked to do things that are less personally exciting in order to streamline production. There are also going to be a lot more people capable of training models coming into the workforce over the next few years, as the skills involved become more widely understood. This all means researchers’ corporate power will shrink and the needs of the deployment team will be given higher priority.

#4 – Application Costs Rule

When inference dominates the overall AI budget, the hardware and workload requirements are very different. Researchers value the ability to quickly experiment, so they need flexibility to prototype new ideas. Applications usually change their models comparatively infrequently, and may use the same fundamental architecture for years, once the researchers have come up with something that meets their needs. We may almost be heading towards a world where model authors use a specialized tool, like Matlab is for mathematical algorithms, and then hand over the results to deployment engineers who will manually convert the results into something more efficient for an application. This will make sense because any cost savings will be multiplied over a long period of time if the model architecture remains constant (even if the weights change).

What does this Mean for the Future?

If you believe my four predictions above, then it’s hard to escape the conclusion that Nvidia’s share of the overall AI market is going to drop. That market is going to grow massively so I wouldn’t be surprised if they continue to grow in absolute unit numbers, but I can’t see how their current margins will be sustainable.

I expect the winners of this shift will be traditional CPU platforms like x86 and Arm. Inference will need to be tightly integrated into traditional business logic to run end user applications, so it’s difficult to see how even hardware specialized for inference can live across a bus, with the latency involved. Instead I expect CPUs to gain much more tightly integrated machine learning support, first as co-processors and eventually as specialized instructions, like the evolution of floating point support.

On a personal level, these beliefs drive my own research and startup focus. The impact of improving inference is going to be so high over the next few years, and it still feels neglected compared to training. There are signs that this is changing though. Communities like r/LocalLlama are mostly focused on improving inference, the success of GGML shows how much of an appetite there is for inference-focused frameworks, and the spread of a few general-purpose models increases the payoff of inference optimizations. One reason I’m so obsessed with the edge is that it’s the closest environment to the army of commodity PCs that I think will run most cloud AI in the future. Even back in 2013 I originally wrote the Jetpac SDK to accelerate computer vision on a cluster of 100 m1.small AWS servers, since that was cheaper and faster than a GPU instance for running inference across millions of images. It was only afterwards that I realized what a good fit it was for mobile devices.

I’d love to hear your thoughts on whether inference is going to be as important as I’m predicting! Let me know in the comments if you think I’m onto something, or if I should be stocking up on Nvidia stock.