During the pandemic travel lockdown I’ve ended up accumulating a lot of vacation time, so I decided to take a lot of December off. I did spend some time relaxing, especially walking our adorablenew dogs, but there were some coding itches I wanted to scratch. One of the biggest was building a simple system for prototyping voice interfaces on an embedded device like a Raspberry Pi, all running locally. I’ve been following the Coqui.ai team’s work since they launched, and was very impressed by the quality of the open source speech models and code they have produced. I didn’t have an easy way to run them myself though, especially on live microphone input. With that in mind, I decided my holiday project would be writing a command line tool using Coqui’s speech to text library. To keep it as straightforward as possible I modeled it on the classic Unix cat command, where the default would be to read audio from a microphone and output text (though it ended up expanding to system audio and files too) so I called it spchcat. You can now download it yourself for Pi’s and x86 Linux from speechcat.org!
As usual, the scope kept expanding beyond my original idea. Coqui have collaborated with groups like ITML to collect models for over 40 languages, including some that are endangered, so I couldn’t resist supporting those, even though it makes the installer over a gigabyte in size. I also found it straightforward to support x86 Linux, since Coqui supply prebuilt libraries for those platforms too.
I’ve now scratched my own itch, but I’m hoping that this code will help introduce more people to the amazing advances in open source voice technology that have been happening over the last few years, and also help increase the number of people donating their voices to Common Voice, since none of this could have happened without Mozilla’s groundbreaking efforts. There’s still a lot of room for improvement with the accuracy and language coverage, but I’m confident that this is a project the open source community can make rapid progress on.
Thanks to the Coqui team for their great contributions, and to everyone who helped me test this initial release, especially Keyi for his detailed bug reports. I’m hoping to see some fun projects emerge out of this, so please drop me a line at [email protected] or leave a comment if you do have something you’d like to share!
I think the transformative power of on-device speech to text is criminally under-rated (and I’m not alone), so I’m a massive fan of the work Coqui are doing to make the technology more widely accessible. Coqui is a startup working on a complete open source solution to speech recognition, as well as text to speech, and I’ve been lucky enough to collaborate with their team on datasets like Multilingual Spoken Words.
They have have great documentation already, but over the holidays I’ve been playing around with the code and I always like to leave a trail of breadcrumbs if I can, so in this post I’ll try to show you how to get speech recognition running locally yourself in just a few minutes. I’ve tried it on my PopOS 21.04 laptop, but it will hopefully work on most modern Linux distributions, and should be trivial to modify for other platforms that Coqui provide binaries for. To accompany this post, I’ve also published a Colab notebook, which you can use from your browser on almost any system, and demonstrates all these steps.
You’ll need to be comfortable using a terminal, but because they do offer pre-built binaries you won’t need to worry about touching code or compilation. I’ll show you how to use their tools to recognize English language text from a WAV file. The code sections below (in a monospace font) should all be run from a shell terminal window.
First we download the example executable, stt, and the shared library, libstt.so, that contains the framework code, all parts of the native_client archive.
wget --quiet https://github.com/coqui-ai/STT/releases/download/v1.1.0/native_client.tflite.Linux.tar.xz
tar -xf native_client.tflite.Linux.tar
Next, we need to fetch a model. For this example I’ve chosen the English large vocabulary model, but there are over 80 different versions available for many languages at coqui.ai/models. Note that this is the recognition model, not the language model. Language models are used to post-process the results of the neural network, and are optional. To keep things simple, in this example we’re just using the raw recognition model output, but there are lots of options to improve the quality for a particular application if you investigate things like language models and hotwords.
The stt file is a command line tool that lets you run speech to text translation using Coqui’s framework. It has a lot of options you can explore, but the simplest way to use it is to provide a recognition model and then point it at a WAV file. After some version logging you should see the predicted transcript of the speech in the audio file as the final line.
You should see output that looks something like this:
Coqui STT: v1.1.0-0-gf3605e23
why should one halt on the way
If you’ve made it this far, congratulations, you’ve just run your own speech to text engine locally on your machine! Coqui have put a lot of work into their open source speech framework, so if you want to dive in deeper I highly recommend browsing their documentation and code. Everything’s open source, even the training, so if you need something special for your own application, like a different language or specialized vocabulary, you have the chance to do it yourself.
Update – I’ve also just added a new Colab notebook showing how to build a program using STT with just a makefile and the binaryreleases, without requiring Bazel.
Even before the first version of TensorFlow was released, the XLA project was integrated as a “domain-specific compiler” for its machine learning graphs. Since then there have been a lot of other compilers aimed at ML problems, like TVM, MLIR, EON, and GLOW. They have all been very successful in different areas, but they’re still not the primary way for most users to run machine learning models. In this post I want to talk about some of the challenges that face ML compiler writers, and some approaches I think may help in the future.
I’m not a compiler expert at all, but I have been working on infrastructure to run deep learning models across different platforms for the last ten years, so most of my observations come from being a user rather than an implementer of compiler technology. I’m also writing from my own personal experience, these are all just my own opinions rather than anything endorsed by TensorFlow or Google, so take them for what they’re worth. I’m outside my area of expertise, and I’d love to hear what other people think about the areas I’m highlighting, I bet I’ll learn something interesting and new from any responses!
What is an ML compiler?
So far I’ve been talking about ML compilers like they’re a well-defined class of technology, but the term actually covers a very wide range of tools. Intuitively there’s an analogy with procedural programming, where interpreted languages tend to be used for experimentation, prototyping and research because of their flexibility, but compilation is deployed when performance is a higher priority. The computation graphs used in deep learning look a lot like programs, and the major frameworks like PyTorch and TensorFlow use interpretation to execute them, so using compilation to improve performance seems like a logical next step. All of the ML compilers take a model defined in Python in one of the major training frameworks and attempt to convert them into a different form that produces the same results. The output form is usually chosen to have some advantages in performance or portability over the original version.
For example, XLA takes the layers defined at the TensorFlow Graph level, and converts them initially into what’s known as an HLO (high-level operation) representation. This term slightly confused me initially, since from my perspective as a TensorFlow engineer the HLO operations were *lower* level than Graph operations, as individual TF ops are often broken into multiple HLOs, but it comes from the fact that these are at the highest level of XLA’s interface. These HLOs are designed to be implementable efficiently on GPUs, CPUs, and TPUs, with the hope that supporting a smaller number of mathematical operations will allow many more TF ops to be implemented by composition, increasing portability.
The definition I’ve given above may seem overly broad, and it probably is, but that’s also one of the challenges in this area. When someone offers an ML compiler as a potential solution, most engineers’ experience with procedural compilers makes them receptive to the idea, because traditional compilers have become such a vital tool for all of us. Using the term compiler is popular because of this halo effect, but it doesn’t say much about the scope of the tool. As another example, TensorFlow Lite has a different representation for its computation graph from TensorFlow, and a tool is required to convert TF models to TFLite. The current version of this tool uses MLIR, a compiler technology, to perform the conversion, but the resulting graph representation is interpreted, so it seems strange to call it a compiler. One of the common assumptions when the term compiler is used is that it will generate code, but many of them actually generate intermediate representations which are then handed over to other tools to perform further steps. This makes it necessary to dig a bit deeper into the actual capabilities of anything labeled as an ML compiler to better understand what problems it can solve.
Why is ML compilation not like procedural compilation?
I’ve mentioned that the analogy between procedural and ML compilers is imperfect, but why? The biggest reason is that deep learning computation graphs are made up of a large, arbitrary, and ever-growing set of layers. Last time I checked, stock TensorFlow has over 2,000 different operations. PyTorch is not as expansive, but it also relies more on Python within its models to implement functionality, as does JAX. This is in contrast to modern procedural languages which tend to have a comparatively small number of core primitives (keywords, built-in types and operations) and a lot of libraries implemented using those primitives to provide most of their useful functionality. It’s also a Big Deal to add primitives to a procedural language, with a long period of debate, prototyping, and consensus building before new ones are accepted.
The reason for this difference is that deep learning model authors are tightly constrained by latency and memory performance. There seems to be a practical time limit of around a week to train a model to completion, if it takes any longer then the author’s unlikely to be able to iterate with enough prototypes to produce a successful result in the end. Because training a model means running it many millions of times, it makes sense for authors trying new techniques to invest time optimizing the code they’re using. Because many people use Nvidia GPUs this often means writing a function in CUDA to implement any new computation that’s needed, rather than leaving it in the Python that might be more natural for experimenting. The consequence of this is that even operators like activation functions that involve trivial math that could easily be represented as a simple NumPy operation get implemented as separate layers, and so show up as such in the compute graph. Even worse from the framework implementer’s perspective is that authors may actually fuse together multiple conceptually-separate operations into a single layer purely for performance reasons. You can see this in the plethora of different LSTM layers available in TensorFlow, they exist because manual fusing helped speed up training for particular models.
What this means in practice is that compute graphs are made up of layers chosen for model authors’ convenience, only defined by their already-optimized implementations in C++/CUDA, and any unit tests, which are often written to test against Python libraries like NumPy, adding another layer of indirection when trying to understand what they do. They are also likely to be comparatively large in their scope, rather than being constrained to a more primitive operation. All this makes the job of anyone trying to convert them into another representation very hard. Even worse, new layers are constantly being invented by researchers!
Most ML compilers “solve” this problem by only supporting a subset of the layers. Even TensorFlow Lite and XLA only support some operations. What I’ve found from my experience is that this is an unpleasant surprise to many users hoping to convert their model from the training environment to run on another platform. Most authors aren’t even particularly aware of which ops they’re using, since they’re likely to be using a higher-level interface like Keras, so figuring out how to change a model definition to fit with any constraints can be a frustrating and confusing process.
I believe this to be the single biggest problem facing ML compilers. The only way we can hope to provide a good experience to everyday users is by changing the training environment so that models are automagically expressed in a manageable representation from the start. The current situation asks compiler authors to turn a hamburger back into a cow, it’s simply not feasible. The challenge is that adding such constraints in makes it harder to experiment with new approaches, since any additional layers would need to be represented in a form other than Python, C++, or CUDA, which are the preferred languages of researchers. Either compiler writers will have to keep chasing all the new layer implementations constantly being produced by researchers, or we’ll have to persuade researchers to write them in a more portable form.
Why are there so many layers?
So far I’ve focused on “classical” deep learning operations, but one of the reasons that there are so many layers is that compute graphs also include a lot of computation that can’t easily be expressed in a mathematical form. Layers like convolution, fully-connected, or activations can be written using a math notation and implemented using a comparatively small number of primitives, and they take up the majority of the compute time, so they’re often chosen as the first targets by compiler writers. Unfortunately there are many other layers that don’t fit as easily into something as mathematical, and where the only practical definition can be written in a procedural function using something like C++. A favorite example of mine is the non-max suppression layer used to prune the soup of bounding boxes produced by networks doing image localization. This algorithm is hard to describe except as a series of sorting, loops, and conditional statements, and it’s difficult to see how it could be represented in anything less general than LLVM’s IR.
There are a lot of operations that generate features or perform other pre-processing, or do post-processing like scoring or beam search. These are tempting to exclude from any compiler solution because they often occur before or after the body of the model where the bulk of the computation happens, and so aren’t a priority for optimization, but these do sometimes occur at performance-critical points and so I think we need a solution.
What about fallbacks?
One answer I’ve heard to this problem is that it’s always possible to fall back to the original CPU C++ implementation for the long tail of layers that are not easy to use a specialized representation for. In my opinion this removes a lot of the advantages of using a compiler. It’s no longer possible to perform fusing or other IR optimizations across the barrier formed by a non-IR layer, and the model itself is not portable across different platforms. You might think that you still have portability across platforms that support C++, but as I mentioned earlier, most layer implementations were created by research scientists and are only present in an optimized form. This means that the code is likely to rely on libraries like Eigen and use functions from the training framework itself. Consequently, porting a single layer often means porting most of the training framework and its dependencies too. This is possible, we use it for the Flex delegate in TensorFlow Lite, and PyTorch Mobile takes a similar approach, but it is a lot of work even for comparatively mainstream platforms like Android and iOS, and doesn’t work at all for anything non-Posix-like such as most embedded devices. It also takes up a lot of binary space, since server code is written to very different constraints than other platforms. Another problem is that even if the libraries relied upon for optimization are portable to other platforms, it’s not likely that they’ll offer the same performance that they do in the original environment. Performance doesn’t tend to be portable.
What about compiling Python?
A lot of frameworks rely on Python glue code to help implement models. This is great for model authors because they can use a familiar and flexible language, but it makes porting to other environments very tough. Mobile platforms don’t support Python for example, and neither do GPUs. The need for GPU support tends to push authors to re-implement any Python code that becomes a performance bottleneck, but that still leaves a lot of places where it can be useful for training. The problem from a compiler’s perspective is that parts of the definition of the computation that needs to be performed to run the model is now held in Python code, not in the regular compute graph of layers, making those parts inaccessible.
A solution to this has been to compile regular Python code, as shown by TensorFlow’s tf.function. This is very helpful in some cases, but there are plenty of times when the Python code is actually relying on other libraries, often only available as C or C++ implementations. For example, a lot of audio models will do things like create spectrograms using a specialized library, which ML compilers don’t have visibility into, or the ability to translate into another representation.
How can we make progress?
I hope this post doesn’t sound too much like an Airing of Grievances (though it is December 23rd as I write this), I’m honestly a big fan of all the compiler work that’s been happening in the ML world and I want it to continue growing. With that in mind, how can we move forward? In my mind, there are two possible futures, one where the ML ecosystem looks like Matlab, and another where it looks like LLVM.
If you’re not familiar with it, Matlab is a tool used by researchers for a lot of mathematical prototyping and exploration. There are tools to compile the resulting Matlab projects into standalone executable code, but because it’s so hard to do so completely and in an optimal way, a common workflow is for researchers to hand over their projects to engineers to write a C or C++ implementation by hand. In this flavor of the future, we’d do something similar for ML, where researchers would use frameworks focused on easy experimentation and flexibility, and the conversion process for production deployments would involve manual engineering into a more portable and optimized representation once it was finalized. As an engineer who would be likely to be responsible for this conversion process, I’m hoping we can do better for my own sake. It will also remove a lot of scope for collaboration between model authors and deploying engineers, which is a shame because iterative feedback loops can make both models and software implementations better. Unfortunately I think the Matlab model is the most likely to happen unless we can change direction.
The key LLVM innovation was the invention of an intermediate representation that was rich enough for a large set of languages, but small enough to be supported by a lot of different platforms without requiring exorbitant engineering resources for each. An IR like this for machine learning is the dream of most of the hardware vendors I know, since it would allow them to support a lot of different models and frameworks with comparatively little effort, at least compared to the current status quo. There are existing attempts that have had some success, such as ONNX or MLIR’s TOSA dialect, but they’ve all struggled either with coverage or have increased the number of layers they support to a level that makes them tougher for hardware teams to implement. This is why I come back to the need to change the training environment itself. We somehow need to come up with tools that permit researchers the flexibility to experiment, give them the performance they need to complete training in a reasonable time, but also result in a representation that can be understood separate from the training environment. Researchers are in high demand in the ML world, so it would have to be something they want to use if it’s going to get adoption. These three requirements might end up being impossible to meet, but I’m hopeful that the ML compiler community can keep innovating and come up with something that does!
One of the most exciting aspects of deep learning’s emergence in computer vision a few years ago was that it didn’t appear to require any feature engineering, unlike previous techniques like histograms-of-gradients or Haar cascades. As neural networks ate up other fields like NLP and speech, the hope was that feature engineering would become unnecessary for those domains too. At first I fully bought into this idea, and saw any remaining manually-engineered feature pipelines as legacy code that would soon be subsumed by more advanced models.
Over the last few years of working with product teams to deploy models in production I’ve realized I was wrong. I’m not the first person to raise this idea, but I have some thoughts I haven’t seen widely discussed on exactly why feature engineering isn’t going away anytime soon. One of them is that even the original vision case actually does rely on a *lot* of feature engineering, we just haven’t been paying attention. Here’s a quote from a typical blog post discussing image models:
“a deep learning system is a fully trainable system beginning from raw input, for example image pixels“
(Emphasis added by me)
I spent over a decade working on graphics and image processing, so the implicit assumption that the kinds of images we train networks on are at all “raw” always bothered me a bit. I was used to starting with truly RAW image files to preserve as much information from the original scene as possible. These formats reflect the output of the camera’s CCD hardware pretty closely. This means that the values for each pixel correspond roughly linearly to the number of photons hitting the detector at that point, and the position of each measured value is actually in a Bayer pattern, rather than a simple grid of pixels.
So, even to get to the kind of two-dimensional array of evenly spaced pixels with RGB values that ML practitioners expect an image to contain, we have to execute some kind of algorithm to resample the original values. There are deep learning approaches to this problem, but it’s clear that this is an important preprocessing step, and one that I’d argue should count as feature engineering. There’s a whole world of other transformations like this that have to be performed before we get what we’d normally recognize as an image. These include some very complex and somewhat arbitrary transformations like white balancing, which everyday camera users might only become aware of during an apocalypse. There are also steps like gamma correction, which take the high dynamic ranges possible for the CCD output values (which reflect photon counts) and scale them into numbers which more closely resemble the human eye’s response curve. Put very simplistically, we can see small differences in dark areas with much more sensitivity than differences in bright parts, so to represent images in an eight-bit byte it’s convenient to apply a gamma curve so that more of the codes are used for darker values.
I don’t want this to turn into an image processing tutorial, but I hope that these examples illustrate that there’s a lot of engineering happening before ML models get an image. I’ve come to think of these steps as feature engineering for the human visual system, and see deep learning as piggy-backing on all this work without realizing it. It makes intuitive sense to me that models benefit from the kinds of transformations that help us recognize objects in the world too. My instinct is that gamma correction makes it a lot easier to spot things in natural scenes, because you’d hope that the differences between two materials would remain roughly constant regardless of lighting conditions, and scaling the values keeps the offsets between the colors from varying as widely as they would with the raw measurements. I can easily believe that neural networks benefit from this property just like we do.
If you accept that there is a lot of hidden feature engineering happening behind the scenes even for the classic vision models, what does this mean for other applications of deep networks? My experience has been that it’s important to think explicitly about feature engineering when designing models, and if you believe your inputs are raw, it’s worth doing a deep dive to understand what’s really happening before you get your data. For example, I’ve been working with a team that’s using accelerometer and gyroscope data to interpret gestures. They were getting good results in their application, but thanks for supply-chain problems they had to change the IMU they were using. It turned out that the original part included sensor fusion to produce estimates of the device’s absolute orientation and that’s what they were feeding into the network. Other parts had different fusion algorithms which didn’t work as well, and even trying software fusion wasn’t effective. Some problems included significant lag responding to movement and biases that sent the orientation way off over time. We switched the model to using the unfused accelerometer and gyroscope values, and were able to get back a lot of the accuracy we’d lost.
In this case, deep learning did manage to eat that part of the feature engineering pipeline, but because we didn’t have a good understanding of what was happening to our input data before we started we ended up spending extra time having to deal with problems that could have been more easily handled in the design and prototyping phase. Also, I don’t have the knowledge of accelerometer hardware but I wouldn’t be at all surprised if the “raw” values we’re now using have actually been through some significant processing.
Another area that feature engineering has surprised me with its usefulness is around labeling and debugging data problems. When I was working on building a more reliable magic wand gesture model, I was getting very frustrated with my inability to tell if the training data I was capturing from people was good enough. Just staring at six curves of the acceleration and gyroscope X, Y, Z values over time wasn’t enough for me to tell if somebody had actually performed the expected gesture or not. I thought about trying to record video of the contributors, but that seemed a lot to ask. Instead, I put some work into reconstructing the absolute position and movement from the “raw” values. This effectively became an extremely poor man’s version of sensor fusion, but focused on the needs of this particular application. I was not only able to visualize the data to check its quality, I started feeding the rendered results into the model itself, improving the accuracy. It also had the side-benefit that I could display an intuitive visualization of the gesture as seen by the model back to the user, so that they could gain an understanding of why it failed to recognize some attempts and learn to adapt their movements to be clearer from the model’s perspective!
I don’t want to minimize deep learning’s achievements in reducing the toil involved in building feature pipelines, I’m still constantly amazed at how effective they are. I would like to see more emphasis put on feature engineering in research and teaching though, since it’s still an important issue that practitioners have to wrestle with to successfully deploy ML applications. I’m hoping this post will at least spark some curiosity about where your data has really been before you get it!
A colleague recently asked for more details on an approach I recommended, but which she hadn’t seen any documentation for. I realized that it was something I’d learned from talking to model builders at Google, and I wasn’t sure there was anything written up, so in the spirit of leaving a trail of breadcrumbs for anyone coming after, I thought I should put it into a quick blog post.
The summary is that if you have MaxPool or AveragePool after a convolutional layer in a network, and you’re targeting a resource-constrained system like a microcontroller, you should try removing them entirely and replacing them with a stride in the convolution instead. This has two main benefits, but to explain it’s easiest to diagram out the network before and after.
In the typical setup, shown on the left, is a convolutional layer is followed by a pooling operation. This has been common since at least AlexNet, and is still found in many modern networks. The setup I often find useful is shown on the right. I’m using an example input size of 224 wide by 224 high for this diagram, but the discussion holds true for any dimensions.
The first thing to notice is that in the standard configuration, there’s a 224x224x8 activation buffer written out to memory after the convolution layer. This is by far the biggest chunk of memory required in this part of the graph, taking over 400KB, even with eight-bit values. All ML frameworks I’m aware of will require this buffer to be instantiated and filled before the next operation can be invoked. In theory it might be possible to do tiled execution, in the way that’s common for image processing frameworks, but the added complexity hasn’t made it a priority so far. If you’re running on an embedded system, 400KB is a lot of RAM, especially since it’s only being used for temporary values. That makes it a tempting target for size optimization.
My second observation is that we’re only using 25% of those values, assuming MaxPool is doing a typical 2x reduction, taking the largest value out of 4 in a 2×2 window. From experience, these values are often very similar, so while doing the pooling does help overall accuracy a bit, taking any of those four values at random isn’t much worse. In essence, this is what removing the pooling and increasing the stride for convolution does.
Stride is an argument that controls the step size as a convolution filter is slid across the input. By default, many networks have windows that are offset from each other by one pixel horizontally, and one pixel vertically. This means (ignoring padding, which is a whole different discussion) the output is the same size as the input, but typically with more channels (eight in the diagram above). Instead of setting the stride to this default of 1 horizontally, 1 vertically, you can set it to 2,2. This means that each window is offset by two pixels vertically and horizontally from its neighbor. This results in an output array that is half the width and height of the input, and so has a quarter of the number of elements. In essence, we’re picking one of the four values that would have been chosen by the pooling operation, but without the comparison or averaging that is used in the standard configuration.
This means that the output of the convolution layer uses much less memory, resulting in a smaller arena for TFL Micro, but also reduces the computation by 75%, since only a quarter of the convolution windows are being calculated. It does result in some accuracy loss, which you can verify during training, but since it reduces the resource usage so dramatically you may even be able to increase some other parameters like the input size or number of channels and gain some back. If you do find yourself struggling for arena size, I highly recommend giving this approach a try, it’s been very helpful for a lot of our models. If you’re not sure if your model has the convolution/pooling pattern, or want to better understand the sizes of your activation buffers and how they influence the arena you’ll need, I recommend the Netron visualizer, which can take TensorFlow Lite model files.
I’ve been enjoying using the Arduino Nano Sense BLE 33 board as an all-round microcontroller for my machine learning work, but I had trouble figuring out how to programmatically write to flash memory from a sketch. I need to do this because I want to be able to download ML models over Bluetooth and then have them persist even if the user unplugs the board or resets it. After some research and experimentation I finally have a solution I’m happy with, so I’ve put an example sketch and documentation up at github.com/petewarden/arduino_nano_ble_write_flash.
The main hurdle I had to overcome was how to initialize an area of memory that would be loaded into flash when the program was first uploaded, but not touched on subsequent resets. Since modifying linker scripts isn’t recommended in the Arduino IDE, I had to come up with a home-brewed solution using const arrays and C++’s alignas() command. Thankfully it seems to work in my testing.
There’s a lot more documentation in the README and inline in the sketch, but I would warn anyone interested in this that flash has a limited number of erase/write cycles it can handle reliably, so don’t go too crazy with high-frequency changes!
I’ve now taught a lot of workshops on TinyML using the Arduino Nano Sense BLE 33 board, including the new EdX course, and while it’s a fantastic piece of technology I often have to spend a lot of time helping students figure out how to get the boards communicating with their computer. Flashing programs to the Arduino relies on having a USB connection that can use the UART serial protocol to communicate, and it turns out that there are a lot of things that can go wrong in this process. Even worse, it’s very hard to debug what’s going wrong, since the UART drivers are deep in the operating system, and vary across Windows, MacOS, and Linux computers. Students can end up getting very frustrated, even after referring to the great troubleshooting FAQ that Brian on the EdX course put together.
I’ve been trying to figure out if there’s an alternative to this approach that will make life easier. To help with that, I’ve been experimenting with how I might be able to transfer files wirelessly over the Bluetooth Low Energy protocol that the Arduino board supports, and I now have a prototype available at github.com/petewarden/ble_file_transfer. There are lots of disclaimers; it’s only a few kilobytes per second, I haven’t tested it very heavily, and it’s just a proof of concept, but I’m hoping to be able to use this to try out some approaches that will help students get started without the UART road bumps.
I also wanted to share a complete example of how to do this kind of file transfer more generally, since when I went looking for similar solutions I saw a lot of questions about how to do this but not many solutions. It’s definitely not an application that BLE is designed for, but it does seem possible to do at least. Hopefully having a version using a well-known board and WebBLE will help someone else out in the future!
This image shows a traditional water meter that’s been converted into a web API, using a cheap ESP32 camera and machine learning to understand the dials and numbers. I expect there are going to be billions of devices like this deployed over the next decade, not only for water meters but for any older device that has a dial, counter, or display. I’ve already heard from multiple teams who have legacy hardware that they need to monitor, in environments as varied as oil refineries, crop fields, office buildings, cars, and homes. Some of the devices are decades old, so until now the only option to enable remote monitoring and data gathering was to replace the system entirely with a more modern version. This is often too expensive, time-consuming, or disruptive to contemplate. Pointing a small, battery-powered camera instead offers a lot of advantages. Since there’s an air gap between the camera and the dial it’s monitoring, it’s guaranteed to not affect the rest of the system, and it’s easy to deploy as an experiment, iterating to improve it.
If you’ve ever worked with legacy software systems, this may all seem a bit familiar. Screen scraping is a common technique to use when you have a system you can’t easily change that you need to extract information from, when there’s no real API available. You take the user interface results for a query as text, HTML, or even an image, ignore the labels, buttons, and other elements you don’t care about, and try to extract the values you want. It’s always preferable to have a proper API, since the code to pull out just the information you need can be hard to write and is usually very brittle to minor changes in the interface, but it’s an incredibly common technique all the same.
The biggest reason we haven’t seen more adoption of this equivalent approach for IoT is that training and deploying machine learning models on embedded systems has been very hard. If you’ve done any deep learning tutorials at all, you’ll know that recognizing digits with MNIST is one of the easiest models to train. With the spread of frameworks like TensorFlow Lite Micro (which the example above apparently uses, though I can’t find the on-device code in that repo) and others, it’s starting to get easier to deploy on cheap, battery-powered devices, so I expect we’ll see more of these applications emerging. What I’d love to see is some middleware that understands common displays types like dials, physical or LED digits, or status lights. Then someone with a device they want to monitor could build it out of those building blocks, rather than having to train an entirely new model from scratch.
I know I’d enjoy being able to use something like this myself. I’d use a cell-connected device to watch my cable modem’s status, so I’d know when my connection was going flaky, I’d keep track of my mileage and efficiency with something stuck on my car’s dash board looking at the speedometer, odometer and gas gauge, it would be great to have my own way to monitor my electricity, gas, and water meters, I’d have my washing machine text me when it was done. I don’t know how I’d set it up physically, but I’m always paranoid about leaving the stove on, so something that looked at the gas dials would put my mind at ease.
There’s a massive amount of information out in the real world that’s can’t be remotely monitored or analyzed over time, and a lot of it is displayed through dials and displays. Waiting for all of the systems involved to be replaced with connected versions could take decades, which is why I’m so excited about this incremental approach. Just like search engines have been able to take unstructured web pages designed for people to read, and index them so we can find and use them, this physical version of screen-scraping takes displays aimed at humans and converts them into information usable from anywhere. A lot of different trends are coming together to make this possible, from cheap, capable hardware, widespread IoT data networks, software improvements, and the democratization of all these technologies. I’m excited to do my bit to hopefully help make this happen, and I can’t wait to see all the applications that you all come up with, do let me know your ideas!
A few weeks ago I was lucky enough to have the chance to present at the Linley Processor Conference. I gave a talk on “What TinyML Needs from Hardware“, and afterwards one of the attendees emailed to ask where some of my numbers came from. In particular, he was intrigued by my note on slide 6 that “Expectations are for tens or hundreds of billions of devices over the next few years“.
I thought that was a great question, since those numbers definitely don’t come from any analyst reports, and they imply at least a doubling of the whole embedded system market from its current level of 40 billion devices a year. Clearly that statement deserves at least a few citations, and I’m an engineer so I try to avoid throwing around predictions without a bit of evidence behind them.
I don’t think I have any particular gift for prophecy, but I do believe I’m in a position that very few other people have, giving me a unique view into machine learning, product teams, and the embedded hardware industry. Since TensorFlow Lite Micro is involved in the integration process for many embedded ML products, we get to hear the requirements from all sides, and see the new capabilities that are emerging from research into production. This also means I get to hear a lot about the unmet needs of product teams. What I see is that there is a lot of latent demand for technology that I believe will become feasible over the next few years, and the scale of that demand is so large that it will lead to a massive increase in the number of embedded devices shipped.
I’m basically assuming that one or more of the killer applications for embedded ML become technically possible. For example, every consumer electronics company I’ve talked to would integrate a voice interface chip into almost everything they make if it was 50 cents and used almost no power (e.g. a coin battery for a year). There’s similar interest in sensor applications for logistics, agriculture, and health, given the assumption that we can scale down the cost and energy usage. A real success in any one of these markets adds tens of billions of devices. Of course, the technical assumptions behind this aren’t certain to be achieved in the time frame of the next few years, but that’s where I stick my neck out based on what I see happening in the research world.
From my perspective, I see models and software already available for things like on-device server-quality voice recognition already, such as Pixel’s system. Of course this example currently requires 80 MB of storage and a Cortex A CPU, but from what I see happening in the MCU and DSP world, the next generation of ML accelerators will provide the needed compute capability, and I’m confident some combination of shrinking the model sizes and increased storage capacity will enable an embedded solution. Then we just need to figure out how to bring the power and price down! It’s similar for other areas like agriculture and health, there are working ML models out there just looking for the right hardware to run on, and then they’ll be able to solve real, pressing problems in the world.
I may be an incorrigible optimist, and as you can see I don’t have any hard proof that we’ll get to hundreds of billions of devices over the next few years, but I hope you can at least understand the trends I’m extrapolating from now.
Joanne and I got engaged two years ago in Paris, and were planning on getting married in the summer, before the pandemic intervened. Once it became clear that it might be years until everybody could meet up in person, especially older members of our families who were overseas, we started looking into how we could have our ceremony online, with no physical contact at all. It was unknown territory for almost everybody involved, including us, but it turned out to be a wonderful day that we’ll remember for the rest of our lives.
In the hope that we might help other couples who are navigating this new world, Joanne has written up an informal how-to guide on Zoom weddings. It covers the legal side of licenses in California, organizing the video conferencing (we used the fantastic startup Wedfuly), cakes, dresses, flowers, and even the first dance! We’re so happy that we were still able to share our love with over a hundred terrific guests, despite the adverse circumstances, so we hope this guide helps others in the same position.