Borderlands, Cookies, and Steal the Sky

fantasycookies

Photo by Megan E. O’Keefe

Last night Joanne and I finally made it along to a Borderlands Books Sponsor’s Social, and I baked some cookies since it was a potluck. My efforts were thrown into the shade by the wonderful edible pieces of art you see above though! One of the other perks of the event  was a preview sample of a few chapters of the upcoming book Steal the Sky by Megan E. O’Keefe. It was only after I read those tonight that I realized Megan was also the author of the cookies, and they were inspired by the world she created.

The sample only included the first 22 pages of the novel, but there was enough there to leave me wanting more. The impression I was left with was that the book was going to be fun, in the best possible sense, and without losing out on depth. She sketches out an interesting fantasy world with a very light hand, mercifully skipping genealogy trees or invented languages, and has drawn characters that feel worth following. There’s definitely some darkness lurking, but the world doesn’t have the all-encompassing gloom of Game of Thrones or a lot of other modern fantasy. I had a similar feeling with the Goblin Emperor, where one of my friends complained it was just ‘too nice’, but I appreciated the chance to enjoy a world that was wider than just a monotonous Mordor.

I’m also a sucker for doppelgangers, whether it’s the Changers from Consider Phlebas, the Kandra’s from Mistborn, or even P.K. Dick’s wild array of objects passing for humans, so I’m looking forward to seeing how Megan develops that theme. They can be really effective ways to ask a lot of questions about identity and loyalty, so it will be good to see how they work in this story.

It was great to see Borderlands supporting a first-time local author like this, discovering new work is exactly why I’m so glad to be a sponsor. Big thanks to Alan and the rest of the community for organizing this event, and I look forward to the next one in January!

An Engineer’s Guide to GEMM

gemm_corrected

I’ve spent most of the last couple of years worrying about the GEMM function because it’s the heart of deep learning calculations. The trouble is, I’m not very good at matrix math! I struggled through the courses I took in high school and college, barely getting a passing grade, confident that I’d never need anything so esoteric ever again. Right out of college I started working on 3D graphics engines where matrices were everywhere, and they’ve been an essential tool in my work ever since.

I managed to develop decent intuitions for 3D transformations and their 4×4 matrix representations, but not having a solid grounding in the theory left me very prone to mistakes when I moved on to more general calculations. I screwed up the first version of all my diagrams in a previous blog post, and most recently had to make a breaking API change to the open-source gemmlowp library, all because I’d messed up the ordering of the matrix multiplies.

The best way I know to fix something in my own mind is to try to explain it to somebody else, so here are my notes on the areas I found most confusing about matrix math as an engineer. I hope they’ll be helpful, and I look forward to getting corrections on anything I screwed up!

Row versus Column Major

The root of a lot of my difficulties are the two competing ways you can store matrix values in RAM. Coming from an image-processing world, when I see a 2D array of values my in-grained assumption is that it’s stored like letters on a page, starting in the top-left corner, and moving from left to right and jumping down at the end of the row. For example, if you have a matrix that you’d draw like this:
| 0 | 1 | 2 |
| 3 | 4 | 5 |
| 6 | 7 | 8 |

You would store it in memory in this order:
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |

This is known as ‘row major’, because each row is stored in adjacent locations in memory.
The alternative storage scheme is ‘column major’, and the same matrix would be laid out like this in memory:
| 0 | 3 | 6 | 1 | 4 | 7 | 2 | 5 | 8 |

What this means is that we’re walking down the columns to get adjacent memory locations instead.

It’s important to know that both of these are just ways of storing the same matrix in memory, they’re an implementation detail that should not affect the underlying math, or the way you diagram your equations. There’s also no widespread agreement on what the ‘correct’ or canonical storage order to use is, so you’ll need to pay attention to what convention the code you’re interacting with expects.

Transpose

One thing you may notice about row versus column ordering is that if you screw it up and pass a square row-major matrix into a library that expects column major, or vice-versa, it will be read as the transpose of the actual matrix. Visually, you transpose a matrix by flipping all the columns into rows, like this:

| 0 | 1 | 2 | 3 |
| 4 | 5 | 6 | 7 |
| 8 | 9 |10 |11 |

Original

| 0 | 4 | 8 |
| 1 | 5 | 9 |
| 2 | 6 |10 |
| 3 | 7 |11 |

Transpose

Another way of visualizing this is drawing a line through the top-left to bottom-right corners, and flipping everything around on that diagonal. You’ll often see the transpose of a matrix indicated by adding an apostrophe to the name, so that the transpose of a matrix A is A’.

Argument Ordering

If you multiply two numbers, A * B is always the same as B * A. This is not true for matrix multiplications! Indeed, you can only multiply two matrices together at all if the number of columns on the left-hand-side is equal to the number of rows in the right-hand argument. Even if they’re both the same square size, and so can potentially be swapped, the result will still depend on the order.

There is one mathematical identity that crops up a lot in practice with transposes. If you have the standard GEMM equation of C = A * B, then C’ = B’ * A’. In words, if you swap the order of the two input matrices and transpose both of them, then multiplying them will give the transpose of the result you’d get in the original untransposed order.

#Errors % 2 == 0

What really led me into danger is that all three of storage order, transposing, and input order have effects that can mimic and cancel each other out. It’s like Jim Blinn’s old quote about all correct graphics programs having an even number of sign errors, except that there are three different ways to screw things up instead of one!

For example, what I realized last week was that I was working in a new code base that assumes row-major order, but gemmlowp assumes column major. Because I had been in a hurry and couldn’t figure out why my unit tests weren’t working, so I ended up swapping the input argument order. Since C’ = B’ * A’, the storage order error was canceled out by the argument order error! It made for very confusing code, so thankfully a co-worker slapped me round the back of the head (very politely) when he ran across it and I revisited it and figured out my errors.

Because I know I’m so prone to these kind of errors, I’ve forced myself to slow down when I’m tackling this kind of problem, and start off by working through a couple of examples on pen and paper. I find working visually with the diagram at the top of this post in mind has helped me immensely. Once I’ve got those examples straight, I’ll turn them into unit tests, with the hand calculations in the comments. You can see an example in gemmlowp/test/test.cc:698:


// Runs a small set of hand-calculated data through the implementation.
void TestWithSmallData() {
const int m = 4;
const int n = 2;
const int k = 3;
// Matrix A (LHS) is:
// | 7 | 10 | 13 | 16 |
// | 8 | 11 | 14 | 17 |
// | 9 | 12 | 15 | 18 |
const uint8_t a_data[] = {7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18};
// Matrix B (RHS) is:
// | 1 | 3 | 5 |
// | 2 | 4 | 6 |
const uint8_t b_data[] = {1, 2, 3, 4, 5, 6};
// Here are the results we expect, from hand calculations:
// (1 * 7) + (3 * 8) + (5 * 9) = 76
// (2 * 7) + (4 * 8) + (6 * 9) = 100
// (1 * 10) + (3 * 11) + (5 * 12) = 103
// (2 * 10) + (4 * 11) + (6 * 12) = 136
// (1 * 13) + (3 * 14) + (5 * 15) = 130
// (2 * 13) + (4 * 14) + (6 * 15) = 172
// (1 * 16) + (3 * 17) + (5 * 18) = 157
// (2 * 16) + (4 * 17) + (6 * 18) = 208
// That means matrix C should be:
// | 76 | 103 | 130 | 157 |
// | 100 | 136 | 172 | 208 |
const uint8_t expected_data[] = {76, 100, 103, 136, 130, 172, 157, 208};
const int c_count = m * n;
std::unique_ptr<uint8_t[]> output_data(new uint8_t[c_count]);
const bool is_a_transposed = true;
const bool is_b_transposed = true;
const bool is_c_transposed = true;
const int lda = k;
const int ldb = n;
const int ldc = n;
const int a_offset = 0;
const int b_offset = 0;
const int c_offset = 0;
const int c_mult = 1;
const int c_shift = 0;
gemmlowp::eight_bit_int_gemm::EightBitIntGemm(
is_a_transposed, is_b_transposed, is_c_transposed, m, n, k, a_data,
a_offset, lda, b_data, b_offset, ldb, output_data.get(), c_offset,
c_mult, c_shift, ldc, eight_bit_int_gemm::BitDepthSetting::A8B8);
ResultStats stats;
GetResultStats(output_data.get(), expected_data, c_count, &stats);
ResultStatsBounds bounds;
const bool good = CheckResultStatsBounds(stats, bounds);
printf("TestWithSmallData: %s\n", good ? "PASS" : "FAIL");
ReportResultStats(stats, bounds);
Check(good);
}

view raw

test.cc

hosted with ❤ by GitHub

My other key tool is keeping around a simple reference GEMM function that’s unoptimized, but that I can easily step through and add logging statements to. Since a lot of my work involves cutting corners with precision to increase speed, it’s important that I have an understandable implementation that I can play with, and compare against more complex versions. You can see my version for gemmlowp at test.cc:36:


void ReferenceEightBitIntGemm(bool transpose_a, bool transpose_b,
bool transpose_c, int m, int n, int k,
const uint8_t* a, int32_t a_offset, int lda,
const uint8_t* b, int32_t b_offset, int ldb,
uint8_t* c, int32_t c_offset, int32_t c_mult_int,
int32_t c_shift, int ldc) {
assert((c_shift >= 0) && (c_shift <= 32));
assert(a != nullptr);
assert(b != nullptr);
assert(c != nullptr);
int a_i_stride;
int a_l_stride;
if (transpose_a) {
a_i_stride = lda;
a_l_stride = 1;
} else {
a_i_stride = 1;
a_l_stride = lda;
}
int b_j_stride;
int b_l_stride;
if (transpose_b) {
b_j_stride = 1;
b_l_stride = ldb;
} else {
b_j_stride = ldb;
b_l_stride = 1;
}
int c_i_stride;
int c_j_stride;
if (transpose_c) {
c_i_stride = ldc;
c_j_stride = 1;
} else {
c_i_stride = 1;
c_j_stride = ldc;
}
int i, j, l;
const std::int32_t kRoundingTerm = (c_shift < 1) ? 0 : (1 << (c_shift – 1));
for (j = 0; j < n; j++) {
for (i = 0; i < m; i++) {
int32_t total = 0;
for (l = 0; l < k; l++) {
const int a_index = i * a_i_stride + l * a_l_stride;
const uint8_t a_as_byte = a[a_index];
const int32_t a_as_int = static_cast<int32_t>(a_as_byte) + a_offset;
const int b_index = j * b_j_stride + l * b_l_stride;
const uint8_t b_as_byte = b[b_index];
const int32_t b_as_int = static_cast<int32_t>(b_as_byte) + b_offset;
const int32_t mult_as_int = a_as_int * b_as_int;
total += mult_as_int;
}
int32_t output =
(((total + c_offset) * c_mult_int) + kRoundingTerm) >> c_shift;
if (output > 255) {
output = 255;
}
if (output < 0) {
output = 0;
}
const int c_index = i * c_i_stride + j * c_j_stride;
c[c_index] = static_cast<uint8_t>(output);
}
}
}

view raw

test.cc

hosted with ❤ by GitHub

This includes some eight-bit specific code, but the structure is common across all my reference versions, with three nested loops across the m, n, and k dimensions of the matrices. It also doesn’t include some of the standard arguments like alpha or beta. This particular code assumes column major by default, but if any of the transpose flags are set to true, then the matrix is treated as row major.

Leading Dimensions

The final pieces of the puzzle for me were the lda, ldb, and ldc arguments. These left me confused initially because I struggled to find a clear representation of that they meant. I finally realized that they were the number of values you moved forward in memory when you reached the end of a row (in row-major order) or column (in column major). They’re strides that give a lot of flexibility when you only want to work with smaller tiles inside a larger matrix, since they let you skip over values you want to ignore. If you’re not dealing with sub-tiles, then they’ll be the number of columns for a row-major matrix, and the number of rows for a column-major one.

Anyway, I hope these notes help any other lost souls who are struggling with matrices. I’m now feeling a lot more confident, and I wish I’d taken the time to study them more carefully before. They’re very powerful tools, and actually a lot of fun once I moved past some of my confusion!

One Weird Trick for Faster Android Multithreading

deckchair

Photo by Gary Perlmutter

It’s not often that doing nothing speeds things up, but that’s exactly how Benoit Jacob just optimized the gemmlowp project on Android. Gemmlowp is a specialized library for doing large matrix multiplications on eight-bit values, which is vital for neural networks (as I’ve talked about before). He had managed to write great ARM assembler routines and careful memory management code to get strong single-core performance on Android, but when we tried to thread the code across two or four cores, we only saw a very small improvement. It wasn’t scaling with the number of cores, despite being a very parallelizeable  algorithm where we could easily split the calculations into independent shards.

Looking at the behavior with tools like systrace, we could see that there were lots of long gaps where cores weren’t being used. Another clue was that changing the cpu governor to a more aggressive mode (something that’s only possible for developers) made performance dramatically better. Governors exist to make sure that applications are efficient in their use of battery power, mostly by lowering the frequencies or disabling cores that aren’t in use. That led us to believe that something was making the governor believe we didn’t need all the cores, even though we were trying to do short bursts of very intensive computation that would benefit from them.

The problem was that we were doing a lot of comparatively ‘small’ calculations, sometimes only a few million operations at a time, with sync points in between. We were using the standard pthread_cond_wait() function in the worker threads to look for new chunks of work to be available. The trouble is that the ‘granularity’ of the waits seemed to be of the magnitude of a millisecond or so, which meant that if there wasn’t immediately new work after one chunk was completed, the thread would go to sleep for quite a while. That seemed to cause a double penalty where not only did the work take longer because of those thread delays, but the governor saw sparse usage of the cores and so downgraded performance even more to save power!

All of these theories about the exact mechanism are just our speculation, but we ended up adding a short period of busy-waiting before we drop back to pthread_cond_wait(), and that improved our overall performance massively. We’re now spinning in a NOP loop for about 32 million cycles (which may sound like a lot but is only about 10 to 20 milliseconds, depending on the clock rate), so by doing nothing we actually end up going faster! We found we could get most of the benefit with even shorter busy-wait lengths, but the current number of NOPs was optimal for our use cases.

It sounds incredible, but you can try it for yourself by changing kMaxBusyWaitNOPs to zero on line 33 in internal/multi_thread_gemm.h, and running the benchmark with:

./scripts/test-android.sh test/benchmark.cc

There’s a series of benchmark results reflecting different use cases, but on my Nexus 5 (with four cores) I see over double the speed with the busy-waiting in place. As an aside, the labeling is a bit off since they’re not floating point operations, but seeing 45 giga-ops on a real-world use case on a years-old phone still blows my mind, we have an incredible amount of computing power in our pockets!

The big concern is obviously power usage, so we used the Monsoon power monitor to check and we actually end up using substantially less energy per operation, as well as completing more quickly (268 pico-joules per op versus 353 single-threaded).

Once the busy-waiting was in place, we were able to see one remaining problem. The original code spun up as many worker threads as there were cores, but this ignored the main thread, which was effectively contending for one core with a worker. By instead having (N-1) worker threads, and having one worker function run on the main thread, we saw a noticeable improvement.

I don’t recommend trying this approach at home unless you’re very sure you need it after careful profiling, since I’d hate to spawn a new generation of apps that are needlessly wasting users’ batteries. If you are in the situation where you’re doing heavy numerical computation in comparatively small chunks and need to spread the work across multiple cores, it’s worth taking a look at the WaitForVariableChange() code. One direction I would also recommend looking at is pthread_spin_lock() , since that promises a similar low-latency busy-waiting approach to waiting in a more standard way (and we may be reinventing the wheel here) but we haven’t had a chance to experiment with that deeply yet.

Anyway, these are just notes from our own adventures, and I’d love to hear from anyone else who’s explored the mysteries of numerical computation on smartphones, it’s a fascinating area for me!

Smartphone Energy Consumption

energybook

I never used to worry about energy usage but over the last few years most of my code runs on smartphones, so it has become one of my top priorities. The trouble is I’ve never had any training or experience in the area, so I’ve had to learn how to do it from experience, other engineers, and the internet, most of which feels more like folklore than engineering. A few days ago I was trying to learn more about the shiny new Monsoon power monitor on my desk when I came across “Smartphone Energy Consumption” from Cambridge University Press.

I’ve now had a chance to read it, and it has helped fill in a lot of gaps in my education. I highly recommend picking up a copy if you’re facing any of the same problems, and to give you a taster here are some of the things I learned. Rest assured that any mistakes in here are my own, not the authors!

There are lots of different ways to measure battery capacity – joules, watt-hours, and milli-amp hours. If I’m getting my physics straight, 1 watt hour is 3,600 joules, and if you assume a rough voltage of 4V, 1 milli-amp hour is 0.001 * 4v * 60 * 60, or 14.4 joules. A typical phone battery might have 2,000 mAh, or about 29,000 joules.

Most phones can’t dissipate more than 3 watts of heat, so that’s a practical limit for sustained power usage. Wearables can have much lower limits.

There are two main ways power is used in mobile chips. Switching loss is the power used to change gate states in the silicon, and static loss is more like a steady leakage. Switching power decreases with the square of the voltage, so halving the voltage reduces its effect by 75%, whereas static power shrinks linearly. That leads to some interesting situations, where having two cores running at half the voltage and frequency (since lower clock frequencies allow lower voltages to be used) may take the same time to complete a task, but at half the power of a single higher-voltage core. Turning off parts of the chip entirely is the key to reducing static leakage.

There’s a great practical guide to wiring up Monsoons, exactly what I need right now! There’s also a great section on human-battery interaction, that I just skimmed, but which covers a lot of research. The key takeaway for me was that users get very annoyed if their battery starts draining more quickly than it used to, and will uninstall recent apps to fix the problem, so developer should be highly motivated to reduce their power consumption.

I found a lot of very useful estimates for components power usages scattered through the book. These are just rough guides, but they helped my mental modeling, so here are some I found notable:

  • An ARM A9 CPU can use between 500 and 2,000 mW.
  • A display might use 400 mW.
  • Active cell radio might use 800 mW.
  • Bluetooth might use 100 mW.
  • Accelerometer is 21 mW.
  • Gyroscope is 130 mW.
  • Microphone is 101 mW.
  • GPS is 176 mW.
  • Using the camera in ‘viewfinder’ mode, focusing and looking at a picture preview, might use 1,000 mW.
  • Actually recording video might take another 200 to 1,000 mW on top of that.

A key problem for wireless network communication is the ‘tail energy’ used to keep the radio active after the last communication, even when nothing’s being sent. This is vital for responsiveness, but it can be ten seconds for LTE, so apparently short communications can use a lot more energy than you’d expect. Sending a single byte can use a massive amount of power if it keeps the radio active for ten seconds after!

A Microsoft paper showed that over 50% of the power on several popular games is consumed by the ads they show!

There’s some interesting work on modeling the tradeoffs between computation offloading (moving work to the cloud from the phone) and communication offloading (doing more work on the device to reduce network costs). I’m a big believer that we should do more work on-device, so it was great to have a better foundation for modeling those tradeoffs. One example they give is using the Android SDK on a 1080p image to detect faces on-device, and taking 3.2 seconds and 9 joules, whereas sending the image to a nearby server was quicker, even with the extra power of network traffic.

Anyway, it’s a great piece of work so if this sort of information is useful, go pick up a copy yourself, there’s a lot more than I can cover here!

Semantic Sensors

peoplecounter

Video from Springboard

The other day I was catching up with neighborhood news, and saw this article about “people counters” in San Francisco’s tourist district. These are cameras watching the sidewalks and totaling up how many pedestrians are walking past. The results weren’t earth-shattering, but I was fascinated because I’d never heard of the technology before. Digging in deeper, I discovered there’s a whole industry of competing vendors offering similar devices.

Why am I so interested in these? Traditionally we’ve always thought about cameras as devices to capture pictures for humans to watch. People counters only use images as an intermediate stage in their data pipeline, their real output is just the coordinates of nearby pedestrians. Right now this is a very niche application, because the systems cost $2,100 each. What happens when something similar costs $2, or even 20 cents? And how about combining that price point with rapidly-improving computer vision, allowing far more information to be derived from images?

Those trends are why I think we’re going to see a lot of “Semantic Sensors” emerging. These will be tiny, cheap, all-in-one modules that capture raw noisy data from the real world, have built-in AI for analysis, and only output a few high-level signals. Imagine a small optical sensor that is wired like a switch, but turns on when it sees someone wave up, and off when they wave down. Here are some other concrete examples of what I think they might enable:

  •  Meeting room lights that stay on when there’s a person sitting there, even if the conference call has paralyzed them into immobility.
  •  Gestural interfaces on anything with a switch.
  •  Parking meters that can tell if there’s a car in their spot, and automatically charge based on the license plate.
  •  Cat-flaps that only let in cats, not raccoons!
  •  Farm gates that spot sick or injured animals.
  •  Streetlights that dim themselves when nobody’s around, and even report car crashes or house fires.
  •  Stop lights that vary their timing cycle depending on whether there are any vehicles or pedestrians approaching from each direction, and will prioritize emergency vehicles.
  •  Drug cabinet doors that keep track of the medicines you have inside, help you find them, and re-order when you’re out.
  •  Shop window display items that spring to life when passers-by are looking at them, using eye tracking.
  •  Canary sensors scattered through crops that spot and report any pests or weeds they see, to minimize the use of chemicals.
  •  IFTTT-style hardware mashups, with quirky niche applications like tea-kettles that turn themselves on if you stare longingly at them, art installations that let you paint on them with hand gestures, or lawn sprinklers that know if it’s been raining, and only water the parts that are starting to go brown.

For all of these applications, the images involved are just an implementation detail, they can be immediately discarded. From a systems view, they’re just black boxes that output data about the local environment. Engineers with no background in vision will be able to integrate them, and get useful signals to drive their applications. There are already a few specialist devices like Omron’s Human Vision Components, but imagine when these become common components, standardized so they can be easily plugged into existing designs and cheap enough to be used on everyday items.

I don’t have a crystal ball, and all of these are purely my own personal musings, but it seems obvious to me that machine vision is becoming a commodity. Once the technology’s truly democratized, I believe it will give computers a window into the real world they’ve never had before, and enable interfaces and responses to the environment we’ve never even dreamed of. I think a big part of that will be the emergence of these “semantic sensors” that output human-meaningful data about what’s happening around them.