It’s not often that doing nothing speeds things up, but that’s exactly how Benoit Jacob just optimized the gemmlowp project on Android. Gemmlowp is a specialized library for doing large matrix multiplications on eight-bit values, which is vital for neural networks (as I’ve talked about before). He had managed to write great ARM assembler routines and careful memory management code to get strong single-core performance on Android, but when we tried to thread the code across two or four cores, we only saw a very small improvement. It wasn’t scaling with the number of cores, despite being a very parallelizeable algorithm where we could easily split the calculations into independent shards.
Looking at the behavior with tools like systrace, we could see that there were lots of long gaps where cores weren’t being used. Another clue was that changing the cpu governor to a more aggressive mode (something that’s only possible for developers) made performance dramatically better. Governors exist to make sure that applications are efficient in their use of battery power, mostly by lowering the frequencies or disabling cores that aren’t in use. That led us to believe that something was making the governor believe we didn’t need all the cores, even though we were trying to do short bursts of very intensive computation that would benefit from them.
The problem was that we were doing a lot of comparatively ‘small’ calculations, sometimes only a few million operations at a time, with sync points in between. We were using the standard pthread_cond_wait() function in the worker threads to look for new chunks of work to be available. The trouble is that the ‘granularity’ of the waits seemed to be of the magnitude of a millisecond or so, which meant that if there wasn’t immediately new work after one chunk was completed, the thread would go to sleep for quite a while. That seemed to cause a double penalty where not only did the work take longer because of those thread delays, but the governor saw sparse usage of the cores and so downgraded performance even more to save power!
All of these theories about the exact mechanism are just our speculation, but we ended up adding a short period of busy-waiting before we drop back to pthread_cond_wait(), and that improved our overall performance massively. We’re now spinning in a NOP loop for about 32 million cycles (which may sound like a lot but is only about 10 to 20 milliseconds, depending on the clock rate), so by doing nothing we actually end up going faster! We found we could get most of the benefit with even shorter busy-wait lengths, but the current number of NOPs was optimal for our use cases.
It sounds incredible, but you can try it for yourself by changing kMaxBusyWaitNOPs to zero on line 33 in internal/multi_thread_gemm.h, and running the benchmark with:
There’s a series of benchmark results reflecting different use cases, but on my Nexus 5 (with four cores) I see over double the speed with the busy-waiting in place. As an aside, the labeling is a bit off since they’re not floating point operations, but seeing 45 giga-ops on a real-world use case on a years-old phone still blows my mind, we have an incredible amount of computing power in our pockets!
The big concern is obviously power usage, so we used the Monsoon power monitor to check and we actually end up using substantially less energy per operation, as well as completing more quickly (268 pico-joules per op versus 353 single-threaded).
Once the busy-waiting was in place, we were able to see one remaining problem. The original code spun up as many worker threads as there were cores, but this ignored the main thread, which was effectively contending for one core with a worker. By instead having (N-1) worker threads, and having one worker function run on the main thread, we saw a noticeable improvement.
I don’t recommend trying this approach at home unless you’re very sure you need it after careful profiling, since I’d hate to spawn a new generation of apps that are needlessly wasting users’ batteries. If you are in the situation where you’re doing heavy numerical computation in comparatively small chunks and need to spread the work across multiple cores, it’s worth taking a look at the WaitForVariableChange() code. One direction I would also recommend looking at is pthread_spin_lock() , since that promises a similar low-latency busy-waiting approach to waiting in a more standard way (and we may be reinventing the wheel here) but we haven’t had a chance to experiment with that deeply yet.
Anyway, these are just notes from our own adventures, and I’d love to hear from anyone else who’s explored the mysteries of numerical computation on smartphones, it’s a fascinating area for me!