PopCount on ARM64 in Go Assembler

Apropos of Apple’s ARM announcment, I thought I might write up a post on a recent bit of code I wrote that specifically looks at ARM64, and its benchmarks on various hardware.

I’ve been implementing some compact data structures for a project. One of the CPU hotspots for the implementation is the need to run a quick population count across a potentially large bit of memory.

If you’ve never seen population count before, it’s the count of the number of set 1 bits in a byte (or list of bytes) – for example:

0xF3 == 0b11110011
popCount(0xF3) == 6

Now, every reasonable x86_64/amd64¹ CPU in the past decade or so has a built-in instruction for this: POPCNT. It works like this (in Go Assembler):

MOV    $0xF3, R10  // Store a constant
POPCNT   R10, AX   // AX now equals 6

Go uses the built-in instruction for this in the math/bits via SSA compiler rewrites (which it added in 1.9), but only up to a uint64 at a time; using assembly to loop over a []byte is considerably more efficient

@tmthrgd created a really nice little x86_64-assembly-optimized package at github.com/tmthrgd/go-popcount. It works great, works around a weird little Intel bug (see the helpful comments) and is still faster than looping and using the Go standard library, which it helpfully benchmarks as well.

Recently, I picked up one of the new 8GB Raspberry Pi 4s. Loaded it up with the nice new Manjaro 20.06 and set up my usual environment. As a test, I wanted to try my latest WIP data structure code. Of course, the bottleneck was right where I expected it to be: in the population count.

Implementing it

I discovered that ARM64 has a POPCNT-like instruction, logically enough called CNT. I thought, since I’ve been playing with Go assembly for memmove, why not try my hand at the new architecture?

Go already has the SSA-rewrite for OnesCount on ARM (added in 1.11), but again, only a uint64 at a time. There might be some performance on the table.

Official architecture guide at the ready, I got to work. And there was a lot to learn. Some notes:

1. It’s part of the vector suite, NEON

NEON is the name for the addition of vector instructions to the ARM architecture, so I’d be working with both an unfamiliar architecture and its vector instructions.

In x86-land, POPCNT and vectorization are two separate concepts. POPCNT, as an instruction, deals with everyday, 64-bit integer registers, and not the vector registers (even though it appeared approximately the same time as the addition to vector instructions, SSE4)².

ARM/NEON did CNT differently. Since you can load an array of items (say, 16 bytes, in the ARM64 vector registers), CNT will count them individually. In fact, you can only do it in vectors of single bytes.

So this means that the following signatures are approximately the way to think about it:

func popCountx86(in uint64) uint64 // register-at-a-time, 64-bit (8 bytes)

func popCountNeon(in [16]byte) [16]uint8 // 16 bytes all counted in parallel.

This ends up being really effective, as it turns out (below).

2. Go’s assembler documentation is barebones

Writing instructions so that Go’s assembler is happy with the instructions you’re giving it is a bit frustrating.

There’s a good gloss at golang.org/pkg/cmd/internal/obj/arm64 which gives an overview of many of the differences. For example, all vector instructions start with V, different than what ARM64 switched to (they used to commonly start with V on 32-bit ARM) – so while I understand the desire for continuity (and even subtly like it, knowing it’s a vector op) it’s just makes another little difference to remember vs. the original documentation.

But more frustrating is, even if your instruction is supported (and most of them are) knowing how to use the instruction in Go assembler boils down to “assume data goes left-to-right and hope there’s an example in the test suite”

I’m a big Go fan, yet Go’s history into Plan 9 and accompanying assembler (and, relatedly, odd calling conventions) is one of my gripes about Go, even more than lack of generics (which is a topic for another day). Sure, there were some good ideas in Plan 9 that influenced the design of Go – from a design level, it’s great! – but on the implementation level, this is one place where I kinda wish it had followed precident. Take whichever side you want in the Intel vs GNU syntax debate, creating a third option means relearning all the quirks from scratch, and ignoring any documentation that already exists.

Putting it all together

The end result is my friendly fork of go-popcount: github.com/barakmich/go-popcount

Really, it’s more of an extension than a fork – it provides the same API, just with handwritten assembly for ARM64 chips.

How it works

The vectorization works really well. The process is:

Load a set of vector registers, 16 bytes each
popCount them
Vector sum their partial results (up to 32 individual vectors, to fit the 8-bit counts), trying to avoid a data dependency
Finally, sum (“widening”, in vector terms) the final vector
Add it to the final output

The other thing to balance was how much to load from memory vs. how much work to do to optimize throughput. That ended up being about 8 vectors (128 bytes) at a time. That may vary as a function of CPU, but it’s a good place to start.

ARM64 feels nice

This is purely subjective, but there were a number of moments where I felt “hey, that’s handy” in writing ARM64 assembly. Of course modern x86_64 chips account for all of these differences and makes them performant – through deeper instruction pipelines or having micro-op instruction queues that ultimately pull the same tricks. But at the same time, when you’re dropping down to work at the instruction level, it’s kind of a breath of fresh air.

Pre-and-post increment

A lot of the time when you’re working with an array of whatever you’re pulling a chunk of memory into registers, doing some transform, and putting it back.

VLD1.P 64(R1), [V16.B16, V17.B16, V18.B16, V19.B16]

Reads as load 1-byte*size structures into the following vector registers – so far so good, this is similar to the VMOVDQU instruction family (though the size-structure variants on that instruction are a recent addition) on x86. It has a similar ability to load many registers in multiple back-to-back instructions through address/offset/size calculation, but ARM has a nice one-liner that way.

But I really like the auto-increment of R1 by the read size (64) after loading – hence post-increment. Many loads from memory have a similar flag. It’s very descriptive and means things like “increment the offset register and decrement the size register and test” things live with their appropriate parts of the code, instead of having to increment later (and finding the optimal time)

Consistency of style

This is a holdover from history, but the consistency of having fixed-size instructions is a nice thing when trying to hand-assemble an instruction. I had to do some hand-assembly when I discovered and reported a bug in Go. It was silently writing the wrong version of the instruction while accepting the correct one as input. Kudos to the Go community – it was fixed by an expert within a day or two, so I’m looking forward to the next version that contains the fix!

Still, this meant with a steady hand and a copy of the architecture guide I could feasibly implement any instructions that were missing.

Also in consistency-land, most binary operations take input1_reg, input2_reg, output_reg with few exceptions. Omitting the output_reg is Go’s assembler syntactic sugar to set output to input2. x86, again for historical reasons (trying to keep instructions small), often has the store-to-the-second-register as the primary or only version of an operation, which can lead to more operations overall (and cognitive overhead IMO).

Benchmarking on ARM

So let’s take a look at some benchmarks. The most interesting thing about looking at population count is that this little routine does something useful and shows tradeoffs between CPU bounds and memory bandwith between the CPU, the on-chip caches, and main memory. At array sizes small enough to fit into CPU cache (but big enough to run the compute loop a few times), the CPU is the limiting factor – how many bits it can count. For larger data sizes, the memory bandwidth becomes the bound; the CPU is waiting on getting enough data to crunch through.

To this end, the benchmark curves in the repository max out in throughput at about 16K (most work possible, while still being in cache) and then trail down into a steady state as memory becomes the bound. So I’ll truncate the full benchmarks to compare peak throughput and long tail.

Some findings and commentary:

Raspberry Pi 4

Unoptimized (Go implementation):
BenchmarkCountBytesGo/16K              297778          8056 ns/op     2033.78 MB/s
BenchmarkCountBytesGo/512M                  8     279380432 ns/op     1921.65 MB/s

Optimized (My hand-rolled assember):
BenchmarkCountBytes/16K                520807          2303 ns/op     7113.24 MB/s
BenchmarkCountBytes/512M                    8     131214574 ns/op     4091.55 MB/s

This was my finished product on my local Pi. I may be able to do better, but varying the block sizes between grabbing from memory and doing the vector addition for popcount topped out about here, so I’m fairly satisfied.

Interestingly, the ~4.1GB/s memory bandwidth follows exactly with initial read benchmarks of the Pi 4 suggesting it’s close to saturation, which is good news.

Ampere eMag

So my next thought was to spin up an ARM64 server with my old friends at Packet. They have a c2.large.arm and it’s gonna be great!

Unoptimized:
BenchmarkCountBytesGo/16K      	   69939	     17208 ns/op	 952.09 MB/s
BenchmarkCountBytesGo/512M     	       2	 582293709 ns/op	 921.99 MB/s

Optimized:
BenchmarkCountBytes/16K        	  458394	      2614 ns/op	6267.80 MB/s
BenchmarkCountBytes/512M       	      12	  93371433 ns/op	5749.84 MB/s

…but I was rather underwhelmed.

This isn’t necessarily Packet’s fault – they were early onto having ARM hardware available and it’s more that the eMag is just a somewhat older chip design. Related benchmarks upon Googling for them show similar characteristics. It’s a fairly good chip on evenly distributed workloads across many cores, but I’m mostly testing single-core throughput with this benchmark, which it’s not built for as much.

Amazon Graviton2

The real winner was the latest ARM64 hardware from Amazon, the Graviton2:

Unoptimized:
BenchmarkCountBytesGo/16K             1000000          2281 ns/op     7182.90 MB/s
BenchmarkCountBytesGo/512M                 31      76826628 ns/op     6988.08 MB/s

Optimized:
BenchmarkCountBytes/16K               4199566           571 ns/op    28673.85 MB/s
BenchmarkCountBytes/512M                   85      27507419 ns/op    19517.31 MB/s

Wow. With just the default Go implementation it’s close to my optimized Pi 4, but with the real assembler it’s downright impressive.

Which makes me ask, how’s that compare to the Intel world?

Baseline: Haswell Xeon x86_64

My best CPU on hand is probably my Haswell (E3-1231v3) fileserver – sure it’s a little old but it’s at least a decent baseline in the Go world, given my previous post:

Unoptimized:
BenchmarkCountBytesGo/16K              893949          1330 ns/op    12316.43 MB/s
BenchmarkCountBytesGo/512M                 21      53529366 ns/op    10029.47 MB/s

Optimized:
BenchmarkCountBytes/16K               2068282           574 ns/op    28524.35 MB/s
BenchmarkCountBytes/512M                   28      40770214 ns/op    13168.21 MB/s

We can see the effect that Go’s built-in optimization for uint64 has, in that it’s closer to the hand-rolled assembly from @tmthrgd. It’s still clearly better to use the library, but at least some work has gone into optimizing the math/bits package of Go’s standard library.

Nonetheless, the Haswell does roughly as well as the Graviton2 on this test, except the newer chip has seemingly better memory bandwidth. Definitely makes me want to play with bigger ARM hardware. I’m more willing to believe that there’s something to the performance per unit cost argument from Amazon.

AWS m5.4xlarge

For fun, let’s try a more modern core (Xeon Platinum 8175M CPU @ 2.50GHz) on AWS to compare that to:

Unoptimized:
BenchmarkCountBytesGo/16K              795211          1509 ns/op    10855.33 MB/s
BenchmarkCountBytesGo/512M                 18      64222468 ns/op     8359.55 MB/s

Optimized:
BenchmarkCountBytes/16K               1783059           673 ns/op    24344.22 MB/s
BenchmarkCountBytes/512M                   24      47282854 ns/op    11354.45 MB/s

…I guess I’m slightly disappointed, but it tracks with the single-core benchmarks on cpubenchmark.net. In fact, that list more-or-less sums up the single-core Xeon world for this particular case.

Summing up and what’s next?

I’m really impressed with the potential for ARM64 if the new Graviton2 processors are where it’s going. That, and the Pi 4 8GB has been a really fun little board, now that it has enough RAM to have a reasonable page cache.

There’s a neat paper on the arXiv with a faster population count implementation of using AVX2 instructions (so, approximately Haswell or newer CPUs, which again works for Go). Since those instructions are available, it may be worth benchmarking some more x86_64 Go assembly to use it.

Similarly with the aforementioned new instructions in AVX-512; it’d be fun to benchmark them if they become more commonplace.

If I had a nickel for every time I typoed arm64 and amd64. Even though they’re the architecture monikers for Go (and Debian), I prefer the GNU triplet forms of aarch64 and x86_64 (the latter I use in the rest of the post for clarity). ↩︎
This is getting an upgrade in AVX-512, which is the new hotness on the x86_64 landscape, but those chips are very expensive and only a handful of them acutally implement this yet. ↩︎