Why is my memmove slow?
So I got caught in a bit of a rabbit hole recently.
It all started with implementing some compact data structures. Specifically, this meant inserting bytes into a byte slice, usually close to the front. Very close to:
array = append([]byte{0x00, 0x00}, array...)
This is expensive, of course, if you’re trying to keep the bytes contiguous (say, a fixed block of memory) as it requires a potentially O(array)
copy (especially when it’s close to the front).
So with the Go profiler, most of the CPU time was spent in runtime.memmove
– which makes sense, as we’re doing that copy to shift everything down.
memcpy go brrrrrr
I wondered what gcc
would do. So I wrote two little test programs:
#include <vector>
int main(int argc, char *argv[]) {
for (int i = 0; i < 30; i++) {
std::vector<int> doit;
doit.reserve(128 * 1024);
for (int a = 0; a < 64 * 1024; a++) {
doit.insert(doit.begin(), 2, 0);
}
}
return 0;
}
package main
func main() {
for i := 0; i < 30; i++ {
slice := make([]byte, 0, 128*1024)
for a := 0; a < 64*1024; a++ {
slice = append([]byte{0x00, 0x00}, slice...)
}
}
}
Now, these aren’t very profound programs. But they do exercise the same notions: preallocate some space, insert two bytes at a time at the front.
Turns out, they had wildly different performance:
$ g++ -o testmem-c testmem.cc
$ time ./testmem-c
./testmem-c 21.37s user 0.00s system 99% cpu 21.397 total
$ go build -o testmem-go testmem.go
$ time ./testmem-go
./testmem-go 93.18s user 18.54s system 202% cpu 55.234 total
perf
to the rescue
For a second, I wondered if I wanted to rewrite everything in C++ (but refrained). Turns out, I had just been reading the Cloudflare blog post When Bloom Filters Don’t Bloom – an excellent piece on low-level performance – and decided I’d try my hand at the Linux perf
tool.
Most all of the time in the C++ version is spent in glibc, which of course makes sense, as this is really leaning on memmove, as I already knew.
The interesting part was the name of the function inside glibc: __memmove_sse2_unaligned_erms
. Sure enough, inside the assembly, it was using MOVUPS and MOVAPS – instructions from SSE2 – into 128-bit XMM registers, also part of SSE2.
The general strategy with memmove (for large enough data) is to pull down as much memory as you reasonably can in a few instructions (often the biggest registers you have available) and copy it back out to the new location.
But wait, I know my dev server downstairs is a used R620 that’s little old but it’s not that old. Heck, /proc/cpuinfo
shows my Xeon E5-2690 has AVX instructions (and therefore even bigger, 256-bit YMM registers) – so why is even gcc using SSE2? I wonder what happens on a different machine.
We’re gonna need a bigger box
So I committed my test files to my git serer and pulled them down on my fileserver, which happens to be a Xeon E3-1231v3. Recompile and run:
./test-c 17.90s user 0.00s system 99% cpu 17.936 total
./test-go 35.15s user 5.94s system 176% cpu 23.221 total
Hot dog! Slower, yes, but much closer to each other. What does the perf
say?
97.81% test-c libc-2.31.so [.] __memmove_avx_unaligned_erms
...
15.19 │258: vmovdqu (%rcx),%ymm0
15.61 │ vmovdqu -0x20(%rcx),%ymm1
15.04 │ vmovdqu -0x40(%rcx),%ymm2
15.73 │ vmovdqu -0x60(%rcx),%ymm3
3.57 │ sub $0x80,%rcx
4.11 │ sub $0x80,%rdx
4.65 │ vmovdqa %ymm0,(%r9)
5.27 │ vmovdqa %ymm1,-0x20(%r9)
7.68 │ vmovdqa %ymm2,-0x40(%r9)
9.82 │ vmovdqa %ymm3,-0x60(%r9)
Wild. So both boxes have AVX instructions, but gcc
only uses them on the newer hardware. But the next surprise was running perf
on the Go binary:
32.23% test-go test-go [.] runtime.memmove
...
13.06 │414: vmovdqu (%rsi),%ymm0
18.04 │ vmovdqu 0x20(%rsi),%ymm1
2.11 │ vmovdqu 0x40(%rsi),%ymm2
40.25 │ vmovdqu 0x60(%rsi),%ymm3
1.37 │ add %rax,%rsi
5.07 │ vmovdqa %ymm0,(%rdi)
3.15 │ vmovdqa %ymm1,0x20(%rdi)
2.78 │ vmovdqa %ymm2,0x40(%rdi)
10.16 │ vmovdqa %ymm3,0x60(%rdi)
Neat! The 33% tracks as Go is doing more than one thing – it’s also GCing. But even more neat is that they’re both using the YMM registers and roughly pulling the same tricks.
Recap on the processor lines
- My old Nehalem boxes don’t have AVX.
- Xeon v1 (Sandy Bridge) has AVX registers/instructions but doesn’t use them
- Go falls back to a backward-compatible
REP; MOVSB
– an optimized byte-at-a-time copy. - GCC falls back to the SSE2-compatible
__memmove_sse2_unaligned_erms
.
- Go falls back to a backward-compatible
- I don’t have any Xeon v2 (Ivy Bridge) machines
- Xeon v3 (Haswell) has AVX registers/instructions and uses them
- Go and GCC have roughly the same assembly for this.
- Newer architectures are likely better.
Like a Bridge over troubled water…
The next step was to look into Go’s runtime.memmove
. First thing that I noticed: an test of the variable:
TESTB $1, runtime·useAVXmemmove(SB)
JNZ avxUnaligned
That makes sense – but why is that variable false on the Sandy Bridge, which has AVX? Easy to grep for and reproduced here:
var useAVXmemmove bool
func init() {
// Let's remove stepping and reserved fields
processor := processorVersionInfo & 0x0FFF3FF0
isIntelBridgeFamily := isIntel &&
processor == 0x206A0 ||
processor == 0x206D0 ||
processor == 0x306A0 ||
processor == 0x306E0
useAVXmemmove = cpu.X86.HasAVX && !isIntelBridgeFamily
}
Yeah, this one hurt. Sandy Bridge and Ivy Bridge (Xeon E5 v1 and v2) are explicitly using the old-style instructions, even though they technically have the ability to use AVX. It’s just slower.
This is actually the right move, even if it seems odd. There’s a ton of commentary on the evolution of memmove/memcpy through the Intel processor line. The best of these, however, is this StackOverflow answer which explains how REP MOVSB (the fallback) on the Bridge architectures is likely faster than the equivalent AVX routines.
This is why gcc avoids AVX on the bridge architectures too. Except gcc, being an older project, has an SSE2-optimized version of memmove which it falls back to instead.
Wrapping it all up
So there you have it. Go has no major motivation to support an equivalent of glibc’s __memmove_sse2_unaligned_erms
because they really got going in a time when Haswell became commonplace.
I’ve actually gone to the trouble of messing around with such an SSE2 port on my Github but keep running into the fact that, try as I might, it’s just a little bit slower (~5%) on most cases that aren’t the big-copy cases, which see huge wins:
MemmoveUnalignedSrc/10 2.09GB/s ± 0% 1.98GB/s ± 0% -5.26% (p=0.000 n=9+8)
MemmoveUnalignedSrc/11 2.30GB/s ± 0% 2.18GB/s ± 0% -5.18% (p=0.000 n=10+7)
MemmoveUnalignedSrc/12 2.51GB/s ± 0% 2.37GB/s ± 0% -5.33% (p=0.000 n=10+8)
MemmoveUnalignedSrc/13 2.71GB/s ± 0% 2.57GB/s ± 0% -5.18% (p=0.000 n=10+6)
MemmoveUnalignedSrc/14 2.92GB/s ± 0% 2.77GB/s ± 0% -5.45% (p=0.000 n=8+8)
MemmoveUnalignedSrc/15 3.13GB/s ± 0% 2.96GB/s ± 0% -5.35% (p=0.000 n=10+6)
MemmoveUnalignedSrc/16 3.33GB/s ± 0% 3.16GB/s ± 1% -5.21% (p=0.000 n=10+8)
...
MemmoveUnalignedDstBackwards/512 8.59GB/s ± 0% 29.32GB/s ± 1% +241.29% (p=0.000 n=10+8)
MemmoveUnalignedDstBackwards/1024 12.8GB/s ± 0% 33.1GB/s ± 0% +158.15% (p=0.000 n=10+8)
MemmoveUnalignedDstBackwards/2048 16.0GB/s ± 0% 34.0GB/s ± 1% +112.35% (p=0.000 n=9+8)
MemmoveUnalignedDstBackwards/4096 19.6GB/s ± 0% 38.4GB/s ± 1% +96.26% (p=0.000 n=10+8)
Digging into that could be a whole other blog post, but effectively, the thing slowing it down is fetching instructions into the CPU, as the assembly code is longer (or has less predictivity) than the original, making the per-instruction-twitchy sections a little slower.
So there you have it – many twisty passages, all different, into low-level performance of Intel processors. Digging into these things is fun; opening the hood on the CPU can be a lesson in itself.