barakmich writes

Why is my memmove slow?

So I got caught in a bit of a rabbit hole recently.

It all started with implementing some compact data structures. Specifically, this meant inserting bytes into a byte slice, usually close to the front. Very close to:

array = append([]byte{0x00, 0x00}, array...)

This is expensive, of course, if you’re trying to keep the bytes contiguous (say, a fixed block of memory) as it requires a potentially O(array) copy (especially when it’s close to the front). So with the Go profiler, most of the CPU time was spent in runtime.memmove – which makes sense, as we’re doing that copy to shift everything down.

memcpy go brrrrrr

I wondered what gcc would do. So I wrote two little test programs:

#include <vector>

int main(int argc, char *argv[]) {
  for (int i = 0; i < 30; i++) {
    std::vector<int> doit;
    doit.reserve(128 * 1024);
    for (int a = 0; a < 64 * 1024; a++) {
      doit.insert(doit.begin(), 2, 0);
    }
  }

  return 0;
}
package main

func main() {
	for i := 0; i < 30; i++ {
		slice := make([]byte, 0, 128*1024)
		for a := 0; a < 64*1024; a++ {
			slice = append([]byte{0x00, 0x00}, slice...)
		}
	}
}

Now, these aren’t very profound programs. But they do exercise the same notions: preallocate some space, insert two bytes at a time at the front.

Turns out, they had wildly different performance:

$ g++ -o testmem-c testmem.cc
$ time ./testmem-c
./testmem-c  21.37s user 0.00s system 99% cpu 21.397 total
$ go build -o testmem-go testmem.go
$ time ./testmem-go
./testmem-go  93.18s user 18.54s system 202% cpu 55.234 total

perf to the rescue

For a second, I wondered if I wanted to rewrite everything in C++ (but refrained). Turns out, I had just been reading the Cloudflare blog post When Bloom Filters Don’t Bloom – an excellent piece on low-level performance – and decided I’d try my hand at the Linux perf tool.

Most all of the time in the C++ version is spent in glibc, which of course makes sense, as this is really leaning on memmove, as I already knew. The interesting part was the name of the function inside glibc: __memmove_sse2_unaligned_erms. Sure enough, inside the assembly, it was using MOVUPS and MOVAPS – instructions from SSE2 – into 128-bit XMM registers, also part of SSE2.

The general strategy with memmove (for large enough data) is to pull down as much memory as you reasonably can in a few instructions (often the biggest registers you have available) and copy it back out to the new location.

But wait, I know my dev server downstairs is a used R620 that’s little old but it’s not that old. Heck, /proc/cpuinfo shows my Xeon E5-2690 has AVX instructions (and therefore even bigger, 256-bit YMM registers) – so why is even gcc using SSE2? I wonder what happens on a different machine.

We’re gonna need a bigger box

So I committed my test files to my git serer and pulled them down on my fileserver, which happens to be a Xeon E3-1231v3. Recompile and run:

./test-c  17.90s user 0.00s system 99% cpu 17.936 total
./test-go  35.15s user 5.94s system 176% cpu 23.221 total

Hot dog! Slower, yes, but much closer to each other. What does the perf say?

  97.81%  test-c   libc-2.31.so      [.] __memmove_avx_unaligned_erms
  ...
 15.19 │258:   vmovdqu    (%rcx),%ymm0
 15.61 │       vmovdqu    -0x20(%rcx),%ymm1
 15.04 │       vmovdqu    -0x40(%rcx),%ymm2
 15.73 │       vmovdqu    -0x60(%rcx),%ymm3
  3.57 │       sub        $0x80,%rcx
  4.11 │       sub        $0x80,%rdx
  4.65 │       vmovdqa    %ymm0,(%r9)
  5.27 │       vmovdqa    %ymm1,-0x20(%r9)
  7.68 │       vmovdqa    %ymm2,-0x40(%r9)
  9.82 │       vmovdqa    %ymm3,-0x60(%r9)

Wild. So both boxes have AVX instructions, but gcc only uses them on the newer hardware. But the next surprise was running perf on the Go binary:

  32.23%  test-go  test-go           [.] runtime.memmove
  ...
 13.06 │414:   vmovdqu     (%rsi),%ymm0
 18.04 │       vmovdqu     0x20(%rsi),%ymm1
  2.11 │       vmovdqu     0x40(%rsi),%ymm2
 40.25 │       vmovdqu     0x60(%rsi),%ymm3
  1.37 │       add         %rax,%rsi
  5.07 │       vmovdqa     %ymm0,(%rdi)
  3.15 │       vmovdqa     %ymm1,0x20(%rdi)
  2.78 │       vmovdqa     %ymm2,0x40(%rdi)
 10.16 │       vmovdqa     %ymm3,0x60(%rdi)

Neat! The 33% tracks as Go is doing more than one thing – it’s also GCing. But even more neat is that they’re both using the YMM registers and roughly pulling the same tricks.

Recap on the processor lines

  • My old Nehalem boxes don’t have AVX.
  • Xeon v1 (Sandy Bridge) has AVX registers/instructions but doesn’t use them
    • Go falls back to a backward-compatible REP; MOVSB – an optimized byte-at-a-time copy.
    • GCC falls back to the SSE2-compatible __memmove_sse2_unaligned_erms.
  • I don’t have any Xeon v2 (Ivy Bridge) machines
  • Xeon v3 (Haswell) has AVX registers/instructions and uses them
    • Go and GCC have roughly the same assembly for this.
  • Newer architectures are likely better.

Like a Bridge over troubled water…

The next step was to look into Go’s runtime.memmove. First thing that I noticed: an test of the variable:

  TESTB	$1, runtime·useAVXmemmove(SB)
  JNZ	avxUnaligned

That makes sense – but why is that variable false on the Sandy Bridge, which has AVX? Easy to grep for and reproduced here:

var useAVXmemmove bool

func init() {
	// Let's remove stepping and reserved fields
	processor := processorVersionInfo & 0x0FFF3FF0

	isIntelBridgeFamily := isIntel &&
		processor == 0x206A0 ||
		processor == 0x206D0 ||
		processor == 0x306A0 ||
		processor == 0x306E0

	useAVXmemmove = cpu.X86.HasAVX && !isIntelBridgeFamily
}

Yeah, this one hurt. Sandy Bridge and Ivy Bridge (Xeon E5 v1 and v2) are explicitly using the old-style instructions, even though they technically have the ability to use AVX. It’s just slower.

This is actually the right move, even if it seems odd. There’s a ton of commentary on the evolution of memmove/memcpy through the Intel processor line. The best of these, however, is this StackOverflow answer which explains how REP MOVSB (the fallback) on the Bridge architectures is likely faster than the equivalent AVX routines.

This is why gcc avoids AVX on the bridge architectures too. Except gcc, being an older project, has an SSE2-optimized version of memmove which it falls back to instead.

Wrapping it all up

So there you have it. Go has no major motivation to support an equivalent of glibc’s __memmove_sse2_unaligned_erms because they really got going in a time when Haswell became commonplace. I’ve actually gone to the trouble of messing around with such an SSE2 port on my Github but keep running into the fact that, try as I might, it’s just a little bit slower (~5%) on most cases that aren’t the big-copy cases, which see huge wins:

MemmoveUnalignedSrc/10             2.09GB/s ± 0%   1.98GB/s ± 0%    -5.26%  (p=0.000 n=9+8)
MemmoveUnalignedSrc/11             2.30GB/s ± 0%   2.18GB/s ± 0%    -5.18%  (p=0.000 n=10+7)
MemmoveUnalignedSrc/12             2.51GB/s ± 0%   2.37GB/s ± 0%    -5.33%  (p=0.000 n=10+8)
MemmoveUnalignedSrc/13             2.71GB/s ± 0%   2.57GB/s ± 0%    -5.18%  (p=0.000 n=10+6)
MemmoveUnalignedSrc/14             2.92GB/s ± 0%   2.77GB/s ± 0%    -5.45%  (p=0.000 n=8+8)
MemmoveUnalignedSrc/15             3.13GB/s ± 0%   2.96GB/s ± 0%    -5.35%  (p=0.000 n=10+6)
MemmoveUnalignedSrc/16             3.33GB/s ± 0%   3.16GB/s ± 1%    -5.21%  (p=0.000 n=10+8)
...
MemmoveUnalignedDstBackwards/512   8.59GB/s ± 0%  29.32GB/s ± 1%  +241.29%  (p=0.000 n=10+8)
MemmoveUnalignedDstBackwards/1024  12.8GB/s ± 0%   33.1GB/s ± 0%  +158.15%  (p=0.000 n=10+8)
MemmoveUnalignedDstBackwards/2048  16.0GB/s ± 0%   34.0GB/s ± 1%  +112.35%  (p=0.000 n=9+8)
MemmoveUnalignedDstBackwards/4096  19.6GB/s ± 0%   38.4GB/s ± 1%   +96.26%  (p=0.000 n=10+8)

Digging into that could be a whole other blog post, but effectively, the thing slowing it down is fetching instructions into the CPU, as the assembly code is longer (or has less predictivity) than the original, making the per-instruction-twitchy sections a little slower.

So there you have it – many twisty passages, all different, into low-level performance of Intel processors. Digging into these things is fun; opening the hood on the CPU can be a lesson in itself.