I am investigating performance hotspots in an application which spends 50% of
its time in memmove(3). The application inserts millions of 4-byte integers
into sorted arrays, and uses memmove to shift the data "to the right" in
order to make space for the inserted value.
My expectation was that copying memory is extremely fast, and I was surprised
that so much time is spent in memmove. But then I had the idea that memmove
is slow because it's moving overlapping regions, which must be implemented
in a tight loop, instead of copying large pages of memory. I wrote a small
microbenchmark to find out whether there was a performance difference between
memcpy and memmove, expecting memcpy to win hands down.
I ran my benchmark on two machines (core i5, core i7) and saw that memmove is
actually faster than memcpy, on the older core i7 even nearly twice as fast!
Now I am looking for explanations.
Here is my benchmark. It copies 100 mb with memcpy, and then moves about 100 mb with memmove; source and destination are overlapping. Various "distances"
for source and destination are tried. Each test is run 10 times, the average
time is printed.
https://gist.github.com/cruppstahl/78a57cdf937bca3d062c
Here are the results on the Core i5 (Linux 3.5.0-54-generic #81~precise1-Ubuntu
SMP x86_64 GNU/Linux, gcc is 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5). The number
in brackets is the distance (gap size) between source and destination:
memcpy 0.0140074
memmove (002) 0.0106168
memmove (004) 0.01065
memmove (008) 0.0107917
memmove (016) 0.0107319
memmove (032) 0.0106724
memmove (064) 0.0106821
memmove (128) 0.0110633
Memmove is implemented as a SSE optimized assembler code, copying from back
to front. It uses hardware prefetch to load the data into the cache, and
copies 128 bytes to XMM registers, then stores them at the destination.
(memcpy-ssse3-back.S, lines 1650 ff)
L(gobble_ll_loop):
prefetchnta -0x1c0(%rsi)
prefetchnta -0x280(%rsi)
prefetchnta -0x1c0(%rdi)
prefetchnta -0x280(%rdi)
sub $0x80, %rdx
movdqu -0x10(%rsi), %xmm1
movdqu -0x20(%rsi), %xmm2
movdqu -0x30(%rsi), %xmm3
movdqu -0x40(%rsi), %xmm4
movdqu -0x50(%rsi), %xmm5
movdqu -0x60(%rsi), %xmm6
movdqu -0x70(%rsi), %xmm7
movdqu -0x80(%rsi), %xmm8
movdqa %xmm1, -0x10(%rdi)
movdqa %xmm2, -0x20(%rdi)
movdqa %xmm3, -0x30(%rdi)
movdqa %xmm4, -0x40(%rdi)
movdqa %xmm5, -0x50(%rdi)
movdqa %xmm6, -0x60(%rdi)
movdqa %xmm7, -0x70(%rdi)
movdqa %xmm8, -0x80(%rdi)
lea -0x80(%rsi), %rsi
lea -0x80(%rdi), %rdi
jae L(gobble_ll_loop)
Why is memmove faster then memcpy? I would expect memcpy to copy memory pages,
which should be much faster than looping. In worst case I would expect memcpy
to be as fast as memmove.
PS: I know that I cannot replace memmove with memcpy in my code. I know that
the code sample mixes C and C++. This question is really just for academic
purposes.
UPDATE 1
I ran some variations of the tests, based on the various answers.
- When running memcpy twice, then the second run is faster than the first one.
- When "touching" the destination buffer of memcpy (
memset(b2, 0, BUFFERSIZE...)
) then the first run of memcpy is also faster.
- memcpy is still a little bit slower than memmove.
Here are the results:
memcpy 0.0118526
memcpy 0.0119105
memmove (002) 0.0108151
memmove (004) 0.0107122
memmove (008) 0.0107262
memmove (016) 0.0108555
memmove (032) 0.0107171
memmove (064) 0.0106437
memmove (128) 0.0106648
My conclusion: based on a comment from @Oliver Charlesworth, the operating system has to commit physical memory as soon as the memcpy destination buffer is accessed for the very first time (if someone knows how to "proof" this then please add an answer!). In addition, as @Mats Petersson said, memmove is cache friendlier than memcpy.
Thanks for all the great answers and comments!
See Question&Answers more detail:
os