I would like to use enhanced REP MOVSB (ERMSB) to get a high bandwidth for a custom memcpy
.
ERMSB was introduced with the Ivy Bridge microarchitecture. See the section "Enhanced REP MOVSB and STOSB operation (ERMSB)" in the Intel optimization manual if you don't know what ERMSB is.
The only way I know to do this directly is with inline assembly. I got the following function from https://groups.google.com/forum/#!topic/gnu.gcc.help/-Bmlm_EG_fE
static inline void *__movsb(void *d, const void *s, size_t n) {
asm volatile ("rep movsb"
: "=D" (d),
"=S" (s),
"=c" (n)
: "0" (d),
"1" (s),
"2" (n)
: "memory");
return d;
}
When I use this however, the bandwidth is much less than with memcpy
.
__movsb
gets 15 GB/s and memcpy
get 26 GB/s with my i7-6700HQ (Skylake) system, Ubuntu 16.10, DDR4@2400 MHz dual channel 32 GB, GCC 6.2.
Why is the bandwidth so much lower with REP MOVSB
? What can I do to improve it?
Here is the code I used to test this.
//gcc -O3 -march=native -fopenmp foo.c
#include <stdlib.h>
#include <string.h>
#include <stdio.h>
#include <stddef.h>
#include <omp.h>
#include <x86intrin.h>
static inline void *__movsb(void *d, const void *s, size_t n) {
asm volatile ("rep movsb"
: "=D" (d),
"=S" (s),
"=c" (n)
: "0" (d),
"1" (s),
"2" (n)
: "memory");
return d;
}
int main(void) {
int n = 1<<30;
//char *a = malloc(n), *b = malloc(n);
char *a = _mm_malloc(n,4096), *b = _mm_malloc(n,4096);
memset(a,2,n), memset(b,1,n);
__movsb(b,a,n);
printf("%d
", memcmp(b,a,n));
double dtime;
dtime = -omp_get_wtime();
for(int i=0; i<10; i++) __movsb(b,a,n);
dtime += omp_get_wtime();
printf("dtime %f, %.2f GB/s
", dtime, 2.0*10*1E-9*n/dtime);
dtime = -omp_get_wtime();
for(int i=0; i<10; i++) memcpy(b,a,n);
dtime += omp_get_wtime();
printf("dtime %f, %.2f GB/s
", dtime, 2.0*10*1E-9*n/dtime);
}
The reason I am interested in rep movsb
is based off these comments
Note that on Ivybridge and Haswell, with buffers to large to fit in MLC you can beat movntdqa using rep movsb; movntdqa incurs a RFO into LLC, rep movsb does not...
rep movsb is significantly faster than movntdqa when streaming to memory on Ivybridge and Haswell (but be aware that pre-Ivybridge it is slow!)
What's missing/sub-optimal in this memcpy implementation?
Here are my results on the same system from tinymembnech.
C copy backwards : 7910.6 MB/s (1.4%)
C copy backwards (32 byte blocks) : 7696.6 MB/s (0.9%)
C copy backwards (64 byte blocks) : 7679.5 MB/s (0.7%)
C copy : 8811.0 MB/s (1.2%)
C copy prefetched (32 bytes step) : 9328.4 MB/s (0.5%)
C copy prefetched (64 bytes step) : 9355.1 MB/s (0.6%)
C 2-pass copy : 6474.3 MB/s (1.3%)
C 2-pass copy prefetched (32 bytes step) : 7072.9 MB/s (1.2%)
C 2-pass copy prefetched (64 bytes step) : 7065.2 MB/s (0.8%)
C fill : 14426.0 MB/s (1.5%)
C fill (shuffle within 16 byte blocks) : 14198.0 MB/s (1.1%)
C fill (shuffle within 32 byte blocks) : 14422.0 MB/s (1.7%)
C fill (shuffle within 64 byte blocks) : 14178.3 MB/s (1.0%)
---
standard memcpy : 12784.4 MB/s (1.9%)
standard memset : 30630.3 MB/s (1.1%)
---
MOVSB copy : 8712.0 MB/s (2.0%)
MOVSD copy : 8712.7 MB/s (1.9%)
SSE2 copy : 8952.2 MB/s (0.7%)
SSE2 nontemporal copy : 12538.2 MB/s (0.8%)
SSE2 copy prefetched (32 bytes step) : 9553.6 MB/s (0.8%)
SSE2 copy prefetched (64 bytes step) : 9458.5 MB/s (0.5%)
SSE2 nontemporal copy prefetched (32 bytes step) : 13103.2 MB/s (0.7%)
SSE2 nontemporal copy prefetched (64 bytes step) : 13179.1 MB/s (0.9%)
SSE2 2-pass copy : 7250.6 MB/s (0.7%)
SSE2 2-pass copy prefetched (32 bytes step) : 7437.8 MB/s (0.6%)
SSE2 2-pass copy prefetched (64 bytes step) : 7498.2 MB/s (0.9%)
SSE2 2-pass nontemporal copy : 3776.6 MB/s (1.4%)
SSE2 fill : 14701.3 MB/s (1.6%)
SSE2 nontemporal fill : 34188.3 MB/s (0.8%)
Note that on my system SSE2 copy prefetched
is also faster than MOVSB copy
.
In my original tests I did not disable turbo. I disabled turbo and tested again and it does not appear to make much of a difference. However, changing the power management does make a big difference.
When I do
sudo cpufreq-set -r -g performance
I sometimes see over 20 GB/s with rep movsb
.
with
sudo cpufreq-set -r -g powersave
the best I see is about 17 GB/s. But memcpy
does not seem to be sensitive to the power management.
I checked the frequency (using turbostat
) with and without SpeedStep enabled, with performance
and with powersave
for idle, a 1 core load and a 4 core load. I ran Intel's MKL dense matrix multiplication to create a load and set the number of threads using OMP_SET_NUM_THREADS
. Here is a table of the results (numbers in GHz).
SpeedStep idle 1 core 4 core
powersave OFF 0.8 2.6 2.6
performance OFF 2.6 2.6 2.6
powersave ON 0.8 3.5 3.1
performance ON 3.5 3.5 3.1
This shows that with powersave
even with SpeedStep disabled the CPU
still clocks down to the idle frequency of 0.8 GHz
. It's only with performance
without SpeedStep that the CPU runs at a constant frequency.
I used e.g sudo cpufreq-set -r performance
(because cpufreq-set
was giving strange results) to change the power settings. This turns turbo back on so I had to disable turbo after.
Question&Answers:
os