I want to measure memory bandwidth using memcpy
. I modified the code from this answer:why vectorizing the loop does not have performance improvement which used memset
to measure the bandwidth. The problem is that memcpy
is only slighly slower than memset
when I expect it to be about two times slower since it operations on twice the memory.
More specifically, I run over 1 GB arrays a
and b
(allocated will calloc
) 100 times with the following operations.
operation time(s)
-----------------------------
memset(a,0xff,LEN) 3.7
memcpy(a,b,LEN) 3.9
a[j] += b[j] 9.4
memcpy(a,b,LEN) 3.8
Notice that memcpy
is only slightly slower then memset
. The operations a[j] += b[j]
(where j
goes over [0,LEN)
) should take three times longer than memcpy
because it operates on three times as much data. However it's only about 2.5 as slow as memset
.
Then I initialized b
to zero with memset(b,0,LEN)
and test again:
operation time(s)
-----------------------------
memcpy(a,b,LEN) 8.2
a[j] += b[j] 11.5
Now we see that memcpy
is about twice as slow as memset
and a[j] += b[j]
is about thrice as slow as memset
like I expect.
At the very least I would have expected that before memset(b,0,LEN)
that memcpy
would be slower because the of lazy allocation (first touch) on the first of the 100 iterations.
Why do I only get the time I expect after memset(b,0,LEN)
?
test.c
#include <time.h>
#include <string.h>
#include <stdio.h>
void tests(char *a, char *b, const int LEN){
clock_t time0, time1;
time0 = clock();
for (int i = 0; i < 100; i++) memset(a,0xff,LEN);
time1 = clock();
printf("%f
", (double)(time1 - time0) / CLOCKS_PER_SEC);
time0 = clock();
for (int i = 0; i < 100; i++) memcpy(a,b,LEN);
time1 = clock();
printf("%f
", (double)(time1 - time0) / CLOCKS_PER_SEC);
time0 = clock();
for (int i = 0; i < 100; i++) for(int j=0; j<LEN; j++) a[j] += b[j];
time1 = clock();
printf("%f
", (double)(time1 - time0) / CLOCKS_PER_SEC);
time0 = clock();
for (int i = 0; i < 100; i++) memcpy(a,b,LEN);
time1 = clock();
printf("%f
", (double)(time1 - time0) / CLOCKS_PER_SEC);
memset(b,0,LEN);
time0 = clock();
for (int i = 0; i < 100; i++) memcpy(a,b,LEN);
time1 = clock();
printf("%f
", (double)(time1 - time0) / CLOCKS_PER_SEC);
time0 = clock();
for (int i = 0; i < 100; i++) for(int j=0; j<LEN; j++) a[j] += b[j];
time1 = clock();
printf("%f
", (double)(time1 - time0) / CLOCKS_PER_SEC);
}
main.c
#include <stdlib.h>
int tests(char *a, char *b, const int LEN);
int main(void) {
const int LEN = 1 << 30; // 1GB
char *a = (char*)calloc(LEN,1);
char *b = (char*)calloc(LEN,1);
tests(a, b, LEN);
}
Compile with (gcc 6.2) gcc -O3 test.c main.c
. Clang 3.8 gives essentially the same result.
Test system: [email protected] (Skylake), 32 GB DDR4, Ubuntu 16.10. On my Haswell system the bandwidths make sense before memset(b,0,LEN)
i.e. I only see a problem on my Skylake system.
I first discovered this issue from the a[j] += b[k]
operations in this answer which was overestimating the bandwidth.
I came up with a simpler test
#include <time.h>
#include <string.h>
#include <stdio.h>
void __attribute__ ((noinline)) foo(char *a, char *b, const int LEN) {
for (int i = 0; i < 100; i++) for(int j=0; j<LEN; j++) a[j] += b[j];
}
void tests(char *a, char *b, const int LEN) {
foo(a, b, LEN);
memset(b,0,LEN);
foo(a, b, LEN);
}
This outputs.
9.472976
12.728426
However, if I do memset(b,1,LEN)
in main after calloc
(see below) then it outputs
12.5
12.5
This leads me to to think this is a OS allocation issue and not a compiler issue.
#include <stdlib.h>
int tests(char *a, char *b, const int LEN);
int main(void) {
const int LEN = 1 << 30; // 1GB
char *a = (char*)calloc(LEN,1);
char *b = (char*)calloc(LEN,1);
//GCC optimizes memset(b,0,LEN) away after calloc but Clang does not.
memset(b,1,LEN);
tests(a, b, LEN);
}
See Question&Answers more detail:
os