c - Memcpy takes the same time as memset

Question

Welcome To Ask or Share your Answers For Others

c - Memcpy takes the same time as memset

asked Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

c - Memcpy takes the same time as memset

I want to measure memory bandwidth using memcpy. I modified the code from this answer:why vectorizing the loop does not have performance improvement which used memset to measure the bandwidth. The problem is that memcpy is only slighly slower than memset when I expect it to be about two times slower since it operations on twice the memory.

More specifically, I run over 1 GB arrays a and b (allocated will calloc) 100 times with the following operations.

operation             time(s)
-----------------------------
memset(a,0xff,LEN)    3.7
memcpy(a,b,LEN)       3.9
a[j] += b[j]          9.4
memcpy(a,b,LEN)       3.8

Notice that memcpy is only slightly slower then memset. The operations a[j] += b[j] (where j goes over [0,LEN)) should take three times longer than memcpy because it operates on three times as much data. However it's only about 2.5 as slow as memset.

Then I initialized b to zero with memset(b,0,LEN) and test again:

operation             time(s)
-----------------------------
memcpy(a,b,LEN)       8.2
a[j] += b[j]          11.5

Now we see that memcpy is about twice as slow as memset and a[j] += b[j] is about thrice as slow as memset like I expect.

At the very least I would have expected that before memset(b,0,LEN) that memcpy would be slower because the of lazy allocation (first touch) on the first of the 100 iterations.

Why do I only get the time I expect after memset(b,0,LEN)?

test.c

#include <time.h>
#include <string.h>
#include <stdio.h>

void tests(char *a, char *b, const int LEN){
    clock_t time0, time1;
    time0 = clock();
    for (int i = 0; i < 100; i++) memset(a,0xff,LEN);
    time1 = clock();
    printf("%f
", (double)(time1 - time0) / CLOCKS_PER_SEC);

    time0 = clock();
    for (int i = 0; i < 100; i++) memcpy(a,b,LEN);
    time1 = clock();
    printf("%f
", (double)(time1 - time0) / CLOCKS_PER_SEC);

    time0 = clock();
    for (int i = 0; i < 100; i++) for(int j=0; j<LEN; j++) a[j] += b[j];
    time1 = clock();
    printf("%f
", (double)(time1 - time0) / CLOCKS_PER_SEC);

    time0 = clock();
    for (int i = 0; i < 100; i++) memcpy(a,b,LEN);
    time1 = clock();
    printf("%f
", (double)(time1 - time0) / CLOCKS_PER_SEC);

    memset(b,0,LEN);
    time0 = clock();
    for (int i = 0; i < 100; i++) memcpy(a,b,LEN);
    time1 = clock();
    printf("%f
", (double)(time1 - time0) / CLOCKS_PER_SEC);

    time0 = clock();
    for (int i = 0; i < 100; i++) for(int j=0; j<LEN; j++) a[j] += b[j];
    time1 = clock();
    printf("%f
", (double)(time1 - time0) / CLOCKS_PER_SEC);
}

main.c

#include <stdlib.h>

int tests(char *a, char *b, const int LEN);

int main(void) {
    const int LEN = 1 << 30;    //  1GB
    char *a = (char*)calloc(LEN,1);
    char *b = (char*)calloc(LEN,1);
    tests(a, b, LEN);
}

Compile with (gcc 6.2) gcc -O3 test.c main.c. Clang 3.8 gives essentially the same result.

Test system: [email protected] (Skylake), 32 GB DDR4, Ubuntu 16.10. On my Haswell system the bandwidths make sense before memset(b,0,LEN) i.e. I only see a problem on my Skylake system.

I first discovered this issue from the a[j] += b[k] operations in this answer which was overestimating the bandwidth.

I came up with a simpler test

#include <time.h>
#include <string.h>
#include <stdio.h>

void __attribute__ ((noinline))  foo(char *a, char *b, const int LEN) {
  for (int i = 0; i < 100; i++) for(int j=0; j<LEN; j++) a[j] += b[j];
}

void tests(char *a, char *b, const int LEN) {
    foo(a, b, LEN);
    memset(b,0,LEN);
    foo(a, b, LEN);
}

This outputs.

9.472976
12.728426

However, if I do memset(b,1,LEN) in main after calloc (see below) then it outputs

12.5
12.5

This leads me to to think this is a OS allocation issue and not a compiler issue.

#include <stdlib.h>

int tests(char *a, char *b, const int LEN);

int main(void) {
    const int LEN = 1 << 30;    //  1GB
    char *a = (char*)calloc(LEN,1);
    char *b = (char*)calloc(LEN,1);
    //GCC optimizes memset(b,0,LEN) away after calloc but Clang does not.
    memset(b,1,LEN);
    tests(a, b, LEN);
}

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-23T19:56:08+0000

The point is that malloc and calloc on most platforms don't allocate memory; they allocate address space.

malloc etc work by:

if the request can be fulfilled by the freelist, carve a chunk out of it
- in case of calloc: the equivalent ofmemset(ptr, 0, size) is issued
if not: ask the OS to extend the address space.

For systems with demand paging (COW) (an MMU could help here), the second options winds downto:

create enough page table entries for the request, and fill them with a (COW) reference to /dev/zero
add these PTEs to the address space of the process

This will consume no physical memory, except only for the Page Tables.

Once the new memory is referenced for read, the read will come from /dev/zero. The /dev/zero device is a very special device, in this case mapped to every page of the new memory.
but, if the new page is written, the COW logic kicks in (via a page fault):
- physical memory is allocated
- the /dev/zero page is copied to the new page
- the new page is detached from the mother page
- and the calling process can finally do the update which started all this

Categories

c - Memcpy takes the same time as memset

c - Memcpy takes the same time as memset

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags