Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
394 views
in Technique[技术] by (71.8m points)

c - why does GCC __builtin_prefetch not improve performance?

I'm writing a program to analyze a graph of social network. It means the program needs a lot of random memory accesses. It seems to me prefetch should help. Here is a small piece of the code of reading values from neighbors of a vertex.

for (size_t i = 0; i < v.get_num_edges(); i++) {
    unsigned int id = v.neighbors[i];
    res += neigh_vals[id];
}

I transform the code above to the one as below and prefetch the values of the neighbors of a vertex.

int *neigh_vals = new int[num_vertices];

for (size_t i = 0; i < v.get_num_edges(); i += 128) {
    size_t this_end = std::min(v.get_num_edges(), i + 128);
    for (size_t j = i; j < this_end; j++) {
        unsigned int id = v.neighbors[j];
        __builtin_prefetch(&neigh_vals[id], 0, 2);
    }
    for (size_t j = i; j < this_end; j++) {
        unsigned int id = v.neighbors[j];
        res += neigh_vals[id];
    }
}

In this C++ code, I didn't override any operators.

Unfortunately, the code doesn't really improve the performance. I wonder why. Apparently, hardware prefetch doesn't work in this case because the hardware can't predict the memory location.

I wonder if it's caused by GCC optimization. When I compile the code, I enable -O3. I really hope prefetch can further improve performance even when -O3 is enabled. Does -O3 optimization fuse the two loops in this case? Can -O3 enable prefetch in this case by default?

I use gcc version 4.6.3 and the program runs on Intel Xeon E5-4620.

Thanks, Da

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

Yes, some recent versions of GCC (e.g. 4.9 in march 2015) are able to issue some PREFETCH instruction when optimizing with -O3 (even without any explicit __builtin_prefetch)

We don't know what get_neighbor is doing, and what are the types of v and neigh_val.

And prefetching is not always profitable. Adding explicit __builtin_prefetch can slow down your code. You need to measure.

As Retired Ninja commented, prefetching in one loop and hoping data would be cached in the following loop (further down in your source code) is wrong.

You might perhaps try instead

for (size_t i = 0; i < v.get_num_edges(); i++) {
  fg::vertex_id_t id = v.get_neighbor(i);
  __builtin_prefetch (neigh_val[v.get_neighbor(i+4)]);
  res += neigh_vals[id];
}

You could empirically replace the 4 with whatever appropriate constant is the best.

But I guess that the __builtin_prefetch above is useless (since the compiler is probably able to add it by itself) and it could harm (or even crash the program, when computing its argument gives undefined behavior, e.g. if v.get_neighbor(i+4) is undefined; however prefetching an address outside of your address space won't harm -but could slow down your program). Please benchmark.

See this answer to a related question.

Notice that in C++ all of [], get_neighbor could be overloaded and becomes very complex operations, so we cannot guess!

And there are cases where the hardware is limiting performance, whatever __builtin_prefetch you add (and adding them could hurt performance)

BTW, you might pass -O3 -mtune=native -fdump-tree-ssa -S -fverbose-asm to understand more what the compiler is doing (and look inside generated dump files and assembler files); also, it does happen that -O3 produces slightly slower code than what -O2 gives.

You could consider explicit multithreading, OpenMP, OpenCL if you have time to waste on optimization. Remember that premature optimization is evil. Did you benchmark, did you profile your entire application?


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...