I was able to reproduce your observation of the speedup, it was even more noticeable for me (1.75x faster). The problem seems to be when you pass x by value it enables the compiler to perform optimizations that it otherwise doesn't do, and these optimizations backfire, they apparently introduce an unintended stall in the processor. The compiler generates a couple of conditional moves, back to back, rather than compare and branch. The compare and branch runs much faster than the back-to-back conditional moves.
I was able to avoid this by simplifying the code that the compiler had trouble with, namely this
if (tmp >= imax) {
carry = tmp >> numbits;
tmp &= imax - 1;
} else {
carry = 0;
}
can be reduced to
carry = tmp >> numbits;
tmp &= imax - 1;
Here's the version of gcc I'm using
g++ --version
g++ (GCC) 4.6.3 20120306 (Red Hat 4.6.3-2)
These are the commands I used, perf record
will profile your code and perf report
will annotate the source and disassembly with the results of the profile
g++ -std=gnu++0x -O3 -g single_mult.cpp -o single_mult
perf record ./single_mult
perf report
Once in perf report press enter
on main
and select Annotate main
, you'll see a disassembly of your program along with source code and the percent of the time the profiler found your program running at each instruction in the function... actually these numbers must be taken as a hint only, often you will see instructions with big counts when in fact it was the previous instruction that stalled, or missed in the cache, or there was a mis-predicted branch, etc. So when you see a big count look back to see what may have caused it. The reason is the profile is statistical, it interrupts your program at a constant rate and looks to see where the instruction pointer is, and often the interrupt happens while the processor is stalled due to a cache miss or a mis-predicted branch or some internal data dependency.
I increased the number of iterations to allow more time for the profiler
int N = 20000;
When x is passed by value it Took 11.58seconds total
and this is what I see, notice the cmovbe
instructions
: ull tmp=0, carry=0, i=0;
: for (auto rhs_it = rbegin; rhs_it != rend; ++rhs_it) {
: tmp = x*(*rhs_it) + data[i] + carry;
11.10 : 400b40: 4c 89 c9 mov %r9,%rcx
0.00 : 400b43: 48 89 fa mov %rdi,%rdx
0.01 : 400b46: 48 03 14 c6 add (%rsi,%rax,8),%rdx
11.65 : 400b4a: 48 0f af 0c c3 imul (%rbx,%rax,8),%rcx
0.99 : 400b4f: 48 01 ca add %rcx,%rdx
: if (tmp >= imax) {
: carry = tmp >> numbits;
2.25 : 400b52: 48 89 d7 mov %rdx,%rdi
: tmp &= imax - 1;
10.99 : 400b55: 48 89 d1 mov %rdx,%rcx
: const ull x) {
: ull tmp=0, carry=0, i=0;
: for (auto rhs_it = rbegin; rhs_it != rend; ++rhs_it) {
: tmp = x*(*rhs_it) + data[i] + carry;
: if (tmp >= imax) {
: carry = tmp >> numbits;
0.69 : 400b58: 48 c1 ef 20 shr $0x20,%rdi
: tmp &= imax - 1;
9.54 : 400b5c: 83 e1 ff and $0xffffffff,%ecx
9.05 : 400b5f: 4c 39 c2 cmp %r8,%rdx
10.78 : 400b62: 49 0f 46 fb cmovbe %r11,%rdi
: } else {
: carry = 0;
: }
: data[i++] = tmp;
20.73 : 400b66: 48 83 c0 01 add $0x1,%rax
0.02 : 400b6a: 4c 39 c2 cmp %r8,%rdx
0.17 : 400b6d: 48 0f 46 ca cmovbe %rdx,%rcx
: static void inline single_mult(const std::vector<ull>::iterator& data,
: const std::vector<ull>::const_iterator& rbegin,
: const std::vector<ull>::const_iterator& rend,
: const ull x) {
: ull tmp=0, carry=0, i=0;
: for (auto rhs_it = rbegin; rhs_it != rend; ++rhs_it) {
11.47 : 400b71: 4c 39 d0 cmp %r10,%rax
: carry = tmp >> numbits;
: tmp &= imax - 1;
: } else {
: carry = 0;
: }
: data[i++] = tmp;
0.01 : 400b74: 48 89 4c c6 f8 mov %rcx,-0x8(%rsi,%rax,8)
: static void inline single_mult(const std::vector<ull>::iterator& data,
: const std::vector<ull>::const_iterator& rbegin,
: const std::vector<ull>::const_iterator& rend,
: const ull x) {