Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
104 views
in Technique[技术] by (71.8m points)

c++ - Simple for() loop benchmark takes the same time with any loop bound

I'm willing to write a code that makes my CPU execute some operations and see how much time does he take to solve them. I wanted to make a loop going from i=0 to i<5000 and then multiplying i by a constant number and time that. I've ended up with this code, it has no errors but it only takes 0.024 seconds to execute the code even if I change the loop i<49058349083 or if i<2 it takes the same amount of time. Whats the error?

PD: I started yesterday learning C++ i'm sorry if this is a really easy question to answer but I couldn't find the solution

#include <iostream>
#include <ctime>

using namespace std;

int main () {
    int start_s=clock();

    int i;
    for(i=0;i<5000;i++){
        i*434243;
    }

    int stop_s=clock();
    cout << "time: "<< (stop_s-start_s)/double(CLOCKS_PER_SEC)*1000;

    return 0;
}
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

BTW, if you'd actually done i<49058349083, gcc and clang create an infinite loop on systems with 32-bit int (including x86 and x86-64). 49058349083 is greater than INT_MAX. Large literal numbers are implicitly promoted to a type large enough to hold them, so you effectively did (int64_t)i < 49058349083LL, which is true for any possible value of int i.

Signed overflow is Undefined Behaviour in C++, and so is an infinite loop with no side-effects inside the loop (e.g. a system call), so I checked on the Godbolt compiler explorer to see how it really compiles with optimization enabled. Fun fact: MSVC still optimizes away an empty loop (including one with i*10 not assigned to anything) when the condition is an always-true compare, rather than a non-zero constant like 42.


A loop like this is pretty fundamentally flawed.

You can microbenchmark a complete non-inline function using Google's benchmark package (How would you benchmark the performance of a function?), but to learn anything useful by putting something inside a repeat-loop you have to know a lot about how your compiler compiles to asm, exactly what you're trying to measure, and how to get the optimizer to make asm that's similar to what you'd get from your code in some real use context. e.g. by using inline asm to require it to have a certain result in a register, or by assigning to a volatile variable (which also has the overhead of doing a store).

If this sound a lot more complicated than you were hoping, well too bad, it is, and for good reason.

That's because compilers are incredibly complex pieces of machinery that can usually produce fairly efficient executables out of source that's written to clearly express what's going on for human readers, not to avoid redundant work or look anything like an efficient machine-language implementation (which is what your CPU actually runs).


Microbenchmarking is hard - you can't get meaningful results unless you understand how your code compiles and what you're really measuring.

Optimizing compilers are designed to create an executable that produces the same results as your C++ source, but which runs as quickly as possible. Performance is not an observable result, so it's always legal to make a program more efficient. This is the "as-if" rule: What exactly is the "as-if" rule?

You want your compiler to not waste time and code-size computing results that aren't used. After the compiler inlines a function into the caller, it often turns out that some of the things it computes aren't used. It's normal for well-written C++ code to have lots of work that can be thrown away, including optimizing away temporary variables entirely; this is not a bad thing and a compiler that didn't do this would suck.

Remember that you're writing for the C++ abstract machine, but your compiler's job is to translate that into assembly language (or machine code) for your CPU. Assembly language is quite different from C++. (And modern higher-performance CPUs can also execute instructions out of order, while following their own "as-if" rule to preserve the illusion of the compiler-generated code running in program order. But CPUs can't discard work, only overlap it.)

You can't microbenchmark the int * int binary operator in C++ in the general case, even just for your own desktop (nevermind on other hardware / different compilers). Different uses in different contexts will compile to different code. Even if you could create some version of your loop which measured something useful, it wouldn't necessarily tell you much about how expensive foo = a * b is in another program. The other program might bottleneck on multiply latency instead of throughput, or combine that multiply with some other nearby operation on a or b, or any number of things.


Don't disable optimization

You might think it would be useful to just disable optimization (like gcc -O0 instead of gcc -O3). But the only way to do that also introduces anti-optimizations like storing every value back to memory after every C++ statement, and reloading variables from memory for the next statement. This lets you modify variable values while debugging a compiled program, or even jump to a new line in the same function, and still get the results you'd expect from looking at the C++ source.

Supporting that level of interference imposes a huge burden on the compiler. Store/reload (store-forwarding) has about 5 cycle latency on a typical modern x86. This means an anti-optimized loop can only run one iteration per ~6 clock cycles at best, vs. 1 cycle for a tight loop like looptop: dec eax / jnz looptop where the loop counter is in a register.

There's not much middle ground: compilers don't have options to make asm that "looks like" the C++ source, but keeping values in registers across statements. That wouldn't be useful or meaningful anyway, because that's not how they compile with full optimization enabled. (gcc -Og might be a little bit like this.)

Spending time modifying your C++ to make it run faster with -O0 is a total waste of time: C loop optimization help for final assignment (with compiler optimization disabled).

Note: anti-optimized debug mode (-O0) is the default for most compilers. It's also "compile-fast" mode, so it's good for seeing if your code compiles / runs at all, but it's useless for benchmarking. The resulting compiler-generated asm runs the speed it does for reasons which depend on the hardware, but don't tell you much of anything about how fast optimized code will run. (e.g. the answer to Adding a redundant assignment speeds up code when compiled without optimization is some fairly subtle Intel Sandybridge-family store-forwarding latency behaviour, directly caused by the store/reloads and having the loop bottleneck on the loop counter being in memory. Note that the first version of the question was asking about why doing that made the C faster, which was rightly downvoted because benchmarking -O0 is silly. It only turned into an interesting question when I edited it into an x86 asm question which is interesting because of the larger-but-faster asm, not because it came from gcc -O0 with any particular source changes.)

And this is not even mentioning C++ standard library functions like std::sort or std::vector::push_back, which depend on the optimizer to inline lots of nested calls to tiny helper / wrapper functions.


How compilers optimize

(I'm going to show transformations of C++ code. Remember that the compiler is actually transforming an internal representation of your program's logic and then producing asm. You can think of the transformed C++ as being pseudo-code for asm, where i++ represents an x86 inc eax instruction or something. Most C/C++ statements can map to 1 or a couple machine instructions. So it's a useful way of describing the logic of what the actual compiler-generated asm might be doing without actually writing it in asm.)

A result which is never used doesn't have to be computed in the first place. A loop with no side effects can be removed completely. Or a loop that assigns to a global variable (an observable side effect) could be optimized to just doing the last assignment. e.g.

// int gsink;  is a global.  
// "sink" is the opposite of a source: something we dump results into.
for (int i=0 ; i<n ; i++) {
    gsink = i*10;
}

is equivalent to this code, as far as an optimizing compiler is concerned:

if ( 0 < n ) {      // you might not have noticed that your loop could run 0 times
    gsink = (n-1)*10; // but the compiler isn't allowed to do gsink=0 if n<1
}

If gsink were instead a local or file-scoped static with nothing that reads it, the compiler could optimize it away entirely. But the compiler can't "see" code outside the current C++ source file ("compilation unit") while compiling a function containing that, so it can't change the observable side-effect that when the function returns, gsink = n*10;.

Nothing reads the intermediate values of gsink because there are no function calls to non-inline function. (Because it's not an atomic<int>, the compiler can assume that no other threads or signal handlers read it; that would be data-race Undefined Behaviour.)


Using volatile to get the compiler to do some work.

If it were global volatile int gsink, the actual store that places the value in memory would be an observable side-effect (that's what volatile means in C++). But does that mean we can benchmark multiplication this way? No, it doesn't. The side-effect the compiler has to preserve is only the placing of the final value in memory. If it can calculate it more cheaply than with i * 10 eve


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...