As a general rule, most if not all Intel CPUs assume forward branches are not taken the first time they see them. See Godbolt's work.
After that, the branch goes into a branch prediction cache, and past behavior is used to inform future branch prediction.
So in a tight loop, the effect of misordering is going to be relatively small. The branch predictor is going to learn which set of branches is most likely, and if you have non-trivial amount of work in the loop the small differences won't add up much.
In general code, most compilers by default (lacking another reason) will order the produced machine code roughly the way you ordered it in your code. Thus if statements are forward branches when they fail.
So you should order your branches in the order of decreasing likelihood to get the best branch prediction from a "first encounter".
A microbenchmark that loops tightly many times over a set of conditions and does trivial work is going to dominated by tiny effects of instruction count and the like, and little in the way of relative branch prediction issues. So in this case you must profile, as rules of thumb won't be reliable.
On top of that, vectorization and many other optimizations apply to tiny tight loops.
So in general code, put most likely code within the if
block, and that will result in the fewest un-cached branch prediction misses. In tight loops, follow the general rule to start, and if you need to know more you have little choice but to profile.
Naturally this all goes out the window if some tests are far cheaper than others.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…