I've seen many questions scattered across the Internet about branch divergence, and how to avoid it. However, even after reading dozens of articles on how CUDA works, I can't seem to see how avoiding branch divergence helps in most cases. Before anyone jumps on on me with claws outstretched, allow me to describe what I consider to be "most cases".
It seems to me that most instances of branch divergence involve a number of truly distinct blocks of code. For example, we have the following scenario:
if (A):
foo(A)
else:
bar(B)
If we have two threads that encounter this divergence, thread 1 will execute first, taking path A. Following this, thread 2 will take path B. In order to remove the divergence, we might change the block above to read like this:
foo(A)
bar(B)
Assuming it is safe to call foo(A)
on thread 2 and bar(B)
on thread 1, one might expect performance to improve. However, here's the way I see it:
In the first case, threads 1 and 2 execute in serial. Call this two clock cycles.
In the second case, threads 1 and 2 execute foo(A)
in parallel, then execute bar(B)
in parallel. This still looks to me like two clock cycles, the difference is that in the former case, if foo(A)
involves a read from memory, I imagine thread 2 can begin execution during that latency, which results in latency hiding. If this is the case, the branch divergent code is faster.
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…