If you don't declare a shared array as volatile
, then the compiler is free to optimize locations in shared memory by locating them in registers (whose scope is specific to a single thread), for any thread, at it's choosing. This is true whether you access that particular shared element from only one thread or not. Therefore, if you use shared memory as a communication vehicle between threads of a block, it's best to declare it volatile
. However, this sort of communication pattern often also requires execution barriers to enforce ordering of reads/writes, so continue reading about barriers below.
Obviously if each thread only accessed its own elements of shared memory, and never those associated with another thread, then this does not matter, and the compiler optimization will not break anything.
In your case, where you have a section of code where each thread is accessing it's own elements of shared memory, and the only inter-thread access occurs at a well understood location, you could use a memory fence function to force the compiler to evict any values that are temporarily stored in registers, back out to the shared array. So you might think that __threadfence_block()
might be useful, but in your case, __syncthreads()
already has memory-fencing functionality built in. So your __syncthreads()
call is sufficient to force thread synchronization as well as to force any register-cached values in shared memory to be evicted back to shared memory.
By the way, if that reduction at the end of your code is of performance concern, you could consider using a parallel reduction method to speed it up.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…