After some more work on this issue, I might have a kind of explanation to what is happening, although this is not really a proof, but rather few more facts.
The starting point is the strange behaviour of having the kernel behave as expected either reducing the local_size
with which it is run or enlarging the size of the private memory array sum
. How does the OpenCL compiler handle private memory? Reading into the AMD OpenCL User Guide, I found the following sentence at page 25-26 (emphasis is mine).
The data in private memory is first placed in registers. If more private memory is used than can be placed in registers, or dynamic indexing is used on private arrays, the overflow data is placed (spilled) into scratch memory. Scratch memory is a private subset of global memory, so performance can be dramatically degraded if spilling occurs.
It is curious that there is no definition of what is meant by dynamic indexing in the full guide. However, I found an interesting article by nVIDIA people, which is pretty much explaining the idea. Dynamic indexing happens when the compiler can't resolve array indices to constants. In such a case, it is then forced not to use registers any more.
In the present kernel, dynamic indexing is forced by using alpha_idx
in setting sum
. This is somehow the breaking point, while putting there a sum[0]
makes the kernel work as expected, but it changes also from dynamic to static indexing!
To check even further this guess, I tried to implement the kernel storing the private sum
in the local memory. This can be achieved with a buffer in the local memory of which only a given portion is accessed by each work item. Interestingly enough, doing so the kernel works as expected and the weird behaviour is not seen any more.
At this point why the code was also fixed either by reducing the local_size
to 8 or something smaller or setting a value of MAX_ORDER
larger than 21 remains unclear. In this interesting SO answer the author writes
If going from local to private doesn't do good, you should decrease local thread group size from 256 to 64 for example. So more private registers available per thread.
that is basically saying that reducing the local_size
makes the compiler being able to handle variables in private memory differently.
All in all I am having the feeling that the compiler on the used GPU (Intel(R) Gen9 HD Graphics NEO
, actually not really a GPGPU) is doing something weird handling the dynamic indexing in the kernel. What exactly, I do not really know, but it really sounds to me like a compiler bug. On one hand, a smaller local_size
value hides the problem because - just a guess - having more memory per work group, the compiler bug is not hit. On the other hand, increasing MAX_ORDER
makes also the problem disappear because - again just guess - the compiler does something different (e.g. using a different strategy being the kernel requiring much more memory), and the bug is not hit.
To support even further all this chain of thoughts, I tested the original code on a real GPGPU. Running the code on a AMD Radeon Instinct MI50 nothing weird occurs.