Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
321 views
in Technique[技术] by (71.8m points)

gpgpu - Non deterministic behaviour of OpenCL kernel

Consider the following OpenCL kernel. It is a MWE which came out from a long simplification process. Of course it does not make sense to have such a kernel, but this is not the point. It is guaranteed that MAX_ORDER >= num_fields and in particular MAX_ORDER=15 (this is passed to clBuildProgram as compile time constant via -D).

__kernel void MWE_kernel(__global const double* const x,
                         __global const double* alpha, const int num_fields,
                         __global double* const restrict result,
                         __local double* const restrict result_local)
{
    const int id          = get_global_id(0);
    const int global_size = get_global_size(0);

    double sum[MAX_ORDER];
    for (unsigned int i = 0; i < num_fields; i++)
        sum[i] = 0.0;
    
    double sum_sc=0.0;

    for (unsigned int id_mem = id; id_mem < 128 * num_fields; id_mem += global_size) {
        const int alpha_idx = id_mem / 128;
        //sum[0] += 3.0;
        sum[alpha_idx] += 3.0;
        sum_sc += 3.0;
        /*
        if (id == 8){
            printf("sum[%d]=%f   sum_sc=%f
", alpha_idx, sum[alpha_idx], sum_sc);
        }
        */
    }
    
    if(id == 8){
        result[0] = sum[0];
        result[1] = sum_sc;
    }
}

which I enqueue on a GPU device (Device Name: Intel(R) Gen9 HD Graphics NEO) with local_size=64 and global_size=64 and num_fields=1.

The weird behaviour I am experiencing, which I cannot explain, is that, if I try to fill the sum private array using alpha_idx as index, then this is not working for some threads, in the sense that the sum entry is not increased. The first thread that shows such a behaviour is the number 8, for which the last if-clause sets result[0] = 0 and result[1]=6, as I can check on the host.

Facts I noticed debugging:

  • Setting MAX_ODER=1 (possible in this example, since num_fields=1), makes the weird behaviour disappear (result[0] becomes 6). However, for 1<MAX_ORDER<21 nothing changes. Surprisingly enough, MAX_ORDER>=22 changes the kernel behaviour and I get result[0]=6.
  • Varying local_size and/or global_size does not change anything, unless local_size is decreased to 8 or less (see next bullet).
  • Reducing the local_size with which I enqueue the kernel to 8 makes the weird behaviour disappear (in the above example trivially, since there is no thread number 8 any more, but in the more complex complete version I then get the expected result).
  • Using online tools like this one does not spot any problem.

Questions:

  1. Do you have any idea that explains the behaviour I am experiencing? Can you reproduce such a behaviour?
  2. Is the private memory usage safe? Or the sum array declaration is somehow critical?
  3. If I am not doing anything wrong, is such a behaviour a (GPU) compiler bug!?

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

After some more work on this issue, I might have a kind of explanation to what is happening, although this is not really a proof, but rather few more facts.

The starting point is the strange behaviour of having the kernel behave as expected either reducing the local_size with which it is run or enlarging the size of the private memory array sum. How does the OpenCL compiler handle private memory? Reading into the AMD OpenCL User Guide, I found the following sentence at page 25-26 (emphasis is mine).

The data in private memory is first placed in registers. If more private memory is used than can be placed in registers, or dynamic indexing is used on private arrays, the overflow data is placed (spilled) into scratch memory. Scratch memory is a private subset of global memory, so performance can be dramatically degraded if spilling occurs.

It is curious that there is no definition of what is meant by dynamic indexing in the full guide. However, I found an interesting article by nVIDIA people, which is pretty much explaining the idea. Dynamic indexing happens when the compiler can't resolve array indices to constants. In such a case, it is then forced not to use registers any more.

In the present kernel, dynamic indexing is forced by using alpha_idx in setting sum. This is somehow the breaking point, while putting there a sum[0] makes the kernel work as expected, but it changes also from dynamic to static indexing!

To check even further this guess, I tried to implement the kernel storing the private sum in the local memory. This can be achieved with a buffer in the local memory of which only a given portion is accessed by each work item. Interestingly enough, doing so the kernel works as expected and the weird behaviour is not seen any more.

At this point why the code was also fixed either by reducing the local_size to 8 or something smaller or setting a value of MAX_ORDER larger than 21 remains unclear. In this interesting SO answer the author writes

If going from local to private doesn't do good, you should decrease local thread group size from 256 to 64 for example. So more private registers available per thread.

that is basically saying that reducing the local_size makes the compiler being able to handle variables in private memory differently.

All in all I am having the feeling that the compiler on the used GPU (Intel(R) Gen9 HD Graphics NEO, actually not really a GPGPU) is doing something weird handling the dynamic indexing in the kernel. What exactly, I do not really know, but it really sounds to me like a compiler bug. On one hand, a smaller local_size value hides the problem because - just a guess - having more memory per work group, the compiler bug is not hit. On the other hand, increasing MAX_ORDER makes also the problem disappear because - again just guess - the compiler does something different (e.g. using a different strategy being the kernel requiring much more memory), and the bug is not hit.

To support even further all this chain of thoughts, I tested the original code on a real GPGPU. Running the code on a AMD Radeon Instinct MI50 nothing weird occurs.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...