Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
714 views
in Technique[技术] by (71.8m points)

c++11 - CUDA: how to use barrier.sync

I have read https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#parallel-synchronization-and-communication-instructions-bar which details about PTX synchronization function.

  1. It says there are 16 "barrier logical resource", and you can specify which barrier to use with the parameter "a". What is a barrier logical resource?

  2. I have a piece of code from an outside source, which I know works. However, I cannot understand the syntax used inside "asm" and what "memory" does. I assume "name" replaces "%0" and "numThreads" replace "%1", but what is "memory" and what are the colons doing?

    __device__ __forceinline__ void namedBarrierSync(int name, int numThreads) {
    asm volatile("bar.sync %0, %1;" : : "r"(name), "r"(numThreads) : "memory");}
    
  3. In a block of 256 threads, I only want threads 64 ~ 127 to synchronize. Is this possible with barrier.sync function? ( for an example, say I have a grid of 1 block, block of 256 threads. we split the block into 3 conditional branches s.t. threads 0 ~ 63 go into kernel1, threads 64 ~ 127 go into kernel 2, and threads 128 ~ 255 go into kernel 3. I want threads in kernel 2 to only synchronize among themselves. So if I use the "namedBarrierSync" function defied above: "namedBarrierSync( 1, 64)". Then does it synchronize only threads 64 ~ 127, or threads 0 ~ 63?

  4. I have tested with below code ( assume that gpuAssert is an error checking function defined somewhere in the file ).

Here is the code:

__global__ void test(int num_threads) 
{
    if (threadIdx.x >= 64 && threadIdx.x < 128) 
    {
        namedBarrierSync(0, num_threads) ;
    }
    __syncthreads();
}

int main(void) 
{
    test<<<1, 1, 256>>>(128);
    gpuAssert(cudaDeviceSynchronize(), __FILE__, __LINE_);
    printf("complete
");
    return 1;
}
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)
  1. "barrier logical resource" are the hardware necessary to synchronize threads/warps in a thread block (probably atomic counters etc.). You don't need to know the actual hardware implementation to program them, it is sufficient to know there are 16 instances of them available.
  2. As Robert Crovella has pointed out in your cross-post on the Nvidia forum, the documentation for inline PTX is at https://docs.nvidia.com/cuda/inline-ptx-assembly/index.html.
  3. barrier.sync with a named barrier and thread count of 64 synchronizes the first two warps arriving at the named barrier (for compute capability up to 6.x) or the first 64 threads arriving at the named barrier (for compute capability 7.0 onwards).
  4. Your test only launches a single thread (with 256 bytes of shared memory allocated to it), which makes tests of synchronisation instructions moot. You want to launch the test kernel as test<<<1, 256>>>(128); instead.

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...