Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
1.4k views
in Technique[技术] by (71.8m points)

cuda - CUDA-在内核中动态重新分配更多的全局内存(CUDA - dynamically reallocate more global memory in Kernel)

I have a question about the following task: (我对以下任务有疑问:)

"Given a two-dimensional array "a[N][M]" so N lines of length M. Each element of the array contains an random integer value between 0 and 16. Write a kernel "compact(int *a, int *listM, int *listN)" that consists of only one block of N threads, and each thread counts for one line of the array how many elements have a value of 16. (“给出一个二维数组“ a [N] [M]”,因此N行的长度为M。该数组的每个元素都包含一个介于0到16之间的随机整数值。 listM,int * listN)“,它仅由N个线程的一个块组成,并且每个线程在数组的一行中计算有多少个元素的值为16。)

The threads write these numbers into an array "num" of length N in shared memory, and then (after a barrier) one of the threads executes the prefix code "PrefixSum(int *num, int N)" listed below (In the code below i explain, what this code does). (线程将这些数字写入共享内存中长度为N的数组“ num”中,然后(在屏障之后)一个线程执行下面列出的前缀代码“ PrefixSum(int * num,int N)”(在代码中下面我解释一下,这段代码做什么)。) Finally (again barrier), each thread "Idx" writes the N- and M-values, respectively positions, (or "x- and y-coordinates") of the elements of its row that have a value of 16 into two arrays "listM" and "listN" in global memory, starting at the position "num[Idx]" in these arrays. (最后(再次障碍),每个线程“ Idx”将其行中元素的值为16的N和M值分别位置(或“ x和y坐标”)写入两个数组中。全局内存中的listM”和“ listN”,从这些数组中的“ num [Idx]”位置开始。) In order to realize this last task more easily, there is the prefix code mentioned above." (为了更轻松地完成最后一项任务,上面有前缀代码。”)

I've written a kernel and a suitable main to test it. (我已经编写了一个内核和一个合适的主内核来对其进行测试。) However, I still have a problem that I can not solve. (但是,我仍然有一个我无法解决的问题。)

In the two arrays "listeM" and "listeN", the individual positions of each 16 occurring in the array "a[M][N]" should be stored. (在两个数组“ listeM”和“ listeN”中,应存储出现在数组“ a [M] [N]”中的每个数组的各个位置。) Therefore, their size must be equal to the total number of occurrences of 16, which may vary. (因此,它们的大小必须等于16的出现总数,该总数可能会有所不同。)

Since you do not know the exact number of elements with the value 16, you only know at runtime of the kernel how much memory is needed for the two arrays "listeM" and "listeN". (由于您不知道值16的确切数目,因此只知道在内核运行时两个数组“ listeM”和“ listeN”需要多少内存。) Of course you could just release enough memory for the maximum possible number at program start, namely N times M, but that would be very inefficient. (当然,您可以在程序启动时释放最大数目的内存,即N乘以M,但这将是非常低效的。) Is it possible to write the kernel so that every single thread dynamically enlarges the two arrays "listeM" and "listeN" after counting the number of elements with the value 16 in its row (just this number)? (是否可以编写内核,以便每个线程在计算其行中值为16的元素的数目(仅此数目)后,动态地扩大两个数组“ listeM”和“ listeN”?)

Here is my Kernel: (这是我的内核:)

__global__ void compact(int* a, int* listM, int* listN)
{
    int Idx = threadIdx.x;
    int elements, i;

    i = elements = 0;

    __shared__ int num[N];

    for (i = 0; i < M; i++)
    {
        if (a[Idx][i] == 16)
        {
            elements++;
        }
    }
    num[Idx] = elements;

        //Here at this point, the thread knows the number of elements with the value 16 of its line and would 
        //need to allocate just as much extra memory in "listeM" and "listeN". Is that possible ?

    __syncthreads();

    if (Idx == 0)
    {
                //This function sets the value of each element in the array "num" to the total value of the 
                //elements previously counted in all lines with the value 16.
                //Example: Input: num{2,4,3,1} Output: num{0,2,6,9}
        PrefixSum(num, N);
    }

    __syncthreads();

        // The output of PrefixSum(num, N) can now be used to realize the last task (put the "coordinates" of 
        //each 16 in the two arrays ("listM" and "listN") and each thread starts at the position equal the 
        //number of counted 16s).
    for (i = 0; i < M; i++)
    {
        if (a[Idx][i] == 16)
        {
            listM[num[Idx] + i] = Idx;
            listN[num[Idx] + i] = i;
        }
    }
}
  ask by Rabobsel translate from so

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

Is it possible to write the kernel so that every single thread dynamically enlarges the two arrays "listeM" and "listeN" after counting the number of elements with the value 16 in its row (just this number)? (是否可以编写内核,以便每个线程在计算其行中值为16的元素的数目(仅此数目)后,动态地扩大两个数组“ listeM”和“ listeN”?)

It's not possible for CUDA device code to enlarge an existing allocation that was created with host-side cudaMalloc , cudaMallocManaged , cudaHostAlloc , or similar. (CUDA设备代码不可能扩大使用主机端cudaMalloccudaMallocManagedcudaHostAlloc或类似名称创建的现有分配。)

It is possible for CUDA device code to create new allocations using in-kernel new or malloc , however data from such allocations cannot be directly transferred back to the host. (CUDA设备代码可以使用内核中的newmalloc 创建新分配 ,但是来自此类分配的数据无法直接传输回主机。) To transfer it back to the host would require a host-side allocation that the data from such allocations could be copied into, which brings you back to the original issue. (要将其传输回主机,需要进行主机端分配,这样分配中的数据可以复制到其中,这使您回到原始问题。)

Therefore there really is no convenient way to do this. (因此,确实没有方便的方法可以做到这一点。) Your choices are: (您的选择是:)

  1. (Over-)allocate the needed sizes based on the maximum possible returned size. ((最大)根据可能的最大返回大小分配所需的大小。)
  2. Create an algorithm that runs a kernel once to determine the needed size, returns that size to the host. (创建一个运行一次内核的算法以确定所需的大小,然后将该大小返回给主机。) The host then allocates that size and passes it to a kernel for use, on a second invocation of the algorithm, that does the actual desired work. (然后,主机会分配该大小,并将其传递给内核,以在第二次调用算法时使用,以完成实际所需的工作。)

A "possible" third approach would be: (“可能的”第三种方法是:)

  1. Run the algorithm just once, and have the kernel allocate in-kernel to provide the needed additional space. (仅运行一次算法,然后让内核在内核中分配以提供所需的额外空间。) This space is not accessible to the host side operations. (主机端操作无法访问此空间。) This kernel would also return the size and/or arrangement of such allocations. (该内核还将返回此类分配的大小和/或安排。)

  2. Based on the returned size/arrangement of the device size allocations, the host would allocate new memory of needed size. (根据返回的大小/设备大小分配的安排,主机将分配所需大小的新内存。)

  3. The host would then launch a new "copy kernel" that would copy the data from the device-side allocations from step 1 to the host-side allocations provided in step 2. (然后,主机将启动一个新的“复制内核”,它将数据从步骤1的设备侧分配复制到步骤2中提供的主机侧分配。)

  4. The host would then copy the data from the host-side allocations in step 2, to host memory. (然后,主机将在步骤2中将数据从主机端分配复制到主机内存中。)

That's an extreme level of complexity for such a trivial problem as you have outlined, where the obvious solution is just to overallocate the needed space and be done with it. (正如您所概述的那样,对于一个琐碎的问题而言,这是极端的复杂性,其中显而易见的解决方案是对所需空间进行整体处理并完成。)


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...