Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
211 views
in Technique[技术] by (71.8m points)

c++ - How performing multiple matrix multiplications in CUDA?

I have an array of square matrices int *M[10]; so that M[i] locates the first element of the i-th matrix. I want to multiply all the matrices M[i] by another matrix N, so that I receive an array of square matrices int *P[10] as output.

There are different possibilities I see:

  1. Assing the computation of a different element of M[i] to a different thread; for example, I have 10 matrices, 4x4 sized, so that the number of involved threads would be 160; how to use CUDA to implement this approach?
  2. In the framework of the example above, creating a composite matrix size 40x40 (i.e., collecting 10, 4x4 sized matrices together) and use 40x40 threads; but this approach seems to require more time; I'm trying with the array of matrices, but I think I'm doing something wrong; how can I use this approach with 10 matrices? How to code it in Kernel function?

This is what I'm trying;

void GPU_Multi(int *M[2], int *N, int *P[2], size_t width)
{

    int *devM[2];
    int *devN[2];
    int *devP[2];
    size_t allocasize =sizeof(int) *width*width;

    for(int i = 0 ; i < 10 ; i ++ ) 
    {
        cudaMalloc((void**)&devM[ i ], allocasize );
        cudaMalloc((void**)&devP[ i ], allocasize ); 
    }

    cudaMalloc((void**)&devN, allocasize );

    for(int i = 0 ; i < 10 ; i ++ ) {

        cudaMemcpy(devM[ i ],M[ i ], allocasize , cudaMemcpyHostToDevice);
        cudaMemcpy(devN, N, allocasize , cudaMemcpyHostToDevice);
        dim3 block(width*2, width*2);
        dim3 grid(1,1,1);
        Kernel_Function<<<grid, block>>>  (devM[2], devN, devP[2],width);

        for(int i = 0 ; i < 10 ; i ++ ) 
        {
            cudaMemcpy(P[ i ], P[ i ], allocatesize, cudaMemcpyDeviceToHost);
            cudaFree(devM[ i ]);   
            cudaFree(devP[ i ]);
        }

    }
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

As it emerged from the comments above and the answer by Robert Crovella, there are different possible approaches. Each of the approach can be better suited for a different situation, i.e., for a different number N of matrices to multiply and for different matrix dimensions MxM. Let me summarize them below:

  1. If N is small and M is large, perhaps the best approach would be to use cublas<t>gemm called from host code;
  2. If N is moderate and M is moderate, and if a device with compute capability of at least 3.5 is available, then a good possibility would be to use dynamic parallelism, namely, creating a thread grid of N threads and launching a cublas<t>gemm from within a kernel; perhaps this approach would fail for large N or M due to the large number of threads required;
  3. If N is large and M is small, then the cuBLAS batched approach linked to by Robert Crovella could be of interest;
  4. Similarly, if N is large and M is small, then a cuBLAS stream-based approach would be worth a try, as also mentioned in Robert's comment;
  5. If N is large and M is very small, an approach using a thread grid of N threads, each "manually" calculating an optimized matrix multiplication could be appealing; for example, if one has to construct a matrix multiplication algorithm for 4x4 matrices, then one could optimize the matrix multiplication performed by each thread according to Number of elementary multiplications for multiplying 4x4 matrices.

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...