I have an array of square matrices int *M[10];
so that M[i]
locates the first element of the i
-th matrix. I want to multiply all the matrices M[i]
by another matrix N
, so that I receive an array of square matrices int *P[10]
as output.
There are different possibilities I see:
- Assing the computation of a different element of
M[i]
to a different thread; for example, I have 10
matrices, 4x4
sized, so that the number of involved threads would be 160
; how to use CUDA to implement this approach?
- In the framework of the example above, creating a composite matrix size
40x40
(i.e., collecting 10
, 4x4
sized matrices together) and use 40x40
threads; but this approach seems to require more time; I'm trying with the array of matrices, but I think I'm doing something wrong; how can I use this approach with 10
matrices? How to code it in Kernel function?
This is what I'm trying;
void GPU_Multi(int *M[2], int *N, int *P[2], size_t width)
{
int *devM[2];
int *devN[2];
int *devP[2];
size_t allocasize =sizeof(int) *width*width;
for(int i = 0 ; i < 10 ; i ++ )
{
cudaMalloc((void**)&devM[ i ], allocasize );
cudaMalloc((void**)&devP[ i ], allocasize );
}
cudaMalloc((void**)&devN, allocasize );
for(int i = 0 ; i < 10 ; i ++ ) {
cudaMemcpy(devM[ i ],M[ i ], allocasize , cudaMemcpyHostToDevice);
cudaMemcpy(devN, N, allocasize , cudaMemcpyHostToDevice);
dim3 block(width*2, width*2);
dim3 grid(1,1,1);
Kernel_Function<<<grid, block>>> (devM[2], devN, devP[2],width);
for(int i = 0 ; i < 10 ; i ++ )
{
cudaMemcpy(P[ i ], P[ i ], allocatesize, cudaMemcpyDeviceToHost);
cudaFree(devM[ i ]);
cudaFree(devP[ i ]);
}
}
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…