as asked within the title, to transpose a device row-major matrix A[m][n], one can do it this way:
float* clone = ...;//copy content of A to clone
float const alpha(1.0);
float const beta(0.0);
cublasHandle_t handle;
cublasCreate(&handle);
cublasSgeam( handle, CUBLAS_OP_T, CUBLAS_OP_N, m, n, &alpha, clone, n, &beta, clone, m, A, m );
cublasDestroy(handle);
And, to multiply two row-major matrices A[m][k] B[k][n], C=A*B
cublasSgemm( handle, CUBLAS_OP_N, CUBLAS_OP_N, n, m, k, &alpha, B, n, A, k, &beta, C, n );
where C is also a row-major matrix.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…