I know it sound weird, but here is my scenario:
I need to do a matrix-matrix multiplication (A(n*k)*B(k*n)), but I only needs the diagonal elements to be evaluated for the output matrix. I searched cublas library and didn't find any level 2 or 3 functions that can do that.
So, I decided to distribute each row of A and each column of B into CUDA threads. For each thread (idx), I need to calculate the dot product "A[idx,:]*B[:,idx]" and save it as the corresponding diagonal output. Now since this dot product also takes some time, and I wonder whether I could somehow call cublas function here (say cublasSdot) to achieve it.
If I missed some cublas function that can achieve my goal directly (only calculate the diagonal elements for a matrix-matrix multiplication), this question could be discarded.
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…