It is not a secret that on CUDA 4.x the first call to cudaMalloc
can be ridiculously slow (which was reported several times), seemingly a bug in CUDA drivers.
Recently, I noticed weird behaviour: the running time of cudaMalloc
directly depends on how many 3rd-party CUDA libraries I linked to my program
(note that I do NOT use these libraries, just link my program with them)
I ran some tests using the following program:
int main() {
cudaSetDevice(0);
unsigned int *ptr = 0;
cudaMalloc((void **)&ptr, 2000000 * sizeof(unsigned int));
cudaFree(ptr);
return 1;
}
the results are as follows:
Linked with: -lcudart -lnpp -lcufft -lcublas -lcusparse -lcurand
running time: 5.852449
Linked with: -lcudart -lnpp -lcufft -lcublas running time: 1.425120
Linked with: -lcudart -lnpp -lcufft running time: 0.905424
Linked with: -lcudart running time: 0.394558
According to 'gdb', the time indeed goes into my cudaMalloc, so it's not caused by some
library initialization routine..
I wonder if somebody has plausible explanation for this ?
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…