I am trying to understand resource usage for each of my CUDA threads for a hand-written kernel.
I compiled my kernel.cu
file to a kernel.o
file with nvcc -arch=sm_20 -ptxas-options=-v
and I got the following output (passed through c++filt
):
ptxas info : Compiling entry function 'searchkernel(octree, int*, double, int, double*, double*, double*)' for 'sm_20'
ptxas info : Function properties for searchkernel(octree, int*, double, int, double*, double*, double*)
72 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 46 registers, 176 bytes cmem[0], 16 bytes cmem[14]
Looking at the output above, is it correct to say that
- each CUDA thread is using 46 registers?
- there is no register spilling to local memory?
I am also having some issues with understanding the output.
My kernel is calling a whole lot of __device__
functions. IS 72 bytes the sum-total
of the memory for the stack frames of the __global__
and __device__
functions?
What is the difference between 0 byte spill stores
and 0 bytes spill loads
Why is the information for cmem
(which I am assuming is constant memory) repeated twice with different figures? Within the kernel I am not using any constant
memory. Does that mean the compiler is, under the hood, going to tell the GPU to use some constant memory?
This question is "continued" in: Interpreting the verbose output of ptxas, part II
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…