memory - Interpreting the verbose output of ptxas, part I

Question

Welcome To Ask or Share your Answers For Others

memory - Interpreting the verbose output of ptxas, part I

asked Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

memory - Interpreting the verbose output of ptxas, part I

I am trying to understand resource usage for each of my CUDA threads for a hand-written kernel.

I compiled my kernel.cu file to a kernel.o file with nvcc -arch=sm_20 -ptxas-options=-v

and I got the following output (passed through c++filt):

ptxas info    : Compiling entry function 'searchkernel(octree, int*, double, int, double*, double*, double*)' for 'sm_20'
ptxas info    : Function properties for searchkernel(octree, int*, double, int, double*, double*, double*)
    72 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 46 registers, 176 bytes cmem[0], 16 bytes cmem[14]

Looking at the output above, is it correct to say that

each CUDA thread is using 46 registers?
there is no register spilling to local memory?

I am also having some issues with understanding the output.

My kernel is calling a whole lot of __device__ functions. IS 72 bytes the sum-total of the memory for the stack frames of the __global__ and __device__ functions?
What is the difference between 0 byte spill stores and 0 bytes spill loads
Why is the information for cmem (which I am assuming is constant memory) repeated twice with different figures? Within the kernel I am not using any constant memory. Does that mean the compiler is, under the hood, going to tell the GPU to use some constant memory?

_{This question is "continued" in: Interpreting the verbose output of ptxas, part II}

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-23T18:37:24+0000

Each CUDA thread is using 46 registers? Yes, correct
There is no register spilling to local memory? Yes, correct
Is 72 bytes the sum-total of the memory for the stack frames of the __global__ and __device__ functions? Yes, correct
What is the difference between 0 byte spill stores and 0 bytes spill loads?
- Fair question, the loads could be greater than the stores since you could spill a computed value, load it once, discard it (i.e. store something else into that register) then load it again (i.e. reuse it). Update: note also that the spill load/store count is based on static analysis as described by @njuffa in the comments below
Why is the information for cmem (which I am assuming is constant memory) repeated twice with different figures? Within the kernel I am not using any constant memory. Does that mean the compiler is, under the hood, going to tell the GPU to use some constant memory?
- Constant memory is used for a few purposes including __constant__ variables and kernel arguments, different "banks" are used, that starts to get a bit detailed but as long as you use less than 64KB for your __constant__ variables and less than 4KB for kernel arguments you will be ok.

Categories

memory - Interpreting the verbose output of ptxas, part I

memory - Interpreting the verbose output of ptxas, part I

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags