Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
436 views
in Technique[技术] by (71.8m points)

profiling - How to calculate the achieved bandwidth of a CUDA kernel

I want a measure of how much of the peak memory bandwidth my kernel archives.

Say I have a NVIDIA Tesla C1060, which has a max Bandwidth of 102.4 GB/s. In my kernel I have the following accesses to global memory:

    ...
    for(int k=0;k>4000;k++){
        float result = (in_data[index]-loc_mem[k]) * (in_data[index]-loc_mem[k]);
        ....
    }
    out_data[index]=result;
    out_data2[index]=sqrt(result);
    ...

I count for each thread 4000*2+2 accesses to global memory. Having 1.000.000 threads and all accesses are float I have ~32GB of global memory accesses (inbound and outbound added). As my kernel only takes 0.1s I would archive ~320GB/s which is higher than the max bandwidth, thus there is an error in my calculations / assumptions. I assume, CUDA does some caching, so not all memory accesses count. Now my questions:

  • What is my error?
  • What accesses to global memory are cached and which are not?
  • Is it correct that I don't count access to registers, local, shared and constant memory?
  • Can I use the CUDA profiler for easier and more accurate results? Which counters would I need to use? How would I need to interpret them?

Profiler output:

method              gputime    cputime  occupancy instruction warp_serial memtransfer
memcpyHtoD           10.944         17                                          16384
fill                  64.32         93          1       14556           0
fill                 64.224         83          1       14556           0
memcpyHtoD           10.656         11                                          16384
fill                 64.064         82          1       14556           0
memcpyHtoD          1172.96       1309                                        4194304
memcpyHtoD           10.688         12                                          16384
cu_more_regT      93223.906      93241          1    40716656           0
memcpyDtoH         1276.672       1974                                        4194304
memcpyDtoH         1291.072       2019                                        4194304
memcpyDtoH          1278.72       2003                                        4194304
memcpyDtoH             1840       3172                                        4194304

New question: - When 4194304Bytes = 4Bytes * 1024*1024 data points = 4MB and gpu_time ~= 0.1 s then I achieve a bandwidth of 10*40MB/s = 400MB/s. That seems very low. Where is the error?

p.s. Tell me if you need other counters for your answer.

sister question: How to calculate Gflops of a kernel

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)
  • You do not really have 1.000.000 of threads running at once. You do ~32GB of global memory accesses where the bandwidth will be given by the current threads running (reading) in the SMs and the size of the data read.
  • All accesses in global memory are cached in L1 and L2 unless you specify un-cached data to the compiler.
  • I think so. Achieved bandwidth is related to global memory.
  • I will recommend use the visual profiler to see the read/write/global memory bandwidth. Would be interesting if you post your result :).

Default counters in Visual Profiler gives you enough information to get an idea about your kernel (memory bandwidth, shared memory bank conflicts, instructions executed...).

Regarding to your question, to calculate the achieved global memory throughput:

Compute Visual Profiler. DU-05162-001_v02 | October 2010. User Guide. Page 56, Table 7. Supported Derived Statistics.

Global memory read throughput in giga-bytes per second. For compute capability < 2.0 this is calculated as (((gld_32*32) + (gld_64*64) + (gld_128*128)) * TPC) / gputime For compute capability >= 2.0 this is calculated as ((DRAM reads) * 32) / gputime

Hope this help.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...