Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
4.8k views
in Technique[技术] by (71.8m points)

performance - Cache miss latency in clock cycles

To measure the impact of cache-misses in a program, I want to latency caused by cache-misses to the cycles used for actual computation. I use perf stat to measure the cycles, L1-loads, L1-misses, LLC-loads and LLC-misses in my program. Here is a example output:

               467?769,70 msec task-clock                #    1,000 CPUs utilized          
        1?234?063?672?432      cycles                    #    2,638 GHz                      (62,50%)
          572?761?379?098      instructions              #    0,46  insn per cycle           (75,00%)
          129?143?035?219      branches                  #  276,083 M/sec                    (75,00%)
            6?457?141?079      branch-misses             #    5,00% of all branches          (75,00%)
          195?360?583?052      L1-dcache-loads           #  417,643 M/sec                    (75,00%)
           33?224?066?301      L1-dcache-load-misses     #   17,01% of all L1-dcache hits    (75,00%)
           20?620?655?322      LLC-loads                 #   44,083 M/sec                    (50,00%)
            6?030?530?728      LLC-load-misses           #   29,25% of all LL-cache hits     (50,00%)

Then my question is: How to convert the number of cache-misses into a number of "lost" clock cycles? Or alternatively, what is the proportion of time spent for fetching data?

I think the factor should be known by the constructor. My processor is Intel Core i7-10810U, and I couldn't find this information in the specifications nor in this list of benchmarked CPUs.

This related problem describes how to measure the number of cycles lost in a cache-miss, but is there a way to obtain this as hardware information? Ideally, the output would be something like:

L1-hit: 3 cycles
L2-hit: 10 cycles
LLC-hit: 30 cycles
RAM: 300 cycles

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

Out-of-order exec and memory-level parallelism exist to hide some of that latency by overlapping useful work with time data is in flight. If you simply multiplied L3 miss count by say 300 cycles each, that could exceed the total number of cycles your whole program took. The perf event cycle_activity.stalls_l3_miss (which exists on my Skylake CPU) should count cycles when no uops execute and there's an outstanding L3 cache miss. i.e. cycles when execution is fully stalled. But there will also be cycles with some work, but less than without a cache miss, and that's harder to evaluate.

TL:DR: memory access is heavily pipelined; the whole core doesn't stop on one cache miss, that's the whole point. A pointer-chasing benchmark (to measure latency) is merely a worst case, where the only work is dereferencing a load result. See Modern Microprocessors A 90-Minute Guide! which has a section about memory and the "memory wall". See also https://agner.org/optimize/ and https://www.realworldtech.com/haswell-cpu/ to learn more about the details of out-of-order exec CPUs and how they can continue doing independent work while one instruction is waiting for data from a cache miss, up to the limit of their out-of-order window size. (https://blog.stuffedcow.net/2013/05/measuring-rob-capacity/)


Re: numbers from vendors:

L3 and RAM latencies aren't a fixed number of core clock cycles: first, core frequency is variable (and independent of uncore and memory clocks), and second because of contention from other cores, and number of hops over the interconnect. (Related: Is cycle count itself reliable on program timing? discusses some effects of core frequency independent of L3 and memory)

That said, Intel's optimization manual does include a table with exact latencies for L1 and L2, and typical for L3, and DRAM on Skylake-server. (2.2.1.3 Skylake Server Microarchitecture Cache Recommendations) https://software.intel.com/content/www/us/en/develop/articles/intel-sdm.html#optimization - they say SKX L3 latency is typically 50-70 cycles. DRAM speed depends some on the timing of your DIMMs.

Other people have tested specific CPUs, like https://www.7-cpu.com/cpu/Skylake.html.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...