Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
572 views
in Technique[技术] by (71.8m points)

assembly - Weird performance effects from nearby dependent stores in a pointer-chasing loop on IvyBridge. Adding an extra load speeds it up?

First I have the below setup on an IvyBridge, I will insert measuring payload code in the commented location. The first 8 bytes of buf store the address of buf itself, I use this to create loop-carried dependency:

section .bss
align   64
buf:    resb    64

section .text
global _start
_start:
    mov rcx,         1000000000
    mov qword [buf], buf
    mov rax,         buf
loop:
    ; I will insert payload here
    ; as is described below 

    dec rcx
    jne loop

    xor rdi,    rdi
    mov rax,    60
    syscall

case 1:

I insert into the payload location:

mov qword [rax+8],  8
mov rax,            [rax]

perf shows the loop is 5.4c/iter. It's somewhat comprehensible, because L1d latency is 4 cycle.

case 2:

I reverse the order of these two instruction:

mov rax,            [rax]
mov qword [rax+8],  8

The result suddenly becomes 9c/iter. I don't understand why. Because the first instruction of the next iteration doesn't depend on the second instruction of the current iteration, this setting shouldn't be different with case 1.

I also used IACA tool to analyze these two cases statically, but the tool is unreliable, because it predicts the same result 5.71c/iter for both cases, which contradicts to the experiment.

case 3:

Then I insert an irrelevant mov instruction to case 2:

mov rax,            [rax]
mov qword [rax+8],  8
mov rbx,            [rax+16] 

Now the result becomes 6.8c/iter. But how can an irrelevant mov inserted boost the speed from 9c/iter to 6.8c/iter?

The IACA tool predicts wrong result as in the previous case, it shows 5.24c/iter.

I'm now totally confused, how to comprehend the above results?

Edit for more info:

In case 1 and 2, there is an address rax+8. The same results remain for case 1 and 2 if rax+8 is changed to rax+16 or rax+24. But something surprising happens when it is changed to rax+32: case 1 becomes 5.3c/iter, case 2 suddenly becomes 4.2c/iter.

Edit for more perf events:

$ perf stat -ecycles,ld_blocks_partial.address_alias,int_misc.recovery_cycles,machine_clears.count,uops_executed.stall_cycles,resource_stalls.any ./a.out

case 1 for [rax+8]:

 5,429,070,287      cycles                                                        (66.53%)
         6,941      ld_blocks_partial.address_alias                                     (66.75%)
       426,528      int_misc.recovery_cycles                                      (66.83%)
        17,117      machine_clears.count                                          (66.84%)
 2,182,476,446      uops_executed.stall_cycles                                     (66.63%)
 4,386,210,668      resource_stalls.any                                           (66.41%)

case 2 for [rax+8]:

 9,018,343,290      cycles                                                        (66.59%)
         8,266      ld_blocks_partial.address_alias                                     (66.73%)
       377,824      int_misc.recovery_cycles                                      (66.76%)
        10,159      machine_clears.count                                          (66.76%)
 7,010,861,225      uops_executed.stall_cycles                                     (66.65%)
 7,993,995,420      resource_stalls.any                                           (66.51%)

case 3 for [rax+8]:

 6,810,946,768      cycles                                                        (66.69%)
         1,641      ld_blocks_partial.address_alias                                     (66.73%)
       223,062      int_misc.recovery_cycles                                      (66.73%)
         7,349      machine_clears.count                                          (66.74%)
 3,618,236,557      uops_executed.stall_cycles                                     (66.58%)
 5,777,653,144      resource_stalls.any                                           (66.53%)

case 2 for [rax+32]:

 4,202,233,246      cycles                                                        (66.68%)
         2,969      ld_blocks_partial.address_alias                                     (66.68%)
       149,308      int_misc.recovery_cycles                                      (66.68%)
         4,522      machine_clears.count                                          (66.68%)
 1,202,497,606      uops_executed.stall_cycles                                     (66.64%)
 3,179,044,737      resource_stalls.any                                           (66.64%)
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

Tl;DR: For these three cases, a penalty of a few cycles is incurred when performing a load and store at the same time. The load latency is on the critical path in all of the three cases, but the penalty is different in different cases. Case 3 is about a cycle higher than case 1 due to the additional load.


Analysis Method 1: Using stall performance events

I was able to reproduce your results for the all of the three cases on IvB and SnB. The numbers I got are within 2% of your numbers. The number of cycles it takes to execute a single iteration of case 1, 2, and 4 is 5.4, 8.9, and 6.6, respectively.

Let's start with the frontend. The LSD.CYCLES_4_UOPS and LSD.CYCLES_3_UOPS performance events show that basically all the uops are issued from the LSD. In addition, these events together with LSD.CYCLES_ACTIVE show that in every cycle in which the LSD is not stalled, 3 uops are issued in cases 1 and 2 and 4 uops are issued in case 3. In other words, as expected, the uops of every iteration are issued together in the same group in a single cycle.

In all of the following relations, the "=~" sign means that the difference is within 2%. I'll start with the following empirical observation:

UOPS_ISSUED.STALL_CYCLES + LSD.CYCLES_ACTIVE =~ cycles

Note that the LSD event counts on SnB need to be adjusted as discussed in here.

We also have the following relations:

case 1: UOPS_ISSUED.STALL_CYCLES =~ RESOURCE_STALLS.ANY =~ 4.4c/iter
case 2: UOPS_ISSUED.STALL_CYCLES =~ RESOURCE_STALLS.ANY =~ 7.9c/iter
case 3: UOPS_ISSUED.STALL_CYCLES =~ RESOURCE_STALLS.ANY =~ 5.6c/iter

This means that the reason for the issue stalls is because one or more required resources in the backend are not available. Therefore, we can confidently eliminate the whole frontend from consideration. In cases 1 and 2, that resource is the RS. In case 3, stalls due to the RS constitute about 20% of all resource stalls1.

Let's focus now on case 1. There are a total of 4 unfused domain uops: 1 load uop, 1 STA, 1 STD, and 1 dec/jne. The load and STA uops depend on the previous load uop. Whenever the LSD issues a group of uops, the STD and jump uops can be dispatched in the next cycle, so the next cycle will not cause an execution stall event. However, the earliest point where the load and STA uops can be dispatched is in the same cycle in the which the load result is written back. The correlation between CYCLES_NO_EXECUTE and STALLS_LDM_PENDING indicates that the reason why there would be no uops ready for execution is because all of the uops that are in the RS are waiting for the L1 to service pending load requests. Specifically, half of the uops in the RS are load uops and the other half are STAs and they are all waiting for the load of the respective previous iteration to complete. LSD.CYCLES_3_UOPS shows that the LSD waits until there are at least 4 free entries in the RS, only then it issues a group of uops that constitute a full iteration. In the next cycle, two of these uops will be dispatched, thereby freeing 2 RS entries2. The other will have to wait for the load they depend on to complete. Most probably the loads complete in program order. Therefore, the LSD waits until the STA and load uops of the oldest iteration that is yet to be executed leave the RS. Thus, UOPS_ISSUED.STALL_CYCLES + 1 =~ the average load latency3. We can conclude that the average load latency in case 1 is 5.4c. Most of this applies to case 2, except for one difference, as I'll explain shortly.

Since the uops in each iteration form a dependency chain, we also have:

cycles =~ the average load latency.

Hence:

cycles =~ UOPS_ISSUED.STALL_CYCLES + 1 =~ the average load latency.

In case 1, the average load latency is 5.4c. We know that the best-case latency of the L1 cache is 4c, so there is a load latency penalty of 1.4c. But why is the effective load latency not 4c?

The scheduler will predict that the load on which the uops depend will complete within some constant latency and so it will schedule them to be dispatched accordingly. If the load takes more time than that for any reason (such as an L1 miss), the uops will be dispatched but the load result has not arrived yet. In this case, the uops will be replayed and the number of dispatched uops will be larger than the total number of issued uops.

The load and STA uops can only be dispatched to port 2 or 3. The events UOPS_EXECUTED_PORT.PORT_2 and UOPS_EXECUTED_PORT.PORT_3 can be used to count the number of uops dispatched to port 2 and 3, respectively.

case 1: UOPS_EXECUTED_PORT.PORT_2 + UOPS_EXECUTED_PORT.PORT_3 =~ 2uops/iter
case 2: UOPS_EXECUTED_PORT.PORT_2 + UOPS_EXECUTED_PORT.PORT_3 =~ 6uops/iter
case 3: UOPS_EXECUTED_PORT.PORT_2 + UOPS_EXECUTED_PORT.PORT_3 =~ 4.2uops/iter

In case 1, the total number of AGU uops dispatched is exactly equal to the number of AGU uops retired; there are no replays. So the scheduler never mispredicts. In case 2, there is on average 2 replays per AGU uop, which means that the scheduler mispredicts twice on average per AGU uop. Why are there mispredictions in case 2 but not in case 1?

The scheduler will replay uops dependent on a load for any of the following reasons:

  • L1 cache miss.
  • Memory disambiguation misprediction.
  • Memory consistency violation.
  • L1 cache hit, but there is L1-L2 traffic.
  • Virtual page number misprediction.
  • Some other (undocumented) reasons.

The first 5 reasons can be definitively ruled out using the corresponding performance events. Patrick Fay (Intel) says the following:

Lastly yes, there are 'a few' idle cycles when switching between a load and a store. I'm told not to be more specific than 'a few'.
...
SNB can read and write different banks at the same cycle.

I find these statements, perhaps intentionally, a little ambiguous. The first statement suggests that a load and store to the L1 can never fully overlap. The second one suggests that a load and store can be performed in the same cycle only if there are to different banks. Although being to different banks may neither be a necessary nor sufficient condition. But one thing is for sure, if there are concurrent load and store requests, the load (and the store) can be delayed for one or more cycles. This explains the average 1.4c penalty on the load latency in case 1.

There is a difference between case 1 and case 2. In case 1, the STA and load uops that depend on the same load uop are issued together in the same cycle. On the other hand, in case 2, the STA and load uops that depend on the same load uop belong to two different issue groups. The issue stall time per iteration would be essentially equal to the time it takes to sequentially execute one load and retire one store. The contribution of each operation can be estimated using CYCLE_ACTIVITY.STALLS_LDM_PENDING. It takes one cycle to execute the STA uop so the store can retire in the cycle that immediately follows the one in which the STA is dispatched.

The average load latency is CYCLE_ACTIVITY.STALLS_LDM_PENDING + 1 cycle (the cycle in which the load is dispatched) + 1 cycle (the cycle in the which the jump uop is dispatched). We need to add 2 cycles to CYCLE_ACTIVITY.STALLS_LDM_PENDING because there are no execution stalls in these cycles yet they constitute a fraction of the total load latency. This is equal to 6.8 + 2 = 8.8 cycles =~ cycles.

During the execution of the first dozen (or so) iterations, a jump and STD uops will be allocated in the RS every cycle. These will be always be dispatched for execution in the cycle that follows the issue cycle. At some point, the RS will get full and all of the entries that have not been dispatched yet will be STA and load uops that are waiting for the load uops of the respective previous iterations to complete (writeback their results). So the allocator will stall until there are enough free RS entries to issue a whole iteration. Let's assume that the oldest load uop has written back its result at cycle T + 0. I'll refer to the iteration to which that load uop belongs as the current iteration. The following sequence of events will occur:

At cycle T + 0: Dispatch the STA uop of the current iteration and the load uop of the next iteration. There is no allocation in this cycle because there aren't enough RS entries. This cycle gets counted as an allocation stall cycle but not as an execution stall cycle.

At cycle T + 1: The STA uop completes execution and the store retires. The uops of the next iteration to be allocated are allocated. This cycle gets counted as an execution stall cycle but not as an allocation stall cycle.

At cycle T + 2: The jump and STD uops that were just allocated get dispatched. This cycle gets counted as an allocation stall cycle but not as an execution stall cycle.

At cycles T + 3 to T + 3 + CYCLE_ACTIVITY.STALLS_LDM_PENDING - 2: All of these cycles are counted as both execution and allocation stall cycles. Note that there are CYCLE_ACTIVITY.STALLS_LDM_PENDING - 1 cycles here.

Therefore, UOPS_ISSUED.STALL_CYCLES should be equal to 1 + 0 + 1 + CYCLE_ACTIVITY.STALLS_LDM_PENDING - 1. Let's check: 7.9 = 1+0+1+6.8-1.

Following the reasoning on case 1, cycles should be equal to UOPS_ISSUED.STALL_CYCLES + 1 = 7.9 + 1 =~ the actual measured cycles. The penalty incurred when performing a load and store in at the same time is 3.6c higher than in case 1. It is as if the load is waiting for a store get committed. I think this also explains why there are replays in the case 2 but not in case 1.

In case 3, there are 1 STD, 1 STA, 2 loads, and 1 jump. The uops of a single iteration can all be allocated in one cycle because the IDQ-RS bandwidth is 4 fused uops per cycle. The uops get unfused on entrance to the RS. The 1 STD require 1 cycle to be dispatched. The jump also takes 1 cycle. There are three AGU uops but only 2 AGU ports. So it takes 2 cycles (compared to 1 in case 1 and 2) to dispatch the AGU uops. The group of AGU uops dispatched will be one of the following:

  • The second load uop and the STA uop of the same iteration. These are dependent on the first load uop of the same iteration. Both AGU ports are used.
  • The first load uop of the next iteration can be dispatched in the next cycle. This depends on the load of the previous iteration. Only one of the two AGU ports is used.

Since it takes one more cycle to free enough RS entries to accommodate an entire issue group, UOPS_ISSUED.STALL_CYCLES + 1 - 1 = UOPS_ISSUED.STALL_CYCLES =~ the average load latency =~ 5.6c, which is very close to that of case 1. The penalty is about 1.6c. This


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...