Tl;DR: For these three cases, a penalty of a few cycles is incurred when performing a load and store at the same time. The load latency is on the critical path in all of the three cases, but the penalty is different in different cases. Case 3 is about a cycle higher than case 1 due to the additional load.
Analysis Method 1: Using stall performance events
I was able to reproduce your results for the all of the three cases on IvB and SnB. The numbers I got are within 2% of your numbers. The number of cycles it takes to execute a single iteration of case 1, 2, and 4 is 5.4, 8.9, and 6.6, respectively.
Let's start with the frontend. The LSD.CYCLES_4_UOPS
and LSD.CYCLES_3_UOPS
performance events show that basically all the uops are issued from the LSD. In addition, these events together with LSD.CYCLES_ACTIVE
show that in every cycle in which the LSD is not stalled, 3 uops are issued in cases 1 and 2 and 4 uops are issued in case 3. In other words, as expected, the uops of every iteration are issued together in the same group in a single cycle.
In all of the following relations, the "=~" sign means that the difference is within 2%. I'll start with the following empirical observation:
UOPS_ISSUED.STALL_CYCLES
+ LSD.CYCLES_ACTIVE
=~ cycles
Note that the LSD event counts on SnB need to be adjusted as discussed in here.
We also have the following relations:
case 1: UOPS_ISSUED.STALL_CYCLES
=~ RESOURCE_STALLS.ANY
=~ 4.4c/iter
case 2: UOPS_ISSUED.STALL_CYCLES
=~ RESOURCE_STALLS.ANY
=~ 7.9c/iter
case 3: UOPS_ISSUED.STALL_CYCLES
=~ RESOURCE_STALLS.ANY
=~ 5.6c/iter
This means that the reason for the issue stalls is because one or more required resources in the backend are not available. Therefore, we can confidently eliminate the whole frontend from consideration. In cases 1 and 2, that resource is the RS. In case 3, stalls due to the RS constitute about 20% of all resource stalls1.
Let's focus now on case 1. There are a total of 4 unfused domain uops: 1 load uop, 1 STA, 1 STD, and 1 dec/jne. The load and STA uops depend on the previous load uop. Whenever the LSD issues a group of uops, the STD and jump uops can be dispatched in the next cycle, so the next cycle will not cause an execution stall event. However, the earliest point where the load and STA uops can be dispatched is in the same cycle in the which the load result is written back. The correlation between CYCLES_NO_EXECUTE
and STALLS_LDM_PENDING
indicates that the reason why there would be no uops ready for execution is because all of the uops that are in the RS are waiting for the L1 to service pending load requests. Specifically, half of the uops in the RS are load uops and the other half are STAs and they are all waiting for the load of the respective previous iteration to complete. LSD.CYCLES_3_UOPS
shows that the LSD waits until there are at least 4 free entries in the RS, only then it issues a group of uops that constitute a full iteration. In the next cycle, two of these uops will be dispatched, thereby freeing 2 RS entries2. The other will have to wait for the load they depend on to complete. Most probably the loads complete in program order. Therefore, the LSD waits until the STA and load uops of the oldest iteration that is yet to be executed leave the RS. Thus, UOPS_ISSUED.STALL_CYCLES
+ 1 =~ the average load latency3. We can conclude that the average load latency in case 1 is 5.4c. Most of this applies to case 2, except for one difference, as I'll explain shortly.
Since the uops in each iteration form a dependency chain, we also have:
cycles
=~ the average load latency.
Hence:
cycles
=~ UOPS_ISSUED.STALL_CYCLES
+ 1 =~ the average load latency.
In case 1, the average load latency is 5.4c. We know that the best-case latency of the L1 cache is 4c, so there is a load latency penalty of 1.4c. But why is the effective load latency not 4c?
The scheduler will predict that the load on which the uops depend will complete within some constant latency and so it will schedule them to be dispatched accordingly. If the load takes more time than that for any reason (such as an L1 miss), the uops will be dispatched but the load result has not arrived yet. In this case, the uops will be replayed and the number of dispatched uops will be larger than the total number of issued uops.
The load and STA uops can only be dispatched to port 2 or 3. The events UOPS_EXECUTED_PORT.PORT_2
and UOPS_EXECUTED_PORT.PORT_3
can be used to count the number of uops dispatched to port 2 and 3, respectively.
case 1: UOPS_EXECUTED_PORT.PORT_2
+ UOPS_EXECUTED_PORT.PORT_3
=~ 2uops/iter
case 2: UOPS_EXECUTED_PORT.PORT_2
+ UOPS_EXECUTED_PORT.PORT_3
=~ 6uops/iter
case 3: UOPS_EXECUTED_PORT.PORT_2
+ UOPS_EXECUTED_PORT.PORT_3
=~ 4.2uops/iter
In case 1, the total number of AGU uops dispatched is exactly equal to the number of AGU uops retired; there are no replays. So the scheduler never mispredicts. In case 2, there is on average 2 replays per AGU uop, which means that the scheduler mispredicts twice on average per AGU uop. Why are there mispredictions in case 2 but not in case 1?
The scheduler will replay uops dependent on a load for any of the following reasons:
- L1 cache miss.
- Memory disambiguation misprediction.
- Memory consistency violation.
- L1 cache hit, but there is L1-L2 traffic.
- Virtual page number misprediction.
- Some other (undocumented) reasons.
The first 5 reasons can be definitively ruled out using the corresponding performance events. Patrick Fay (Intel) says the following:
Lastly yes, there are 'a few' idle cycles when switching between a
load and a store. I'm told not to be more specific than 'a few'.
...
SNB can read and write different banks at the same cycle.
I find these statements, perhaps intentionally, a little ambiguous. The first statement suggests that a load and store to the L1 can never fully overlap. The second one suggests that a load and store can be performed in the same cycle only if there are to different banks. Although being to different banks may neither be a necessary nor sufficient condition. But one thing is for sure, if there are concurrent load and store requests, the load (and the store) can be delayed for one or more cycles. This explains the average 1.4c penalty on the load latency in case 1.
There is a difference between case 1 and case 2. In case 1, the STA and load uops that depend on the same load uop are issued together in the same cycle. On the other hand, in case 2, the STA and load uops that depend on the same load uop belong to two different issue groups. The issue stall time per iteration would be essentially equal to the time it takes to sequentially execute one load and retire one store. The contribution of each operation can be estimated using CYCLE_ACTIVITY.STALLS_LDM_PENDING
. It takes one cycle to execute the STA uop so the store can retire in the cycle that immediately follows the one in which the STA is dispatched.
The average load latency is CYCLE_ACTIVITY.STALLS_LDM_PENDING
+ 1 cycle (the cycle in which the load is dispatched) + 1 cycle (the cycle in the which the jump uop is dispatched). We need to add 2 cycles to CYCLE_ACTIVITY.STALLS_LDM_PENDING
because there are no execution stalls in these cycles yet they constitute a fraction of the total load latency. This is equal to 6.8 + 2 = 8.8 cycles =~ cycles
.
During the execution of the first dozen (or so) iterations, a jump and STD uops will be allocated in the RS every cycle. These will be always be dispatched for execution in the cycle that follows the issue cycle. At some point, the RS will get full and all of the entries that have not been dispatched yet will be STA and load uops that are waiting for the load uops of the respective previous iterations to complete (writeback their results). So the allocator will stall until there are enough free RS entries to issue a whole iteration. Let's assume that the oldest load uop has written back its result at cycle T
+ 0. I'll refer to the iteration to which that load uop belongs as the current iteration. The following sequence of events will occur:
At cycle T
+ 0: Dispatch the STA uop of the current iteration and the load uop of the next iteration. There is no allocation in this cycle because there aren't enough RS entries. This cycle gets counted as an allocation stall cycle but not as an execution stall cycle.
At cycle T
+ 1: The STA uop completes execution and the store retires. The uops of the next iteration to be allocated are allocated. This cycle gets counted as an execution stall cycle but not as an allocation stall cycle.
At cycle T
+ 2: The jump and STD uops that were just allocated get dispatched. This cycle gets counted as an allocation stall cycle but not as an execution stall cycle.
At cycles T
+ 3 to T
+ 3 + CYCLE_ACTIVITY.STALLS_LDM_PENDING
- 2: All of these cycles are counted as both execution and allocation stall cycles. Note that there are CYCLE_ACTIVITY.STALLS_LDM_PENDING
- 1 cycles here.
Therefore, UOPS_ISSUED.STALL_CYCLES
should be equal to 1 + 0 + 1 + CYCLE_ACTIVITY.STALLS_LDM_PENDING
- 1. Let's check: 7.9 = 1+0+1+6.8-1.
Following the reasoning on case 1, cycles
should be equal to UOPS_ISSUED.STALL_CYCLES
+ 1 = 7.9 + 1 =~ the actual measured cycles
. The penalty incurred when performing a load and store in at the same time is 3.6c higher than in case 1. It is as if the load is waiting for a store get committed. I think this also explains why there are replays in the case 2 but not in case 1.
In case 3, there are 1 STD, 1 STA, 2 loads, and 1 jump. The uops of a single iteration can all be allocated in one cycle because the IDQ-RS bandwidth is 4 fused uops per cycle. The uops get unfused on entrance to the RS. The 1 STD require 1 cycle to be dispatched. The jump also takes 1 cycle. There are three AGU uops but only 2 AGU ports. So it takes 2 cycles (compared to 1 in case 1 and 2) to dispatch the AGU uops. The group of AGU uops dispatched will be one of the following:
- The second load uop and the STA uop of the same iteration. These are dependent on the first load uop of the same iteration. Both AGU ports are used.
- The first load uop of the next iteration can be dispatched in the next cycle. This depends on the load of the previous iteration. Only one of the two AGU ports is used.
Since it takes one more cycle to free enough RS entries to accommodate an entire issue group, UOPS_ISSUED.STALL_CYCLES
+ 1 - 1 = UOPS_ISSUED.STALL_CYCLES
=~ the average load latency =~ 5.6c, which is very close to that of case 1. The penalty is about 1.6c. This