concurrency - x86 memory ordering: Loads Reordered with Earlier Stores vs. Intra-Processor Forwarding

Question

Welcome To Ask or Share your Answers For Others

concurrency - x86 memory ordering: Loads Reordered with Earlier Stores vs. Intra-Processor Forwarding

asked Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

concurrency - x86 memory ordering: Loads Reordered with Earlier Stores vs. Intra-Processor Forwarding

I am trying to understand section 8.2 of Intel's System Programming Guide (that's Vol 3 in the PDF).

In particular, I see two different reordering scenarios:

8.2.3.4 Loads May Be Reordered with Earlier Stores to Different Locations

and

8.2.3.5 Intra-Processor Forwarding Is Allowed

However, I do not understand the difference between these scenarios from the observable effects POW. The examples provided in those sections seem interchangeable to me. 8.2.3.4 example can be explained by 8.2.3.5 rule just as well as by its own rule. And the converse seems true to me as well, although I am not that sure in that case.

So here is my question: are there better examples or explanations how the observable effects of 8.2.3.4 are different from observable effects of 8.2.3.5?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-23T19:39:42+0000

The example at 8.2.3.5 should be "surprising" if you expect memory ordering to be all strict an clean, and even if you acknowledge that 8.2.3.4 allows loads to reorder with stores of different addresses.

   Processor 0      |      Processor 1
  --------------------------------------
   mov [x],1        |      mov [y],1
   mov R1, [x]      |      mov R3,[y]
   mov R2, [y]      |      mov R4,[x]

Note that the key part is that the newly added loads in the middle both return 1 (store-to-load forwarding makes that possible in the uarch without stalling). So in theory, you would expect that both stores have been "observed" globally by the time both these loads completed (that would have been the case with sequential consistency, where there is a unique ordering between stores and all cores see it).

However, having later R2 = R4 = 0 as a valid outcome proves this is not the case - the stores are in fact observed locally first. In other words, allowing this outcome means that processor 0 sees the stores as time(x) < time(y), while processor 1 sees the opposite.

This is a very important observation about the consistency of this memory model, which the previous example doesn't prove. This nuance is the biggest difference between Sequential Consistency and Total Store Ordering - the second example breaks SC, the first one doesn't.

Categories

concurrency - x86 memory ordering: Loads Reordered with Earlier Stores vs. Intra-Processor Forwarding

concurrency - x86 memory ordering: Loads Reordered with Earlier Stores vs. Intra-Processor Forwarding

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags