Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
206 views
in Technique[技术] by (71.8m points)

c++ - Does STLR(B) provide sequential consistency on ARM64?

For stores performed on an object of an atomic data type (say, std::atomic<uint8_t>), GCC generates:

  • MOV instruction in case of release-store (std::memory_order_release),
  • XCHG instruction in case of sequential-consistent-store (std::memory_order_seq_cst).

when target architecture is x86_64. However, when it is ARM64 (AArch64), in both cases, GCC generates the same instruction, namely STLRB. There are no other instructions generated (such as a memory barrier) and the same happens with Clang as well. Does it mean that this instruction, described as having the store-relase semantics, actually provides sequential consistency as well?

For instance, if two threads running on two cores will perform stores with STLRB to different memory locations, is the order of these two stores unique? Such that all other threads are guaranteed to observe the same order?

I am asking since, according to this answer, with acquire-loads, different threads may observe different order of release-stores. To observe the same order, sequential consistency is needed instead.

Live demo: https://godbolt.org/z/hajMKnd53

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

Yup, stlr is store-release on its own, and ldar can't pass an earlier stlr (i.e. no StoreLoad reordering) - that interaction between them satisfies that part of the seq_cst requirements which acq / rel doesn't have. (ARMv8.3 ldapr is like ldar without that interaction, being only a plain acquire load, allowing more efficient acq_rel.)

So on ARMv8.3, the difference between seq_cst and acq / rel is in the load side. ARMv8 before 8.3 can't do acq / rel while still allowing StoreLoad reordering, so it's unfortunately slow if you acquire-load something else after a release-store. ARMv8.3 fixes that, making acq / rel as efficient as on x86.

On x86, everything is an acquire load or release store (so acq_rel is free), and the least-bad way to achieve sequential consistency is by doing a full barrier on seq_cst stores. (You want atomic loads to be cheap, and it's going to be common for code to use the default seq_cst memory order.)

(C/C++11 mappings to processors discusses that tradeoff of wanting cheap loads, if you have to pick either load or store to attach the full barrier to.)


Separately, the IRIW litmus test (all threads agreeing on the order of independent stores) is guaranteed by the ARMv8 memory model even for release stores. It's guaranteed to be "multicopy-atomic", which means that when a store becomes visible to any other core, it becomes visible to all other cores at the same time. This is sufficient for all cores to agree on a total order for all stores, to the limits of anything they can observe via two acquire loads.

In practical terms, that means stores only become visible by committing to L1d cache, which is coherent. Not for example by store-forwarding between logical cores sharing a physical core, the mechanism for IRIW reordering on the few POWER CPUs that can produce the effect in real life. ARMv8 originally allowed that on paper, but no ARM CPUs ever did. They strengthened the memory model to simply guarantee that no future CPU would be weird like that. See Simplifying ARM Concurrency: Multicopy-Atomic Axiomatic and Operational Models for ARMv8 for details.

Note that this guarantee of all threads being able to agree on an order applies to all stores on ARM64, including relaxed. (There are very few HW mechanisms that can create it, in a machine with coherent shared memory, so it's only on rare ISAs that seq_cst has to actually do anything specific to prevent it.)

x86's TSO (Total Store Order) memory model has the required property right in the name. And yes, it's much stronger, basically program-order plus a store-buffer with store-forwarding. (So this allows StoreLoad reordering, and for a core to see its own stores before they're globally visible, but nothing else. Ignoring NT stores, and NT loads from WC memory such as video RAM...)


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...