If you care about portable performance, you should ideally write your C++ source with the minimum necessary ordering for each operation. The only thing that really costs "extra" on x86 is mo_seq_cst
for a pure store, so make a point of avoiding that even for x86.
(relaxed
ops can also allow more compile-time optimization of the surrounding non-atomic operations, e.g. CSE and dead store elimination, because relaxed ops avoid a compiler barrier. If you don't need any order wrt. surrounding code, tell the compiler that fact so it can optimize.)
Keep in mind that you can't fully test weaker orders if you only have x86 hardware, especially atomic RMWs with only acquire
or release
, so in practice it's safer to leave your RMWs as seq_cst
if you're doing anything that's already complicated and hard to reason about correctness.
x86 asm naturally has acquire
loads, release
stores, and seq_cst
RMW operations. Compile-time reordering is possible with weaker orders in the source, but after the compiler makes its choices, those are "nailed down" into x86 asm. (And stronger store orders require an mfence after mov
, or using xchg
. seq_cst
loads don't actually have any extra cost, but it's more accurate to describe them as acquire because earlier stores can reorder past them, and all being acquire means they can't reorder with each other.)
There are very few use-cases where seq_cst
is required (draining the store buffer before later loads can happen). Almost always a weaker order like acquire or release would also be safe.
There are artificial cases like https://preshing.com/20120515/memory-reordering-caught-in-the-act/, but even implementing locking generally only requires acquire and release ordering. (Of course taking a lock does require an atomic RMW, so on x86 that might as well be seq_cst.) One practical use-case I came up with was to have multiple threads set bits in an array. Avoid atomic RMWs and detect when one thread stepped on another by re-checking values that were recently stored. You have to wait until your stores are globally visible before you can safely reload them to check.
As such relaxed
, acquire
and release
seem to be the only orderings required on x86.
From one POV, in C++ source you don't require any ordering weaker than seq_cst
(except for performance); that's why it's the default for all std::atomic functions. Remember you're writing C++, not x86 asm.
Or if you mean to describe the full range of what x86 asm can do, then it's acq for loads, rel for pure stores, and seq_cst for atomic RMWs. (The lock
prefix is a full barrier; fetch_add(1, relaxed)
compiles to the same asm as seq_cst). x86 asm can't do a relaxed load or store1.
The only benefit to using relaxed
in C++ (when compiling for x86) is to allow more optimization of surrounding non-atomic operations by reordering at compile time, e.g. to allow optimizations like store coalescing and dead-store elimination. Always remember that you're not writing x86 asm; the C++ memory model applies for compile-time ordering / optimization decisions.
acq_rel
and seq_cst
are nearly identical for atomic RMW operations in ISO C++,
I think no difference when compiling for ISAs like x86 and ARMv8 that are multi-copy-atomic. (No IRIW reordering like e.g. POWER can do by store-forwarding between SMT threads before a store commits to L1d). How do memory_order_seq_cst and memory_order_acq_rel differ?
For barriers, atomic_thread_fence(mo_acq_rel)
compiles to zero instructions on x86, while fence(seq_cst)
compiles to mfence
or a faster equivalent (e.g. a dummy lock
ed instruction on some stack memory). When is a memory_order_seq_cst fence useful?
You could say acq_rel
and consume
are truly useless if you're only compiling for x86. consume
was intended to expose the dependency ordering that most weakly-ordered ISAs do (notably not DEC Alpha). But unfortunately it was designed in a way that compilers couldn't implement safely so they currently just give up and promote it to acquire, which costs a barrier on some weakly-ordered ISAs. But on x86, acquire
is "free" so it's fine.
If you actually do need efficient consume, e.g. for RCU, your only real option is to use relaxed
and don't give the compiler enough information to optimize away the data dependency from the asm it makes. C++11: the difference between memory_order_relaxed and memory_order_consume.
Footnote 1: I'm not counting movnt
as a relaxed atomic store because the usual C++ -> asm mapping for release operations uses just a mov
store, not sfence
, and thus would not order an NT store. i.e. std::atomic leaves it up to you to use _mm_sfence()
if you'd been messing around with _mm_stream_ps()
stores.
PS: this entire answer is assuming normal WB (write-back) cacheable memory regions. If you just use C++ normally under a mainstream OS, all your memory allocations will be WB, not weakly-ordered WC or strongly-ordered uncacheable UC or anything else. In fact even if you wanted a WC mapping of a page, most OSes don't have an API for that. And std::atomic
release stores would be broken on WC memory, weakly-ordered like NT stores.