c++ - Acquire/release semantics with non-temporal stores on x64

Question

Welcome To Ask or Share your Answers For Others

c++ - Acquire/release semantics with non-temporal stores on x64

asked Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

c++ - Acquire/release semantics with non-temporal stores on x64

I have something like:

if (f = acquire_load() == ) {
   ... use Foo
}

and:

auto f = new Foo();
release_store(f)

You could easily imagine an implementation of acquire_load and release_store that uses atomic with load(memory_order_acquire) and store(memory_order_release). But now what if release_store is implemented with _mm_stream_si64, a non-temporal write, which is not ordered with respect to other stores on x64? How to get the same semantics?

I think the following is the minimum required:

atomic<Foo*> gFoo;

Foo* acquire_load() {
    return gFoo.load(memory_order_relaxed);
}

void release_store(Foo* f) {
   _mm_stream_si64(*(Foo**)&gFoo, f);
}

And use it as so:

// thread 1
if (f = acquire_load() == ) {
   _mm_lfence(); 
   ... use Foo
}

and:

// thread 2
auto f = new Foo();
_mm_sfence(); // ensures Foo is constructed by the time f is published to gFoo
release_store(f)

Is that correct? I'm pretty sure the sfence is absolutely required here. But what about the lfence? Is it required or would a simple compiler barrier be enough for x64? e.g. asm volatile("": : :"memory"). According the the x86 memory model, loads are not re-ordered with other loads. So to my understanding, acquire_load() must happen before any load inside the if statement, as long as there's a compiler barrier.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-17T00:07:04+0000

I might be wrong about some things in this answer (proof-reading welcome from people that know this stuff!). It's based on reading the docs and Jeff Preshing's blog, not actual recent experience or testing.

Linus Torvalds strongly recommends against trying to invent your own locking, because it's so easy to get it wrong. It's more of an issue when writing portable code for the Linux kernel, rather than something that's x86-only, so I feel brave enough to try to sort things out for x86.

The normal way to use NT stores is to do a bunch of them in a row, like as part of a memset or memcpy, then an SFENCE, then a normal release store to a shared flag variable: done_flag.store(1, std::memory_order_release).

Using a movnti store to the synchronization variable will hurt performance. You might want to use NT stores into the Foo it points to, but evicting the pointer itself from cache is perverse. (movnt stores evict the cache line if it was in cache to start with; see vol1 ch 10.4.6.2 Caching of Temporal vs. Non-Temporal Data).

The whole point of NT stores is for use with Non-Temporal data, which won't be used again (by any thread) for a long time if ever. The locks that control access to shared buffers, or the flags that producers/consumers use to mark data as read, are expected to be read by other cores.

Your function names also don't really reflect what you're doing.

x86 hardware is extremely heavily optimized for doing normal (not NT) release-stores, because every normal store is a release-store. The hardware has to be good at it for x86 to run fast.

Using normal stores/loads only requires a trip to L3 cache, not to DRAM, for communication between threads on Intel CPUs. Intel's large inclusive L3 cache works as a backstop for cache-coherency traffic. Probing the L3 tags on a miss from one core will detect the fact that another core has the cache line in the Modified or Exclusive state. NT stores would require synchronization variables to go all the way out to DRAM and back for another core to see it.

Memory ordering for NT streaming stores

movnt stores can be reordered with other stores, but not with older reads.

Intel's x86 manual vol3, chapter 8.2.2 (Memory Ordering in P6 and More Recent Processor Families):

Reads are not reordered with other reads.

Writes are not reordered with older reads. (note the lack of exceptions).

Writes to memory are not reordered with other writes, with the following exceptions:

streaming stores (writes) executed with the non-temporal move instructions (MOVNTI, MOVNTQ, MOVNTDQ, MOVNTPS, and MOVNTPD); and

string operations (see Section 8.2.4.1). (note: From my reading of the docs, fast string and ERMSB ops still implicitly have a StoreStore barrier at the start/end. There's only potential reordering between the stores within a single rep movs or rep stos.)

... stuff about clflushopt and the fence instructions

update: There's also a note (in 8.1.2.2 Software Controlled Bus Locking) that says:

Do not implement semaphores using the WC memory type. Do not perform non-temporal stores to a cache line containing a location used to implement a semaphore.

This may just be a performance suggestion; they don't explain whether it can cause a correctness problem. Note that NT stores are not cache-coherent, though (data can sit in the line fill buffer even if conflicting data for the same line is present somewhere else in the system, or in memory). Maybe you could safely use NT stores as a release-store that synchronizes with regular loads, but would run into problems with atomic RMW ops like lock add dword [mem], 1.

Release semantics prevent memory reordering of the write-release with any read or write operation which precedes it in program order.

To block reordering with earlier stores, we need an SFENCE instruction, which is a StoreStore barrier even for NT stores. (And is also a barrier to some kinds of compile-time reordering, but I'm not sure if it blocks earlier loads from crossing the barrier.) Normal stores don't need any kind of barrier instruction to be release-stores, so you only need SFENCE when using NT stores.

For loads: The x86 memory model for WB (write-back, i.e. "normal") memory already prevents LoadStore reordering even for weakly-ordered stores, so we don't need an LFENCE for its LoadStore barrier effect, only a LoadStore compiler barrier before the NT store. In gcc's implementation at least, std::atomic_signal_fence(std::memory_order_release) is a compiler-barrier even for non-atomic loads/stores, but atomic_thread_fence is only a barrier for atomic<> loads/stores (including mo_relaxed). Using an atomic_thread_fence still allows the compiler more freedom to reorder loads/stores to non-shared variables. See this Q&A for more.

// The function can't be called release_store unless it actually is one (i.e. includes all necessary barriers)
// Your original function should be called relaxed_store
void NT_release_store(const Foo* f) {
   // _mm_lfence();  // make sure all reads from the locked region are already globally visible.  Not needed: this is already guaranteed
   std::atomic_thread_fence(std::memory_order_release);  // no insns emitted on x86 (since it assumes no NT stores), but still a compiler barrier for earlier atomic<> ops
   _mm_sfence();  // make sure all writes to the locked region are already globally visible, and don't reorder with the NT store
   _mm_stream_si64((long long int*)&gFoo, (int64_t)f);
}

This stores to the atomic variable (note the lack of dereferencing &gFoo). Your function stores to the Foo it points to, which is super weird; IDK what the point of that was. Also note that it compiles as valid C++11 code.

When thinking about what a release-store means, think about it as the store that releases the lock on a shared data structure. In your case, when the release-store becomes globally visible, any thread that sees it should be able to safely dereference it.

To do an acquire-load, just tell the compiler you want one.

x86 doesn't need any barrier instructions, but specifying mo_acquire instead of mo_relaxed gives you the necessary compiler-barrier. As a bonus, this function is portable: you'll get any and all necessary barriers on other architectures:

Foo* acquire_load() {
    return gFoo.load(std::memory_order_acquire);
}

You didn't say anything about storing gFoo in weakly-ordered WC (uncacheable write-combining) memory. It's probably really hard to arrange for your program's data segment to be mapped into WC memory... It would be a lot easier for gFoo to simply point to WC memory, after you mmap some WC video RAM or something. But if you want acquire-loads from WC memory, you probably do need LFENCE. IDK. Ask another question about that, because this answer mostly assumes you're using WB memory.

Note that using a pointer instead of a flag creates a data dependency. I think you should be able to use gFoo.load(std::memory_order_consume), which doesn't require barriers even on weakly-ordered CPUs (other than Alpha). Once compilers are sufficiently advanced to make sure they don't break the data dependency, they can actually make better code (instead of promoting mo_consume to mo_acquire. Read up on this before using mo_consume in production code, and esp. be careful to note that testing it properly is impossible because future compilers are expected to give weaker guarantees than current compilers in practice do.

Initially I was thinking that we did need LFENCE to get a LoadStore barrier. ("Writes cannot pass earlier LFENCE, SFENCE, and MFENCE instructions". This in turn prevents them from passing (becoming globally visible before) reads that are before the LFENCE).

Note that LFENCE + SFENCE is still weaker than a full MFENCE, because it's not a StoreLoad barrier. SFENCE's own documentation says it's ordered wrt. LFENCE, but that table of the x86 memory model from Intel manual vol3 doesn't mention that. If SFENCE can't execute until after an LFENCE, then sfence / lfence might actually be a slower equivalent to mfence, but lfence / sfence / movnti would give release semantics without a full barrier. Note that the NT store could become globally visible after some following loads/stores, unlike a normal strongly-ordered x86 store.)

Related: NT loads

In x86, every load has acqui

Categories

c++ - Acquire/release semantics with non-temporal stores on x64

c++ - Acquire/release semantics with non-temporal stores on x64

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Memory ordering for NT streaming stores

To do an acquire-load, just tell the compiler you want one.

Related: NT loads

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags