TL:DR: no, CPU hardware is already optimized for one core storing, another core loading. There's no magic high-performance lower-latency method you can use instead. If the write side can force write-back to L3 somehow, that can reduce latency for the read-side, but unfortunately there's no good way to do that (except on Tremont Atom, see below).
Shared last-level cache already backstops coherency traffic, avoiding write/re-read to DRAM.
Don't be fooled by MESI diagrams; those show single-level caches without a shared cache.
In real CPUs, stores from one core only have to write-back to last-level cache (LLC = L3 in modern x86) for loads from other cores to access them. L3 can hold dirty lines; all modern x86 CPUs have write-back L3 not write-through.
On a modern multi-socket system, each socket has its own memory controllers (NUMA) so snooping detects when cache->cache transfers need to happen over the interconnect between sockets. But yes, pinning threads to the same physical core does improve inter-core / inter-thread latency. (Similarly for AMD Zen, where clusters of 4 cores share a chunk of LLC, within / across clusters matters for inter-core latency even within a single socket because there isn't one big LLC shared across all cores.)
You can't do much better than this; a load on one core will generate a share request once it reaches L3 and finds the line is Modified in the private L1d or L2 of another core. This is why latency is higher than an L3 hit: the load request has to get L3 before it even knows it's not just going to be an L3 hit. But Intel uses its large shared inclusiv L3 cache tags as a snoop filter, to track which core on the chip might have it cached. (This changed in Skylake-Xeon; its L3 is no longer inclusive, not even tag-inclusive, and must have some separate snoop filter.)
See also Which cache mapping technique is used in intel core i7 processor?
Fun fact: on Core 2 CPUs traffic between cores really was as slow as DRAM in some cases, even for cores that shared an L2 cache.
Early Core 2 Quad CPUs were really two dual-core dies in the same package, and didn't share a last-level cache. That might have been even worse; some CPUs like that didn't have a shared LLC and IDK if the "glue" logic could even do cache->cache transfers of dirty data without write-back to DRAM.
But those days are long past; modern multi-core and multi-socket CPUs are about as optimized as they can be for inter-core traffic.
You can't really do anything special on the read side that can make anything faster.
If you had cldemote
on the write side, or other way to get data evicted back to L3, the read side could just get L3 hits. But that's only available on Tremont Atom
x86 MESI invalidate cache line latency issue is another question about trying to get the write side to evict cache lines back to L3, this one via conflict misses.
clwb
would maybe work to reduce read-side latency, but the downside is that it forces a write-back all the way to DRAM, not just L3. (And on Skylake-Xeon it does evict, like clflushopt
. Hopefully IceLake will give us a "real" clwb
.)
How to force cpu core to flush store buffer in c? is another question about basically the same thing.