c - CPU cache inhibition

Question

Welcome To Ask or Share your Answers For Others

c - CPU cache inhibition

asked Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

c - CPU cache inhibition

Say I have the defacto standard x86 CPU with 3 level of Caches, L1/L2 private, and L3 shared among cores. Is there a way to allocate shared memory whose data will not be cached on L1/L2 private caches, but rather it will only be cached at L3? I don't want to fetch data from memory (that's too costly), but I'd like to experiment with performance with and without bringing the shared data into private caches.

The assumption is that L3 is shared among the cores (presumably physically indexed cache) and thus will not incur any false sharing or cache line invalidation for heavily used shared data.

Any solution (if it exists) would have to be done programmatically, using C and/or assembly for intel based CPUs (relatively modern Xeon architectures (skylake, broadwell), running linux based OS.

Edit:

I have latency sensitive code which uses a form of shared memory for synchronization. The data will be in L3, but when read or written to it will go into L1/L2 depending on cache inclusivity policy. By implication of the problem, the data will have to be invalidated adding an unnecessary (I think) performance hit. I'd like to see if it's possible to just store the data, either through some page policy or special instructions only in L3.

I know it's possible to use the special memory register to inhibit caching for security reasons, but that requires CPL0 privilege.

Edit2:

I'm dealing with parallel codes that run on high performance systems for months at at time. The systems are high core-count systems (eg. 40-160+ cores) that periodically perform synchronization which needs to execute in usecs.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-17T03:08:51+0000

x86 has no way to do a store that bypasses or writes through L1D/L2 but not L3. There are NT stores which bypass all cache. Anything that forces a write-back to L3 also forces write-back all the way to memory. (e.g. a clwb instruction). Those are designed for non-volatile RAM use cases, or for non-coherent DMA, where it's important to get data committed to actual RAM.

There's also no way to do a load that bypasses L1D (except from USWC memory with SSE4.1 movntdqa, but it's not "special" on other memory types). prefetchNTA can bypass L2, according to Intel's optimization manual.

Prefetch on the core doing the read should be useful to trigger write-back from other core into L3, and transfer into your own L1D. But that's only useful if you have the address ready before you want to do the load. (Dozens of cycles for it to be useful.)

Intel CPUs use a shared inclusive L3 cache as a backstop for on-chip cache coherency. 2-socket has to snoop the other socket, but Xeons that support more than 2P have snoop filters to track cache lines that move around.

When you read a line that was recently written by another core, it's always Invalid in your L1D. L3 is tag-inclusive, and its tags have extra info to track which core has the line. (This is true even if the line is in M state in an L1D somewhere, which requires it to be Invalid in L3, according to normal MESI.) Thus, after your cache-miss checks L3 tags, it triggers a request to the L1 that has the line to write it back to L3 cache (and maybe to send it directly to the core than wants it).

Skylake-X (Skylake-AVX512) doesn't have an inclusive L3 (It has a bigger private L2 and a smaller L3), but it still has a tag-inclusive structure to track which core has a line. It also uses a mesh instead of ring, and L3 latency seems to be significantly worse than Broadwell.

Possibly useful: map the latency-critical part of your shared memory region with a write-through cache policy. IDK if this patch ever made it into the mainline Linux kernel, but see this patch from HP: Support Write-Through mapping on x86. (The normal policy is WB.)

Also related: Main Memory and Cache Performance of Intel Sandy Bridge and AMD Bulldozer, an in-depth look at latency and bandwidth on 2-socket SnB, for cache lines in different starting states.

For more about memory bandwidth on Intel CPUs, see Enhanced REP MOVSB for memcpy, especially the Latency Bound Platforms section. (Having only 10 LFBs limits single-core bandwidth).

Related: What are the latency and throughput costs of producer-consumer sharing of a memory location between hyper-siblings versus non-hyper siblings? has some experimental results for having one thread spam writes to a location while another thread reads it.

Note that the cache miss itself isn't the only effect. You also get a lot of machine_clears.memory_ordering from mis-speculation in the core doing the load. (x86's memory model is strongly ordered, but real CPUs speculatively load early and abort in the rare case where the cache line becomes invalid before the load was supposed to have "happened".

Categories

c - CPU cache inhibition

c - CPU cache inhibition

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags