computer science - What cache coherence solution do modern x86 CPUs use?

Question

Welcome To Ask or Share your Answers For Others

computer science - What cache coherence solution do modern x86 CPUs use?

asked Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

computer science - What cache coherence solution do modern x86 CPUs use?

I am somewhat confused with what how cache coherence systems function in modern multi core CPU. I have seen that snooping based protocols like MESIF/MOESI snooping based protocols have been used in Intel and AMD processors, on the other hand directory based protocols seem to be a lot more efficient with multiple core as they don't broadcast but send messages to specific nodes.

What is the modern cache coherence solution in AMD or Intel processors, is it snooping based protocols like MOESI and MESIF, or is it only directory based protocols, or is it a combination of both (snooping based protocols for communication between elements inside the same node, and directory based for node to node communications)?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-23T18:23:24+0000

MESI is defined in terms of snooping a shared bus, but no, modern CPUs don't actually work that way. MESI states for each cache line can be tracked / updated with messages and a snoop filter (basically a directory) to avoid broadcasting those messages, which is what Intel (MESIF) and AMD (MOESI) actually do.

e.g. the shared inclusive L3 cache in Intel CPUs (before Skylake server) lets L3 tags act as a snoop filter; as well as tracking the MESI state, they also record which core # (if any) has a private copy of a line. Which cache mapping technique is used in intel core i7 processor?

For example, a Sandybridge-family CPU with a ring bus (modern client chips, server chips up to Broadwell). Core #0 reads a line. That line is in Modified state on core #1.

read misses in L1d and L2 cache on core #0, resulting in is sending a request on the ring bus to the L3 slice that contains that line (indexing via a hash function on some physical address bits)
That slice of L3 gets the message, checks its tags. If it found tag = Shared at this point, the response could go back over the bidirectional ring bus with the data.
Otherwise, L3 tags tell it that core #1 has exclusive ownership of a line: Exclusive, may have been promoted to Modified = dirty.
L3 cache logic in that slice of L3 will generate a message to ask core #1 to write back that line.
The message arrives at the ring bus stop for core #1, and gets its L2 or L1d to write back that line.

IDK if one ring bus message can be read directly by Core #0 as well as the relevant slice of L3 cache, or if the message might have to go all the way to the L3 slice and then to core #0 from there. (Worst case distance = basically all the way around the ring, instead of half, for a bidirectional ring.)

This is super hand-wavy; do not take my word for it on the exact details, but the general concept of sending messages like share-request, RFO, or write-back, is the right mental model. BeeOnRope has an answer that with a similar breakdown into steps that covers uops and the store buffer, as well as MESI / RFO.

In a similar case, core #1 could have silently dropped the line without having modified it, if it had only gotten Exclusive ownership but never written it. (Loads that miss in cache default to loading into Exclusive state so a separate store won't have to do an RFO for the same line). In that case I assume it the core that doesn't have the line after all has to send a message back to indicate that. Or maybe it sends a message directly to one of the memory controllers that are also on the ring bus, instead of a round trip back to the L3 slice to force it to do that.

Obviously stuff like this can be happening in parallel for every core. (And each core can have multiple outstanding requests it's waiting for: memory level parallelism within a single core. On Intel, L2 superqueue has 16 entries on some microarchitectures, while there are 10 or 12 L1 LFBs.)

Quad-socket and higher systems have snoop filters between sockets; dual-socket Intel systems with E5-xxxx CPUs of Broadwell and earlier did just spam snoops to each other over the QPI links. (Unless you used a quad-socket-capable CPU (E7-xxxx) in a dual-socket system). Multi-socket is hard because missing in local L3 doesn't necessarily mean it's time to hit DRAM; the / an other socket might have the line modified.

Also related:

https://www.realworldtech.com/sandy-bridge/ Kanter's SnB write-up covers some about Intel's ring bus design, IIRC, although it's mostly about the internals of each core. The shared inclusive L3 was new in Nehalem (when Intel started using the "core i7" brand name), https://www.realworldtech.com/nehalem/
Why is Skylake so much better than Broadwell-E for single-threaded memory throughput? - more hops on the ring bus for Intel CPUs with more cores hurts L3 and DRAM latency and therefore bandwidth = max-concurrency / latency.
What is the benefit of the MOESI cache coherency protocol over MESI? some more links.

Categories

computer science - What cache coherence solution do modern x86 CPUs use?

computer science - What cache coherence solution do modern x86 CPUs use?

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags