As known all levels of cache L1/L2/L3 on modern x86_64 are virtually indexed, physically tagged. And all cores communicate via Last Level Cache - cache-L3 by using cache coherent protocol MOESI/MESIF over QPI/HyperTransport.
For example, Sandybridge family CPU has 4 - 16 way cache L3 and page_size 4KB, then this allows to exchange the data between concurrent processes which are executed on different cores via a shared memory. This is possible because cache L3 can't contain the same physical memory area as a page of process 1 and as a page of process 2 at the same time.
Does this mean that every time when the process-1 requests the same shared memory region, then the process-2 flushes its cache-lines of page into the RAM, and then process-1 loaded the same memory region as cache-lines of page in virtual space of process-1? It's really slow or processor uses some optimizations?
Does modern x86_64 CPU use the same cache lines, without any flushes, to communicate between 2 processes with different virtual spaces via a shared memory?
Sandy Bridge Intel CPU - cache L3:
We have 7 missing bits [18:12] - i.e. we need to check (7^2 * 16-way) = 1024 cache lines. This is the same as 1024-way cache - so this is very slow. Does this mean, that cache L3 is (physically indexed, physically tagged)?
Summary of missing bits in virtual address for tag (page size 8 KB - 12 bits):
- L3 (8 MB = 64 B x 128 K lines), 16-way, 8 K sets, 13 bits tag [18:6] - missing 7 bits
- L2 (256 KB = 64 B x 4 K lines), 8-way, 512 sets, 9 bits tag [14:6] - missing 3 bits
- L1 (32 KB = 64 B x 512 lines), 8-way, 64 sets, 6 bits tag [11:6] - no missing bits
It should be:
- L3 / L2 (physically indexed, physically tagged) used after TLB lookup
- L1 (virtually indexed, physically tagged)
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…