Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
612 views
in Technique[技术] by (71.8m points)

caching - Does Linux use x86 CPU's PCID feature for TLB? If not, why?

I wrote a kernel module to check CR4.PCIDE, it is not set. Why doesn't Linux use such feature to reduce the performance slowdown due to TLB invalidation and cache pollution?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

Update: This changed around the 4.15 timeframe due to the Meltdown and Spectre attacks in late 2017 and early 2018. See the other answer for details.

Note: I'm not a Linux developer

For Intel's "Process Context Identifiers", there's a limit of 4096 IDs. This means that when there are more than 4096 processes you need to manage them (e.g. maybe do a "least recently used" thing so that if a process that currently doesn't have an ID needs to be executed then the ID is taken from some other process and reused).

The other thing that comes into it is "TLB shootdown" on multi-CPU systems. These can be a little expensive, so people do tricks to avoid them. For example, if a process only has one thread then it can only be running on one CPU and you know there's no need to send an IPI to other CPUs (interrupting them and asking them to do the "TLB shootdown"). Once you start using PCIDs you can't be sure that other CPUs don't still have TLB entries, and can't do these tricks to avoid "TLB shootdown". It also means that (in theory, for badly implemented PCID support) the performance you gain from PCID may be less than the performance you lose due to unavoided TLB shootdown and ID management overhead, resulting in a net loss.

Mostly what I'm saying is that it's a little complicated to add support for PCID (it's not like you can just set a flag in CR4 and forget about it). You'd have to do some research (experiments, prototypes, benchmarking) to determine the most effective way of implementing it. For a large/complex/old kernel (like Linux) it'd be even more complicated as you'd have to be careful not to upset something else by accident. The other thing is that this feature is relatively new (it's only existed for a few years if I remember correctly) and isn't supported by a lot of CPUs (e.g. anything a little older, and anything from AMD).

Basically, I'd assume that it comes down to "time vs. benefits" (or, not enough time for a small performance improvement on a limited number of CPUs).


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...