Just imagine, how the processor would execute a typical spin-wait loop:
1 Spin_Lock:
2 CMP lockvar, 0 ; Check if lock is free
3 JE Get_Lock
4 JMP Spin_Lock
5 Get_Lock:
After a few iterations the branch predictor will predict that the conditional branch (3) will never be taken and the pipeline will fill with CMP instructions (2). This goes on until finally another processor writes a zero to lockvar. At this point we have the pipeline full of speculative (i.e. not yet committed) CMP instructions some of which already read lockvar and reported an (incorrect) nonzero result to the following conditional branch (3) (also speculative). This is when the memory order violation happens. Whenever the processor "sees" an external write (a write from another processor), it searches in its pipeline for instructions which speculatively accessed the same memory location and did not yet commit. If any such instructions are found then the speculative state of the processor is invalid and is erased with a pipeline flush.
Unfortunately this scenario will (very likely) repeat each time a processor is waiting on a spin-lock and make these locks much slower than they ought to be.
Enter the PAUSE instruction:
1 Spin_Lock:
2 CMP lockvar, 0 ; Check if lock is free
3 JE Get_Lock
4 PAUSE ; Wait for memory pipeline to become empty
5 JMP Spin_Lock
6 Get_Lock:
The PAUSE instruction will "de-pipeline" the memory reads, so that the pipeline is not filled with speculative CMP (2) instructions like in the first example. (I.e. it could block the pipeline until all older memory instructions are committed.) Because the CMP instructions (2) execute sequentially it is unlikely (i.e. the time window is much shorter) that an external write occurs after the CMP instruction (2) read lockvar but before the CMP is committed.
Of course "de-pipelining" will also waste less energy in the spin-lock and in case of hyperthreading it will not waste resources the other thread could use better. On the other hand there is still a branch mis-prediction waiting to occur before each loop exit. Intel's documentation does not suggest that PAUSE eliminates that pipeline flush, but who knows...
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…