Macro-fused jumps have to be mentioned separately because it means the whole cmp/jcc
or whatever is vulnerable to this slowdown if the cmp
touches the boundary when the jcc
itself doesn't. Because the uop cache would have a single uop for both those x86 machine instructions together, with the start address of the non-jump instruction.
If everyone only said "jumps", you'd expect that only the JCC / JMP / CALL / RET itself had to avoid touching a 32B boundary. So it's a good thing to highlight the interaction with macro-fusion.
This slowdown (for all jumps) is the result of a microcode mitigation / workaround for a hardware design flaw. Not being able to uop-cache cache jumps that touch a 32-byte boundary is not the original erratum, it's a side effect of the cure.
That original erratum description doesn't say anything about affecting only conditional branches. Even if it was only conditional branches that were a real problem, perhaps the best way Intel could find to make it safe with a microcode update unfortunately affected all jumps.
For example, in Skylake-Xeon (SKX), the original erratum is documented as SKX102 in Intel's "spec update" errata list for that uarch:
SKX102. Processor May Behave Unpredictably on Complex Sequence of
Conditions Which Involve Branches That Cross 64 Byte Boundaries
Problem: Under complex micro-architectural conditions involving branch instructions bytes that
span multiple 64 byte boundaries (cross cache line), unpredictable system behavior
may occur.
Implication: When this erratum occurs, the system may behave unpredictably.
Workaround: It is possible for BIOS to contain a workaround for this erratum. [i.e. a microcode update]
Status: No fix.
I suspect the "JCC erratum" name caught on because most branches in "hot" code paths are conditional. Compilers can usually avoid putting unconditional taken branches in the fast path. So it's likely that people noticed the performance problem with JCC instructions first, and that name simply stuck even though it's not accurate.
BTW, 32-byte aligned routine does not fit the uops cache has a screenshot of the relevant diagram from the Intel PDF you linked about, and some other links and details about performance effects.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…