Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
834 views
in Technique[技术] by (71.8m points)

assembly - Does using mix of pxor and xorps affect performance?

I've come across a fast CRC computation using PCLMULQDQ implementation. I see, that guys mix pxor and xorps instructions heavily like in the fragment below:

movdqa  xmm10, [rk9]
movdqa  xmm8, xmm0
pclmulqdq xmm0, xmm10, 0x11
pclmulqdq xmm8, xmm10, 0x0
pxor  xmm7, xmm8
xorps xmm7, xmm0

movdqa  xmm10, [rk11]
movdqa  xmm8, xmm1
pclmulqdq xmm1, xmm10, 0x11
pclmulqdq xmm8, xmm10, 0x0
pxor  xmm7, xmm8
xorps xmm7, xmm1

Is there any practical reason for this? Performance boost? If yes, then what lies beneath this? Or maybe it's just a sort of coding style, for fun?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

TL:DR: it looks like maybe some microarch-specific tuning for this specific code sequence. There's nothing "generally recommended" about it that will help in other cases.

On further consideration, I think @Iwillnotexist Idonotexist's theory is the most likely: this was written by a non-expert who thought this might help. The register allocation is a big clue: many REX prefixes could have been avoided by choosing all the repeatedly-used registers in the low 8.


XORPS runs in the "float" domain, on some Intel CPUs (Nehalem and later), while PXOR always runs in the "ivec" domain.

Since wiring every ALU output to every ALU input for forwarding results directly would be expensive, CPU designers break them up into domains. (Forwarding saves the latency of writing back to the register file and re-reading). A domain-crossing can take an extra 1 cycle of latency (Intel SnB-family), or 2 cycles (Nehalem).

Further reading: my answer on What's the difference between logical SSE intrinsics?


Two theories occur to me:

  • Whoever wrote this thought that PXOR and XORPS would give more parallelism, because they don't compete with each other. (This is wrong: PXOR can run on all vector ALU ports, but XORPS can't).

  • This is some very cleverly tuned code that creates a bypass delay on purpose, to avoid a resource conflicts that might delay the execution of the next PCLMULQDQ. (or as EOF suggests, code-size / alignment might have something to do with it).

The copyright notice on the code says "2011-2015 Intel", so it's worth considering the possibility that it's somehow helpful for some recent Intel CPU, and isn't just based on a misunderstanding of how Intel CPUs work. Nehalem was the first CPU to include PCLMULQDQ at all, and this is Intel so if anything it'll be tuned to do badly on AMD CPUs. The code history isn't in the git repo, only the May 6th commit that added the current version.

The Intel whitepaper (from Dec 2009) that it's based on used PXOR only, not XORPS, in its version of the 2x pclmul / 2x xor block.

Agner Fog's table doesn't even show a number of uops for PCLMULQDQ on Nehalem, or which ports they require. It's 12c latency, and one per 8c throughput, so it might be similar to Sandy/Ivybridge's 18 uop implementation. Haswell makes it an impressive 3 uops (2p0 p5), while it runs in only 1 uop on Broadwell (p0) and Skylake (p5).

XORPS can only run on port5 (until Skylake where it also runs on all three vector ALU ports). On Nehalem has 2c bypass delay when one of its input comes from PXOR. On SnB-family CPUs, Agner Fog says:

In some cases, there is no bypass delay when using the wrong type of shuffle or Boolean instruction.

So I think there's actually no extra bypass delay for forwarding from PXOR -> XORPS on SnB, so the only effect would be that it can only run on port 5. On Nehalem, it might actually delay the XORPS until after the PSHUFBs were done.

In the main unrolled loop, there's a PSHUFB after the XORs, to set up the inputs for the next PCLMUL. SnB/IvB can run integer shuffles on p1/p5 (unlike Haswell and later where there's only one shuffle unit on p5. But it's 256b wide, for AVX2).

Since competing for the ports needed to set up the input for the next PCLMUL doesn't seem useful, my best guess is code size / alignment if this change was done when tuning for SnB.


On CPUs where PCLMULQDQ is more than 4 uops, it's microcoded. This means each PCLMULQDQ requires an entire uop cache line to itself. Since only 3 uop cache lines can map to the same 32B block of x86 instructions, this means that much of the code won't fit in the uop cache at all on SnB/IvB. Each line of the uop cache can only cache contiguous instructions. From Intel's optimization manual:

All micro-ops in a Way (uop cache line) represent instructions which are statically contiguous in the code and have their EIPs within the same aligned 32-byte region.

This sounds like a very similar issue to having integer DIV in a loop: Branch alignment for loops involving micro-coded instructions on Intel SnB-family CPUs. With the right alignment, you can get it to run out of the uop cache (the DSB in Intel performance counter terminology). @Iwillnotexist Idonotexist did some useful testing on a Haswell CPU of micro-coded instructions, showing that they prevent running from the loopback buffer. (LSD in Intel terminology).


On Haswell and later, PCLMULQDQ is not microcoded, so it can go in the same uop cache line with other instructions before or after it.

For previous CPUs, it might be worth trying to tweak the code to bust the uop cache in fewer places. OTOH, switching between uop cache and legacy decoders might be worse than just always running from the decoders.

Also IDK if such a big unroll is really helpful. It probably varies a lot between SnB and Skylake, since microcoded instructions are very different for the pipeline, and SKL might not even bottleneck on PCLMUL throughput.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...