Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
617 views
in Technique[技术] by (71.8m points)

assembly - What is the penalty of mixing EVEX and VEX encoded scheme?

It is a known issue that mixing VEX-encoded instructions and non-VEX instructions has a penalty and the programmer must be aware of it.

There are some questions and answers like this. The solutions are depended on the way you program (usually you should use zeroupper after transitions. But my question is about EVEX-encoded scheme. As far as there are no intrinsics such as _mm512_zeroupper() It seems there is no penalty when using VEX-encoded and EVEX-encoded instructions together. However EVEX is 4-byte and VEX is 3-byte and also the vector length is 512-bit and 256-bit respectively.

Because AVX-512 is not available (at least for me). I wanted to ask If there is anything to be aware of when we want to mix them.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

There is no penalty for mixing any of VEX 128 / 256 or EVEX 128 / 256 / 512 on any current CPUs, and no reason to expect any penalty on future CPUs.

All VEX and EVEX coded instructions are defined to zero the high bytes of the destination vector register, out to whatever the maximum vector width the CPU supports. This makes them future-proof for any future wider vectors without needing ugly stuff like vzeroupper.


(There is a related slowdown, though: see @BeeOnRope's comments about writing a full 512-bit register having a permanent effect until vzeroupper on SKX, if you write a ZMM register explicitly (not via implicitly zero-extension of the corresponding YMM or XMM register). It makes every narrower vector instruction act as if it was a 512-bit instruction for Turbo frequency limits.

No false dependencies or extra clock cycles, just that each clock cycle isn't as short as with full turbo. Port 1 is not shut down: we still have 3-per-clock vpaddd xmm/ymm.

This is a "global" core-wide state: one polluted zmm0..15 register will hurt the whole core, and only vzeroupper/all will restore higher turbo. (But writes to zmm16..31 reportedly aren't a problem). Simply writing the low halves of the affected ZMM register(s) with normal zero-extending XMM YMM VEX or EVEX instructions won't get you out of that "mode" / state. Even a zeroing idiom like VEX vpxor or EVEX vpxord the polluted register doesn't help. vpxord zmm0,zmm0,zmm0 can in fact can cause the problem, which is odd for a zeroing idiom.

Two different experiments performed by user Mysticial and BeeOnRope (see comments) indicate that SKX's physical register file has 512-bit entries; a microbenchmark that depends on the vector PRF size to find ILP found "a SIMD speculative PRF size of about 150 to 158", the same for 256-bit or 512-bit vectors. (And we know that's about right for the 256-bit PRF size, based on Intel's published info for Skylake-client and experiments there.) So we can rule out a mode where storing an architectural ZMM register requires 2 PRF entries and twice as many read/write ports.

My current guess at an explanation is that maybe there's an upper256 PRF physically farther from scheduler than the main vector PRF, or just extra width sharing the same indexing in the main vector PRF. Speed-of-light propagation delays could limit max turbo when the upper256 PRF is powered up, if that's a thing. This hardware-design hypothesis is not testable with software, but it is compatible with only vzeroupper / vzeroall getting out of the bad state (if I'm right, letting the upper256 part of the PRF power down because that one instruction lets us know it's unused). I'm not sure why zmm16..31 wouldn't matter for this, though.

The CPU does track whether any upper-256 parts are non-zero, so xsaveopt can use a more compact block if possible. Interaction with the kernel's xsaveopt / restore are possible in interrupt handlers, but mostly I mention this just as another reason why CPUs do track this.

Note that this ZMM dirty-upper problem is not due to mixing VEX and EVEX. You'd have the same problem if you used EVEX encodings for all the 128-bit and 256-bit instructions. The problem is from mixing 512-bit with narrower vectors, on first-gen AVX512 CPUs where 512-bit is a bit of a stretch and they're more optimized for shorter vectors. (The port-1 shutdown, and higher latency for the port5 FMA).

I wonder if this was intentional, or if it was a design bug.



Using VEX when possible in AVX512 code is a good thing.

VEX saves code-size vs. EVEX. Sometimes when unpacking or converting between element widths, you could end up with narrower vectors.

(Even given the above issue with mixing 512-bit with shorter vectors, 128/256-bit instructions are not worse than their 512-bit equivalent. They keep the max turbo reduced when they shouldn't, but that's all.)

A VEX-coded vpxor xmm0,xmm0,xmm0 is already the most efficient way to zero a ZMM register, saving 2 bytes vs. vpxord zmm0,zmm0,zmm0 and running at least as fast. MSVC has been doing this for a while, and clang 6.0 (trunk) does it too after I reported the missed optimization. (gcc vs. clang on godbolt.

Even apart from code-size, it's potentially faster on future CPUs that split 512b instructions into two 256b ops. (See Agner Fog's answer on Is vxorps-zeroing on AMD Jaguar/Bulldozer/Zen faster with xmm registers than ymm?).

Similarly, horizontal sums should narrow down to 256b and then 128b as the first steps, so they can use shorter VEX instructions, and 128b instructions are fewer uops on some CPUs. Also in-lane shuffles are often faster than lane-crossing.



Background on why SSE/AVX is a problem

See also Agner Fog's 2008 post on the Intel forums and the rest of the thread commenting on the AVX design when it was first announced. He correctly points out that if Intel had planned for extension to wider vectors when designing SSE in the first place and provided a way to save/restore a full vector regardless of width, this wouldn't have been a problem.

Also interesting, Agner's 2013 comments on AVX512, and the resulting discussion on the Intel forum: AVX-512 is a big step forward - but repeating past mistakes!


When AVX was first introduced, they could have defined the behaviour of legacy SSE instructions to zero the upper lane, which would have avoided the need for vzeroupper and having a saved-upper state (or false dependencies).

Calling conventions would simply allow functions to clobber the upper lanes of vector regs (like current calling conventions already do).

The problem is asynchronous clobbering of upper lanes by non-AVX-aware code in kernels. OSes already need to be AVX-aware to save/restore the full vector state, and AVX instructions fault if the OS hasn't set a bit in an MSR that promises this support. So you need an AVX-aware kernel to use AVX, so what's the problem?

The problem is basically legacy binary-only Windows device drivers that manually save/restore some XMM registers "manually" using legacy SSE instructions. If that did implicit zeroing, this would break the AVX state for user-space.

Instead of making AVX unsafe to enable on Windows systems using such drivers, Intel designed AVX so the legacy SSE versions left the upper lane unmodified. Letting non-AVX-aware SSE code run efficiently requires some kind of penalty.

We have binary-only software distribution for Microsoft Windows to thank for Intel's decision to inflict the pain of SSE/AVX transition penalties.

Linux kernel code has to call kernel_fpu_begin / kernel_fpu_end around code vector regs, which triggers the regular save/restore code which has to know about AVX or AVX512. So any kernel built with AVX support will support it in every driver/module (e.g. RAID5/RAID6) that wants to use SSE or AVX, even a non-AVX-aware binary-only kernel module (assuming it was correctly written, rather than saving/restoring a couple xmm or ymm regs itself).

Windows has a similar future-proof save/restore mechanism, KeSaveExtendedProcessorState, that lets you use SSE/AVX code in kernel code (but not interrupt handlers). IDK why drivers didn't always use that; maybe it's slow or didn't exist at first. If it's been available for long enough, then it's purely the fault of binary-only driver writers/distributors, not Microsoft themselves.

(IDK about OS X either. If binary drivers save/restore xmm regs "manually" instead of telling the OS that the next context switch needs to restore FP state as well as integer, then they're part of the problem too.)


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...