Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
834 views
in Technique[技术] by (71.8m points)

c++ - Do I need to use _mm256_zeroupper in 2021?

From Agner Fog's "Optimizing software in C++":

There is a problem when mixing code compiled with and without AVX support on some Intel processors. There is a performance penalty when going from AVX code to non-AVX code because of a change in the YMM register state. This penalty should be avoided by calling the intrinsic function _mm256_zeroupper() before any transition from AVX code to nonAVX code. This can be necessary in the following cases:

? If part of a program is compiled with AVX support and another part of the program is compiled without AVX support then call _mm256_zeroupper() before leaving the AVX part.

? If a function is compiled in multiple versions with and without AVX using CPU dispatching then call _mm256_zeroupper() before leaving the AVX part.

? If a piece of code compiled with AVX support calls a function in a library other than the library that comes with the compiler, and the library has no AVX support, then call _mm256_zeroupper() before calling the library function.

I'm wondering what are some Intel processors. Specifically, are there processors made in the last five years. So that I know if it is too late to fix missing _mm256_zeroupper() calls or not.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

TL:DR: Don't use the _mm256_zeroupper() intrinsic manually, compilers understand SSE/AVX transition stuff and emit vzeroupper where needed for you. (Including when auto-vectorizing or expanding memcpy/memset/whatever with YMM regs.)


"Some Intel processors" being all except Xeon Phi.

Xeon Phi (KNL / KNM) don't have a state optimized for running legacy SSE instructions because they're purely designed to run AVX-512. Legacy SSE instructions probably always have false dependencies merging into the destination.

On mainstream CPUs with AVX or later, there are two different mechanisms: saving dirty uppers (SnB through Haswell, and Ice Lake) or false dependencies (Skylake). See Why is this SSE code 6 times slower without VZEROUPPER on Skylake? the two different styles of SSE/AVX penalty

Related Q&As about the effects of asm vzeroupper (in the compiler-generated machine code):


Intrinsics in C or C++ source

You should pretty much never use _mm256_zeroupper() in C/C++ source code. Things have settled on having the compiler insert a vzeroupper instruction automatically where it might be needed, which is pretty much the only sensible way for compilers to be able to optimize functions containing intrinsics and still reliably avoid transition penalties. (Especially when considering inlining). All the major compilers can auto-vectorize and/or inline memcpy/memset/array init with YMM registers, so need to keep track of using vzeroupper after that.

The convention is to have the CPU in clean-uppers state when calling or returning, except when calling functions that take __m256 / __m256i/d args by value (in registers or at all), or when returning such a value. The target function (callee or caller) inherently must be AVX-aware and expecting a dirty-upper state because a full YMM register is in-use as part of the calling convention.

x86-64 System V passes vectors in vector regs. Windows vectorcall does, too, but the original Windows x64 convention (now named "fastcall" to distinguish from "vectorcall") passes vectors by value in memory via hidden pointer. (This optimizes for variadic functions by making every arg always fit in an 8-byte slot.) IDK how compilers compiling Windows non-vectorcall calls handle this, whether they assume the function probably looks at its args or at least is still responsible for using a vzeroupper at some point even if it doesn't. Probably yes, but if you're writing your own code-gen back-end, or hand-written asm, have a look at what some compilers you care about actually do if this case is relevant for you.

Some compilers optimize by also omitting vzeroupper before returning from a function that took vector args, because clearly the caller is AVX-aware. And crucially, apparently compilers shouldn't expect that calling a function like void foo(__m256i) will leave the CPU in clean-upper state, so the callee does still need a vzeroupper after such a function, before call printf or whatever.


Compilers have options to control vzeroupper usage

For example, GCC -mno-vzeroupper / clang -mllvm -x86-use-vzeroupper=0. (The default is -mvzeroupper to do the behaviour described above, using when it might be needed.)

This is implied by -march=knl (Knight's Landing) because it's not needed and very slow on Xeon Phi CPUs (thus should actively be avoided).

Or you might possibly want it if you build libc (and any other libraries you use) with -mavx -mno-veroupper. glibc has some hand-written asm for functions like strlen, but most of those have AVX2 versions. So as long as you're not on an AVX1-only CPU, legacy-SSE versions of string functions might not get used at all.

For MSVC, you should definitely prefer using -arch:AVX when compiling code that uses AVX intrinsics. I think some versions of MSVC could generate code that caused transition penalties if you mixed __m128 and __m256 without /arch:AVX. But beware that that option will make even 128-bit intrinsics like _mm_add_ps use the AVX encoding (vaddps) instead of legacy SSE (addps), though, and will let the compiler auto-vectorize with AVX. There is undocumented switch /d2vzeroupper to enable automatic vzeroupper generation (default), /d2vzeroupper- disables it - see What is the /d2vzeroupper MSVC compiler optimization flag doing?


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...