You can declare a function pointer and point it to the correct version at program startup by calling cpuid
to determine the current architecture
But it's better to utilize support from many modern compilers. Intel's ICC has automatic function dispatching to select the optimized version for each architecture long ago. I don't know the details but looks like it only applies to Intel's libraries. Besides it only dispatches to the efficient version on Intel CPUs, hence would be unfair to other manufacturers. There are many patches and workarounds for that in Agner`s CPU blog
Later a feature called Function Multiversioning was introduced in GCC 4.8. It adds the target
attribute that you'll declare on each version of your function
__attribute__ ((target ("sse4.2")))
int foo() { return 1; }
__attribute__ ((target ("arch=atom")))
int foo() { return 2; }
int main() {
int (*p)() = &foo;
return foo() + p();
}
That duplicates a lot of code and is cumbersome so GCC 6 added target_clones
that tells GCC to compile a function to multiple clones. For example __attribute__((target_clones("avx2","arch=atom","default"))) void foo() {}
will create 3 different foo
versions. More information about them can be found in GCC's documentation about function attribute
The syntax was then adopted by Clang and ICC. Performance can even be better than a global function pointer because the function symbols can be resolved at process loading time instead of runtime. It's one of the reasons Intel's Clear Linux runs so fast. ICC may also create multiple versions of a single loop during auto-vectorization
Here's an example from The one with multi-versioning (Part II) along with its demo which is about popcnt but you get the idea
__attribute__((target_clones("popcnt","default")))
int runPopcount64_builtin_multiarch_loop(const uint8_t* bitfield, int64_t size, int repeat) {
int res = 0;
const uint64_t* data = (const uint64_t*)bitfield;
for (int r=0; r<repeat; r++)
for (int i=0; i<size/8; i++) {
res += popcount64_builtin_multiarch_loop(data[i]);
}
return res;
}
Note that PDEP
and PEXT
are very slow on current AMD CPUs so they should only be enabled on Intel
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…