SSE2 is enabled by default for x86-64, because it's a required part of the x86-64 ISA.
Since Apple has never sold any AMD or Pentium4 CPUs, x86-64 on OS X also implies SSSE3 (first-gen Core2). The first x86 Macs were Core (not Core2), but they were 32-bit only. You unfortunately can't assume SSE4.1 or -mpopcnt
.
I'd suggest -march=core2 -mtune=haswell
. (-mtune
doesn't affect compatibility, and Haswell tuning shouldn't be bad for actual Core2 or Nehalem hardware. See http://agner.org/optimize/ and links in the x86 tag wiki for microarchitecture details about what things in (compiler-generated) assembly language are fast or slow on different CPUs.).
(See How does mtune actually work? for an example of different tuning causing different instruction selection without changing the required ISA extensions.)
-march=core2
enables everything that core2 supports, not just SSSE3. Since you don't care about your code performing well on AMD CPUs (because it's OS X), you can tune for an Intel CPU. There's also -mtune=intel
which is more generic, but Haswell should be reasonable.
You might be missing out on support for Hackintosh systems where someone installed OS X on an ancient CPU on non-Apple hardware, but IDK if OS X would work on an AMD Athlon64 / PhenomII, or Intel P4.
It would be nice to be able to enable some Nehalem stuff like -mpopcnt
, but Core 2 first and 2nd gen (Conroe and Penryn) lacked that. Even SSE4.1 isn't available on first-gen Core 2.
It's also possible to build a fat binary with baseline and Haswell slices, x86_64
and x86_64h
. Stephen Cannon says (in comments below) that "the x86_64h slice will run automatically on Haswell and later μarches". (Slices for other uarches aren't currently an option, but most programs would get little benefit.)
Your x86_64
(non-Haswell) slice should probably build with -march=core2 -mtune=sandybridge
.
Haswell introduced AVX2, FMA, and BMI2, so -march=haswell
is a very nice for Broadwell / Skylake / Kaby Lake / Coffee Lake. (For tuning options as well as ISA extensions: gcc -march=haswell
disables -mavx256-split-unaligned-load
and store, while -mavx
+ tune=default or sandybridge enables it. It sucks on Haswell especially when it creates shuffle-port bottlenecks. And it's really dumb when your data is almost always aligned, or really always but you just didn't tell the compiler about it.
Broadwell introduced ADOX/ADCX which is pretty niche (run two extended-precision add dependency chains in parallel), and Skylake introduced clflushopt
which isn't widely useful.
Skylake and most Broadwell CPUs do have working transactional memory, though, which might be important for some fine-grained multithreading cases. (Haswell was going to have it, but it was disabled in a microcode update after a rare bug was discovered in the implementation.)
AVX512 is the next big thing that's widely useful but Haswell doesn't have, so maybe Apple will add support for a Cannonlake or Ice Lake slice at some point.
I wouldn't recommend making a separate build for Broadwell or Skylake (with any dispatching mechanism), unless you know you can take advantage of a specific new feature and it makes a significant difference.
But it could be potentially useful for Sandybridge, for AVX support without AVX2, especially for 256-bit FP math but also to save movdqa
instructions in integer 128-bit vector code. Also for SSE4.x and popcnt. And no partial-flag problems in an extended-precision adc
loop using dec/jnz
.