TL:DR/advice for modern CPUs: Use inc
except with a memory destination. In code you're tuning to run on mainstream Intel or any AMD, inc register
is fine. (e.g. like gcc -mtune=core2
, -mtune=haswell
, or -mtune=znver1
). inc mem
costs an extra uop on Intel P6 / SnB-family; the load can't micro-fuse.
If you care about Silvermont-family (including KNL in Xeon Phi, and some netbooks, chromebooks, and NAS servers), probably avoid inc
. add 1
only costs 1 extra byte in 64-bit code, or 2 in 32-bit code. But it's not a performance disaster (just locally 1 extra ALU port used, not creating false dependencies or big stalls), so if you don't care much about SMont then don't worry about it.
Writing CF instead of leaving it unmodified can potentially be useful with other surrounding code that might benefit from CF dep-breaking, e.g. shifts. See below.
If you want to inc/dec without touching any flags, lea eax, [rax+1]
runs efficiently and has the same code-size as add eax, 1
. (Usually on fewer possible execution ports than add/inc, though, so add/inc are better when destroying FLAGS is not a problem. https://agner.org/optimize/)
On modern CPUs, add
is never slower than inc
(except for indirect code-size / decode effects), but usually it's not faster either, so you should prefer inc
for code-size reasons. Especially if this choice is repeated many times in the same binary (e.g. if you are a compiler-writer).
inc
saves 1 byte (64-bit mode), or 2 bytes (opcodes 0x40..F inc r32
/dec r32
short form in 32-bit mode, re-purposed as the REX prefix for x86-64). This makes a small percentage difference in total code size. This helps instruction-cache hit rates, iTLB hit rate, and number of pages that have to be loaded from disk.
Advantages of inc
:
- code-size directly
- Not using an immediate can have uop-cache effects on Sandybridge-family, which could offset the better micro-fusion of
add
. (See Agner Fog's table 9.1 in the Sandybridge section of his microarch guide.) Perf counters can easily measure issue-stage uops, but it's harder to measure how things pack into the uop cache and uop-cache read bandwidth effects.
- Leaving CF unmodified is an advantage in some cases, on CPUs where you can read CF after
inc
without a stall. (Not on Nehalem and earlier.)
There is one exception among modern CPUs: Silvermont/Goldmont/Knight's Landing decodes inc
/dec
efficiently as 1 uop, but expands to 2 in the allocate/rename (aka issue) stage. The extra uop merges partial flags. inc
throughput is only 1 per clock, vs. 0.5c (or 0.33c Goldmont) for independent add r32, imm8
because of the dep chain created by the flag-merging uops.
Unlike P4, the register result doesn't have a false-dep on flags (see below), so out-of-order execution takes the flag-merging off the latency critical path when nothing uses the flag result. (But the OOO window is much smaller than mainstream CPUs like Haswell or Ryzen.) Running inc
as 2 separate uops is probably a win for Silvermont in most cases; most x86 instructions write all the flags without reading them, breaking these flag dependency chains.
SMont/KNL has a queue between decode and allocate/rename (See Intel's optimization manual, figure 16-2) so expanding to 2 uops during issue can fill bubbles from decode stalls (on instructions like one-operand mul
, or pshufb
, which produce more than 1 uop from the decoder and cause a 3-7 cycle stall for microcode). Or on Silvermont, just an instruction with more than 3 prefixes (including escape bytes and mandatory prefixes), e.g. REX + any SSSE3 or SSE4 instruction. But note that there is a ~28 uop loop buffer, so small loops don't suffer from these decode stalls.
inc
/dec
aren't the only instructions that decode as 1 but issue as 2: push
/pop
, call
/ret
, and lea
with 3 components do this too. So do KNL's AVX512 gather instructions. Source: Intel's optimization manual, 17.1.2 Out-of-Order Engine (KNL). It's only a small throughput penalty (and sometimes not even that if anything else is a bigger bottleneck), so it's generally fine to still use inc
for "generic" tuning.
Intel's optimization manual still recommends add 1
over inc
in general, to avoid risks of partial-flag stalls. But since Intel's compiler doesn't do that by default, it's not too likely that future CPUs will make inc
slow in all cases, like P4 did.
Clang 5.0 and Intel's ICC 17 (on Godbolt) do use inc
when optimizing for speed (-O3
), not just for size. -mtune=pentium4
makes them avoid inc
/dec
, but the default -mtune=generic
doesn't put much weight on P4.
ICC17 -xMIC-AVX512
(equivalent to gcc's -march=knl
) does avoid inc
, which is probably a good bet in general for Silvermont / KNL. But it's not usually a performance disaster to use inc
, so it's probably still appropriate for "generic" tuning to use inc
/dec
in most code, especially when the flag result isn't part of the critical path.
Other than Silvermont, this is mostly-stale optimization advice left over from Pentium4. On modern CPUs, there's only a problem if you actually read a flag that wasn't written by the last insn that wrote any flags. e.g. in BigInteger adc
loops. (And in that case, you need to preserve CF so using add
would break your code.)
add
writes all the condition-flag bits in the EFLAGS register. Register-renaming makes write-only easy for out-of-order execution: see write-after-write and write-after-read hazards. add eax, 1
and add ecx, 1
can execute in parallel because they are fully independent of each other. (Even Pentium4 renames the condition flag bits separate from the rest of EFLAGS, since even add
leaves the interrupts-enabled and many other bits unmodified.)
On P4, inc
and dec
depend on the previous value of the all the flags, so they can't execute in parallel with each other or preceding flag-setting instructions. (e.g. add eax, [mem]
/ inc ecx
makes the inc
wait until after the add
, even if the add's load misses in cache.) This is called a false dependency. Partial-flag writes work by reading the old value of the flags, updating the bits other than CF, then writing the full flags.
All other out-of-order x86 CPUs (including AMD's), rename different parts of flags separately, so internally they do a write-only update to all the flags except CF. (source: Agner Fog's microarchitecture guide). Only a few instructions, like adc
or cmc
, truly read and then write flags. But also shl r, cl
(see below).
Cases where add dest, 1
is preferable to inc dest
, at least for Intel P6/SnB uarch families:
But beware of uop-cache effects with add [label], 1
which needs a 32-bit address and an 8-bit immediate for the same uop.
On Intel SnB-family, variable-count shifts are 3 uops (up from 1 on Core2/Nehalem). AFAICT, two of the uops read/write flags, and an independent uop reads reg
and cl
, and writes reg
. It's a weird case of having better latency (1c + inevitable resource conflicts) than throughput (1.5c), and only bei