TL;DR: It's almost always best to just do four broadcast-loads using _mm256_set1_pd()
. This very good on Haswell and later, where vbroadcastsd ymm,[mem]
doesn't require an ALU shuffle operation, and usually also the best option for Sandybridge/Ivybridge (where it's a 2-uop load + shuffle instruction).
It also means you don't need to care about alignment at all, beyond natural alignment for a double
.
The first vector is ready sooner than if you did a two-step load + shuffle, so out-of-order execution can potentially get started on the code using these vectors while the first one is still loading. AVX512 can even fold broadcast-loads into memory operands for ALU instructions, so doing it this way will allow a recompile to take slight advantage of AVX512 with 256b vectors.
(It's usually best to use set1(x)
, not _mm256_broadcast_sd(&x)
; If the AVX2-only register-source form of vbroadcastsd
isn't available, the compiler can choose to store -> broadcast-load or to do two shuffles. You never know when inlining will mean your code will run on inputs that are already in registers.)
If you're really bottlenecked on load-port resource-conflicts or throughput, not total uops or ALU / shuffle resources, it might help to replace a pair of 64->256b broadcasts with a 16B->32B broadcast-load (vbroadcastf128
/_mm256_broadcast_
pd
) and two in-lane shuffles (vpermilpd
or vunpckl/hpd
(_mm256_shuffle_pd
)).
Or with AVX2: load 32B and use 4 _mm256_permute4x64_pd
shuffles to broadcast each element into a separate vector.
Source Agner Fog's insn tables (and microarch pdf):
Intel Haswell and later:
vbroadcastsd ymm,[mem]
and other broadcast-load insns are 1uop instructions that are handled entirely by a load port (the broadcast happens "for free").
The total cost of doing four broadcast-loads this way is 4 instructions. fused-domain: 4uops. unfused-domain: 4 uops for p2/p3. Throughput: two vectors per cycle.
Haswell only has one shuffle unit, on port5. Doing all your broadcast-loads with load+shuffle will bottleneck on p5.
Maximum broadcast throughput is probably with a mix of vbroadcastsd ymm,m64
and shuffles:
## Haswell maximum broadcast throughput with AVX1
vbroadcastsd ymm0, [rsi]
vbroadcastsd ymm1, [rsi+8]
vbroadcastf128 ymm2, [rsi+16] # p23 only on Haswell, also p5 on SnB/IvB
vunpckhpd ymm3, ymm2,ymm2
vunpcklpd ymm2, ymm2,ymm2
vbroadcastsd ymm4, [rsi+32] # or vaddpd ymm0, [rdx+something]
#add rsi, 40
Any of these addressing modes can be two-register indexed addressing modes, because they don't need to micro-fuse to be a single uop.
AVX1: 5 vectors per 2 cycles, saturating p2/p3 and p5. (Ignoring cache-line splits on the 16B load). 6 fused-domain uops, leaving only 2 uops per 2 cycles to use the 5 vectors... Real code would probably use some of the load throughput to load something else (e.g. a non-broadcast 32B load from another array, maybe as a memory operand to an ALU instruction), or to leave room for stores to steal p23 instead of using p7.
## Haswell maximum broadcast throughput with AVX2
vmovups ymm3, [rsi]
vbroadcastsd ymm0, xmm3 # special-case for the low element; compilers should generate this from _mm256_permute4x64_pd(v, 0)
vpermpd ymm1, ymm3, 0b01_01_01_01 # NASM syntax for 0x99
vpermpd ymm2, ymm3, 0b10_10_10_10
vpermpd ymm3, ymm3, 0b11_11_11_11
vbroadcastsd ymm4, [rsi+32]
vbroadcastsd ymm5, [rsi+40]
vbroadcastsd ymm6, [rsi+48]
vbroadcastsd ymm7, [rsi+56]
vbroadcastsd ymm8, [rsi+64]
vbroadcastsd ymm9, [rsi+72]
vbroadcastsd ymm10,[rsi+80] # or vaddpd ymm0, [rdx + whatever]
#add rsi, 88
AVX2: 11 vectors per 4 cycles, saturating p23 and p5. (Ignoring cache-line splits for the 32B load...). Fused-domain: 12 uops, leaving 2 uops per 4 cycles beyond this.
I think 32B unaligned loads are a bit more fragile in terms of performance than unaligned 16B loads like vbroadcastf128
.
Intel SnB/IvB:
vbroadcastsd ymm, m64
is 2 fused-domain uops: p5 (shuffle) and p23 (load).
vbroadcastss xmm, m32
and movddup xmm, m64
are single-uop load-port-only. Interestingly, vmovddup ymm, m256
is also a single-uop load-port-only instruction, but like all 256b loads, it occupies a load port for 2 cycles. It can still generate a store-address in the 2nd cycle. This uarch doesn't deal well with cache-line splits for unaligned 32B-loads, though. gcc defaults to using movups / vinsertf128 for unaligned 32B loads with -mtune=sandybridge
/ -mtune=ivybridge
.
4x broadcast-load: 8 fused-domain uops: 4 p5 and 4 p23. Throughput: 4 vectors per 4 cycles, bottlenecking on port 5. Multiple loads from the same cache line in the same cycle don't cause a cache-bank conflict, so this is nowhere near saturating the load ports (also needed for store-address generation). That only happens on the same bank of two different cache lines in the same cycle.
Multiple 2-uop instructions with no other instructions between is the worst case for the decoders if the uop-cache is cold, but a good compiler would mix in single-uop instructions between them.
SnB has 2 shuffle units, but only the one on p5 can handle shuffles that have a 256b version in AVX. Using a p1 integer-shuffle uop to broadcast a double to both elements of an xmm register doesn't get us anywhere, since vinsertf128 ymm,ymm,xmm,i
takes a p5 shuffle uop.
## Sandybridge maximum broadcast throughput: AVX1
vbroadcastsd ymm0, [rsi]
add rsi, 8
one per clock, saturating p5 but only using half the capacity of p23.
We can save one load uop at the cost of 2 more shuffle uops, throughput = two results per 3 clocks:
vbroadcastf128 ymm2, [rsi+16] # 2 uops: p23 + p5 on SnB/IvB
vunpckhpd ymm3, ymm2,ymm2 # 1 uop: p5
vunpcklpd ymm2, ymm2,ymm2 # 1 uop: p5
Doing a 32B load and unpacking it with 2x vperm2f128
-> 4x vunpckh/lpd
might help if stores are part of what's competing for p23.