Inline-assembly is usually a bad idea: modern compilers do a good job with x86 intrinsics.
Putting the bit-pattern for a double
into RAX is usually also not useful, and smells like you've probably already gone down the wrong path into sub-optimal territory. Vector shuffle instructions are usually better: element-wise insert/extract instructions already cost a shuffle uop on Intel hardware, except for vmovq %xmm0, %rax
to get the low element.
Also, if you're going to insert it into another vector, you should shuffle/immediate-blend. (vpermpd
/ vblendpd
).
L1d and store-forwarding cache is fast, and even store-forwarding stalls are not a disaster. Choose wisely between ALU vs. memory to gather or scatter of data into / from SIMD vectors. Also remember that insert/extract instructions need an immediate index, so if you have a runtime index for a vector, you definitely want to store it and index. (See https://agner.org/optimize/ and other performance links in https://stackoverflow.com/tags/x86/info)
Lots of insert/extract can quickly bottleneck on port 5 on Haswell and later. See Loading an xmm from GP regs for more details, and some links to gcc bug reports where I went into more detail about strategies for different element widths on different uarches and with SSE4.1 vs. without SSE4.1, etc.
There's no PD version of extractps r/m32, xmm, imm
, and insertps
is a shuffle between XMM vectors.
To read/write the low lane of a YMM, you'll have to use integer vpextrq $1, %xmm0, %rax
/ pinsrq $1, %rax, %xmm0
. Those aren't available in YMM width, so you need multiple instructions to read/write the high lane.
The VEX version vpinsrq $1, %rax, %xmm0
will zero the high lane(s) of the destination vector's full YMM or ZMM width, which is why I suggested pinsrq
. On Skylake and later, it won't cause an SSE/AVX transition stall. See Using ymm registers as a "memory-like" storage location for an example (NASM syntax), and also Loading an xmm from GP regs
For the low element, use vmovq %xmm0, %rax
to extract, it's cheaper than vpextrq
(1 uop instead of 2).
For ZMM, my answer on that linked XMM from GP regs question shows that you can use AVX512F to merge-mask an integer register into a vector, given a mask register with a single bit set.
vpbroadcastq %rax, %zmm0{%k1}
Similarly, vpcompressq
can move an element selected by a single-bit mask to the bottom for vmovq
.
But for extract, if have an index instead of 1<<index
to start with, you may be better off with vmovd %ecx, %zmm1
/ vpermd %zmm0, %zmm1, %zmm2
/ vmovq %zmm2, %rax
. This trick even works with vpshufb
for byte elements (with a lane at least). For lane crossing, maybe shuffle + vmovd
with the high bits of the byte index, then scalar right-shift using the low bits of the index as the byte-within-word offset. See also How to use _mm_extract_epi8 function? for intrinsics for a variable-index emulation of pextrb
.
High lane of a YMM, with AVX2
I think your best bet to write an element in the high lane of a YMM with AVX2 needs a scratch register:
vmovq %rax, %xmm0
(copy to a scratch vector)
- shuffle into position with
vinsertf128
(AVX1) or vpbroadcastq
/vbroadcastsd
. it's faster than vpermq
/vpermpd
on AMD. (But the reg-reg version is still AVX2-only)
vblendpd
(FP) or vpblendd
(integer) into the target YMM reg. Immediate blends with dword or larger element width are very cheap (1 uop for any vector ALU port on Intel).
This is only 3 uops, but 2 of them need port 5 on Intel CPUs. (So it costs the same as a vpinsrq
+ a blend). Only the blend is on the critical path from vector input to vector output, setting up ymm0
from rax
is independent.
To read the highest element, vpermpd
or vpermq $3, %ymm1, %ymm0
(AVX2), then vmovq
from xmm0.
To read the 2nd-highest element, vextractf128 $1, %ymm1, %xmm0
(AVX1) and vmovq
. vextractf128
is faster than vpermq/pd
on AMD CPUs.
A bad alternative to avoid a scratch reg for insert would be vpermq
or vperm2i128
to shuffle the qword you want to replace into the low lane, pinsrq
(not vpinsrq
), then vpermq
to put it back in the right order. That's all shuffle uops, and pinsrq
is 2 uops. (And causes an SSE/AVX transition stall on Haswell, but not Skylake). Plus all those operations are part of a dependency chain for the register you're modifying, unlike setting up a value in another register and blending.