First of all: You shouldn't assume that you can optimize better than the compiler could.
Yes, you are now using AVX instructions in your "optimized" code. But you also wrote code which the compiler now has issues unrolling, in addition to the plain vectorization.
For comparison, let's have a look at what a compiler would actually make from your "slow" C implementation, just the hot loop without footer.
ICC, compiled with -O3 -march=skylake -ffast-math
:
..B1.13:
vmovups ymm2, YMMWORD PTR [rsi+rdi*4]
vmovups ymm3, YMMWORD PTR [32+rsi+rdi*4]
vfmadd231ps ymm1, ymm2, YMMWORD PTR [r8+rdi*4]
vfmadd231ps ymm0, ymm3, YMMWORD PTR [32+r8+rdi*4]
add rdi, 16
cmp rdi, rax
jb ..B1.13
Clang, with the same parameters is even more pessimistic and unrolls this to the following:
.LBB0_4:
vmovups ymm4, ymmword ptr [rsi + 4*rcx]
vmovups ymm5, ymmword ptr [rsi + 4*rcx + 32]
vmovups ymm6, ymmword ptr [rsi + 4*rcx + 64]
vmovups ymm7, ymmword ptr [rsi + 4*rcx + 96]
vfmadd132ps ymm4, ymm0, ymmword ptr [rdi + 4*rcx]
vfmadd132ps ymm5, ymm1, ymmword ptr [rdi + 4*rcx + 32]
vfmadd132ps ymm6, ymm2, ymmword ptr [rdi + 4*rcx + 64]
vfmadd132ps ymm7, ymm3, ymmword ptr [rdi + 4*rcx + 96]
vmovups ymm0, ymmword ptr [rsi + 4*rcx + 128]
vmovups ymm1, ymmword ptr [rsi + 4*rcx + 160]
vmovups ymm2, ymmword ptr [rsi + 4*rcx + 192]
vmovups ymm3, ymmword ptr [rsi + 4*rcx + 224]
vfmadd132ps ymm0, ymm4, ymmword ptr [rdi + 4*rcx + 128]
vfmadd132ps ymm1, ymm5, ymmword ptr [rdi + 4*rcx + 160]
vfmadd132ps ymm2, ymm6, ymmword ptr [rdi + 4*rcx + 192]
vfmadd132ps ymm3, ymm7, ymmword ptr [rdi + 4*rcx + 224]
add rcx, 64
add rax, 2
jne .LBB0_4
Surprise, both compilers were already able to use AVX instructions, no intrinsic hacking required.
But what's more interesting, is that both compiler decided that one accumulation register wasn't enough to saturate the AVX pipeline, but rather used 2 respectively 4 accumulation registers. Having more operations in flight helps masking latency of the FMA, up to the point where the actual memory throughput limit is reached.
Just don't forget the -ffast-math
compiler option, without it wouldn't be legal to pull the final accumulation out of the vectorized loop.
GCC, also with the same options, is actually "only" doing as good as your "optimized" solution:
.L7:
add r8, 1
vmovaps ymm3, YMMWORD PTR [r9+rax]
vfmadd231ps ymm1, ymm3, YMMWORD PTR [rcx+rax]
add rax, 32
cmp r8, r10
jb .L7
However GCC was still a little bit smarter in also adding a header to that loop, so it could use vmovaps
(aligned memory access) instead of vmovups
(unaligned memory access) for the first load.
For completeness, with pure AVX (-O3 -march=ivybridge -ffast-math
):
ICC:
..B1.12:
vmovups xmm2, XMMWORD PTR [r8+rdi*4]
vmovups xmm5, XMMWORD PTR [32+r8+rdi*4]
vinsertf128 ymm3, ymm2, XMMWORD PTR [16+r8+rdi*4], 1
vinsertf128 ymm6, ymm5, XMMWORD PTR [48+r8+rdi*4], 1
vmulps ymm4, ymm3, YMMWORD PTR [rsi+rdi*4]
vmulps ymm7, ymm6, YMMWORD PTR [32+rsi+rdi*4]
vaddps ymm1, ymm1, ymm4
vaddps ymm0, ymm0, ymm7
add rdi, 16
cmp rdi, rax
jb ..B1.12
Clang:
.LBB0_5:
vmovups xmm4, xmmword ptr [rdi + 4*rcx]
vmovups xmm5, xmmword ptr [rdi + 4*rcx + 32]
vmovups xmm6, xmmword ptr [rdi + 4*rcx + 64]
vmovups xmm7, xmmword ptr [rdi + 4*rcx + 96]
vinsertf128 ymm4, ymm4, xmmword ptr [rdi + 4*rcx + 16], 1
vinsertf128 ymm5, ymm5, xmmword ptr [rdi + 4*rcx + 48], 1
vinsertf128 ymm6, ymm6, xmmword ptr [rdi + 4*rcx + 80], 1
vinsertf128 ymm7, ymm7, xmmword ptr [rdi + 4*rcx + 112], 1
vmovups xmm8, xmmword ptr [rsi + 4*rcx]
vmovups xmm9, xmmword ptr [rsi + 4*rcx + 32]
vmovups xmm10, xmmword ptr [rsi + 4*rcx + 64]
vmovups xmm11, xmmword ptr [rsi + 4*rcx + 96]
vinsertf128 ymm8, ymm8, xmmword ptr [rsi + 4*rcx + 16], 1
vmulps ymm4, ymm8, ymm4
vaddps ymm0, ymm4, ymm0
vinsertf128 ymm4, ymm9, xmmword ptr [rsi + 4*rcx + 48], 1
vmulps ymm4, ymm4, ymm5
vaddps ymm1, ymm4, ymm1
vinsertf128 ymm4, ymm10, xmmword ptr [rsi + 4*rcx + 80], 1
vmulps ymm4, ymm4, ymm6
vaddps ymm2, ymm4, ymm2
vinsertf128 ymm4, ymm11, xmmword ptr [rsi + 4*rcx + 112], 1
vmulps ymm4, ymm4, ymm7
vaddps ymm3, ymm4, ymm3
add rcx, 32
cmp rax, rcx
jne .LBB0_5
GCC:
.L5:
vmovups xmm3, XMMWORD PTR [rdi+rax]
vinsertf128 ymm1, ymm3, XMMWORD PTR [rdi+16+rax], 0x1
vmovups xmm4, XMMWORD PTR [rsi+rax]
vinsertf128 ymm2, ymm4, XMMWORD PTR [rsi+16+rax], 0x1
add rax, 32
vmulps ymm1, ymm1, ymm2
vaddps ymm0, ymm0, ymm1
cmp rax, rcx
jne .L5
Pretty much the same optimizations applied, just a few additional operations as FMA is missing and unaligned 256bit loads are not advisable for Ivy Bridge.