Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
490 views
in Technique[技术] by (71.8m points)

optimization - Computing 8 horizontal sums of eight AVX single-precision floating-point vectors

I have 8 AVX vectors containing 8 floats each (64 floats in total) and I want to sum elements in each vector together (basically perform eight horizontal sums).

For now, I'm using the following code:

__m256 HorizontalSums(__m256 v0, __m256 v1, __m256 v2, __m256 v3, __m256 v4, __m256 v5, __m256 v6, __m256 v7)
{
    // transpose
    const __m256 t0 = _mm256_unpacklo_ps(v0, v1);
    const __m256 t1 = _mm256_unpackhi_ps(v0, v1);
    const __m256 t2 = _mm256_unpacklo_ps(v2, v3);
    const __m256 t3 = _mm256_unpackhi_ps(v2, v3);
    const __m256 t4 = _mm256_unpacklo_ps(v4, v5);
    const __m256 t5 = _mm256_unpackhi_ps(v4, v5);
    const __m256 t6 = _mm256_unpacklo_ps(v6, v7);
    const __m256 t7 = _mm256_unpackhi_ps(v6, v7);

    __m256 v = _mm256_shuffle_ps(t0, t2, 0x4E);
    const __m256 tt0 = _mm256_blend_ps(t0, v, 0xCC);
    const __m256 tt1 = _mm256_blend_ps(t2, v, 0x33);
    v = _mm256_shuffle_ps(t1, t3, 0x4E);
    const __m256 tt2 = _mm256_blend_ps(t1, v, 0xCC);
    const __m256 tt3 = _mm256_blend_ps(t3, v, 0x33);
    v = _mm256_shuffle_ps(t4, t6, 0x4E);
    const __m256 tt4 = _mm256_blend_ps(t4, v, 0xCC);
    const __m256 tt5 = _mm256_blend_ps(t6, v, 0x33);
    v = _mm256_shuffle_ps(t5, t7, 0x4E);
    const __m256 tt6 = _mm256_blend_ps(t5, v, 0xCC);
    const __m256 tt7 = _mm256_blend_ps(t7, v, 0x33);

    // compute sums
    __m256 sum0 = _mm256_add_ps(_mm256_add_ps(tt0, tt1), _mm256_add_ps(tt2, tt3));
    __m256 sum1 = _mm256_add_ps(_mm256_add_ps(tt4, tt5), _mm256_add_ps(tt6, tt7));
    v0 = _mm256_blend_ps(sum0, sum1, 0xF0);
    v1 = _mm256_permute2f128_ps(sum0, sum1, 0x21); // final inter-lane shuffling
    return _mm256_add_ps(v0, v1);
}

As you can see, I'm just transposing the vectors and summing elements at the end. I'm already using two tricks here: replacing _mm256_shuffle_ps with _mm256_blend_ps where possible in order to reduce port 5 pressure on Intel CPUs as well as I'm using _mm256_permute2f128_ps + _mm256_blend_ps at the end to perform inter-lane shuffling.

Is there any better (faster) way to compute this?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

OK, I think I have found faster algorithm based on (usually slow) HADDs:

__m256 HorizontalSums(__m256 v0, __m256 v1, __m256 v2, __m256 v3, __m256 v4, __m256 v5, __m256 v6, __m256 v7)
{
    const __m256 s01 = _mm256_hadd_ps(v0, v1);
    const __m256 s23 = _mm256_hadd_ps(v2, v3);
    const __m256 s45 = _mm256_hadd_ps(v4, v5);
    const __m256 s67 = _mm256_hadd_ps(v6, v7);
    const __m256 s0123 = _mm256_hadd_ps(s01, s23);
    const __m256 s4556 = _mm256_hadd_ps(s45, s67);

    // inter-lane shuffle
    v0 = _mm256_blend_ps(s0123, s4556, 0xF0);
    v1 = _mm256_permute2f128_ps(s0123, s4556, 0x21);

    return _mm256_add_ps(v0, v1);
}

According to IACA, it's ~8 cycles faster on Haswell.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...