Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
555 views
in Technique[技术] by (71.8m points)

gcc - How do you load/store from/to an array of doubles with GNU C Vector Extensions?

I'm using GNU C Vector Extensions, not Intel's _mm_* intrinsics.

I want to do the same thing as Intel's _m256_loadu_pd intrinsic. Assigning the values one by one is slow: gcc produces code that has 4 load instructions, rather than one single vmovupd (which _m256_loadu_pd does generate).

typedef double vector __attribute__((vector_size(4 * sizeof(double))));

int main(int argc, char **argv) {
    double a[4] = {1.0, 2.0, 3.0, 4.0};
    vector v;

    /* I currently do this */
    v[0] = a[0];
    v[1] = a[1];
    v[2] = a[2];
    v[3] = a[3];
}

I want something like this:

v = (vector)(a);

or

v = *((vector*)(a));

but neither work. The first fails with "can't convert value to a vector" while the second results in segfaults.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

update: I see you're using GNU C's native vector syntax, not Intel intrinsics. Are you avoiding Intel intrinsics for portability to non-x86? gcc currently does a bad job compiling code that uses GNU C vectors wider than the target machine supports. (You'd hope that it would just use two 128b vectors and operate on each separately, but apparently it's worse than that.)

Anyway, this answer shows how you can use Intel x86 intrinsics to load data into GNU C vector-syntax types


First of all, looking at compiler output at less than -O2 is a waste of time if you're trying to learn anything about what will compile to good code. Your main() will optimize to just a ret at -O2.

Besides that, it's not totally surprising that you get bad asm from assigning elements of a vector one at a time.


Aside: normal people would call the type v4df (vector of 4 Double Float) or something, not vector, so they don't go insane when using it with C++ std::vector. For single-precision, v8sf. IIRC, gcc uses type names like this internally for __m256d.

On x86, Intel intrinsic types (like __m256d) are implemented on top of GNU C vector syntax (which is why you can do v1 * v2 in GNU C instead of writing _mm256_mul_pd(v1, v2)). You can convert freely from __m256d to v4df, like I've done here.

I've wrapped both sane ways to do this in functions, so we can look at their asm. Notice how we're not loading from an array that we define inside the same function, so the compiler won't optimize it away.

I put them on the Godbolt compiler explorer so you can look at the asm with various compile options and compiler versions.

typedef double v4df __attribute__((vector_size(4 * sizeof(double))));

#include <immintrin.h>

// note the return types.  gcc6.1 compiles with no warnings, even at -Wall -Wextra
v4df load_4_doubles_intel(const double *p) { return _mm256_loadu_pd(p); }
    vmovupd ymm0, YMMWORD PTR [rdi]   # tmp89,* p
    ret

v4df avx_constant() { return _mm256_setr_pd( 1.0, 2.0, 3.0, 4.0 ); }
    vmovapd ymm0, YMMWORD PTR .LC0[rip]
    ret

If the args to _mm_set* intrinsics aren't compile-time constants, the compiler will do the best it can to make efficient code to get all the elements into a single vector. It's usually best to do that rather than writing C that stores to a tmp array and loads from it, because that's not always the best strategy. (Store-forwarding failure on multiple narrow stores forwarding to a wide load costs an extra ~10 cycles (IIRC) of latency on top of the usual store-forwarding delay. If your doubles are already in registers, it's usually best to just shuffle them together.)


See also Is it possible to cast floats directly to __m128 if they are 16 byte alligned? for a list of the various intrinsics for getting a single scalar into a vector. The tag wiki has links to Intel's manuals, and their intrinsics finder.


Load/store GNU C vectors without Intel intrinsics:

I'm not sure how you're "supposed" to do that. This Q&A suggests casting a pointer to the memory you want to load, and using a vector type like typedef char __attribute__ ((vector_size (16),aligned (1))) unaligned_byte16; (note the aligned(1) attribute).

You get a segfault from *(v4df *)a because presumably a isn't aligned on a 32-byte boundary, but you're using a vector type that does assume natural alignment. (Just like __m256d if you dereference a pointer to it instead of using load/store intrinsics to communicate alignment info to the compiler.)


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

2.1m questions

2.1m answers

60 comments

57.0k users

...