You want potability and efficiency. Well you can't have it both ways. You said you want to do this with the fewest number of instructions. Well it's possible to do this with only one instruction with SSE3 using the pshufb instruction (see below) from the x86 instruction set.
Maybe ARM Neon has something equivalent. If you want efficiency (and are sure that you need it) then learn the hardware.
The SSE equivalent of _MM_TRANSPOSE4_PS
for bytes is to use _mm_shuffle_epi8
(the intrinsic for pshufb) with a mask. Define the mask outside of your main loop.
//use -msse3 with GCC or /arch:SSE2 with MSVC
#include <stdio.h>
#include <tmmintrin.h> //SSSE3
int main() {
char x[] = {0,1,2,3, 4,5,6,7, 8,9,10,11, 12,13,15,16};
__m128i mask = _mm_setr_epi8(0x0,0x04,0x08,0x0c, 0x01,0x05,0x09,0x0d, 0x02,0x06,0x0a,0x0e, 0x03,0x07,0x0b,0x0f);
__m128i v = _mm_loadu_si128((__m128i*)x);
v = _mm_shuffle_epi8(v,mask);
_mm_storeu_si128((__m128i*)x,v);
for(int i=0; i<16; i++) printf("%d ", x[i]); printf("
");
//output: 0 4 8 12 1 5 9 13 2 6 10 15 3 7 11 16
}
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…