From the original post, the best x64/SSE2 algorithm implemented on ARMv7+NEON works as follows:
(a[32:63] === b[32:63]) & (b[0:63] - a[0:63])
yields a mask of 0xFFFFFFFF.........
for every case where the top 32 bits are equal and a[0:31] > b[0:31]
. In all other cases such as when the top 32 bits are not equal or a[0:31]< b[0:31]
, it returns 0x0
. This has the effect of taking the bottom 32bits of each integer and propagating them into the upper 32bits as a mask if the top 32 bits are inconsequential, and the lower 32bits are significant. For the remaining cases, it takes the comparison of the top 32 bits and ORs them together. As an example if a[32:63] > b[32:63], then a is definitely greater than b, regardless of the least significant bits. Finally, it swizzles/shuffles/transposes the upper 32s of each 64bit mask to the lower 32bits to produce a full 64bit mask.
An illustrative example implementation is in this Godbolt.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…