It's not possible to do this with a single instruction, but you can emulate it and still be lock-free. Except for the very earliest AMD64 CPUs, x64 supports the CMPXCHG16B
instruction. With a little multi-precision math, you can do this pretty easily.
I'm afraid I don't know the instrinsic for CMPXCHG16B
in GCC, but hopefully you get the idea of having a spin loop of CMPXCHG16B
. Here's some untested code for VC++:
// atomically adds 128-bit src to dst, with src getting the old dst.
void fetch_add_128b(uint64_t *dst, uint64_t* src)
{
uint64_t srclo, srchi, olddst[2], exchlo, exchhi;
srchi = src[0];
srclo = src[1];
olddst[0] = dst[0];
olddst[1] = dst[1];
do
{
exchlo = srclo + olddst[1];
exchhi = srchi + olddst[0] + (exchlo < srclo); // add and carry
}
while(!_InterlockedCompareExchange128((long long*)dst,
exchhi, exchlo,
(long long*)olddst));
src[0] = olddst[0];
src[1] = olddst[1];
}
Edit: here's some untested code going off of what I could find for the GCC intrinsics:
// atomically adds 128-bit src to dst, returning the old dst.
__uint128_t fetch_add_128b(__uint128_t *dst, __uint128_t src)
{
__uint128_t dstval, olddst;
dstval = *dst;
do
{
olddst = dstval;
dstval = __sync_val_compare_and_swap(dst, dstval, dstval + src);
}
while(dstval != olddst);
return dstval;
}
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…