As usual, inline asm is not strictly necessary. https://gcc.gnu.org/wiki/DontUseInlineAsm. But currently compilers kinda suck for actual extended-precision addition, so you might want asm for this.
There's an Intel intrinsic for adc
: _addcarry_u64
. But gcc and clang may make slow code., unfortunately. In GNU C on a 64-bit platform, you could just use unsigned __int128
.
Compilers usually manage to make pretty good code when checking for carry-out from addition using the idiom that carry_out = (x+y) < x
, where <
is an unsigned compare. For example:
struct long_carry { unsigned long res; unsigned carry; };
struct long_carry add_carryout(unsigned long x, unsigned long y) {
unsigned long retval = x + y;
unsigned carry = (retval < x);
return (struct long_carry){ retval, carry };
}
gcc7.2 -O3 emits this (and clang emits similar code):
mov rax, rdi # because we need return value in a different register
xor edx, edx # set up for setc
add rax, rsi # generate carry
setc dl # save carry.
ret # return with rax=sum, edx=carry (SysV ABI struct packing)
There's no way you can do better than this with inline asm; this function already looks optimal for modern CPUs. (Well I guess if mov
wasn't zero latency, doing the add
first would shorten the latency to carry being ready. But on Intel CPUs, it's supposed to be better to overwrite mov-elimination results right away, so it's better to mov first and then add.)
Clang will even use adc
to use the carry-out from an add as the carry-in to another add, but only for the first limb. Perhaps because: Update: this function is broken: carry_out = (x+y) < x
doesn't work when there's carry-in. With carry_out = (x+y+c_in) < x
, y+c_in
can wrap to zero and give you (x+0) < x
(false) even though there was carry.
Notice that clang's cmp
/adc reg,0
exactly implements the behaviour of the C, which isn't the same as another adc
there.
Anyway, gcc doesn't even use adc
the first time, when it is safe. (So use unsigned __int128
for code that doesn't suck, and asm for integers even wider than that).
// BROKEN with carry_in=1 and y=~0U
static
unsigned adc_buggy(unsigned long *sum, unsigned long x, unsigned long y, unsigned carry_in) {
*sum = x + y + carry_in;
unsigned carry = (*sum < x);
return carry;
}
// *x += *y
void add256(unsigned long *x, unsigned long *y) {
unsigned carry;
carry = adc(x, x[0], y[0], 0);
carry = adc(x+1, x[1], y[1], carry);
carry = adc(x+2, x[2], y[2], carry);
carry = adc(x+3, x[3], y[3], carry);
}
mov rax, qword ptr [rsi]
add rax, qword ptr [rdi]
mov qword ptr [rdi], rax
mov rax, qword ptr [rdi + 8]
mov r8, qword ptr [rdi + 16] # hoisted
mov rdx, qword ptr [rsi + 8]
adc rdx, rax # ok, no memory operand but still adc
mov qword ptr [rdi + 8], rdx
mov rcx, qword ptr [rsi + 16] # r8 was loaded earlier
add rcx, r8
cmp rdx, rax # manually check the previous result for carry. /facepalm
adc rcx, 0
...
This sucks, so if you want extended-precision addition, you still need asm. But for getting the carry-out into a C variable, you don't.