Four, but it starts to get a little tricky.
Let a and b be the numbers to be multiplied, with a0 and a1 being the low and high 32 bits of a, respectively, and b0, b1, b2, b3 being 32-bit groups of b, from low to high respectively.
The desired result is the remainder of (a0 + a1?232) ? (b0 + b1?232 + b2?264 + b3?296) modulo 2128.
We can rewrite that as (a0 + a1?232) ? (b0 + b1?232) +
(a0 + a1?232) ? (b2?264 + b3?296) modulo 2128.
The remainder of the latter term modulo 2128 can be computed as a single 64-bit by 64-bit multiplication (whose result is implicitly multiplied by 264).
Then the former term can be computed with three multiplications using a
carefully implemented Karatsuba step. The simple version would involve a 33-bit by 33-bit to 66-bit product which is not available, but there is a trickier version that avoids it:
z0 = a0 * b0
z2 = a1 * b1
z1 = abs(a0 - a1) * abs(b0 - b1) * sgn(a0 - a1) * sgn(b1 - b0) + z0 + z2
The last line contains only one multiplication; the other two pseudo-multiplications are just conditional negations. Absolute-difference and conditional-negate are annoying to implement in pure C, but it could be done.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…