The GMP library is usually a good reference for good algorithms. Their documented algorithms for division mainly depend on choosing a very large base, so that you're dividing a 4 digit number by a 2 digit number, and then proceed via long division.
Long division will require computing 2 digit by 1 digit quotients; this can either be done recursively, or by precomputing an inverse and estimating the quotient as you would with Barrett reduction.
When dividing a 2n
-bit number by an n
-bit number, the recursive version costs O(M(n) log(n))
, where M(n)
is the cost of multiplying n
-bit numbers.
The version using Barrett reduction will cost O(M(n))
if you use Newton's algorithm to compute the inverse, but according to GMP's documentation, the hidden constant is a lot larger, so this method is only preferable for very large divisions.
In more detail, the core algorithm behind most division algorithms is an "estimated quotient with reduction" calculation, computing (q,r)
so that
x = qy + r
but without the restriction that 0 <= r < y
. The typical loop is
- Estimate the quotient
q
of x/y
- Compute the corresponding reduction
r = x - qy
- Optionally adjust the quotient so that the reduction
r
is in some desired interval
- If
r
is too big, then repeat with r
in place of x
.
The quotient of x/y
will be the sum of all the q
s produced, and the final value of r
will be the true remainder.
Schoolbook long division, for example, is of this form. e.g. step 3 covers those cases where the digit you guessed was too big or too small, and you adjust it to get the right value.
The divide and conquer approach estimates the quotient of x/y
by computing x'/y'
where x'
and y'
are the leading digits of x
and y
. There is a lot of room for optimization by adjusting their sizes, but IIRC you get best results if x'
is twice as many digits of y'
.
The multiply-by-inverse approach is, IMO, the simplest if you stick to integer arithmetic. The basic method is
- Estimate the inverse of
y
with m = floor(2^k / y)
- Estimate
x/y
with q = 2^(i+j-k) floor(floor(x / 2^i) m / 2^j)
In fact, practical implementations can tolerate additional error in m
if it means you can use a faster reciprocal implementation.
The error is a pain to analyze, but if I recall the way to do it, you want to choose i
and j
so that x ~ 2^(i+j)
due to how errors accumulate, and you want to choose x / 2^i ~ m^2
to minimize the overall work.
The ensuing reduction will have r ~ max(x/m, y)
, so that gives a rule of thumb for choosing k
: you want the size of m
to be about the number of bits of quotient you compute per iteration — or equivalently the number of bits you want to remove from x
per iteration.