[Edit1] repaired code
Recently found the results where off even if tests determine all was OK so I dig deeper and found out that I had a silly bug in my equation and due to name conflicts with my pgm environment the tests got false positives so I overlooked it before. Now it work in all cases as it should.
The best thing I can think of (except approximation or large LUT) is binary search without multiplication, here C++ code:
//---------------------------------------------------------------------------
WORD u32_sqrt(DWORD xx) // 16 T
{
DWORD x,m,a0,a1,i;
const DWORD lut[16]=
{
// m*m
0x40000000,
0x10000000,
0x04000000,
0x01000000,
0x00400000,
0x00100000,
0x00040000,
0x00010000,
0x00004000,
0x00001000,
0x00000400,
0x00000100,
0x00000040,
0x00000010,
0x00000004,
0x00000001,
};
for (x=0,a0=0,m=0x8000,i=0;m;m>>=1,i++)
{
a1=a0+lut[i]+(x<<(16-i));
if (a1<=xx) { a0=a1; x|=m; }
}
return x;
}
//---------------------------------------------------------------------------
Standard binary search sqrt(xx)
is setting bits of x
from MSB to LSB so that result of x*x <= xx
. Luckily we can avoid the multiplication by simply rewrite the thing as incrementing multiplicant... in each iteration the older x*x
result can be used like this:
x1 = x0+m
x1*x1 = (x0+m)*(x0+m) = (x0*x0) + (2*m*x0) + (m*m)
Where x0
is value of x
from last iteration and x1
is actual value. The m
is weight of actual processed bit. The (2*m)
and (m*m)
are constant and can be used as LUT and bit-shift so no need to multiply. Only addition is needed. Sadly the iteration is bound to sequential computation forbid paralelisation so the result is 16T
at best.
In the code a0
represents last x*x
and a1
represents actual iterated x*x
As you can see the sqrt
is done in 16 x (BitShiftLeft,BitShiftRight,OR,Plus,Compare)
where the bit shift and LUT can be hardwired.
If you got super fast gates for this in comparison to the rest you can multiply the input clock by 16
and use that as internal timing for SQRT module. Something similar to the old days when there was MC clock as Division of source CPU clock in old Intel CPU/MCUs ... This way you can get 1T
timing (or multiple of it depends on the multiplication ratio).