Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
831 views
in Technique[技术] by (71.8m points)

c - Fast float quantize, scaled by precision?

Since float precision reduces for larger values, in some cases it may be useful to quantize the value based on its size - instead of quantizing by an absolute value.

A naive approach could be to detect the precision and scale it up:

float quantize(float value, float quantize_scale) {
    float factor = (nextafterf(fabsf(value)) - fabsf(value)) * quantize_scale;
    return floorf((value / factor) + 0.5f) * factor;
}

However this seems too heavy.

Instead, it should be possible to mask out bits in the floats mantisa to simulate something like casting to a 16bit float, then back - for eg.

Not being expert in float bit twiddling, I couldn't say if the resulting float would be valid (or need normalizing)


For speed, when exact behavior regarding rounding isn't important, what is a fast way to quantize floats, taking their magnitude into account?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

The Veltkamp-Dekker splitting algorithm will split a floating-point number into high and low parts. Sample code is below.

If there are s bits in the significand (53 in IEEE 754 64-bit binary), and the value Scale in the code below is 2b, then *x0 receives the high s-b bits of x, and *x1 receives the remaining bits, which you may discard (or remove from the code below, so it is never calculated). If b is known at compile time, e.g., the constant 43, you can replace Scale with the appropriate constant, such as 0x1p43. Otherwise, you must produce 2b in some way.

This requires round-to-nearest mode. IEEE 754 arithmetic suffices, but other reasonable arithmetic may be okay too. It rounds ties to even.

This assumes that x * (Scale + 1) does not overflow. The operations must be evaluated in the same precision as the value being separated. (double for double, float for float, and so on. If the compiler evaluates float expressions with double, this would break. A workaround would be to convert the inputs to the widest floating-point type supported, perform the split in that type [with Scale adjusted correspondingly], and then convert back.)

void Split(double *x0, double *x1, double x)
{
    double d = x * (Scale + 1);
    double t = d - x;
    *x0 = d - t;
    *x1 = x - *x0;
}

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...