A better quadratic formula algorithm

Published 2022-08-27

A robust algorithm for computing roots of quadratic equations, much better than common code.

A Better Algorithm for Solving Quadratic Equations

Chris Lomont, Aug 2022, www.lomont.org, v0.5

I was recently using some code that did registration of point clouds that rarely would produce really bad results. The internal algorithm required computing roots of a degree 4 polynomial, which required computing roots of a degree 3 polynomial, which required computing roots of a degree 2 (quadratic) polynomial. The quadratic roots were bad in the cases the overall algorithm failed. Since having bad quadratic root solvers in code has bitten me several times in my career, I decided to find or make a solid one that I can rely on.

Here are results for error and failure (infinite or complex/real swapped or NaN results) rates:

1,000,000 passes each	small range (2^-32, 2^32)			large range (2^-70,2^70)			huge range (2^-126,2^126)
Function	max ulp	avg ulp	fail %	max ulp	avg ulp	fail %	max ulp	avg ulp	fail %
Naïve	4.10E+12	1.50E+07	0%	3.10E+23	-1.90E+17	4.10%	2.60E+37	-7.20E+30	34%
Algorithm 2	3.60E+04	0.38	0%	1500	0.34	3.90%	2.00E+22	1.80E+17	25%
Algorithm 3	3.20	0.36	0%	1500	0.34	3.50%	2.00E+22	1.80E+17	13%
Algorithm 4 (this one)	3.20	0.36	0%	3.00	0.33	0%	3.20	0.31	0%

Code for my new algorithm in C++ and C# is at https://github.com/ChrisLomont/BetterQuadraticRoots.

Lots of digging around on GitHub, old numerical archives, and papers didn’t show what I considered a good solution. For example, the commonly referenced [Kahan04] does not cover overflow. So this note is how I derived a much more robust quadratic root solver.

Common solutions to get better numerical results includes adding a multi-precision library, using the “double-double trick,” or simply making sure the numerics are solid.

I had a few requirements

Numerically robust. If the roots were representable in the floating point format, those roots would be found.
Numerically accurate. Have as little error as reasonably possible.
Fast. Since one often needs to compute large numbers of roots, I wanted good performance.
Small, self contained code. This way I can include the code in whatever projects I want without needing large, cumbersome libraries. It also makes porting the code to different languages much easier.

These requirements, like all good requirements, are in tension, driving tradeoffs. Since my overall drive was to stop having code perform badly due to some low quadratic root inaccuracy, that is the part I needed fixed.

After fixing the problem, I decided to write up it up as a tutorial on how to do such things for others to learn from.

To understand the problem, some knowledge of how computers do floating-point calculations is required. Here are facts helpful for understanding and solving this problem.

Floating Point

A real number $r$ can be written in base $\beta$ as $r=\beta^e \sum_{i=0}a_i \beta^{-i}$. If $r=0$ all digits $a_i$ are 0 and the exponent $e$ is not uniquely determined. If $r \neq0$ this can be done (uniquely once you associate things like 0.999…. to 1.00000 for decimal) with the first digit $a_0\neq 0$.

A floating-point number uses this idea, usually in base $\beta=2$, and only allows a finite range for exponents and a fixed number of bits (think of them as 0,1 digits). Most computers use a floating-point standard called IEEE 754 (there are a few updates) that dictate accuracy for certain operations and dictate binary representations of values. IEEE 754 standardizes several binary and several decimal formats.

A floating point format is defined by four integers $(\beta,p,e_{min},e_{max})$ [Muller18],

A base $\beta\geq 2$, which is 2 or 10 for IEEE 754.
A precision $p\geq 2$ which is roughly the number of significant bits or digits, depending on base.
A exponent range $e_{min}<e_{max}$. All IEEE formats satisfy $e_{min}<0<e_{max}$ and $e_{min}=1-e_{max}$.

All IEEE formats represent these numbers using fixed bitlength fields, always in the following format:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


<---------------- N bits, N=16,32,64,128...------------>          
                                   
+----------+--------------------+----------------------+          
|sign bit s| exponent E, t bits | fraction F, p-1 bits |
+----------+--------------------+----------------------+

N,t,p-1: (16,5,10), (32,8,23), (64,11,52)

1 bit for the sign 's' (MSB): 0 = positive, 1 = negative
t bits for the exponent stored as biased unsigned integer E. bias = 2^(t-1)-1, e = E-bias
p-1 bits for the fractional part, stored as 1+fraction F, except for denormalized, infinities, NaNs

Relevant IEEE 754 binary formats are summarized in this table

Format (bitlength)	t exponent bits, p-1 fraction bits	Precision p	(emin,emax)	exp bias	C#, C/C++, Java, Julia
binary16	5,10	11	(-14,15)	15	Half, N/A, N/A, half
binary32	8,23	24	(-126,127)	127	float,float,float,single
binary64	11,52	53	(-1022,1023)	1023	double,double,double,double

Note in each case the length N and number of fraction bits (equivalently N and exponent bit length) specifies the rest of the format.

$p=N-t$
$bias=2^{t-1}-1$
$e_{min}=2^{t-1}-bias+1$, $e_{max}=2^{t}-bias-1$

Special Numbers

The exponent bit field can store $e_{max}+1$, which is used to store infinity $\infty$ when the fractional bits are $F=0$, or NaN (Not-a-Number) if $F \neq 0$. The sign bit determines $\pm\infty$.

The exponent field can store $e_{min}-1$ ( E=0) which denotes denormalized numbers. These do not have the implied 1 for the fractional part $1.F$. They allow what Kahan calls ‘graceful underflow’, providing some smooth drop off from $2^{e_{min}}$ down to $0$.

Values with exponent $e_{min}\leq e\leq e_{max}$ are called normalized numbers.

Rounding modes

IEEE 754 defines 5 rounding modes that apply when converting a real number to a floating point number or when doing operations on a floating point numbers. These are usually set by software. Most commonly used is “round to even”.

Round down (toward $-\infty$), written here as $RD(x)$
Round up (toward $+\infty$), written $RU(x)$
Round toward zero, $RZ(0)$
Two round to nearest modes
1. Round to nearest, ties go to even $RN_{even}(x)$
2. Round to away, picks the larger in magnitude of two choices, $RN_{away}(x)$.

$RN_{even}(x)$ is the default mode in the standard, and from experience seems to be the most common when running software. However, for precise analysis of algorithms, you must be aware a system can be in any of these modes.

Generically, write $\circ$ to denote any of the rounding functions, so $\circ(a+b)$ is computing $a+b$, then rounding by whichever mode is being used. lso write $RN$ to denote either of the round to nearest functions.

IEEE 754 guarantees monotonicity of rounding : $x\geq y$ means $\circ(x)\geq \circ(y)$ for any of the rounding modes. This is needed for analyzing many algorithms.

Error measurement

Relative error: $\epsilon(x)=\left|\frac{x-\circ(x)}{x}\right|$. Set to 0 if $x=0$

Theorem [Mueller18, 2.2]: if $x$ in the normal range, then relative error $\epsilon(x)$ satisfies $$\epsilon(x) < \left\{ \begin{array}{ll} \frac{1}{2} \beta^{1-p} & \circ\in\{RN_{even},RN_{away}\}\\ \beta^{1-p} & \circ\in\{RD,RU,RZ\} \end{array}\right.$$

These strict inequalities allow precise analysis of algorithms. The right hand value occurs so often it is called the unit roundoff $u$. This represents the smallest value so that 1+u != 1 in the floating point format (it is the last digit rounding). Many languages have this constant available in their standard library.

The function $ulp(x)$, for units in last place, gives the value of a change in the last bit of the representation of $x$. This definition is a slight bit sloppy, but good enough for this work. See [Muller18] for several definitions and issues. The rough idea is that $ulp(x)$ is a useful measurement of error in calculations, telling how far a computed number is from the true number.

IEEE Guarantees

Any operation $\odot\in {+,-,\times,\div,}$ is accurate up to a certain error : $$\begin{aligned} \circ(a \odot b)&=(a \odot b)(1+\epsilon_1), & |\epsilon_1|\leq u\\ &=(a \odot b)/(1+\epsilon_2), & |\epsilon_2|\leq u \end{aligned}$$

$\sqrt{a}$ is similarly accurate :

$$\begin{aligned} \circ(\sqrt{a})&=(\sqrt{a})(1+\epsilon_1), & |\epsilon_1|\leq u\ &=(\sqrt{a})/(1+\epsilon_2), & |\epsilon_2|\leq u \end{aligned}$$

For each size, there is a largest value representable (both positive and negative); anything larger is an infinity $\pm \infty$. This is called overflow.

There is also a smallest representable positive number; anything smaller is treated as zero. Rolling off the smallest end is called underflow. (Some places call underflow rolling off the largest of the negative numbers to $-\infty$. We’ll call that case an overflow.)

Note the very useful fact that multiplying or dividing by powers of the the base, as long as no overflow or underflow happens, introduces no error.

Another common problem is called catastrophic cancellation, which is when numbers similar in size are subtracted (or equivalently, a number added to a negative number close in magnitude), then the leading digits can cancel out, leaving far less precision in the answer. This is to be avoided when possible.

The algorithms below rely (somewhat, they can be carefully rederived for other formats, which are rare) on having IEEE 754 support, which is extremely common. For more info on floating-point, see my notes at https://lomont.org/posts/2022/lomont-floating-point-notes/

The Quadratic Equation

The high school algebra quadratic equation finds the values of $x$ that solve the quadratic equation $ax^2+bx+c=0$. The equation is

$$x=\frac{-b \pm \sqrt{b^2-4ac}}{2a}$$

The expression $D=b^2-4ac$ is called the discriminant; if $D\geq 0$ then the equation has two real roots; if $D<0$ then the equation has two complex roots, necessarily conjugates of each other (that means the roots are of the form $r_1+i r_2$ and $r_1-ir_2$ for two real numbers $r_1$ and $r_2$). Correctly determining the sign of $D$ is very important. In the case $D<0$, you use $\sqrt{-D}=\sqrt{|D|}$to get the value as a floating point number. So once you have the sign of $D$, you can simply compute $\sqrt{|D|}$ in all cases.

A few more items: given real roots $r_1,r_2$, then the equation is $ax^2+bx+c=a(x-r_1)(x-r_2)=ax^2 + (-a)(r_1+r_2)x+a r_1 r_2$. Setting coeffs equal gives $a,b,c$ in terms of roots.

$$\begin{aligned} a&=a \ b&=-a(r_1+r_2) \ c&=a r_1 r_2 \end{aligned}$$

This last item is useful for computing $r_2$ given $r_1$, as $r_2=c/(a\times r_1)$.

The complex roots when $D<0$ are given by $\frac{-b\pm i \times \sqrt{-D}}{2a} = -\frac{b}{2a} \pm i\times \frac{\sqrt{|D|}}{2a}$.

Improvement plan

To make a better algorithm, I first list places errors can happen, estimate what error rate a good algorithm should provide, decide on a testing method, and list some directions I avoided and why.

Error locations

The quadratic formula as written above can fail in the following ways when implemnted using floating-point numbers:

$b^2$ can overflow or underflow
$ac$ can overflow or underflow
$4(ac)$ can overflow
$b^2-4ac$ can overflow, underflow, or suffer from catastrophic cancellation
The sign of the discriminant $D=b^2-4ac$ can be computed incorrectly, confusing real and complex roots
$-b\pm\sqrt{D}$ can suffer from underflow, overflow, or catastrophic cancellation.
$2a$ can overflow
the final division of $-b\pm\sqrt{D}$ by $2a$ can overflow or underflow.

So there are a lot of places this can go wrong.

Expected error

A rough estimate on error bounds for an algorithm is the $ulp(x)$ error in the answer. $ulp<1$ means at most the last bit of the value is incorrect, $ulp=5$ means the error is 5 times the amount of the last bit being off.

$b^2$ adds 1
$ac$ adds 1
$4(ac)$ is free since powers of two have no error
$b^2-4ac$ adds 1
$\sqrt{b^2-4ac}$ adds 1
$\pm$ is free, sign changes have no error
$-b\pm\sqrt{|D|}$ adds 1
$2a$ is free
$\frac{-b\pm\sqrt{|D|}}{2a}$ adds 1

This is 6*ulp for the expected worst error for a good algorithm.

Testing method

To make a better algorithm, I needed a way to measure improvement. To do this, I compute how how much an output is from the true result as a multiple of an error in the last bit.

Getting truth can be difficult, often requiring high precision libraries, but for this project, a much simpler way exists: since computing the answer using an ok algorithm with binary64 will avoid any overflow and underflow compared to binary32, I will develop othe algorithm using binary32, then later port the idea to whatever IEEE types I want.

I’ll also use C# instead of C/C++, since it’s vastly quicker to develop test harnesses and make lots of other pieces needed to evaluate the algorithm. Once everything works, I can run the C/C++ versions against the C# versions and C# harness for testing. Plus C# supports binary16 as the Half type, so that can be tested in the same infrastructure.

The most important piece for testing is generating lots of test cases. I made a binary32 generator with the following properties

Can generate on any range of exponents: this allows testing naive algorithms while avoiding overflows.
Can generate edge cases: for each exponent, it can generate bitwise edge cases in represnetation
Can add zeros, subnormals, infinities, and NaNs as desired to test robustness and error handling.
Can add arbitrary counts of random floats, where random is uniform on exponent, then uniform on fraction.

Things I avoided

I avoided multi-precision libraries since they’re slow. Iteration and quad-double and double-double based solutions are also slow, and still suffer from overflow issues.

Some derivations divide by $a$ first to simplify code and analysis, but this adds an extra error case (for overflow or underflow), so I avoid that route.

Another nice idea, which can be made to work here, is to replace $b$ with $b/2$ as soon as it comes in, effectively solving the equation $ax^2+2bx+c=0$ which has the nicer (and less operations) solution $x=\frac{-b\pm\sqrt{b^2-ac}}{a}$. This works in base 2 (or multiple of 2, like 10) since $b/2$ doesn’t lose any digits, except in the case where $b$ is denormalized. In the $b$ denormalized case, you can use $(a,c)\leftarrow(2a,2c)$ replacement, but now you also need to check that doesn’t make $a,c$ overflow. All this testing, which ultimately fails in rare cases, I found not to be worth using in the final algorithm.

On to some algorithms and testing:

Algorithm 1: Naïve implementation

For the interface to each algorithm, I chose taking three floating point values a,b,c and returning the triple: (float r1, float r2, rootType) where root type denoted real or complex roots (and other issues, explained later). The the roots were the real numbers r1 and r2 or the complex numbers r1 + i*r2 and r1-i*r2.

To start, and illustrate how bad the naïve high school formula performs, here is that implementation as baseline.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15


// Naive root algorithm
float d = b * b - 4 * a * c;
if (d >= 0)
{
    float r = Sqrt(d);
    float r1 = (-b + r) / (2 * a);
    float r2 = (-b - r) / (2 * a);
    return (r1, r2, RootType.SuccessReal);
}
else
{
    float r1 = -b / (2 * a);
    float r2 = Sqrt(-d) / (2 * a);
    return (r1, r2, RootType.SuccessComplex);
}

Some hand picked testing values to illustrate error…

1
2
3
4
5
6
7
8


1x^2 + 10.5x + 5 gives roots -0.5,-10 type SuccessReal
u:0.00 0.00

1x^2 + 11x + 5 gives roots -0.47506237,-10.524938 type SuccessReal
u:6.07 3.13

1x^2 + 400x + 1 gives roots -0.0025024414,-399.9975 type SuccessReal
u:10418.65 5209.36

The error values are given in ulps as u:v1 v2 where v1 is the max root error and v2 is the average error.

The first example was picked to have nice roots -0.5 and -10, and it found them with zero error in the last place units (the u:0.00 0.00 parts).

Changing the 10.5 to 11 increases the error to u:6.07 3.13 which is already pretty bad, especially for such a small change in coefficients.

The third example has terrible error of 10418 ulps.

Walking the code for the last case shows the first problem: catastrophic cancellation in the term $-b+\sqrt{D}$ . $b$ is 400.000 and $r=\sqrt{D}$ is 399.995. Subtracting them loses a lot of significant digits via cancellation. We want to avoid this problem, so we must avoid subtracting similar numbers.

Fix 1

The first fix is the one almost all pieces of code in the internet perform: rewrite the naïve formula to avoid this cancellation. This can be done by multiplying the positive root expression by its conjugate

$$r_1=\frac{-b + \sqrt{D}}{2a}=\frac{-b + \sqrt{D}}{2a}\times\frac{-b -\sqrt{D}}{-b -\sqrt{D}}=\frac{b^2 - D}{2a(-b-\sqrt{D})}$$

This has changed the addition $-b+\sqrt{D}$ to a subtraction $-b-\sqrt{D}$ for this case, a common trick for removing a catastrophic cancellation. It’s not always possible to rewrite an equation to switch signs though.

In practice, I use a different rewrite to clean this up. The problem arises from combining $-b$ and $\pm\sqrt{D}$. So we can choose the sign of the root that avoids cancellation, compute that root (call it $r_1$), and use a different expression for the other root: $r_2 = \frac{c}{a\times r1}$. The first one can be defined with $r1=-b-sgn(b)\sqrt{D}$ where $sgn$ is the sign function: -1 for $b<0$, 0 for $b=0$, and +1 for $b>0$. This way the combined numbers are the same sign.

Algorithm 2: Remove a cancellation

This just takes a little change

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


// Naive + remove one cancellation
float d = b * b - 4 * a * c;
if (d >= 0)
{
    float r1 = (-b - Sign(b) * Sqrt(d)) / (2 * a);
    float r2 = c / (r1 * a);
    return (r1, r2, RootType.SuccessReal);
}
else
{
    float r1 = -b / (2 * a);
    float r2 = Sqrt(-d) / (2 * a);
    return (r1, r2, RootType.SuccessComplex);
}

Running the examples above:

1
2
3
4
5
6
7
8


1x^2 + 10.5x + 5 gives roots -10,-0.5 type SuccessReal
u:0.00 0.00

1x^2 + 11x + 5 gives roots -10.524938,-0.4750622 type SuccessReal
u:0.19 0.13

1x^2 + 400x + 1 gives roots -399.9975,-0.0025000155 type SuccessReal
u:0.35 0.21

So this change reduced error for these values by a lot, to very respectable numbers.

Poke some more, I find

1
2


25x^2 + 100x + 99.99999 gives roots -2.000625,-1.9993752 type SuccessReal
u:606.90 455.43

Lots of error again…. Walking through the code, I find catastrophic cancellation in the discriminant $D=b^2-4ac$. Here there are many, many ways to proceed…..

Interlude

Special Cases

At this point, to make all the solutions more robust, we should handle cases where any coefficient is 0, or infinity, or NaN, so we avoid divide by 0 when possible and don’t’ return invalid answers. So we make a function called at the start of each root algorithm that handles these special cases. These cases are each handled in this routine, which let’s the rest of this article focus on the case $a,b,c \neq 0$.

1
2


// Handle cases where a,b,or c is 0, or inputs include Infinity or NaN
(bool isHandled, float r1, float r2, RootType type) HandleSpecialCasesFloat(float a, float b, float c)....

We also extend the root types to handle these cases, making the overall algorithms robust against common errors and rare cases.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


// most common cases, last two bits = 01
SuccessReal       = 0xb_0_01, // got 2 real roots in r1 and r2
SuccessComplex    = 0xb_1_01, // got 2 complex roots as r1 +/- i*r2

// bad input, last 2 bits = 00
InputHasNaN       = 0xb_0_00, // r1=r2=NaN
InputHasInfinity  = 0xb_1_00, // r1=r2=NaN

// rare cases, last two bits 10
OneRealRoot       = 0xb_0_0_10, // coeff 'a' was 0, one root r1 = -c/b, which may be infinite, r2 = NaN
AllRealNumbers    = 0xb_0_1_10, // a=b=c=0, all real numbers are roots, r1=r2=NaN

Another change we make here is to use special functions from library provided routines.

Special functions

Often libraries provide special functions, such as C++ expm1 which computes $e^x-1$. These special functions are generally much more accurate than simply computing pow(e,x)-1 for some values, so we will leverage needed platform specific functions that improve accuracy. Here is a list we use:

1
2
3
4
5
6
7


// Needed functions
Sqrt returns he sqrt of a value
Sign(x) returns -1 if x < 0, 0 if x == 0, +1 if x > 0
CopySign(x,y) returns |x|*sign(y)
Scale2(x,t) returns x*2^t
Abs(x) returns |x|
Fma(x,y,z) returns x*y+z with at most one rounding error, provided by most languages. If not, see article text

Fix 2 - A better discriminant

Computing the discriminant of the quadratic correctly is a huge topic ([Kahan04] for example). Many places factor it as $(|b|+2\sqrt{|ac|})(|b|-2\sqrt{|ac|})$ and proceed, but this still has a cancellation issue, and adds lots more operations in the main path, increasing overall error.

The best fixes here are to use more precision, which is slower, but only when needed.

Quad-double and Double-double

One way to add precision is to represent each floating-point value $v$ with multiple other values $v=v_1+v_2+v_3+v_4$, where the $v_i$ are chosen to give more bits of precision. This is done by having them have different exponents, covering a wider range. See [Knuth14, Priest91, Bailey08] for details.

These do not completely solve all the problems since they do not allow larger or smaller exponents, thus to not prevent overflow or underflow. And they are much slower than single floating point math. But for some problems, especially if you need more precision, these are excellent techniques.

Back to Fix 2

Another process useful in many algorithms is to compute the value, then compute the error, and track both, adding the accumulated error back in at strategic places. This is the fix here.

IEEE 754 recommends a FusedMultiplyAdd(x,y,z) function (called a FMA) that computes $\circ(x\times y+z)$ in floating point with one round at the end, instead of doing it in steps as $\circ(\circ(x\times y)+z)$, which has two roundings. The added precision of a FMA operation improves a huge array of floating point algorithms, so it’s standard in many languages.

If you don’t have a FMA, you can make one via Dekker’s algorithm, requiring 17 floating point operations, as covered in [Muller18, Section 4.4.2].

As an interesting aside, using FMA indiscriminately can also lead to errors - [Higham02, section 2.6] quotes Kahan: “[FMAs] should not be used indiscriminately”, and gives an example where the discriminant can incorrectly switch signs when computed incorrectly with a FMA.

For our use, [Kahan04, Boldo09, Jannerod13] detail a very useful function for computing the determinant Det2x2(w,x,y,z) of a 2x2 matrix as $wx-yz$ which can be used to compute a higher accuracy discriminant. Here is a listing

1
2
3
4
5
6
7
8
9


// compute a*b - c*d with extra precision
// fma is a fused multiply add
float Det2X2(float a, float b, float c, float d)
{
    var v1 = c * d; // c * d loses precision
    var v2 = Fma(-c, d, v1); // c*d full precision - c*d low precision gives error
    var v3 = Fma(a, b, -v1); // ab-cd with lost precision
    return v3 + v2; // add back excess
}

It works using a very common and useful idea of computing a value, then determining the error, doing some work, and adding the error at the end. For lots of tricks like this and more examples, read anything by Kahan.

See the comments for how it works. It’s worth working through this many times until you internalize it so you can use the ideas on your own.

So we’ll simply replace the simplistic $d=bb-4a*c$ computation with a call to Det2X2(b,b,4*a,c). We replace 4*a with the possibly better Scale2(a,2) function.

Algorithm 3: Better discriminant

Here is the next level of fixes: special case handling and a better discriminant.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18


var (isHandled, r1, r2, type) = HandleSpecialCasesFloat(a, b, c);
if (isHandled)
    return (r1, r2, type);

var d = Det2X2(b, b, Scale2(a, 2), c);

if (d >= 0)
{
    r1 = (-b - CopySign(Sqrt(d),b)) / Scale2(a, 1);
    r2 = c / (r1 * a);
    return (r1, r2, RootType.SuccessReal);
}
else
{
    r1 = -b / Scale2(a, 1);
    r2 = Sqrt(-d) / Scale2(a, 1);
    return (r1, r2, RootType.SuccessComplex);
}

Now we get

1
2


25x^2 + 100x + 99.99999 gives roots -2.0005524,-1.9994476 type SuccessReal
1(100%) : 100% (0) 100% (0) u:0.10 0.07

Nice!

At this point, we can show how these three (adding back the special case handling to the earlier algorithms) behaves over several sets of floats (from the table at the start)

1,000,000 passes each	small range (2^-32, 2^32)			large range (2^-70,2^70)			huge range (2^-126,2^126)
Function	max ulp	avg ulp	fail %	max ulp	avg ulp	fail %	max ulp	avg ulp	fail %
Naïve	4.10E+12	1.50E+07	0%	3.10E+23	-1.90E+17	4.10%	2.60E+37	-7.20E+30	34%
Algorithm 2	3.60E+04	0.38	0%	1500	0.34	3.90%	2.00E+22	1.80E+17	25%
Algorithm 3	3.20	0.36	0%	1500	0.34	3.50%	2.00E+22	1.80E+17	13%
Algorithm 4 (this one)	3.20	0.36	0%	3.00	0.33	0%	3.20	0.31	0%

So for small (no overflow) sizes, the Algorithm 3 is good, with no failures over 1,000,000 runs, and good error bounds. However running the large set of floats, which contain exponents up to $2^{70}$ where overflows can happen, shows larger errors and some cases failing (getting real/complex roots messed up, or returning infinities when it shouldn’t).

Again, tracking down some examples that fail, and inspecting

1
2


6.096731E+18x^2 + 1.3318949E+20x + 1.3318949E+20 gives roots -10.923025,NaN type SuccessComplex
Double gives roots -20.795533772103447,-1.050516445351387 type SuccessReal

So even though the roots computed with binary64 are well within bounds of the algorithm, it fails to find them. It also thinks the roots are complex numbers, when they should be real. Clearly the discriminant computation failed.

Now we must face overflow (and underflow).

Fix 3 - Scaling and bounds

Picking failed cases and walking through the code, they are all cases of overflow, mostly in the computation of the determinant.

Note that you don’t really need to compute the discriminant: what you really need are two things:

The sign of the determinant, more precisely, if it is nonnegative.
The root $\sqrt{|D|}$

Scaling

To deal with overflow or underflow, avoid calculation that results in exponents outside of some exponent range $e_{min}<e_{max}$ must be avoided. This is generally handled via scaling, which is scaling the values in use to some safe range, doing the computation, then scaling the results back correctly. Scaling by a power of the base introduces no error, which makes it extremely useful.

The idea is it treat a floating point value $b$ as a tuple $(s,e,b_f$) where $s$ is a sign, $e$ is a power of the base $\beta=2$, and $b_f$ is $b$ scaled by pwoers of the base to keep the exponent of $b_f$ in a safe range. By removing enough the exponent frtom the floating point number, you can put it in an integer that has far more range.

As a simple example, suppose no exponent can go over 7 in the floating point hardware, and for $b=2^4$ you want to compute $\sqrt{b^2-192}$. The answer $8=2^3$ fits in this hardware, but $b^2$ overflows. So (dropping the sign part of the tuple), $192=2^6 \times 3 = (6,3)$, and $b=2^4\times 1= (4,1)$. Then $b^2$ would be $(4+4,1^2)=(8,1)$ where the squaring can be done in the floating point hardware, and the exponents are tracked. 192 cannot be subtracted from this, since they have different exponents, so one of these is scaled to give

$$b^2-192=(8,1)-(6,3)=(6,2^2\times 1)-(6,3)=(6,4)-(6,3)=(6,4-3)=(6,1)$$

Then taking a square root gives (if the exponent was odd, then move a power back in).

$\sqrt{(6,1)}=(6/2, \sqrt{1})=(3,1)=2^3\times 1=8$ the correct answer. At no point did the hardware floating point unit overflow.

And that’s the idea. All we need is a routine to extract from a floating point value $b$ the tuple $(s,e,b_f)$. Converting this tuple back is already easy: $b=s\times2^e\times b_f$, or, in code, sign*Scale2(bf,e) for our Scale2 function

And that’s the main idea.

Details

The details are tricky. For the first part, let’s analyze the discriminant, $D=b^2-4ac$.

Discriminant

From special case handling, $a,b,c \neq 0$. Also we can treat $b>0$ since it’s squared.

First, normalize all three of these, which makes $b_f^2$ neither overflow nor underflow, and the same for $ac$. Note that $a_f, b_f, c_f$ are in the half interval $[1,2)$.

Case 1: b^2 » |4ac|

Now, if $b^2$ is much larger than $4ac$, then $D>=0$ and the root, as floating point is $|b|$, so the root can be returned as $(1,b_e,b_f)$.

The sufficient condition is $2b_e > a_e+c_e+p+5$, in which case we don’t need to compute roots: merely use

(root,rootE,nonnegative) = (bF,bS,true).

The proofs section has a detailed proof.

Case 2: b^2 «|4ac|

Similarly, if $ac$ is large compared to $b^2$, then the root is approximated good enough by $2\sqrt{|ac|}$ and the sign of $D$ is the sign of $ac$. Similar analysis derives the condition for this to be true: $2b_e<a_e+c_e-p-1$

To compute the root robustly

$p=a_fc_f$.
$e=a_e+c_e$
If $e$ is odd then $(e,p) \leftarrow (e-1,2p) $
The root and scale is $(e/2,\sqrt{p})$

This handles this case.

Case 1: b^2 ~ |4ac|

The last discriminant case is all others. In this case we’ll leverage the Det2X2 trick for more accurate discriminnat computation. To avoid overflow and underflow as much as possible, the goal will be to find an expoennt scaling $w_e$ to apply to each of $a,b,c$ such that the result does not overflow or underflow.

This would be much harder, except we already bounded the exponents by the above. We have

$$-p-1 \leq 2b_e-(a_e+c_e)\leq p+5$$

Suppose $b_e$ is large, so computing $b^2$ will overflow. Then we want to shift this lower. But if one of $a$ or $c$ is too small, this may scale them to underflow (zero). But the condition to be here ensures that $a_e+c_e$ is near $2b_e$. Which means for one of $a,c$ to be small, one has to be big.

Since we only really care about the product $ac$, the trick is to move some powers of $2$ between $a$ and $c$ so their exponents are as close as can be, then to shift the new $a,c$ and $b$ by the midpoint of $2b_e$ and $a_e+c_e$. The result of all this is to move $a,b,c$ as close to 1.0 as possible while still retaining proper information for the root. And, while we’re scaling things, we can add a 1 to $a,c$ exponents to handle the 4 in $4ac$. This leads to

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17


// scale a and c exponents to center of a and c, leaves a*c fixed in value, 
// makes robust against underflow/overflow 
// in effect, we will scale (a,c) = (a*2^-dc, c*2^dc), brings them to near same in size
var deltaE = (aE - cE) / 2;

var mid = (bE + bE + (aE + cE)) / 4;
aF = Scale2(a, -mid + 2-deltaE); // add 2 to handle 4 in 4ac
bF = Scale2(b, -mid);
cF = Scale2(c, -mid + deltaE);

Debug.Assert(float.IsFinite(aF) && float.IsFinite(bF) && float.IsFinite(cF));

var d = Det2X2(bF, bF, aF, cF);

root = Sqrt(Abs(d));
nonnegative = d >= 0;
scale = mid;

And now we have a robust discriminant info routine that returns (float root, int rootExponent, bool nonnegative).

Note: this last case does still have a flaw which I don’t correct here: the discriminant can still be the wrong sign, but it is really rare. See the proof section for a good start on when this can occur. The solution for that rare case would likely be to drop to double-double computation for that case.

Roots

This next part turns this info into roots. There is slight room to fix this to be slightly more robust, explained at the end. But in practice this is fast and I hit no mistakes for exponents under $2^{126}$. The max float exponent $2^{127}$ hits some errors. If needed, later I will fix these cases too. it does catch one case where $-b+root$ may overflow, and divides first.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16


root = Scale2(root, rootExponent); 
if (nonnegative)
{
	if (Abs(b) < MaxValue / 2)
		r1 = (-b - CopySign(root, b)) / Scale2(a, 1);
	else
		r1 = -b / Scale2(a, 1) - CopySign(root, b) / Scale2(a, 1);
	r2 = c / (r1 * a);
	return (r1, r2, RootType.SuccessReal);
}
else
{
	r1 = -b / Scale2(a, 1);
	r2 = root / Scale2(a, 1);
	return (r1, r2, RootType.SuccessComplex);
}

Algorithm 4: Scaling

See the most up to date code at https://github.com/ChrisLomont/BetterQuadraticRoots, which has float and double versions for C/C++ and C#.

A complete listing C# listing for float (perhaps out of date) is also at the end of this article. This routine is only around 300 lines of C# or C++ code.

Final Results

For completeness, I repeat the table again.

1,000,000 passes each	small range (2^-32, 2^32)			large range (2^-70,2^70)			huge range (2^-126,2^126)
Function	max ulp	avg ulp	fail %	max ulp	avg ulp	fail %	max ulp	avg ulp	fail %
Naïve	4.10E+12	1.50E+07	0%	3.10E+23	-1.90E+17	4.10%	2.60E+37	-7.20E+30	34%
Algorithm 2	3.60E+04	0.38	0%	1500	0.34	3.90%	2.00E+22	1.80E+17	25%
Algorithm 3	3.20	0.36	0%	1500	0.34	3.50%	2.00E+22	1.80E+17	13%
Algorithm 4 (this one)	3.20	0.36	0%	3.00	0.33	0%	3.20	0.31	0%

The final algorithm had no failures over a million runs of a huge range of inputs, and had really good average and max error.

Benchmarks

Using BenchmarkDotNet gives that the new algorithm is around 4x as slow as naive, 18ns to 71ns, so the new algorithm is plenty fast for all except the most high-performance needs.

Wrap up

The most accurate solution with best performance would be to apply scaling to handle overflow and underflow, use double-doubles as needed, and handle every edge case that still fails (of which there are some, but only with extreme values) by rewriting the appropriate expressions for each case and compute as much as is possible.

See the most recent code at https://github.com/ChrisLomont/BetterQuadraticRoots.

Future work

Here are some notes to track things I should fix or change given another pass someday:

Fix up the last possible error cases
Make solid proofs of all performance and claims
More extensive testing, especially the binary16 and binary64 cases.
Make all code generic, use one code path under different types, for both C/C++ and C#
Optimize, expecting 2x performance improvements via clipping slower paths
More extensive benchmarking
Implement the homogeneous roots and stable for animation behavior of [Blinn05, Blinn06]
Add some comments from [Forsythe67], make sure I hit his criteria
List where each error source in quad list is addressed
Add iteration methods to the testing for speed and accuracy
Test against common other libs and platforms

Code Listings

Here is a complete listing for C#

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339


// 
namespace Lomont.Numerical
{
    public static partial class QuadraticEquation
    {

        [Flags]
        public enum RootType
        {
            // most common cases, last two bits = 01
            SuccessReal       = 0xb_0_01, // got 2 real roots in r1 and r2
            SuccessComplex    = 0xb_1_01, // got 2 complex roots as r1 +/- i*r2

            // bad input, last 2 bits = 00
            InputHasNaN       = 0xb_0_00, // r1=r2=NaN
            InputHasInfinity  = 0xb_1_00, // r1=r2=NaN

            // rare cases, last two bits 10
            OneRealRoot       = 0xb_0_0_10, // coeff 'a' was 0, one root r1 = -c/b, which may be infinite, r2 = NaN
            AllRealNumbers    = 0xb_0_1_10, // a=b=c=0, all real numbers are roots, r1=r2=NaN
        }
    }
}

//
//    MIT License
//    
//    Copyright (c) 2022 Chris Lomont
//    
//    Permission is hereby granted, free of charge, to any person obtaining a copy
//    of this software and associated documentation files (the "Software"), to deal
//    in the Software without restriction, including without limitation the rights
//    to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
//    copies of the Software, and to permit persons to whom the Software is
//    furnished to do so, subject to the following conditions:
//    
//    The above copyright notice and this permission notice shall be included in all
//    copies or substantial portions of the Software.
//    
//    THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
//    IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
//    FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
//    AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
//    LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
//    OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
//    SOFTWARE.
//

using System.Diagnostics;

namespace Lomont.Numerical
{
    public static partial class QuadraticEquation
    {

        /// <summary>
        /// Compute roots using float32 for the quadratic equation ax^2 + bx + c = 0
        /// Returns (r1, r2, rootType) where root type is 
        ///   SuccessReal      : two real roots r1,r2
        ///   SuccessComplex   : two complex valued roots r1 +\- i*r2
        ///   InputHasNaN      : input has invalid values
        ///   InputHasInfinity : input has invalid values
        ///   OneRealRoot      : a was 0, so real root in r1, r2 = NaN
        ///   AllRealNumbers   : a=b=c=0, all numbers valid roots, r1=r2=NaN
        ///   
        /// Derivation of algorithms Chris Lomont, 2022, https://lomont.org/posts/2022/a-better-quadratic-formula-algorithm/
        /// </summary>
        /// <param name="a"></param>
        /// <param name="b"></param>
        /// <param name="c"></param>
        /// <returns></returns>
        public static (float r1, float r2, RootType type) FloatRoots(float a, float b, float c)
        {
            var (isHandled, r1, r2, type) = InternalF.HandleSpecialCasesFloat(a, b, c);
            if (isHandled)
                return (r1, r2, type);

            // so now can assume a,b,c nonzero
            var (root, nonnegative, rootE) = InternalF.DiscriminantInfo(a, b, c);

            // todo - can make this all slightly more accurate, see https://lomont.org/posts/2022/a-better-quadratic-formula-algorithm/
            root = InternalF.Scale2(root, rootE);

            if (nonnegative)
            {

                if (InternalF.Abs(b) < InternalF.MaxValue / 2)
                    r1 = (-b - InternalF.CopySign(root, b)) / InternalF.Scale2(a, 1);
                else
                    r1 = -b / InternalF.Scale2(a, 1) - InternalF.CopySign(root, b) / InternalF.Scale2(a, 1);
                r2 = c / (r1 * a);
                return (r1, r2, RootType.SuccessReal);
            }
            else
            {
                r1 = -b / InternalF.Scale2(a, 1);
                r2 = root / InternalF.Scale2(a, 1);
                return (r1, r2, RootType.SuccessComplex);
            }
        }

        static class InternalF
        {
            #region Per platform funcs

            internal static float Sqrt(float v) => float.Sqrt(v);
            internal static int Sign(float v) => float.Sign(v);
            internal static float CopySign(float x, float y) => float.CopySign(x, y);
            internal static float Abs(float v) => float.Abs(v);
            internal static float Fma(float x, float y, float z) => float.FusedMultiplyAdd(x, y, z);
            internal static bool IsNaN(float v) => float.IsNaN(v);
            internal static bool IsInfinity(float v) => float.IsInfinity(v);
            internal static bool IsFinite(float v) => float.IsFinite(v);
            internal static bool IsNormal(float v) => float.IsNormal(v);

            internal const float NaN = float.NaN;
            internal const float MaxValue = float.MaxValue;

            /// <summary>
            /// Return x*2^exp
            /// </summary>
            /// <param name="x"></param>
            /// <param name="exp"></param>
            /// <returns></returns>
            internal static float Scale2(float x, int exp) => float.ScaleB(x, exp);

            /// <summary>
            /// Compute a*b-c*d
            /// </summary>
            /// <param name="a"></param>
            /// <param name="b"></param>
            /// <param name="c"></param>
            /// <param name="d"></param>
            /// <returns></returns>
            internal static float Det2X2(float a, float b, float c, float d)
            {
                var v1 = c * d; // c * d loses precision
                var v2 = Fma(-c, d, v1); // c*d full precision - c*d low precision gives excess
                var v3 = Fma(a, b, -v1); // ab-cd with lost precision
                return v3 + v2; // add back excess
            }

            // put in 64 bit type to make porting to other sizes easier
            static ulong ToBits(float value) => BitConverter.SingleToUInt32Bits(value);

            #endregion

            #region per type values // double, float, half, others...

            const int PrecisionBits = 24; // # of bits in binary format + implicit 1 bit
            const int ExponentBits = 8;
            const int TotalBits = sizeof(float) * 8;
            const int ExpMask = (1 << ExponentBits) - 1;
            const int ExpBias = (1 << (ExponentBits - 1)) - 1;

            #endregion

            /*
            move all values into form (sign,exp, frac) with frac in 1<= frac < 2 
            input value = sign*2^exp * value
            input of 0 returns (1,0,0)
             */
            internal static (int sign, int exp, float frac) Normalize(float value)
            {
                Debug.Assert(IsFinite(value)); // do not call on NaN, Inf
                if (value == 0) return (1, 0, 0);

                ulong i = ToBits(value);

                int sign = (i & (1UL << (TotalBits-1))) == 0 ? 1 : -1;
                int exp = (int)((i >> (PrecisionBits-1)) & ExpMask) - ExpBias;
                var frac = sign * Scale2(value, -exp);

                if (!IsNormal(value))
                {
                    // above not enough - do more....
                    var (s2, e2, f2) = Normalize(frac);
                    exp += e2;
                    frac = f2;
                    Debug.Assert(s2 == 1);
                }

                Debug.Assert(1 <= frac && frac < 2.0f);

                Debug.Assert(sign * Scale2(frac, exp) == value);


                return (sign, exp, frac);
            }




            /// <summary>
            /// check special cases 
            /// </summary>
            /// <param name="a"></param>
            /// <param name="b"></param>
            /// <param name="c"></param>
            /// <returns></returns>
            internal static (bool isHandled, float r1, float r2, RootType type) HandleSpecialCasesFloat(float a, float b, float c)
            {
                if (IsNaN(a) || IsNaN(b) || IsNaN(c))
                    return (true, NaN, NaN, RootType.InputHasNaN);
                if (IsInfinity(a) || IsInfinity(b) || IsInfinity(c))
                    return (true, NaN, NaN, RootType.InputHasInfinity);

                // cases:
                if (a == 0)
                { // want bx+c = 0 gives x = -c/b
                    if (b == 0 && c == 0)
                        return (true, NaN, NaN, RootType.AllRealNumbers);

                    var r1 = -c / b;
                    return (true, r1, NaN, RootType.OneRealRoot);
                }

                if (b == 0)
                { // a != 0, want ax^2+c = 0, so x = +/- sqrt(-c/a)

                    var sgn = Sign(a) * Sign(c); // sign of quotient

                    if (sgn <= 0)
                    { // real answers
                        var r1 = DivRoot(-c, a);
                        var r2 = -r1;
                        return (true, r1, r2, RootType.SuccessReal);
                    }
                    else
                    { // complex answers, purely imaginary
                        var r2 = DivRoot(c, a);
                        return (true, 0, r2, RootType.SuccessComplex); // 0 +/- i*r2
                    }
                }

                if (c == 0)
                { // a,b != 0, of form ax^2 + bx = 0, so roots are x=0 and x=-b/a
                    return (true, 0, -b / a, RootType.SuccessReal);
                }

                return (false, 0, 0, RootType.SuccessReal); // not real, but will continue to work

            }


            /// <summary>
            /// Compute sqrt(|x/y|), handling overflow and underflow if possible. 
            /// </summary>
            /// <param name="x"></param>
            /// <param name="y"></param>
            /// <returns></returns>
            internal static float DivRoot(float x, float y)
            {
                var (xS, xE, xF) = Normalize(x);
                var (yS, yE, yF) = Normalize(y);
                Debug.Assert(xS*yS>=0);

                var q = xF / yF;
                var e = xE - yE;
                if (((xE + yE) & 1) == 1)
                {
                    // exponent odd, scale so can easily update after root
                    q = Scale2(q, 1);
                    e--;
                }

                var r = Sqrt(q);
                return Scale2(r, e / 2);
            }

            /// <summary>
            /// Compute the discriminant D = b*b-4*a*c
            /// Return the (scaled) root r' = Sqrt(|D|), if d >= 0, and a scaling factor E
            /// such that the correct root is r = 2^E * r'
            /// </summary>
            /// <param name="a"></param>
            /// <param name="b"></param>
            /// <param name="c"></param>
            /// <returns></returns>
            internal static (float root, bool nonnegative, int scale) DiscriminantInfo(float a, float b, float c)
            {
                var (aS, aE, aF) = Normalize(a);
                var (bS, bE, bF) = Normalize(b);
                var (cS, cE, cF) = Normalize(c);

                var (root, scale, nonnegative) = (b, 0, true); // float,scaling, disc >= 0

                if (2 * bE > aE + cE + PrecisionBits + 5) // +5 works, is derived, seems to work( +4, +0, -2, ) , -10 fails (-10, -5, -4, -3)
                {
                    root = bF;
                    scale = bE;
                    nonnegative = true;
                }
                else if (2 * bE < aE + cE - PrecisionBits - 1) // works: (-1,+2,+4), fails (+5, +7,+15,+40)
                {
                    scale = aE + cE;
                    if ((scale & 1) != 0) // is odd
                    {
                        scale--;
                        aF = Scale2(aF, 1); // move factor back in
                    }
                    scale = scale / 2 + 1; //  +1 for the 4 in 4ac, then root, /2 is root

                    root = Sqrt(aF * cF);
                    nonnegative = aS * cS < 0;
                }
                else
                {
                    // from above, we have:
                    Debug.Assert(
                        -PrecisionBits - 1 <= 2 * bE - aE - cE &&
                        2 * bE - aE - cE <= PrecisionBits + 5
                    );

                    // now must align exponents for b*b and a*c so they can be subtracted... 
                    // idea, pull midpoint of (be + be) and (ae + ce) to zero, making values close to 1.0f, but still same scale

                    // scale a and c exponents to center of a and c, leaves a*c fixed in value, makes robust against underflow/overflow on following scaling
                    // in effect, we will scale (a,c) = (a*2^-dc, c*2^dc), brings them to near same in size
                    var deltaE = (aE - cE) / 2;

                    var mid = (bE + bE + aE + cE) / 4;
                    aF = Scale2(a, -mid + 2 - deltaE); // add 2 to handle 4 in 4ac
                    bF = Scale2(b, -mid);
                    cF = Scale2(c, -mid + deltaE);

                    Debug.Assert(IsFinite(aF) && IsFinite(bF) && IsFinite(cF));

                    var d = Det2X2(bF, bF, aF, cF);

                    root = Sqrt(Abs(d));
                    nonnegative = d >= 0;
                    scale = mid;
                }
                return (root, nonnegative, scale);
            }
        }
    }
}

Proofs

Here are some proof sketches for things claimed to show how to develop such algorithms.

D1: Proof for b^2 » 4ac

Now, if $b^2$ is much larger than $4ac$, then $D>=0$ and the root, as floating point is $|b|$, so the root can be returned as $(1,b_e,b_f)$.

How to check if it’s large enough? We need bounded relative error $\frac{\sqrt{b^2-4ac}-b}{b}<\epsilon$. Note floating point rounding modes affect some bounds when analyzing. Using the Taylor series $\sqrt{1-x}=1-\frac{x}{2}-\frac{x^2}{8}-\frac{x^3}{16}+\frac{5x^4}{256}+O(x^5)$, which converges for $|x|<1$. Let $x=\frac{4ac}{b^2}$.

$$\begin{aligned}\frac{\sqrt{b^2-4ac}-b}{b}&=\frac{b(1-\frac{4ac}{b^2})^{1/2}-b}{b}\ &=-\frac{x}{2}-\frac{x^2}{8}-\frac{x^3}{16}+O(x^4)\ &= T \end{aligned}$$

(defining $T$). Then we can bound $T$ with $x$ as $-|x|\leq T \leq |x|$.

$$\begin{aligned}|x|&=\frac{4\times 2^{a_e+c_e}a_f c_f}{2^{2b_e}b_f^2}\ &=2^{a_e+c_e+2b_e+2} \times \frac{a_f c_f}{b_f^2}\ &=2^Q\times P \end{aligned}$$

Since $a_f,b_f,c_f\in[0,1)$, $P\leq \frac{2\times 2}{1^2}=4$. So it’s enough to have

$$-\epsilon<-2^Q\times 4 \leq 2^Q\times p=-|x|\leq T \leq 2^Q\times 4<\epsilon$$

taking floating point precision $p$, and $\epsilon=2^{-(p+1)}$, this reduces to

$$\begin{aligned} \epsilon & > 2^Q\times 4\ 2^{-(p+1)} &> 2^{a_e+c_e+2b_e+2+2}\ 2b_e &> a_e+c_e+p+5 \end{aligned}$$

So, if this last condition is true, we don’t need to compute roots: merely use (root,rootE,nonnegative) = (bF,bS,true).

D2: Proof for b^2 « 4ac

Same techniques as proof D1 works

D3: Proof for b^2~4ac

TODO

Sign of the Discriminant

Initial proof ideas for how to check disc sign in detail

Computing the sign for the determinant $D=b^2-4ac$ completely correct is difficult. Here is an analysis.

Write $R(x)$ for real number $x$ rounded to floating point, and $R(x \circ y)$ for a rounded result of some operation $\circ$.

For now, assume no overflow (use scaling if needed)

Use the $b\rightarrow b/2$ trick so the discriminant is $D=b^2-ac$, so we don’t have to fiddle with the $4$.

Also assume $ac\geq 0$, which can be checked exactly from their signs, so can also assume $a,c>=0$.

Checking exponents of $b^2$ and $ac$ removes many cases. So the real issue is when they are close.

Then, to prove that $b^2\geq4ac$ using floating point, where $a,b,c$ are floating point values, it looks like $R(b\times b)\geq R(a\times c)$, but this can fail due to rounding issues. (TODO - example).

Assume there is a floating point number $k_1$ such that if $R(b^2)\geq R(R(a\times c)\times k_1)$ then we can be sure that $b^2\geq ac$. I’ll determine such a $k_1$. Note the first expression can be computed exactly with floating point, tested, and proves what we need.

Idea:

$$\begin{aligned} R(b\times b)&\geq R(R(a\times c)\times k_1)\ b^2(1+\delta_1)&\geq ac(1+\delta_2)k_1(1+\delta_3)\ b^2&\geq ac \frac{(1+\delta_2)(1+\delta_3)}{(1+\delta_1)}k_1\ b^2&\geq ac \gamma_1 \end{aligned}$$

where I replaced the $\delta_i$ and $k_1$ expression with $\gamma_1$

Then, if $\gamma_1\geq 1$, the above would give $b^2\geq ac \gamma_1\geq ac$ as desired. $\gamma_1\geq 1$ means

$$\begin{aligned} \frac{(1+\delta_2)(1+\delta_3)}{(1+\delta_1)}k_1&\geq 1\ k_1&\geq \frac{(1+\delta_1)}{(1+\delta_2)(1+\delta_3)} \end{aligned}$$

Since $\frac{(1-\epsilon)}{(1+\epsilon)^2} \geq \frac{(1+\delta_1)}{(1+\delta_2)(1+\delta_3)}$, any $k_1\geq eps$ suffices. Taylor expand

$$\frac{(1-\epsilon)}{(1+\epsilon)^2} =1+3\epsilon + 5\epsilon^2+7\epsilon^3+O(\epsilon^4)$$

Since $\epsilon$ is small, $1+4\epsilon$ is larger than the tayloer expansio, and setting $k_1=1+4\epsilon$ is sufficient. Note this is also representable as a floating point number, so it’s perfect. This gives a small constant $k_1$ TODO - more, list the k1.

Doing similar work, you can derive that for the representable $k_2=1-3\epsilon$ (and the 3 is correct! not 4) that computing as floats $R(b\times b) < R(R(a\times c)\times k_2$) means that $b^2<ac$.

Now you have a bound, outside of which you know the sign of the discriminant $D$ with certainty.

To compute inside that bound, use scaling and the double-double trick (todo - check in detail, post here?)

References

[Bailey08] Bailey et. al., “Library for Double-Double and Quad-Double Arithmetic,” 2008.

[Blinn05] James Blinn, “How to Solve a Quadratic Equation,” 2005

[Blinn06] James Blinn, “How to Solve a Quadratic Equation, Part II” 2006

[Boldo09] Sylvie Boldo, “Kahan’s Algorithm for a Correct Discriminant Computation at Last Formally Proven,” 2009.

[Forsythe67] George E. Forsythe, “What is a satisfactory quadratic Equation Solver,” TR CS 74, 1967.

[Jannerod13] Claude-Pierre Jeannerod et. al., “Further analysis of Kahan’s algorithm for the accurate computation of 2x2 determinants,” 2013.

[Higham02] Nicolas Higham, “Accuracy and Stability of Numerical Algorithms,” 2002.

[Kahan04] Kahan, “On the Cost of Floating-Point Computation Without Extra-Precise Arithmetic,” 2004.

[Knuth14] Donald Knuth, “Seminumerical Algorithms,” Vol 2, 3rd ed., 2014.

[Muller18] Jean-Michel Muller et. al., “Handbook of Floating-Point Arithmetic”, 2nd ed., 2018, Birkhäuser

[Priest91] Douglas Priest, “Algorithms for Arbitrary Precision Floating Point Arithmetic,” 1991.

THE END

Categories: Software Numerical Math

Tags: Math Software C# C C++ Algorithms

Comments

Comment is disabled to avoid unwanted discussions from 'localhost:1313' on your Disqus account...