# Lomont.org

(Also lomonster.com and clomont.com)

Published

A robust algorithm for computing roots of quadratic equations, much better than common code.

# A Better Algorithm for Solving Quadratic Equations

Chris Lomont, Aug 2022, www.lomont.org, v0.5

I was recently using some code that did registration of point clouds that rarely would produce really bad results. The internal algorithm required computing roots of a degree 4 polynomial, which required computing roots of a degree 3 polynomial, which required computing roots of a degree 2 (quadratic) polynomial. The quadratic roots were bad in the cases the overall algorithm failed. Since having bad quadratic root solvers in code has bitten me several times in my career, I decided to find or make a solid one that I can rely on.

Here are results for error and failure (infinite or complex/real swapped or NaN results) rates:

1,000,000 passes each small range (2^-32, 2^32) large range (2^-70,2^70) huge range (2^-126,2^126)
Function max ulp avg ulp fail % max ulp avg ulp fail % max ulp avg ulp fail %
Naïve 4.10E+12 1.50E+07 0% 3.10E+23 -1.90E+17 4.10% 2.60E+37 -7.20E+30 34%
Algorithm 2 3.60E+04 0.38 0% 1500 0.34 3.90% 2.00E+22 1.80E+17 25%
Algorithm 3 3.20 0.36 0% 1500 0.34 3.50% 2.00E+22 1.80E+17 13%
Algorithm 4 (this one) 3.20 0.36 0% 3.00 0.33 0% 3.20 0.31 0%

Code for my new algorithm in C++ and C# is at https://github.com/ChrisLomont/BetterQuadraticRoots.

Lots of digging around on GitHub, old numerical archives, and papers didn’t show what I considered a good solution. For example, the commonly referenced [Kahan04] does not cover overflow. So this note is how I derived a much more robust quadratic root solver.

Common solutions to get better numerical results includes adding a multi-precision library, using the “double-double trick,” or simply making sure the numerics are solid.

1. Numerically robust. If the roots were representable in the floating point format, those roots would be found.
2. Numerically accurate. Have as little error as reasonably possible.
3. Fast. Since one often needs to compute large numbers of roots, I wanted good performance.
4. Small, self contained code. This way I can include the code in whatever projects I want without needing large, cumbersome libraries. It also makes porting the code to different languages much easier.

These requirements, like all good requirements, are in tension, driving tradeoffs. Since my overall drive was to stop having code perform badly due to some low quadratic root inaccuracy, that is the part I needed fixed.

After fixing the problem, I decided to write up it up as a tutorial on how to do such things for others to learn from.

To understand the problem, some knowledge of how computers do floating-point calculations is required. Here are facts helpful for understanding and solving this problem.

## Floating Point

A real number $r$ can be written in base $\beta$ as $r=\beta^e \sum_{i=0}a_i \beta^{-i}$. If $r=0$ all digits $a_i$ are 0 and the exponent $e$ is not uniquely determined. If $r \neq0$ this can be done (uniquely once you associate things like 0.999…. to 1.00000 for decimal) with the first digit $a_0\neq 0$.

A floating-point number uses this idea, usually in base $\beta=2$, and only allows a finite range for exponents and a fixed number of bits (think of them as 0,1 digits). Most computers use a floating-point standard called IEEE 754 (there are a few updates) that dictate accuracy for certain operations and dictate binary representations of values. IEEE 754 standardizes several binary and several decimal formats.

A floating point format is defined by four integers $(\beta,p,e_{min},e_{max})$ [Muller18],

• A base $\beta\geq 2$, which is 2 or 10 for IEEE 754.
• A precision $p\geq 2$ which is roughly the number of significant bits or digits, depending on base.
• A exponent range $e_{min}<e_{max}$. All IEEE formats satisfy $e_{min}<0<e_{max}$ and $e_{min}=1-e_{max}$.

All IEEE formats represent these numbers using fixed bitlength fields, always in the following format:

  1 2 3 4 5 6 7 8 9 10 11  <---------------- N bits, N=16,32,64,128...------------> +----------+--------------------+----------------------+ |sign bit s| exponent E, t bits | fraction F, p-1 bits | +----------+--------------------+----------------------+ N,t,p-1: (16,5,10), (32,8,23), (64,11,52) 1 bit for the sign 's' (MSB): 0 = positive, 1 = negative t bits for the exponent stored as biased unsigned integer E. bias = 2^(t-1)-1, e = E-bias p-1 bits for the fractional part, stored as 1+fraction F, except for denormalized, infinities, NaNs 

Relevant IEEE 754 binary formats are summarized in this table

Format (bitlength) t exponent bits, p-1 fraction bits Precision p (emin,emax) exp bias C#, C/C++, Java, Julia
binary16 5,10 11 (-14,15) 15 Half, N/A, N/A, half
binary32 8,23 24 (-126,127) 127 float,float,float,single
binary64 11,52 53 (-1022,1023) 1023 double,double,double,double

Note in each case the length N and number of fraction bits (equivalently N and exponent bit length) specifies the rest of the format.

• $p=N-t$
• $bias=2^{t-1}-1$
• $e_{min}=2^{t-1}-bias+1$, $e_{max}=2^{t}-bias-1$
##### Special Numbers

The exponent bit field can store $e_{max}+1$, which is used to store infinity $\infty$ when the fractional bits are $F=0$, or NaN (Not-a-Number) if $F \neq 0$. The sign bit determines $\pm\infty$.

The exponent field can store $e_{min}-1$ ( E=0) which denotes denormalized numbers. These do not have the implied 1 for the fractional part $1.F$. They allow what Kahan calls ‘graceful underflow’, providing some smooth drop off from $2^{e_{min}}$ down to $0$.

Values with exponent $e_{min}\leq e\leq e_{max}$ are called normalized numbers.

##### Rounding modes

IEEE 754 defines 5 rounding modes that apply when converting a real number to a floating point number or when doing operations on a floating point numbers. These are usually set by software. Most commonly used is “round to even”.

1. Round down (toward $-\infty$), written here as $RD(x)$
2. Round up (toward $+\infty$), written $RU(x)$
3. Round toward zero, $RZ(0)$
4. Two round to nearest modes
1. Round to nearest, ties go to even $RN_{even}(x)$
2. Round to away, picks the larger in magnitude of two choices, $RN_{away}(x)$.

$RN_{even}(x)$ is the default mode in the standard, and from experience seems to be the most common when running software. However, for precise analysis of algorithms, you must be aware a system can be in any of these modes.

Generically, write $\circ$ to denote any of the rounding functions, so $\circ(a+b)$ is computing $a+b$, then rounding by whichever mode is being used. lso write $RN$ to denote either of the round to nearest functions.

IEEE 754 guarantees monotonicity of rounding : $x\geq y$ means $\circ(x)\geq \circ(y)$ for any of the rounding modes. This is needed for analyzing many algorithms.

##### Error measurement

Relative error: $\epsilon(x)=\left|\frac{x-\circ(x)}{x}\right|$. Set to 0 if $x=0$

Theorem [Mueller18, 2.2]: if $x$ in the normal range, then relative error $\epsilon(x)$ satisfies

$$\epsilon(x) < \left{ \begin{array}{ll} \frac{1}{2} \beta^{1-p} & \circ\in{RN_{even},RN_{away}}\ \beta^{1-p} & \circ\in{RD,RU,RZ} \end{array}\right.$$

These strict inequalities allow precise analysis of algorithms. The right hand value occurs so often it is called the unit roundoff $u$. This represents the smallest value so that 1+u != 1 in the floating point format (it is the last digit rounding). Many languages have this constant available in their standard library.

The function $ulp(x)$, for units in last place, gives the value of a change in the last bit of the representation of $x$. This definition is a slight bit sloppy, but good enough for this work. See [Muller18] for several definitions and issues. The rough idea is that $ulp(x)$ is a useful measurement of error in calculations, telling how far a computed number is from the true number.

##### IEEE Guarantees

Any operation $\odot\in {+,-,\times,\div,}$ is accurate up to a certain error :

\begin{aligned} \circ(a \odot b)&=(a \odot b)(1+\epsilon_1), & |\epsilon_1|\leq u\ &=(a \odot b)/(1+\epsilon_2), & |\epsilon_2|\leq u \end{aligned}

$\sqrt{a}$ is similarly accurate :

\begin{aligned} \circ(\sqrt{a})&=(\sqrt{a})(1+\epsilon_1), & |\epsilon_1|\leq u\ &=(\sqrt{a})/(1+\epsilon_2), & |\epsilon_2|\leq u \end{aligned}

For each size, there is a largest value representable (both positive and negative); anything larger is an infinity $\pm \infty$. This is called overflow.

There is also a smallest representable positive number; anything smaller is treated as zero. Rolling off the smallest end is called underflow. (Some places call underflow rolling off the largest of the negative numbers to $-\infty$. We’ll call that case an overflow.)

Note the very useful fact that multiplying or dividing by powers of the the base, as long as no overflow or underflow happens, introduces no error.

Another common problem is called catastrophic cancellation, which is when numbers similar in size are subtracted (or equivalently, a number added to a negative number close in magnitude), then the leading digits can cancel out, leaving far less precision in the answer. This is to be avoided when possible.

The algorithms below rely (somewhat, they can be carefully rederived for other formats, which are rare) on having IEEE 754 support, which is extremely common. For more info on floating-point, see my notes at https://lomont.org/posts/2022/lomont-floating-point-notes/

The high school algebra quadratic equation finds the values of $x$ that solve the quadratic equation $ax^2+bx+c=0$. The equation is

$$x=\frac{-b \pm \sqrt{b^2-4ac}}{2a}$$

The expression $D=b^2-4ac$ is called the discriminant; if $D\geq 0$ then the equation has two real roots; if $D<0$ then the equation has two complex roots, necessarily conjugates of each other (that means the roots are of the form $r_1+i r_2$ and $r_1-ir_2$ for two real numbers $r_1$ and $r_2$). Correctly determining the sign of $D$ is very important. In the case $D<0$, you use $\sqrt{-D}=\sqrt{|D|}$to get the value as a floating point number. So once you have the sign of $D$, you can simply compute $\sqrt{|D|}$ in all cases.

A few more items: given real roots $r_1,r_2$, then the equation is $ax^2+bx+c=a(x-r_1)(x-r_2)=ax^2 + (-a)(r_1+r_2)x+a r_1 r_2$. Setting coeffs equal gives $a,b,c$ in terms of roots.

\begin{aligned}a&=a\b&=-a(r_1+r_2)\c&=a r_1 r_2\end{aligned}

This last item is useful for computing $r_2$ given $r_1$, as $r_2=c/(a\times r_1)$.

The complex roots when $D<0$ are given by $\frac{-b\pm i \times \sqrt{-D}}{2a} = -\frac{b}{2a} \pm i\times \frac{\sqrt{|D|}}{2a}$.

## Improvement plan

To make a better algorithm, I first list places errors can happen, estimate what error rate a good algorithm should provide, decide on a testing method, and list some directions I avoided and why.

### Error locations

The quadratic formula as written above can fail in the following ways when implemnted using floating-point numbers:

1. $b^2$ can overflow or underflow
2. $ac$ can overflow or underflow
3. $4(ac)$ can overflow
4. $b^2-4ac$ can overflow, underflow, or suffer from catastrophic cancellation
5. The sign of the discriminant $D=b^2-4ac$ can be computed incorrectly, confusing real and complex roots
6. $-b\pm\sqrt{D}$ can suffer from underflow, overflow, or catastrophic cancellation.
7. $2a$ can overflow
8. the final division of $-b\pm\sqrt{D}$ by $2a$ can overflow or underflow.

So there are a lot of places this can go wrong.

### Expected error

A rough estimate on error bounds for an algorithm is the $ulp(x)$ error in the answer. $ulp<1$ means at most the last bit of the value is incorrect, $ulp=5$ means the error is 5 times the amount of the last bit being off.

1. $b^2$ adds 1
2. $ac$ adds 1
3. $4(ac)$ is free since powers of two have no error
4. $b^2-4ac$ adds 1
5. $\sqrt{b^2-4ac}$ adds 1
6. $\pm$ is free, sign changes have no error
7. $-b\pm\sqrt{|D|}$ adds 1
8. $2a$ is free
9. $\frac{-b\pm\sqrt{|D|}}{2a}$ adds 1

This is 6*ulp for the expected worst error for a good algorithm.

### Testing method

To make a better algorithm, I needed a way to measure improvement. To do this, I compute how how much an output is from the true result as a multiple of an error in the last bit.

Getting truth can be difficult, often requiring high precision libraries, but for this project, a much simpler way exists: since computing the answer using an ok algorithm with binary64 will avoid any overflow and underflow compared to binary32, I will develop othe algorithm using binary32, then later port the idea to whatever IEEE types I want.

I’ll also use C# instead of C/C++, since it’s vastly quicker to develop test harnesses and make lots of other pieces needed to evaluate the algorithm. Once everything works, I can run the C/C++ versions against the C# versions and C# harness for testing. Plus C# supports binary16 as the Half type, so that can be tested in the same infrastructure.

The most important piece for testing is generating lots of test cases. I made a binary32 generator with the following properties

1. Can generate on any range of exponents: this allows testing naive algorithms while avoiding overflows.
2. Can generate edge cases: for each exponent, it can generate bitwise edge cases in represnetation
3. Can add zeros, subnormals, infinities, and NaNs as desired to test robustness and error handling.
4. Can add arbitrary counts of random floats, where random is uniform on exponent, then uniform on fraction.

### Things I avoided

I avoided multi-precision libraries since they’re slow. Iteration and quad-double and double-double based solutions are also slow, and still suffer from overflow issues.

Some derivations divide by $a$ first to simplify code and analysis, but this adds an extra error case (for overflow or underflow), so I avoid that route.

Another nice idea, which can be made to work here, is to replace $b$ with $b/2$ as soon as it comes in, effectively solving the equation $ax^2+2bx+c=0$ which has the nicer (and less operations) solution $x=\frac{-b\pm\sqrt{b^2-ac}}{a}$. This works in base 2 (or multiple of 2, like 10) since $b/2$ doesn’t lose any digits, except in the case where $b$ is denormalized. In the $b$ denormalized case, you can use $(a,c)\leftarrow(2a,2c)$ replacement, but now you also need to check that doesn’t make $a,c$ overflow. All this testing, which ultimately fails in rare cases, I found not to be worth using in the final algorithm.

On to some algorithms and testing:

## Algorithm 1: Naïve implementation

For the interface to each algorithm, I chose taking three floating point values a,b,c and returning the triple: (float r1, float r2, rootType) where root type denoted real or complex roots (and other issues, explained later). The the roots were the real numbers r1 and r2 or the complex numbers r1 + i*r2 and r1-i*r2.

To start, and illustrate how bad the naïve high school formula performs, here is that implementation as baseline.

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15  // Naive root algorithm float d = b * b - 4 * a * c; if (d >= 0) { float r = Sqrt(d); float r1 = (-b + r) / (2 * a); float r2 = (-b - r) / (2 * a); return (r1, r2, RootType.SuccessReal); } else { float r1 = -b / (2 * a); float r2 = Sqrt(-d) / (2 * a); return (r1, r2, RootType.SuccessComplex); } 

Some hand picked testing values to illustrate error…

 1 2 3 4 5 6 7 8  1x^2 + 10.5x + 5 gives roots -0.5,-10 type SuccessReal u:0.00 0.00 1x^2 + 11x + 5 gives roots -0.47506237,-10.524938 type SuccessReal u:6.07 3.13 1x^2 + 400x + 1 gives roots -0.0025024414,-399.9975 type SuccessReal u:10418.65 5209.36 

The error values are given in ulps as u:v1 v2 where v1 is the max root error and v2 is the average error.

The first example was picked to have nice roots -0.5 and -10, and it found them with zero error in the last place units (the u:0.00 0.00 parts).

Changing the 10.5 to 11 increases the error to u:6.07 3.13 which is already pretty bad, especially for such a small change in coefficients.

The third example has terrible error of 10418 ulps.

Walking the code for the last case shows the first problem: catastrophic cancellation in the term $-b+\sqrt{D}$ . $b$ is 400.000 and $r=\sqrt{D}$ is 399.995. Subtracting them loses a lot of significant digits via cancellation. We want to avoid this problem, so we must avoid subtracting similar numbers.

## Fix 1

The first fix is the one almost all pieces of code in the internet perform: rewrite the naïve formula to avoid this cancellation. This can be done by multiplying the positive root expression by its conjugate

$$r_1=\frac{-b + \sqrt{D}}{2a}=\frac{-b + \sqrt{D}}{2a}\times\frac{-b -\sqrt{D}}{-b -\sqrt{D}}=\frac{b^2 - D}{2a(-b-\sqrt{D})}$$

This has changed the addition $-b+\sqrt{D}$ to a subtraction $-b-\sqrt{D}$ for this case, a common trick for removing a catastrophic cancellation. It’s not always possible to rewrite an equation to switch signs though.

In practice, I use a different rewrite to clean this up. The problem arises from combining $-b$ and $\pm\sqrt{D}$. So we can choose the sign of the root that avoids cancellation, compute that root (call it $r_1$), and use a different expression for the other root: $r_2 = \frac{c}{a\times r1}$. The first one can be defined with $r1=-b-sgn(b)\sqrt{D}$ where $sgn$ is the sign function: -1 for $b<0$, 0 for $b=0$, and +1 for $b>0$. This way the combined numbers are the same sign.

## Algorithm 2: Remove a cancellation

This just takes a little change

  1 2 3 4 5 6 7 8 9 10 11 12 13 14  // Naive + remove one cancellation float d = b * b - 4 * a * c; if (d >= 0) { float r1 = (-b - Sign(b) * Sqrt(d)) / (2 * a); float r2 = c / (r1 * a); return (r1, r2, RootType.SuccessReal); } else { float r1 = -b / (2 * a); float r2 = Sqrt(-d) / (2 * a); return (r1, r2, RootType.SuccessComplex); } 

Running the examples above:

 1 2 3 4 5 6 7 8  1x^2 + 10.5x + 5 gives roots -10,-0.5 type SuccessReal u:0.00 0.00 1x^2 + 11x + 5 gives roots -10.524938,-0.4750622 type SuccessReal u:0.19 0.13 1x^2 + 400x + 1 gives roots -399.9975,-0.0025000155 type SuccessReal u:0.35 0.21 

So this change reduced error for these values by a lot, to very respectable numbers.

Poke some more, I find

 1 2  25x^2 + 100x + 99.99999 gives roots -2.000625,-1.9993752 type SuccessReal u:606.90 455.43 

Lots of error again…. Walking through the code, I find catastrophic cancellation in the discriminant $D=b^2-4ac$. Here there are many, many ways to proceed…..

## Interlude

### Special Cases

At this point, to make all the solutions more robust, we should handle cases where any coefficient is 0, or infinity, or NaN, so we avoid divide by 0 when possible and don’t’ return invalid answers. So we make a function called at the start of each root algorithm that handles these special cases. These cases are each handled in this routine, which let’s the rest of this article focus on the case $a,b,c \neq 0$.

 1 2  // Handle cases where a,b,or c is 0, or inputs include Infinity or NaN (bool isHandled, float r1, float r2, RootType type) HandleSpecialCasesFloat(float a, float b, float c).... 

We also extend the root types to handle these cases, making the overall algorithms robust against common errors and rare cases.

  1 2 3 4 5 6 7 8 9 10 11  // most common cases, last two bits = 01 SuccessReal = 0xb_0_01, // got 2 real roots in r1 and r2 SuccessComplex = 0xb_1_01, // got 2 complex roots as r1 +/- i*r2 // bad input, last 2 bits = 00 InputHasNaN = 0xb_0_00, // r1=r2=NaN InputHasInfinity = 0xb_1_00, // r1=r2=NaN // rare cases, last two bits 10 OneRealRoot = 0xb_0_0_10, // coeff 'a' was 0, one root r1 = -c/b, which may be infinite, r2 = NaN AllRealNumbers = 0xb_0_1_10, // a=b=c=0, all real numbers are roots, r1=r2=NaN 

Another change we make here is to use special functions from library provided routines.

### Special functions

Often libraries provide special functions, such as C++ expm1 which computes $e^x-1$. These special functions are generally much more accurate than simply computing pow(e,x)-1 for some values, so we will leverage needed platform specific functions that improve accuracy. Here is a list we use:

 1 2 3 4 5 6 7  // Needed functions Sqrt returns he sqrt of a value Sign(x) returns -1 if x < 0, 0 if x == 0, +1 if x > 0 CopySign(x,y) returns |x|*sign(y) Scale2(x,t) returns x*2^t Abs(x) returns |x| Fma(x,y,z) returns x*y+z with at most one rounding error, provided by most languages. If not, see article text 

## Fix 2 - A better discriminant

Computing the discriminant of the quadratic correctly is a huge topic ([Kahan04] for example). Many places factor it as $(|b|+2\sqrt{|ac|})(|b|-2\sqrt{|ac|})$ and proceed, but this still has a cancellation issue, and adds lots more operations in the main path, increasing overall error.

The best fixes here are to use more precision, which is slower, but only when needed.

One way to add precision is to represent each floating-point value $v$ with multiple other values $v=v_1+v_2+v_3+v_4$, where the $v_i$ are chosen to give more bits of precision. This is done by having them have different exponents, covering a wider range. See [Knuth14, Priest91, Bailey08] for details.

These do not completely solve all the problems since they do not allow larger or smaller exponents, thus to not prevent overflow or underflow. And they are much slower than single floating point math. But for some problems, especially if you need more precision, these are excellent techniques.

### Back to Fix 2

Another process useful in many algorithms is to compute the value, then compute the error, and track both, adding the accumulated error back in at strategic places. This is the fix here.

IEEE 754 recommends a FusedMultiplyAdd(x,y,z) function (called a FMA) that computes $\circ(x\times y+z)$ in floating point with one round at the end, instead of doing it in steps as $\circ(\circ(x\times y)+z)$, which has two roundings. The added precision of a FMA operation improves a huge array of floating point algorithms, so it’s standard in many languages.

If you don’t have a FMA, you can make one via Dekker’s algorithm, requiring 17 floating point operations, as covered in [Muller18, Section 4.4.2].

As an interesting aside, using FMA indiscriminately can also lead to errors - [Higham02, section 2.6] quotes Kahan: “[FMAs] should not be used indiscriminately”, and gives an example where the discriminant can incorrectly switch signs when computed incorrectly with a FMA.

For our use, [Kahan04, Boldo09, Jannerod13] detail a very useful function for computing the determinant Det2x2(w,x,y,z) of a 2x2 matrix as $wx-yz$ which can be used to compute a higher accuracy discriminant. Here is a listing

 1 2 3 4 5 6 7 8 9  // compute a*b - c*d with extra precision // fma is a fused multiply add float Det2X2(float a, float b, float c, float d) { var v1 = c * d; // c * d loses precision var v2 = Fma(-c, d, v1); // c*d full precision - c*d low precision gives error var v3 = Fma(a, b, -v1); // ab-cd with lost precision return v3 + v2; // add back excess } 

It works using a very common and useful idea of computing a value, then determining the error, doing some work, and adding the error at the end. For lots of tricks like this and more examples, read anything by Kahan.

See the comments for how it works. It’s worth working through this many times until you internalize it so you can use the ideas on your own.

So we’ll simply replace the simplistic $d=bb-4a*c$ computation with a call to Det2X2(b,b,4*a,c). We replace 4*a with the possibly better Scale2(a,2) function.

## Algorithm 3: Better discriminant

Here is the next level of fixes: special case handling and a better discriminant.

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18  var (isHandled, r1, r2, type) = HandleSpecialCasesFloat(a, b, c); if (isHandled) return (r1, r2, type); var d = Det2X2(b, b, Scale2(a, 2), c); if (d >= 0) { r1 = (-b - CopySign(Sqrt(d),b)) / Scale2(a, 1); r2 = c / (r1 * a); return (r1, r2, RootType.SuccessReal); } else { r1 = -b / Scale2(a, 1); r2 = Sqrt(-d) / Scale2(a, 1); return (r1, r2, RootType.SuccessComplex); } 

Now we get

 1 2  25x^2 + 100x + 99.99999 gives roots -2.0005524,-1.9994476 type SuccessReal 1(100%) : 100% (0) 100% (0) u:0.10 0.07 

Nice!

At this point, we can show how these three (adding back the special case handling to the earlier algorithms) behaves over several sets of floats (from the table at the start)

1,000,000 passes each small range (2^-32, 2^32) large range (2^-70,2^70) huge range (2^-126,2^126)
Function max ulp avg ulp fail % max ulp avg ulp fail % max ulp avg ulp fail %
Naïve 4.10E+12 1.50E+07 0% 3.10E+23 -1.90E+17 4.10% 2.60E+37 -7.20E+30 34%
Algorithm 2 3.60E+04 0.38 0% 1500 0.34 3.90% 2.00E+22 1.80E+17 25%
Algorithm 3 3.20 0.36 0% 1500 0.34 3.50% 2.00E+22 1.80E+17 13%
Algorithm 4 (this one) 3.20 0.36 0% 3.00 0.33 0% 3.20 0.31 0%

So for small (no overflow) sizes, the Algorithm 3 is good, with no failures over 1,000,000 runs, and good error bounds. However running the large set of floats, which contain exponents up to $2^{70}$ where overflows can happen, shows larger errors and some cases failing (getting real/complex roots messed up, or returning infinities when it shouldn’t).

Again, tracking down some examples that fail, and inspecting

 1 2  6.096731E+18x^2 + 1.3318949E+20x + 1.3318949E+20 gives roots -10.923025,NaN type SuccessComplex Double gives roots -20.795533772103447,-1.050516445351387 type SuccessReal 

So even though the roots computed with binary64 are well within bounds of the algorithm, it fails to find them. It also thinks the roots are complex numbers, when they should be real. Clearly the discriminant computation failed.

Now we must face overflow (and underflow).

## Fix 3 - Scaling and bounds

Picking failed cases and walking through the code, they are all cases of overflow, mostly in the computation of the determinant.

Note that you don’t really need to compute the discriminant: what you really need are two things:

1. The sign of the determinant, more precisely, if it is nonnegative.
2. The root $\sqrt{|D|}$

### Scaling

To deal with overflow or underflow, avoid calculation that results in exponents outside of some exponent range $e_{min}<e_{max}$ must be avoided. This is generally handled via scaling, which is scaling the values in use to some safe range, doing the computation, then scaling the results back correctly. Scaling by a power of the base introduces no error, which makes it extremely useful.

The idea is it treat a floating point value $b$ as a tuple $(s,e,b_f$) where $s$ is a sign, $e$ is a power of the base $\beta=2$, and $b_f$ is $b$ scaled by pwoers of the base to keep the exponent of $b_f$ in a safe range. By removing enough the exponent frtom the floating point number, you can put it in an integer that has far more range.

As a simple example, suppose no exponent can go over 7 in the floating point hardware, and for $b=2^4$ you want to compute $\sqrt{b^2-192}$. The answer $8=2^3$ fits in this hardware, but $b^2$ overflows. So (dropping the sign part of the tuple), $192=2^6 \times 3 = (6,3)$, and $b=2^4\times 1= (4,1)$. Then $b^2$ would be $(4+4,1^2)=(8,1)$ where the squaring can be done in the floating point hardware, and the exponents are tracked. 192 cannot be subtracted from this, since they have different exponents, so one of these is scaled to give

$$b^2-192=(8,1)-(6,3)=(6,2^2\times 1)-(6,3)=(6,4)-(6,3)=(6,4-3)=(6,1)$$

Then taking a square root gives (if the exponent was odd, then move a power back in).

$\sqrt{(6,1)}=(6/2, \sqrt{1})=(3,1)=2^3\times 1=8$ the correct answer. At no point did the hardware floating point unit overflow.

And that’s the idea. All we need is a routine to extract from a floating point value $b$ the tuple $(s,e,b_f)$. Converting this tuple back is already easy: $b=s\times2^e\times b_f$, or, in code, sign*Scale2(bf,e) for our Scale2 function

And that’s the main idea.

### Details

The details are tricky. For the first part, let’s analyze the discriminant, $D=b^2-4ac$.

#### Discriminant

From special case handling, $a,b,c \neq 0$. Also we can treat $b>0$ since it’s squared.

First, normalize all three of these, which makes $b_f^2$ neither overflow nor underflow, and the same for $ac$. Note that $a_f, b_f, c_f$ are in the half interval $[1,2)$.

##### Case 1: b^2 » |4ac|

Now, if $b^2$ is much larger than $4ac$, then $D>=0$ and the root, as floating point is $|b|$, so the root can be returned as $(1,b_e,b_f)$.

The sufficient condition is $2b_e > a_e+c_e+p+5$, in which case we don’t need to compute roots: merely use

(root,rootE,nonnegative) = (bF,bS,true).

The proofs section has a detailed proof.

##### Case 2: b^2 «|4ac|

Similarly, if $ac$ is large compared to $b^2$, then the root is approximated good enough by $2\sqrt{|ac|}$ and the sign of $D$ is the sign of $ac$. Similar analysis derives the condition for this to be true: $2b_e<a_e+c_e-p-1$

To compute the root robustly

1. $p=a_fc_f$.
2. $e=a_e+c_e$
3. If $e$ is odd then $(e,p) \leftarrow (e-1,2p)$
4. The root and scale is $(e/2,\sqrt{p})$

This handles this case.

##### Case 1: b^2 ~ |4ac|

The last discriminant case is all others. In this case we’ll leverage the Det2X2 trick for more accurate discriminnat computation. To avoid overflow and underflow as much as possible, the goal will be to find an expoennt scaling $w_e$ to apply to each of $a,b,c$ such that the result does not overflow or underflow.

This would be much harder, except we already bounded the exponents by the above. We have

$$-p-1 \leq 2b_e-(a_e+c_e)\leq p+5$$

Suppose $b_e$ is large, so computing $b^2$ will overflow. Then we want to shift this lower. But if one of $a$ or $c$ is too small, this may scale them to underflow (zero). But the condition to be here ensures that $a_e+c_e$ is near $2b_e$. Which means for one of $a,c$ to be small, one has to be big.

Since we only really care about the product $ac$, the trick is to move some powers of $2$ between $a$ and $c$ so their exponents are as close as can be, then to shift the new $a,c$ and $b$ by the midpoint of $2b_e$ and $a_e+c_e$. The result of all this is to move $a,b,c$ as close to 1.0 as possible while still retaining proper information for the root. And, while we’re scaling things, we can add a 1 to $a,c$ exponents to handle the 4 in $4ac$. This leads to

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17  // scale a and c exponents to center of a and c, leaves a*c fixed in value, // makes robust against underflow/overflow // in effect, we will scale (a,c) = (a*2^-dc, c*2^dc), brings them to near same in size var deltaE = (aE - cE) / 2; var mid = (bE + bE + (aE + cE)) / 4; aF = Scale2(a, -mid + 2-deltaE); // add 2 to handle 4 in 4ac bF = Scale2(b, -mid); cF = Scale2(c, -mid + deltaE); Debug.Assert(float.IsFinite(aF) && float.IsFinite(bF) && float.IsFinite(cF)); var d = Det2X2(bF, bF, aF, cF); root = Sqrt(Abs(d)); nonnegative = d >= 0; scale = mid; 

And now we have a robust discriminant info routine that returns (float root, int rootExponent, bool nonnegative).

Note: this last case does still have a flaw which I don’t correct here: the discriminant can still be the wrong sign, but it is really rare. See the proof section for a good start on when this can occur. The solution for that rare case would likely be to drop to double-double computation for that case.

#### Roots

This next part turns this info into roots. There is slight room to fix this to be slightly more robust, explained at the end. But in practice this is fast and I hit no mistakes for exponents under $2^{126}$. The max float exponent $2^{127}$ hits some errors. If needed, later I will fix these cases too. it does catch one case where $-b+root$ may overflow, and divides first.

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16  root = Scale2(root, rootExponent); if (nonnegative) { if (Abs(b) < MaxValue / 2) r1 = (-b - CopySign(root, b)) / Scale2(a, 1); else r1 = -b / Scale2(a, 1) - CopySign(root, b) / Scale2(a, 1); r2 = c / (r1 * a); return (r1, r2, RootType.SuccessReal); } else { r1 = -b / Scale2(a, 1); r2 = root / Scale2(a, 1); return (r1, r2, RootType.SuccessComplex); } 

## Algorithm 4: Scaling

See the most up to date code at https://github.com/ChrisLomont/BetterQuadraticRoots, which has float and double versions for C/C++ and C#.

A complete listing C# listing for float (perhaps out of date) is also at the end of this article. This routine is only around 300 lines of C# or C++ code.

## Final Results

For completeness, I repeat the table again.

1,000,000 passes each small range (2^-32, 2^32) large range (2^-70,2^70) huge range (2^-126,2^126)
Function max ulp avg ulp fail % max ulp avg ulp fail % max ulp avg ulp fail %
Naïve 4.10E+12 1.50E+07 0% 3.10E+23 -1.90E+17 4.10% 2.60E+37 -7.20E+30 34%
Algorithm 2 3.60E+04 0.38 0% 1500 0.34 3.90% 2.00E+22 1.80E+17 25%
Algorithm 3 3.20 0.36 0% 1500 0.34 3.50% 2.00E+22 1.80E+17 13%
Algorithm 4 (this one) 3.20 0.36 0% 3.00 0.33 0% 3.20 0.31 0%

The final algorithm had no failures over a million runs of a huge range of inputs, and had really good average and max error.

## Benchmarks

Using BenchmarkDotNet gives that the new algorithm is around 4x as slow as naive, 18ns to 71ns, so the new algorithm is plenty fast for all except the most high-performance needs.

## Wrap up

The most accurate solution with best performance would be to apply scaling to handle overflow and underflow, use double-doubles as needed, and handle every edge case that still fails (of which there are some, but only with extreme values) by rewriting the appropriate expressions for each case and compute as much as is possible.

See the most recent code at https://github.com/ChrisLomont/BetterQuadraticRoots.

## Future work

Here are some notes to track things I should fix or change given another pass someday:

1. Fix up the last possible error cases
2. Make solid proofs of all performance and claims
3. More extensive testing, especially the binary16 and binary64 cases.
4. Make all code generic, use one code path under different types, for both C/C++ and C#
5. Optimize, expecting 2x performance improvements via clipping slower paths
6. More extensive benchmarking
7. Implement the homogeneous roots and stable for animation behavior of [Blinn05, Blinn06]
8. Add some comments from [Forsythe67], make sure I hit his criteria
10. Add iteration methods to the testing for speed and accuracy
11. Test against common other libs and platforms

## Code Listings

Here is a complete listing for C#

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339  // namespace Lomont.Numerical { public static partial class QuadraticEquation { [Flags] public enum RootType { // most common cases, last two bits = 01 SuccessReal = 0xb_0_01, // got 2 real roots in r1 and r2 SuccessComplex = 0xb_1_01, // got 2 complex roots as r1 +/- i*r2 // bad input, last 2 bits = 00 InputHasNaN = 0xb_0_00, // r1=r2=NaN InputHasInfinity = 0xb_1_00, // r1=r2=NaN // rare cases, last two bits 10 OneRealRoot = 0xb_0_0_10, // coeff 'a' was 0, one root r1 = -c/b, which may be infinite, r2 = NaN AllRealNumbers = 0xb_0_1_10, // a=b=c=0, all real numbers are roots, r1=r2=NaN } } } // // MIT License // // Copyright (c) 2022 Chris Lomont // // Permission is hereby granted, free of charge, to any person obtaining a copy // of this software and associated documentation files (the "Software"), to deal // in the Software without restriction, including without limitation the rights // to use, copy, modify, merge, publish, distribute, sublicense, and/or sell // copies of the Software, and to permit persons to whom the Software is // furnished to do so, subject to the following conditions: // // The above copyright notice and this permission notice shall be included in all // copies or substantial portions of the Software. // // THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR // IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, // FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE // AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER // LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, // OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE // SOFTWARE. // using System.Diagnostics; namespace Lomont.Numerical { public static partial class QuadraticEquation { /// /// Compute roots using float32 for the quadratic equation ax^2 + bx + c = 0 /// Returns (r1, r2, rootType) where root type is /// SuccessReal : two real roots r1,r2 /// SuccessComplex : two complex valued roots r1 +\- i*r2 /// InputHasNaN : input has invalid values /// InputHasInfinity : input has invalid values /// OneRealRoot : a was 0, so real root in r1, r2 = NaN /// AllRealNumbers : a=b=c=0, all numbers valid roots, r1=r2=NaN /// /// Derivation of algorithms Chris Lomont, 2022, https://lomont.org/posts/2022/a-better-quadratic-formula-algorithm/ /// /// /// /// /// public static (float r1, float r2, RootType type) FloatRoots(float a, float b, float c) { var (isHandled, r1, r2, type) = InternalF.HandleSpecialCasesFloat(a, b, c); if (isHandled) return (r1, r2, type); // so now can assume a,b,c nonzero var (root, nonnegative, rootE) = InternalF.DiscriminantInfo(a, b, c); // todo - can make this all slightly more accurate, see https://lomont.org/posts/2022/a-better-quadratic-formula-algorithm/ root = InternalF.Scale2(root, rootE); if (nonnegative) { if (InternalF.Abs(b) < InternalF.MaxValue / 2) r1 = (-b - InternalF.CopySign(root, b)) / InternalF.Scale2(a, 1); else r1 = -b / InternalF.Scale2(a, 1) - InternalF.CopySign(root, b) / InternalF.Scale2(a, 1); r2 = c / (r1 * a); return (r1, r2, RootType.SuccessReal); } else { r1 = -b / InternalF.Scale2(a, 1); r2 = root / InternalF.Scale2(a, 1); return (r1, r2, RootType.SuccessComplex); } } static class InternalF { #region Per platform funcs internal static float Sqrt(float v) => float.Sqrt(v); internal static int Sign(float v) => float.Sign(v); internal static float CopySign(float x, float y) => float.CopySign(x, y); internal static float Abs(float v) => float.Abs(v); internal static float Fma(float x, float y, float z) => float.FusedMultiplyAdd(x, y, z); internal static bool IsNaN(float v) => float.IsNaN(v); internal static bool IsInfinity(float v) => float.IsInfinity(v); internal static bool IsFinite(float v) => float.IsFinite(v); internal static bool IsNormal(float v) => float.IsNormal(v); internal const float NaN = float.NaN; internal const float MaxValue = float.MaxValue; /// /// Return x*2^exp /// /// /// /// internal static float Scale2(float x, int exp) => float.ScaleB(x, exp); /// /// Compute a*b-c*d /// /// /// /// /// /// internal static float Det2X2(float a, float b, float c, float d) { var v1 = c * d; // c * d loses precision var v2 = Fma(-c, d, v1); // c*d full precision - c*d low precision gives excess var v3 = Fma(a, b, -v1); // ab-cd with lost precision return v3 + v2; // add back excess } // put in 64 bit type to make porting to other sizes easier static ulong ToBits(float value) => BitConverter.SingleToUInt32Bits(value); #endregion #region per type values // double, float, half, others... const int PrecisionBits = 24; // # of bits in binary format + implicit 1 bit const int ExponentBits = 8; const int TotalBits = sizeof(float) * 8; const int ExpMask = (1 << ExponentBits) - 1; const int ExpBias = (1 << (ExponentBits - 1)) - 1; #endregion /* move all values into form (sign,exp, frac) with frac in 1<= frac < 2 input value = sign*2^exp * value input of 0 returns (1,0,0) */ internal static (int sign, int exp, float frac) Normalize(float value) { Debug.Assert(IsFinite(value)); // do not call on NaN, Inf if (value == 0) return (1, 0, 0); ulong i = ToBits(value); int sign = (i & (1UL << (TotalBits-1))) == 0 ? 1 : -1; int exp = (int)((i >> (PrecisionBits-1)) & ExpMask) - ExpBias; var frac = sign * Scale2(value, -exp); if (!IsNormal(value)) { // above not enough - do more.... var (s2, e2, f2) = Normalize(frac); exp += e2; frac = f2; Debug.Assert(s2 == 1); } Debug.Assert(1 <= frac && frac < 2.0f); Debug.Assert(sign * Scale2(frac, exp) == value); return (sign, exp, frac); } /// /// check special cases /// /// /// /// /// internal static (bool isHandled, float r1, float r2, RootType type) HandleSpecialCasesFloat(float a, float b, float c) { if (IsNaN(a) || IsNaN(b) || IsNaN(c)) return (true, NaN, NaN, RootType.InputHasNaN); if (IsInfinity(a) || IsInfinity(b) || IsInfinity(c)) return (true, NaN, NaN, RootType.InputHasInfinity); // cases: if (a == 0) { // want bx+c = 0 gives x = -c/b if (b == 0 && c == 0) return (true, NaN, NaN, RootType.AllRealNumbers); var r1 = -c / b; return (true, r1, NaN, RootType.OneRealRoot); } if (b == 0) { // a != 0, want ax^2+c = 0, so x = +/- sqrt(-c/a) var sgn = Sign(a) * Sign(c); // sign of quotient if (sgn <= 0) { // real answers var r1 = DivRoot(-c, a); var r2 = -r1; return (true, r1, r2, RootType.SuccessReal); } else { // complex answers, purely imaginary var r2 = DivRoot(c, a); return (true, 0, r2, RootType.SuccessComplex); // 0 +/- i*r2 } } if (c == 0) { // a,b != 0, of form ax^2 + bx = 0, so roots are x=0 and x=-b/a return (true, 0, -b / a, RootType.SuccessReal); } return (false, 0, 0, RootType.SuccessReal); // not real, but will continue to work } /// /// Compute sqrt(|x/y|), handling overflow and underflow if possible. /// /// /// /// internal static float DivRoot(float x, float y) { var (xS, xE, xF) = Normalize(x); var (yS, yE, yF) = Normalize(y); Debug.Assert(xS*yS>=0); var q = xF / yF; var e = xE - yE; if (((xE + yE) & 1) == 1) { // exponent odd, scale so can easily update after root q = Scale2(q, 1); e--; } var r = Sqrt(q); return Scale2(r, e / 2); } /// /// Compute the discriminant D = b*b-4*a*c /// Return the (scaled) root r' = Sqrt(|D|), if d >= 0, and a scaling factor E /// such that the correct root is r = 2^E * r' /// /// /// /// /// internal static (float root, bool nonnegative, int scale) DiscriminantInfo(float a, float b, float c) { var (aS, aE, aF) = Normalize(a); var (bS, bE, bF) = Normalize(b); var (cS, cE, cF) = Normalize(c); var (root, scale, nonnegative) = (b, 0, true); // float,scaling, disc >= 0 if (2 * bE > aE + cE + PrecisionBits + 5) // +5 works, is derived, seems to work( +4, +0, -2, ) , -10 fails (-10, -5, -4, -3) { root = bF; scale = bE; nonnegative = true; } else if (2 * bE < aE + cE - PrecisionBits - 1) // works: (-1,+2,+4), fails (+5, +7,+15,+40) { scale = aE + cE; if ((scale & 1) != 0) // is odd { scale--; aF = Scale2(aF, 1); // move factor back in } scale = scale / 2 + 1; // +1 for the 4 in 4ac, then root, /2 is root root = Sqrt(aF * cF); nonnegative = aS * cS < 0; } else { // from above, we have: Debug.Assert( -PrecisionBits - 1 <= 2 * bE - aE - cE && 2 * bE - aE - cE <= PrecisionBits + 5 ); // now must align exponents for b*b and a*c so they can be subtracted... // idea, pull midpoint of (be + be) and (ae + ce) to zero, making values close to 1.0f, but still same scale // scale a and c exponents to center of a and c, leaves a*c fixed in value, makes robust against underflow/overflow on following scaling // in effect, we will scale (a,c) = (a*2^-dc, c*2^dc), brings them to near same in size var deltaE = (aE - cE) / 2; var mid = (bE + bE + aE + cE) / 4; aF = Scale2(a, -mid + 2 - deltaE); // add 2 to handle 4 in 4ac bF = Scale2(b, -mid); cF = Scale2(c, -mid + deltaE); Debug.Assert(IsFinite(aF) && IsFinite(bF) && IsFinite(cF)); var d = Det2X2(bF, bF, aF, cF); root = Sqrt(Abs(d)); nonnegative = d >= 0; scale = mid; } return (root, nonnegative, scale); } } } } 

## Proofs

Here are some proof sketches for things claimed to show how to develop such algorithms.

### D1: Proof for b^2 » 4ac

Now, if $b^2$ is much larger than $4ac$, then $D>=0$ and the root, as floating point is $|b|$, so the root can be returned as $(1,b_e,b_f)$.

How to check if it’s large enough? We need bounded relative error $\frac{\sqrt{b^2-4ac}-b}{b}<\epsilon$. Note floating point rounding modes affect some bounds when analyzing. Using the Taylor series $\sqrt{1-x}=1-\frac{x}{2}-\frac{x^2}{8}-\frac{x^3}{16}+\frac{5x^4}{256}+O(x^5)$, which converges for $|x|<1$. Let $x=\frac{4ac}{b^2}$.

\begin{aligned}\frac{\sqrt{b^2-4ac}-b}{b}&=\frac{b(1-\frac{4ac}{b^2})^{1/2}-b}{b}\ &=-\frac{x}{2}-\frac{x^2}{8}-\frac{x^3}{16}+O(x^4)\ &= T \end{aligned}

(defining $T$). Then we can bound $T$ with $x$ as $-|x|\leq T \leq |x|$.

\begin{aligned}|x|&=\frac{4\times 2^{a_e+c_e}a_f c_f}{2^{2b_e}b_f^2}\ &=2^{a_e+c_e+2b_e+2} \times \frac{a_f c_f}{b_f^2}\ &=2^Q\times P \end{aligned}

Since $a_f,b_f,c_f\in[0,1)$, $P\leq \frac{2\times 2}{1^2}=4$. So it’s enough to have

$$-\epsilon<-2^Q\times 4 \leq 2^Q\times p=-|x|\leq T \leq 2^Q\times 4<\epsilon$$

taking floating point precision $p$, and $\epsilon=2^{-(p+1)}$, this reduces to

\begin{aligned} \epsilon & > 2^Q\times 4\ 2^{-(p+1)} &> 2^{a_e+c_e+2b_e+2+2}\ 2b_e &> a_e+c_e+p+5 \end{aligned}

So, if this last condition is true, we don’t need to compute roots: merely use (root,rootE,nonnegative) = (bF,bS,true).

### D2: Proof for b^2 « 4ac

Same techniques as proof D1 works

TODO

### Sign of the Discriminant

Computing the sign for the determinant $D=b^2-4ac$ completely correct is difficult. Here is an analysis.

Write $R(x)$ for real number $x$ rounded to floating point, and $R(x \circ y)$ for a rounded result of some operation $\circ$.

For now, assume no overflow (use scaling if needed)

Use the $b\rightarrow b/2$ trick so the discriminant is $D=b^2-ac$, so we don’t have to fiddle with the $4$.

Also assume $ac\geq 0$, which can be checked exactly from their signs, so can also assume $a,c>=0$.

Checking exponents of $b^2$ and $ac$ removes many cases. So the real issue is when they are close.

Then, to prove that $b^2\geq4ac$ using floating point, where $a,b,c$ are floating point values, it looks like $R(b\times b)\geq R(a\times c)$, but this can fail due to rounding issues. (TODO - example).

Assume there is a floating point number $k_1$ such that if $R(b^2)\geq R(R(a\times c)\times k_1)$ then we can be sure that $b^2\geq ac$. I’ll determine such a $k_1$. Note the first expression can be computed exactly with floating point, tested, and proves what we need.

Idea:

\begin{aligned} R(b\times b)&\geq R(R(a\times c)\times k_1)\ b^2(1+\delta_1)&\geq ac(1+\delta_2)k_1(1+\delta_3)\ b^2&\geq ac \frac{(1+\delta_2)(1+\delta_3)}{(1+\delta_1)}k_1\ b^2&\geq ac \gamma_1 \end{aligned}

where I replaced the $\delta_i$ and $k_1$ expression with $\gamma_1$

Then, if $\gamma_1\geq 1$, the above would give $b^2\geq ac \gamma_1\geq ac$ as desired. $\gamma_1\geq 1$ means

\begin{aligned} \frac{(1+\delta_2)(1+\delta_3)}{(1+\delta_1)}k_1&\geq 1\ k_1&\geq \frac{(1+\delta_1)}{(1+\delta_2)(1+\delta_3)} \end{aligned}

Since $\frac{(1-\epsilon)}{(1+\epsilon)^2} \geq \frac{(1+\delta_1)}{(1+\delta_2)(1+\delta_3)}$, any $k_1\geq eps$ suffices. Taylor expand

$$\frac{(1-\epsilon)}{(1+\epsilon)^2} =1+3\epsilon + 5\epsilon^2+7\epsilon^3+O(\epsilon^4)$$

Since $\epsilon$ is small, $1+4\epsilon$ is larger than the tayloer expansio, and setting $k_1=1+4\epsilon$ is sufficient. Note this is also representable as a floating point number, so it’s perfect. This gives a small constant $k_1$ TODO - more, list the k1.

Doing similar work, you can derive that for the representable $k_2=1-3\epsilon$ (and the 3 is correct! not 4) that computing as floats $R(b\times b) < R(R(a\times c)\times k_2$) means that $b^2<ac$.

Now you have a bound, outside of which you know the sign of the discriminant $D$ with certainty.

To compute inside that bound, use scaling and the double-double trick (todo - check in detail, post here?)

## References

[Bailey08] Bailey et. al., “Library for Double-Double and Quad-Double Arithmetic,” 2008.

[Blinn05] James Blinn, “How to Solve a Quadratic Equation,” 2005

[Blinn06] James Blinn, “How to Solve a Quadratic Equation, Part II” 2006

[Boldo09] Sylvie Boldo, “Kahan’s Algorithm for a Correct Discriminant Computation at Last Formally Proven,” 2009.

[Forsythe67] George E. Forsythe, “What is a satisfactory quadratic Equation Solver,” TR CS 74, 1967.

[Jannerod13] Claude-Pierre Jeannerod et. al., “Further analysis of Kahan’s algorithm for the accurate computation of 2x2 determinants,” 2013.

[Higham02] Nicolas Higham, “Accuracy and Stability of Numerical Algorithms,” 2002.

[Kahan04] Kahan, “On the Cost of Floating-Point Computation Without Extra-Precise Arithmetic,” 2004.

[Knuth14] Donald Knuth, “Seminumerical Algorithms,” Vol 2, 3rd ed., 2014.

[Muller18] Jean-Michel Muller et. al., “Handbook of Floating-Point Arithmetic”, 2nd ed., 2018, Birkhäuser

[Priest91] Douglas Priest, “Algorithms for Arbitrary Precision Floating Point Arithmetic,” 1991.

Categories:

Tags: