A better quadratic formula algorithm
Published
A robust algorithm for computing roots of quadratic equations, much better than common code.
A Better Algorithm for Solving Quadratic Equations
Chris Lomont, Aug 2022, www.lomont.org, v0.5
I was recently using some code that did registration of point clouds that rarely would produce really bad results. The internal algorithm required computing roots of a degree 4 polynomial, which required computing roots of a degree 3 polynomial, which required computing roots of a degree 2 (quadratic) polynomial. The quadratic roots were bad in the cases the overall algorithm failed. Since having bad quadratic root solvers in code has bitten me several times in my career, I decided to find or make a solid one that I can rely on.
Here are results for error and failure (infinite or complex/real swapped or NaN results) rates:
1,000,000 passes each | small range (2^-32, 2^32) | large range (2^-70,2^70) | huge range (2^-126,2^126) | ||||||
---|---|---|---|---|---|---|---|---|---|
Function | max ulp | avg ulp | fail % | max ulp | avg ulp | fail % | max ulp | avg ulp | fail % |
Naïve | 4.10E+12 | 1.50E+07 | 0% | 3.10E+23 | -1.90E+17 | 4.10% | 2.60E+37 | -7.20E+30 | 34% |
Algorithm 2 | 3.60E+04 | 0.38 | 0% | 1500 | 0.34 | 3.90% | 2.00E+22 | 1.80E+17 | 25% |
Algorithm 3 | 3.20 | 0.36 | 0% | 1500 | 0.34 | 3.50% | 2.00E+22 | 1.80E+17 | 13% |
Algorithm 4 (this one) | 3.20 | 0.36 | 0% | 3.00 | 0.33 | 0% | 3.20 | 0.31 | 0% |
Code for my new algorithm in C++ and C# is at https://github.com/ChrisLomont/BetterQuadraticRoots.
Lots of digging around on GitHub, old numerical archives, and papers didn’t show what I considered a good solution. For example, the commonly referenced [Kahan04] does not cover overflow. So this note is how I derived a much more robust quadratic root solver.
Common solutions to get better numerical results includes adding a multi-precision library, using the “double-double trick,” or simply making sure the numerics are solid.
I had a few requirements
- Numerically robust. If the roots were representable in the floating point format, those roots would be found.
- Numerically accurate. Have as little error as reasonably possible.
- Fast. Since one often needs to compute large numbers of roots, I wanted good performance.
- Small, self contained code. This way I can include the code in whatever projects I want without needing large, cumbersome libraries. It also makes porting the code to different languages much easier.
These requirements, like all good requirements, are in tension, driving tradeoffs. Since my overall drive was to stop having code perform badly due to some low quadratic root inaccuracy, that is the part I needed fixed.
After fixing the problem, I decided to write up it up as a tutorial on how to do such things for others to learn from.
To understand the problem, some knowledge of how computers do floating-point calculations is required. Here are facts helpful for understanding and solving this problem.
Floating Point
A real number $r$ can be written in base $\beta$ as $r=\beta^e \sum_{i=0}a_i \beta^{-i}$. If $r=0$ all digits $a_i$ are 0 and the exponent $e$ is not uniquely determined. If $r \neq0$ this can be done (uniquely once you associate things like 0.999…. to 1.00000 for decimal) with the first digit $a_0\neq 0$.
A floating-point number uses this idea, usually in base $\beta=2$, and only allows a finite range for exponents and a fixed number of bits (think of them as 0,1 digits). Most computers use a floating-point standard called IEEE 754 (there are a few updates) that dictate accuracy for certain operations and dictate binary representations of values. IEEE 754 standardizes several binary and several decimal formats.
A floating point format is defined by four integers $(\beta,p,e_{min},e_{max})$ [Muller18],
- A base $\beta\geq 2$, which is 2 or 10 for IEEE 754.
- A precision $p\geq 2$ which is roughly the number of significant bits or digits, depending on base.
- A exponent range $e_{min}<e_{max}$. All IEEE formats satisfy $e_{min}<0<e_{max}$ and $e_{min}=1-e_{max}$.
All IEEE formats represent these numbers using fixed bitlength fields, always in the following format:
|
|
Relevant IEEE 754 binary formats are summarized in this table
Format (bitlength) | t exponent bits, p-1 fraction bits | Precision p | (emin,emax) | exp bias | C#, C/C++, Java, Julia |
---|---|---|---|---|---|
binary16 | 5,10 | 11 | (-14,15) | 15 | Half, N/A, N/A, half |
binary32 | 8,23 | 24 | (-126,127) | 127 | float,float,float,single |
binary64 | 11,52 | 53 | (-1022,1023) | 1023 | double,double,double,double |
Note in each case the length N and number of fraction bits (equivalently N and exponent bit length) specifies the rest of the format.
- $p=N-t$
- $bias=2^{t-1}-1$
- $e_{min}=2^{t-1}-bias+1$, $e_{max}=2^{t}-bias-1$
Special Numbers
The exponent bit field can store $e_{max}+1$, which is used to store infinity $\infty$ when the fractional bits are $F=0$, or NaN (Not-a-Number) if $F \neq 0$. The sign bit determines $\pm\infty$.
The exponent field can store $e_{min}-1$ ( E=0
) which denotes denormalized numbers. These do not have the implied 1 for the fractional part $1.F$. They allow what Kahan calls ‘graceful underflow’, providing some smooth drop off from $2^{e_{min}}$ down to $0$.
Values with exponent $e_{min}\leq e\leq e_{max}$ are called normalized numbers.
Rounding modes
IEEE 754 defines 5 rounding modes that apply when converting a real number to a floating point number or when doing operations on a floating point numbers. These are usually set by software. Most commonly used is “round to even”.
- Round down (toward $-\infty$), written here as $RD(x)$
- Round up (toward $+\infty$), written $RU(x)$
- Round toward zero, $RZ(0)$
- Two round to nearest modes
- Round to nearest, ties go to even $RN_{even}(x)$
- Round to away, picks the larger in magnitude of two choices, $RN_{away}(x)$.
$RN_{even}(x)$ is the default mode in the standard, and from experience seems to be the most common when running software. However, for precise analysis of algorithms, you must be aware a system can be in any of these modes.
Generically, write $\circ$ to denote any of the rounding functions, so $\circ(a+b)$ is computing $a+b$, then rounding by whichever mode is being used. lso write $RN$ to denote either of the round to nearest functions.
IEEE 754 guarantees monotonicity of rounding : $x\geq y$ means $\circ(x)\geq \circ(y)$ for any of the rounding modes. This is needed for analyzing many algorithms.
Error measurement
Relative error: $\epsilon(x)=\left|\frac{x-\circ(x)}{x}\right|$. Set to 0 if $x=0$
Theorem [Mueller18, 2.2]: if $x$ in the normal range, then relative error $\epsilon(x)$ satisfies
$$\epsilon(x) < \left\{ \begin{array}{ll}
\frac{1}{2} \beta^{1-p} & \circ\in\{RN_{even},RN_{away}\}\\
\beta^{1-p} & \circ\in\{RD,RU,RZ\}
\end{array}\right.$$
These strict inequalities allow precise analysis of algorithms. The right hand value occurs so often it is called the unit roundoff $u$. This represents the smallest value so that 1+u != 1
in the floating point format (it is the last digit rounding). Many languages have this constant available in their standard library.
The function $ulp(x)$, for units in last place, gives the value of a change in the last bit of the representation of $x$. This definition is a slight bit sloppy, but good enough for this work. See [Muller18] for several definitions and issues. The rough idea is that $ulp(x)$ is a useful measurement of error in calculations, telling how far a computed number is from the true number.
IEEE Guarantees
Any operation $\odot\in {+,-,\times,\div,}$ is accurate up to a certain error :
$$\begin{aligned}
\circ(a \odot b)&=(a \odot b)(1+\epsilon_1), & |\epsilon_1|\leq u\\
&=(a \odot b)/(1+\epsilon_2), & |\epsilon_2|\leq u
\end{aligned}$$
$\sqrt{a}$ is similarly accurate :
$$\begin{aligned} \circ(\sqrt{a})&=(\sqrt{a})(1+\epsilon_1), & |\epsilon_1|\leq u\ &=(\sqrt{a})/(1+\epsilon_2), & |\epsilon_2|\leq u \end{aligned}$$
For each size, there is a largest value representable (both positive and negative); anything larger is an infinity $\pm \infty$. This is called overflow.
There is also a smallest representable positive number; anything smaller is treated as zero. Rolling off the smallest end is called underflow. (Some places call underflow rolling off the largest of the negative numbers to $-\infty$. We’ll call that case an overflow.)
Note the very useful fact that multiplying or dividing by powers of the the base, as long as no overflow or underflow happens, introduces no error.
Another common problem is called catastrophic cancellation, which is when numbers similar in size are subtracted (or equivalently, a number added to a negative number close in magnitude), then the leading digits can cancel out, leaving far less precision in the answer. This is to be avoided when possible.
The algorithms below rely (somewhat, they can be carefully rederived for other formats, which are rare) on having IEEE 754 support, which is extremely common. For more info on floating-point, see my notes at https://lomont.org/posts/2022/lomont-floating-point-notes/
The Quadratic Equation
The high school algebra quadratic equation finds the values of $x$ that solve the quadratic equation $ax^2+bx+c=0$. The equation is
$$x=\frac{-b \pm \sqrt{b^2-4ac}}{2a}$$
The expression $D=b^2-4ac$ is called the discriminant; if $D\geq 0$ then the equation has two real roots; if $D<0$ then the equation has two complex roots, necessarily conjugates of each other (that means the roots are of the form $r_1+i r_2$ and $r_1-ir_2$ for two real numbers $r_1$ and $r_2$). Correctly determining the sign of $D$ is very important. In the case $D<0$, you use $\sqrt{-D}=\sqrt{|D|}$to get the value as a floating point number. So once you have the sign of $D$, you can simply compute $\sqrt{|D|}$ in all cases.
A few more items: given real roots $r_1,r_2$, then the equation is $ax^2+bx+c=a(x-r_1)(x-r_2)=ax^2 + (-a)(r_1+r_2)x+a r_1 r_2$. Setting coeffs equal gives $a,b,c$ in terms of roots.
$$\begin{aligned} a&=a \ b&=-a(r_1+r_2) \ c&=a r_1 r_2 \end{aligned}$$
This last item is useful for computing $r_2$ given $r_1$, as $r_2=c/(a\times r_1)$.
The complex roots when $D<0$ are given by $\frac{-b\pm i \times \sqrt{-D}}{2a} = -\frac{b}{2a} \pm i\times \frac{\sqrt{|D|}}{2a}$.
Improvement plan
To make a better algorithm, I first list places errors can happen, estimate what error rate a good algorithm should provide, decide on a testing method, and list some directions I avoided and why.
Error locations
The quadratic formula as written above can fail in the following ways when implemnted using floating-point numbers:
- $b^2$ can overflow or underflow
- $ac$ can overflow or underflow
- $4(ac)$ can overflow
- $b^2-4ac$ can overflow, underflow, or suffer from catastrophic cancellation
- The sign of the discriminant $D=b^2-4ac$ can be computed incorrectly, confusing real and complex roots
- $-b\pm\sqrt{D}$ can suffer from underflow, overflow, or catastrophic cancellation.
- $2a$ can overflow
- the final division of $-b\pm\sqrt{D}$ by $2a$ can overflow or underflow.
So there are a lot of places this can go wrong.
Expected error
A rough estimate on error bounds for an algorithm is the $ulp(x)$ error in the answer. $ulp<1$ means at most the last bit of the value is incorrect, $ulp=5$ means the error is 5 times the amount of the last bit being off.
- $b^2$ adds 1
- $ac$ adds 1
- $4(ac)$ is free since powers of two have no error
- $b^2-4ac$ adds 1
- $\sqrt{b^2-4ac}$ adds 1
- $\pm$ is free, sign changes have no error
- $-b\pm\sqrt{|D|}$ adds 1
- $2a$ is free
- $\frac{-b\pm\sqrt{|D|}}{2a}$ adds 1
This is 6*ulp for the expected worst error for a good algorithm.
Testing method
To make a better algorithm, I needed a way to measure improvement. To do this, I compute how how much an output is from the true result as a multiple of an error in the last bit.
Getting truth can be difficult, often requiring high precision libraries, but for this project, a much simpler way exists: since computing the answer using an ok algorithm with binary64
will avoid any overflow and underflow compared to binary32
, I will develop othe algorithm using binary32
, then later port the idea to whatever IEEE types I want.
I’ll also use C# instead of C/C++, since it’s vastly quicker to develop test harnesses and make lots of other pieces needed to evaluate the algorithm. Once everything works, I can run the C/C++ versions against the C# versions and C# harness for testing. Plus C# supports binary16 as the Half
type, so that can be tested in the same infrastructure.
The most important piece for testing is generating lots of test cases. I made a binary32
generator with the following properties
- Can generate on any range of exponents: this allows testing naive algorithms while avoiding overflows.
- Can generate edge cases: for each exponent, it can generate bitwise edge cases in represnetation
- Can add zeros, subnormals, infinities, and NaNs as desired to test robustness and error handling.
- Can add arbitrary counts of random floats, where random is uniform on exponent, then uniform on fraction.
Things I avoided
I avoided multi-precision libraries since they’re slow. Iteration and quad-double and double-double based solutions are also slow, and still suffer from overflow issues.
Some derivations divide by $a$ first to simplify code and analysis, but this adds an extra error case (for overflow or underflow), so I avoid that route.
Another nice idea, which can be made to work here, is to replace $b$ with $b/2$ as soon as it comes in, effectively solving the equation $ax^2+2bx+c=0$ which has the nicer (and less operations) solution $x=\frac{-b\pm\sqrt{b^2-ac}}{a}$. This works in base 2 (or multiple of 2, like 10) since $b/2$ doesn’t lose any digits, except in the case where $b$ is denormalized. In the $b$ denormalized case, you can use $(a,c)\leftarrow(2a,2c)$ replacement, but now you also need to check that doesn’t make $a,c$ overflow. All this testing, which ultimately fails in rare cases, I found not to be worth using in the final algorithm.
On to some algorithms and testing:
Algorithm 1: Naïve implementation
For the interface to each algorithm, I chose taking three floating point values a,b,c
and returning the triple: (float r1, float r2, rootType)
where root type denoted real or complex roots (and other issues, explained later). The the roots were the real numbers r1
and r2
or the complex numbers r1 + i*r2
and r1-i*r2
.
To start, and illustrate how bad the naïve high school formula performs, here is that implementation as baseline.
|
|
Some hand picked testing values to illustrate error…
|
|
The error values are given in ulps as u:v1 v2
where v1
is the max root error and v2
is the average error.
The first example was picked to have nice roots -0.5 and -10, and it found them with zero error in the last place units (the u:0.00 0.00
parts).
Changing the 10.5 to 11 increases the error to u:6.07 3.13
which is already pretty bad, especially for such a small change in coefficients.
The third example has terrible error of 10418 ulps.
Walking the code for the last case shows the first problem: catastrophic cancellation in the term $-b+\sqrt{D}$ . $b$ is 400.000
and $r=\sqrt{D}$ is 399.995
. Subtracting them loses a lot of significant digits via cancellation. We want to avoid this problem, so we must avoid subtracting similar numbers.
Fix 1
The first fix is the one almost all pieces of code in the internet perform: rewrite the naïve formula to avoid this cancellation. This can be done by multiplying the positive root expression by its conjugate
$$r_1=\frac{-b + \sqrt{D}}{2a}=\frac{-b + \sqrt{D}}{2a}\times\frac{-b -\sqrt{D}}{-b -\sqrt{D}}=\frac{b^2 - D}{2a(-b-\sqrt{D})}$$
This has changed the addition $-b+\sqrt{D}$ to a subtraction $-b-\sqrt{D}$ for this case, a common trick for removing a catastrophic cancellation. It’s not always possible to rewrite an equation to switch signs though.
In practice, I use a different rewrite to clean this up. The problem arises from combining $-b$ and $\pm\sqrt{D}$. So we can choose the sign of the root that avoids cancellation, compute that root (call it $r_1$), and use a different expression for the other root: $r_2 = \frac{c}{a\times r1}$. The first one can be defined with $r1=-b-sgn(b)\sqrt{D}$ where $sgn$ is the sign function: -1 for $b<0$, 0 for $b=0$, and +1 for $b>0$. This way the combined numbers are the same sign.
Algorithm 2: Remove a cancellation
This just takes a little change
|
|
Running the examples above:
|
|
So this change reduced error for these values by a lot, to very respectable numbers.
Poke some more, I find
|
|
Lots of error again…. Walking through the code, I find catastrophic cancellation in the discriminant $D=b^2-4ac$. Here there are many, many ways to proceed…..
Interlude
Special Cases
At this point, to make all the solutions more robust, we should handle cases where any coefficient is 0, or infinity, or NaN, so we avoid divide by 0 when possible and don’t’ return invalid answers. So we make a function called at the start of each root algorithm that handles these special cases. These cases are each handled in this routine, which let’s the rest of this article focus on the case $a,b,c \neq 0$.
|
|
We also extend the root types to handle these cases, making the overall algorithms robust against common errors and rare cases.
|
|
Another change we make here is to use special functions from library provided routines.
Special functions
Often libraries provide special functions, such as C++ expm1
which computes $e^x-1$. These special functions are generally much more accurate than simply computing pow(e,x)-1
for some values, so we will leverage needed platform specific functions that improve accuracy. Here is a list we use:
|
|
Fix 2 - A better discriminant
Computing the discriminant of the quadratic correctly is a huge topic ([Kahan04] for example). Many places factor it as $(|b|+2\sqrt{|ac|})(|b|-2\sqrt{|ac|})$ and proceed, but this still has a cancellation issue, and adds lots more operations in the main path, increasing overall error.
The best fixes here are to use more precision, which is slower, but only when needed.
Quad-double and Double-double
One way to add precision is to represent each floating-point value $v$ with multiple other values $v=v_1+v_2+v_3+v_4$, where the $v_i$ are chosen to give more bits of precision. This is done by having them have different exponents, covering a wider range. See [Knuth14, Priest91, Bailey08] for details.
These do not completely solve all the problems since they do not allow larger or smaller exponents, thus to not prevent overflow or underflow. And they are much slower than single floating point math. But for some problems, especially if you need more precision, these are excellent techniques.
Back to Fix 2
Another process useful in many algorithms is to compute the value, then compute the error, and track both, adding the accumulated error back in at strategic places. This is the fix here.
IEEE 754 recommends a FusedMultiplyAdd(x,y,z)
function (called a FMA) that computes $\circ(x\times y+z)$ in floating point with one round at the end, instead of doing it in steps as $\circ(\circ(x\times y)+z)$, which has two roundings. The added precision of a FMA operation improves a huge array of floating point algorithms, so it’s standard in many languages.
If you don’t have a FMA, you can make one via Dekker’s algorithm, requiring 17 floating point operations, as covered in [Muller18, Section 4.4.2].
As an interesting aside, using FMA indiscriminately can also lead to errors - [Higham02, section 2.6] quotes Kahan: “[FMAs] should not be used indiscriminately”, and gives an example where the discriminant can incorrectly switch signs when computed incorrectly with a FMA.
For our use, [Kahan04, Boldo09, Jannerod13] detail a very useful function for computing the determinant Det2x2(w,x,y,z)
of a 2x2 matrix as $wx-yz$ which can be used to compute a higher accuracy discriminant. Here is a listing
|
|
It works using a very common and useful idea of computing a value, then determining the error, doing some work, and adding the error at the end. For lots of tricks like this and more examples, read anything by Kahan.
See the comments for how it works. It’s worth working through this many times until you internalize it so you can use the ideas on your own.
So we’ll simply replace the simplistic $d=bb-4a*c$ computation with a call to Det2X2(b,b,4*a,c)
. We replace 4*a
with the possibly better Scale2(a,2)
function.
Algorithm 3: Better discriminant
Here is the next level of fixes: special case handling and a better discriminant.
|
|
Now we get
|
|
Nice!
At this point, we can show how these three (adding back the special case handling to the earlier algorithms) behaves over several sets of floats (from the table at the start)
1,000,000 passes each | small range (2^-32, 2^32) | large range (2^-70,2^70) | huge range (2^-126,2^126) | ||||||
---|---|---|---|---|---|---|---|---|---|
Function | max ulp | avg ulp | fail % | max ulp | avg ulp | fail % | max ulp | avg ulp | fail % |
Naïve | 4.10E+12 | 1.50E+07 | 0% | 3.10E+23 | -1.90E+17 | 4.10% | 2.60E+37 | -7.20E+30 | 34% |
Algorithm 2 | 3.60E+04 | 0.38 | 0% | 1500 | 0.34 | 3.90% | 2.00E+22 | 1.80E+17 | 25% |
Algorithm 3 | 3.20 | 0.36 | 0% | 1500 | 0.34 | 3.50% | 2.00E+22 | 1.80E+17 | 13% |
Algorithm 4 (this one) | 3.20 | 0.36 | 0% | 3.00 | 0.33 | 0% | 3.20 | 0.31 | 0% |
So for small (no overflow) sizes, the Algorithm 3 is good, with no failures over 1,000,000 runs, and good error bounds. However running the large set of floats, which contain exponents up to $2^{70}$ where overflows can happen, shows larger errors and some cases failing (getting real/complex roots messed up, or returning infinities when it shouldn’t).
Again, tracking down some examples that fail, and inspecting
|
|
So even though the roots computed with binary64
are well within bounds of the algorithm, it fails to find them. It also thinks the roots are complex numbers, when they should be real. Clearly the discriminant computation failed.
Now we must face overflow (and underflow).
Fix 3 - Scaling and bounds
Picking failed cases and walking through the code, they are all cases of overflow, mostly in the computation of the determinant.
Note that you don’t really need to compute the discriminant: what you really need are two things:
- The sign of the determinant, more precisely, if it is
nonnegative
. - The root $\sqrt{|D|}$
Scaling
To deal with overflow or underflow, avoid calculation that results in exponents outside of some exponent range $e_{min}<e_{max}$ must be avoided. This is generally handled via scaling, which is scaling the values in use to some safe range, doing the computation, then scaling the results back correctly. Scaling by a power of the base introduces no error, which makes it extremely useful.
The idea is it treat a floating point value $b$ as a tuple $(s,e,b_f$) where $s$ is a sign, $e$ is a power of the base $\beta=2$, and $b_f$ is $b$ scaled by pwoers of the base to keep the exponent of $b_f$ in a safe range. By removing enough the exponent frtom the floating point number, you can put it in an integer that has far more range.
As a simple example, suppose no exponent can go over 7 in the floating point hardware, and for $b=2^4$ you want to compute $\sqrt{b^2-192}$. The answer $8=2^3$ fits in this hardware, but $b^2$ overflows. So (dropping the sign part of the tuple), $192=2^6 \times 3 = (6,3)$, and $b=2^4\times 1= (4,1)$. Then $b^2$ would be $(4+4,1^2)=(8,1)$ where the squaring can be done in the floating point hardware, and the exponents are tracked. 192 cannot be subtracted from this, since they have different exponents, so one of these is scaled to give
$$b^2-192=(8,1)-(6,3)=(6,2^2\times 1)-(6,3)=(6,4)-(6,3)=(6,4-3)=(6,1)$$
Then taking a square root gives (if the exponent was odd, then move a power back in).
$\sqrt{(6,1)}=(6/2, \sqrt{1})=(3,1)=2^3\times 1=8$ the correct answer. At no point did the hardware floating point unit overflow.
And that’s the idea. All we need is a routine to extract from a floating point value $b$ the tuple $(s,e,b_f)$. Converting this tuple back is already easy: $b=s\times2^e\times b_f$, or, in code, sign*Scale2(bf,e)
for our Scale2
function
And that’s the main idea.
Details
The details are tricky. For the first part, let’s analyze the discriminant, $D=b^2-4ac$.
Discriminant
From special case handling, $a,b,c \neq 0$. Also we can treat $b>0$ since it’s squared.
First, normalize all three of these, which makes $b_f^2$ neither overflow nor underflow, and the same for $ac$. Note that $a_f, b_f, c_f$ are in the half interval $[1,2)$.
Case 1: b^2 » |4ac|
Now, if $b^2$ is much larger than $4ac$, then $D>=0$ and the root, as floating point is $|b|$, so the root can be returned as $(1,b_e,b_f)$.
The sufficient condition is $2b_e > a_e+c_e+p+5$, in which case we don’t need to compute roots: merely use
(root,rootE,nonnegative) = (bF,bS,true)
.
The proofs section has a detailed proof.
Case 2: b^2 «|4ac|
Similarly, if $ac$ is large compared to $b^2$, then the root is approximated good enough by $2\sqrt{|ac|}$ and the sign of $D$ is the sign of $ac$. Similar analysis derives the condition for this to be true: $2b_e<a_e+c_e-p-1$
To compute the root robustly
- $p=a_fc_f$.
- $e=a_e+c_e$
- If $e$ is odd then $(e,p) \leftarrow (e-1,2p) $
- The root and scale is $(e/2,\sqrt{p})$
This handles this case.
Case 1: b^2 ~ |4ac|
The last discriminant case is all others. In this case we’ll leverage the Det2X2
trick for more accurate discriminnat computation. To avoid overflow and underflow as much as possible, the goal will be to find an expoennt scaling $w_e$ to apply to each of $a,b,c$ such that the result does not overflow or underflow.
This would be much harder, except we already bounded the exponents by the above. We have
$$-p-1 \leq 2b_e-(a_e+c_e)\leq p+5$$
Suppose $b_e$ is large, so computing $b^2$ will overflow. Then we want to shift this lower. But if one of $a$ or $c$ is too small, this may scale them to underflow (zero). But the condition to be here ensures that $a_e+c_e$ is near $2b_e$. Which means for one of $a,c$ to be small, one has to be big.
Since we only really care about the product $ac$, the trick is to move some powers of $2$ between $a$ and $c$ so their exponents are as close as can be, then to shift the new $a,c$ and $b$ by the midpoint of $2b_e$ and $a_e+c_e$. The result of all this is to move $a,b,c$ as close to 1.0 as possible while still retaining proper information for the root. And, while we’re scaling things, we can add a 1 to $a,c$ exponents to handle the 4 in $4ac$. This leads to
|
|
And now we have a robust discriminant info routine that returns (float root, int rootExponent, bool nonnegative)
.
Note: this last case does still have a flaw which I don’t correct here: the discriminant can still be the wrong sign, but it is really rare. See the proof section for a good start on when this can occur. The solution for that rare case would likely be to drop to double-double computation for that case.
Roots
This next part turns this info into roots. There is slight room to fix this to be slightly more robust, explained at the end. But in practice this is fast and I hit no mistakes for exponents under $2^{126}$. The max float exponent $2^{127}$ hits some errors. If needed, later I will fix these cases too. it does catch one case where $-b+root$ may overflow, and divides first.
|
|
Algorithm 4: Scaling
See the most up to date code at https://github.com/ChrisLomont/BetterQuadraticRoots, which has float and double versions for C/C++ and C#.
A complete listing C# listing for float (perhaps out of date) is also at the end of this article. This routine is only around 300 lines of C# or C++ code.
Final Results
For completeness, I repeat the table again.
1,000,000 passes each | small range (2^-32, 2^32) | large range (2^-70,2^70) | huge range (2^-126,2^126) | ||||||
---|---|---|---|---|---|---|---|---|---|
Function | max ulp | avg ulp | fail % | max ulp | avg ulp | fail % | max ulp | avg ulp | fail % |
Naïve | 4.10E+12 | 1.50E+07 | 0% | 3.10E+23 | -1.90E+17 | 4.10% | 2.60E+37 | -7.20E+30 | 34% |
Algorithm 2 | 3.60E+04 | 0.38 | 0% | 1500 | 0.34 | 3.90% | 2.00E+22 | 1.80E+17 | 25% |
Algorithm 3 | 3.20 | 0.36 | 0% | 1500 | 0.34 | 3.50% | 2.00E+22 | 1.80E+17 | 13% |
Algorithm 4 (this one) | 3.20 | 0.36 | 0% | 3.00 | 0.33 | 0% | 3.20 | 0.31 | 0% |
The final algorithm had no failures over a million runs of a huge range of inputs, and had really good average and max error.
Benchmarks
Using BenchmarkDotNet gives that the new algorithm is around 4x as slow as naive, 18ns to 71ns, so the new algorithm is plenty fast for all except the most high-performance needs.
Wrap up
The most accurate solution with best performance would be to apply scaling to handle overflow and underflow, use double-doubles as needed, and handle every edge case that still fails (of which there are some, but only with extreme values) by rewriting the appropriate expressions for each case and compute as much as is possible.
See the most recent code at https://github.com/ChrisLomont/BetterQuadraticRoots.
Future work
Here are some notes to track things I should fix or change given another pass someday:
- Fix up the last possible error cases
- Make solid proofs of all performance and claims
- More extensive testing, especially the
binary16
andbinary64
cases. - Make all code generic, use one code path under different types, for both C/C++ and C#
- Optimize, expecting 2x performance improvements via clipping slower paths
- More extensive benchmarking
- Implement the homogeneous roots and stable for animation behavior of [Blinn05, Blinn06]
- Add some comments from [Forsythe67], make sure I hit his criteria
- List where each error source in quad list is addressed
- Add iteration methods to the testing for speed and accuracy
- Test against common other libs and platforms
Code Listings
Here is a complete listing for C#
|
|
Proofs
Here are some proof sketches for things claimed to show how to develop such algorithms.
D1: Proof for b^2 » 4ac
Now, if $b^2$ is much larger than $4ac$, then $D>=0$ and the root, as floating point is $|b|$, so the root can be returned as $(1,b_e,b_f)$.
How to check if it’s large enough? We need bounded relative error $\frac{\sqrt{b^2-4ac}-b}{b}<\epsilon$. Note floating point rounding modes affect some bounds when analyzing. Using the Taylor series $\sqrt{1-x}=1-\frac{x}{2}-\frac{x^2}{8}-\frac{x^3}{16}+\frac{5x^4}{256}+O(x^5)$, which converges for $|x|<1$. Let $x=\frac{4ac}{b^2}$.
$$\begin{aligned}\frac{\sqrt{b^2-4ac}-b}{b}&=\frac{b(1-\frac{4ac}{b^2})^{1/2}-b}{b}\ &=-\frac{x}{2}-\frac{x^2}{8}-\frac{x^3}{16}+O(x^4)\ &= T \end{aligned}$$
(defining $T$). Then we can bound $T$ with $x$ as $-|x|\leq T \leq |x|$.
$$\begin{aligned}|x|&=\frac{4\times 2^{a_e+c_e}a_f c_f}{2^{2b_e}b_f^2}\ &=2^{a_e+c_e+2b_e+2} \times \frac{a_f c_f}{b_f^2}\ &=2^Q\times P \end{aligned}$$
Since $a_f,b_f,c_f\in[0,1)$, $P\leq \frac{2\times 2}{1^2}=4$. So it’s enough to have
$$-\epsilon<-2^Q\times 4 \leq 2^Q\times p=-|x|\leq T \leq 2^Q\times 4<\epsilon$$
taking floating point precision $p$, and $\epsilon=2^{-(p+1)}$, this reduces to
$$\begin{aligned} \epsilon & > 2^Q\times 4\ 2^{-(p+1)} &> 2^{a_e+c_e+2b_e+2+2}\ 2b_e &> a_e+c_e+p+5 \end{aligned}$$
So, if this last condition is true, we don’t need to compute roots: merely use (root,rootE,nonnegative) = (bF,bS,true)
.
D2: Proof for b^2 « 4ac
Same techniques as proof D1 works
D3: Proof for b^2~4ac
TODO
Sign of the Discriminant
Initial proof ideas for how to check disc sign in detail
Computing the sign for the determinant $D=b^2-4ac$ completely correct is difficult. Here is an analysis.
Write $R(x)$ for real number $x$ rounded to floating point, and $R(x \circ y)$ for a rounded result of some operation $\circ$.
For now, assume no overflow (use scaling if needed)
Use the $b\rightarrow b/2$ trick so the discriminant is $D=b^2-ac$, so we don’t have to fiddle with the $4$.
Also assume $ac\geq 0$, which can be checked exactly from their signs, so can also assume $a,c>=0$.
Checking exponents of $b^2$ and $ac$ removes many cases. So the real issue is when they are close.
Then, to prove that $b^2\geq4ac$ using floating point, where $a,b,c$ are floating point values, it looks like $R(b\times b)\geq R(a\times c)$, but this can fail due to rounding issues. (TODO - example).
Assume there is a floating point number $k_1$ such that if $R(b^2)\geq R(R(a\times c)\times k_1)$ then we can be sure that $b^2\geq ac$. I’ll determine such a $k_1$. Note the first expression can be computed exactly with floating point, tested, and proves what we need.
Idea:
$$\begin{aligned} R(b\times b)&\geq R(R(a\times c)\times k_1)\ b^2(1+\delta_1)&\geq ac(1+\delta_2)k_1(1+\delta_3)\ b^2&\geq ac \frac{(1+\delta_2)(1+\delta_3)}{(1+\delta_1)}k_1\ b^2&\geq ac \gamma_1 \end{aligned}$$
where I replaced the $\delta_i$ and $k_1$ expression with $\gamma_1$
Then, if $\gamma_1\geq 1$, the above would give $b^2\geq ac \gamma_1\geq ac$ as desired. $\gamma_1\geq 1$ means
$$\begin{aligned} \frac{(1+\delta_2)(1+\delta_3)}{(1+\delta_1)}k_1&\geq 1\ k_1&\geq \frac{(1+\delta_1)}{(1+\delta_2)(1+\delta_3)} \end{aligned}$$
Since $\frac{(1-\epsilon)}{(1+\epsilon)^2} \geq \frac{(1+\delta_1)}{(1+\delta_2)(1+\delta_3)}$, any $k_1\geq eps$ suffices. Taylor expand
$$\frac{(1-\epsilon)}{(1+\epsilon)^2} =1+3\epsilon + 5\epsilon^2+7\epsilon^3+O(\epsilon^4)$$
Since $\epsilon$ is small, $1+4\epsilon$ is larger than the tayloer expansio, and setting $k_1=1+4\epsilon$ is sufficient. Note this is also representable as a floating point number, so it’s perfect. This gives a small constant $k_1$ TODO - more, list the k1.
Doing similar work, you can derive that for the representable $k_2=1-3\epsilon$ (and the 3 is correct! not 4) that computing as floats $R(b\times b) < R(R(a\times c)\times k_2$) means that $b^2<ac$.
Now you have a bound, outside of which you know the sign of the discriminant $D$ with certainty.
To compute inside that bound, use scaling and the double-double trick (todo - check in detail, post here?)
References
[Bailey08] Bailey et. al., “Library for Double-Double and Quad-Double Arithmetic,” 2008.
[Blinn05] James Blinn, “How to Solve a Quadratic Equation,” 2005
[Blinn06] James Blinn, “How to Solve a Quadratic Equation, Part II” 2006
[Boldo09] Sylvie Boldo, “Kahan’s Algorithm for a Correct Discriminant Computation at Last Formally Proven,” 2009.
[Forsythe67] George E. Forsythe, “What is a satisfactory quadratic Equation Solver,” TR CS 74, 1967.
[Jannerod13] Claude-Pierre Jeannerod et. al., “Further analysis of Kahan’s algorithm for the accurate computation of 2x2 determinants,” 2013.
[Higham02] Nicolas Higham, “Accuracy and Stability of Numerical Algorithms,” 2002.
[Kahan04] Kahan, “On the Cost of Floating-Point Computation Without Extra-Precise Arithmetic,” 2004.
[Knuth14] Donald Knuth, “Seminumerical Algorithms,” Vol 2, 3rd ed., 2014.
[Muller18] Jean-Michel Muller et. al., “Handbook of Floating-Point Arithmetic”, 2nd ed., 2018, Birkhäuser
[Priest91] Douglas Priest, “Algorithms for Arbitrary Precision Floating Point Arithmetic,” 1991.
Comments
Comment is disabled to avoid unwanted discussions from 'localhost:1313' on your Disqus account...