Lomont floating point notes
Published
Notes on floating point issues, including lists of facts and theorems useful for algorithm development.
Floatingpoint notes
Chris Lomont
Document history
v0.1 Feb 2021  Initial layout, notes
v0.2 Aug 2022  significant extensions and cleanup
Introduction
These are notes on many aspects of floating point numbers in computing. This is very messy, and needs lots of cleaning. Instead of continuing to not post this, I decided to post and edit, clean up, and refine over time. Good luck.
The level is for skilled TODO programmers to work on good floating point code. More details than most places, gives tools to develop your own algorithms and start learning how to do this well.
significant amount of facts useful for understanding, deriving, and developing floating point algorithms.
TODO  quick contents
TODO  image here
TODO If you’re already decently knowledgeable about floatingpoint, jump to the section titled “The Facts!” for a quick summary of IEEE 754 rules and subsequent theorems as well as many examples of how to use floating point well in common situations.
Floating point numbers
Floatingpoint formats define a finite precision approximation to infinite precision real numbers.
An arbitrary real number $r$ can be written in base 10 in the form $r=\pm 10^N \times d_0.d_1d_2…$ where $N$ is an integer, the $d_i$ are digits in ${0,1,…,9}$. Note the decimal point between $d_0$ and $d_1$. If we require $d_0 \neq 0$ whenever $r\neq 0$, then the exponent $N$ is uniquely determined by $r$. (To keep uniqueness, we exclude representations with an infinite trailing string of 9’s  these are equivalent to incrementing the digit before the start of the string and replacing those trailing 9’s with 0’s.)
Similarly, and to make things more suitable for computers, represent $r$ in base 2: an arbitrary real number $r$ can be written in base 2 in the form $r=\pm 2^N \times b_0.b_1b_2…$ where $N$ is an integer, the $b_i$ are bits (binary digits) in ${0,1}$. If we require $b_0 \neq 0$ (i.e., $b_0=1$) whenever $r\neq 0$, then $N$ is determined by $r$. (To keep uniqueness, we exclude representations with an infinite trailing string of 1’s  these are equivalent to incrementing the digit before the start of the string and replacing those trailing 1’s with 0’s.)
Following convention, instead of integral $N$ for exponents, we will use $e$ for the integral exponent. This is not the natural base $e=2.7181828…$.
To approximate real numbers for computer calculations, it is convenient to fit an approximation $\hat{r}$ of a real number $r$ into a fixed number of bits. This forces using only finitely many bits to store the exponent $e$ and to store some number $p$ of bits $b_0,…,b_{p1}$. Including a bit $s$ to denote a sign $\pm1$, this leads to a representation like
$$\hat{r}=(1)^s \times 2^e \times b_0.b_1b_2…b_{p1}=(1)^s\times 2^e\times\sum_{i=0}^{p1}b_i 2^{i}$$
TODO  image here
Here $s$ is the sign bit 0 or 1, $e$ is an integer exponent , the $b_i$ are binary digits (bits) each in ${0,1}$, and either $b_0$ is nonzero or all $b_i$ are 0. $p$ is the precision: the number of digits used to represent the floating point approximation to a real number. TODO  clean, moved below. To make this representation finite, the exponent is bounded by a min and max value as such that $e_{min}\leq e\leq e_{max}$. Such a format is specified by the four integer parameters $e_{min} \leq e_{max}$, precision $p$, and base $\beta$.
This is the basic idea behind floating point numbers: approximate a real number by limiting the number of significant digits and restricting the exponent to a limited range.
IEEE 754
Most modern computing follows a standard for dealing with floatingpoint numbers, the IEEE 754 standard, which has several versions. IEEE 754 has specific binary formats for various sizes of floatingpoint numbers, it precisely specifies minimum requires to conform to IEEE 754, and most importantly, makes various guarantees on what to expect from a conforming implementation. These guarantees are what enables careful analysis of algorithms to make code robust and portable. Without some guarantees portable code would be nearly impossible and would certainly be a mess.
IEE 754 defines a representation of floating point formats of various lengths, the most common being a 32bit and a 64bit format (called float and double in C/C++, float and double in C#, etc.). For the rest of this document we’ll call them float32 and float64 (and float 16 and float 128, etc., where needed).
todo  IEEE also defines bit sizes >= 128 for multiples of 32.
It also defines both base 2 and base 10 formats, but we’ll only cover base 2 here, which is the base most commonly used. Similar facts and conclusions can be developed for base 10.
Binary formats
How IEEE 754 floating point works is best illustrated by looking at the bit layout of a floating point number. Consider the common 32 bit floating point layout:
 the highest bit is the sign bit $s$ to denote the sign $(1)^s$
 The next $w=8$ bits are an unsigned integer representing the (biased) exponent $E$. The actual exponent is $e=E127$. The 127 is called the bias, and is determined as $bias = 2^{w1}1$. The actual exponent has range $e_{min}\leq e\leq e_{max}$, where $e_{max}=2^{w1}1=bias$ asnd $e_{min}=1e_{max}$
 biased exponent allows faster comparisons of floating point numbers, useful for sorting: unbiased 1/2>1111_1111, 2>0000_0001, biased: 1/2>01111110, 2>1000_0000 TODO
 The final $p1=23$ bits store $p=24$ bits of precision $b_1b_2…b_{23}$: 23 explicitly and the initial hidden bit $b_0$ is determined implicitly
 The number in base 2 (with decimal point) $b_0.b_1b_2…b_{23}$ is called the significand.
 if $E=0$ this bit is $b_0=0$, else this bit is $b_0=1$.
 When $E=0$ and the significand is 0, this represents +0 or 0 (depending on sign bit). These compare as equal, but have different bit patterns.
 When $E=0$ and the significand is nonzero, this represents a subnormal number (sometimes called denormal).
 When $0<E<255=2^w1$ the number is called a normal number (or a normalized float).
 When $E=255=2^w1$ and the significand is 0, the float represents $\pm \infty$ with the sign from $s$.
 When $E=255=2^w1$ and the significand is nonzero, this represents Not a Number (NaN), used to flag invalid operations like sqrt of a negative number.
 There are 2 types of NaN: quiet (b1=1) and signaling (b1=0). Signaling NaNs can raise exceptions, but this is not always exposed to software languages.
 The remaining bits of a NaN are sometimes used for system meaning. For example, JavaScript engines routine pack data type and other information in these bits (see https://anniecherkaev.com/thesecretlifeofnan). TODO  But you must be really careful  processors can change these bits on reading/writing to/from memory and registers! Also 1985 std did not specify bit patterns and some older processors implment the bits in different orders. 2008 fixed this and specifies the bit patterns. https://stackoverflow.com/questions/9574192/shouldiusefloatingpointsnanorfloatingpointboolforadatasetthat


Subnormal numbers TODO  details:
The hidden bit is 1 whenever $E\neq0$, that is, whenever the exponent is not the minimum value. This means the stored fraction is shifted until there is a 1 in the leading spot, and then the 1 is not written in. When $E=0$, the float stores two kinds of numbers: subnormal ones when the fraction bits are not all 0, and a zero when the fraction bits are all 0. Note there is both a +0 and a 0 bit pattern, and they are distinct. When $E$ is not 0 or the highest value the hidden bit is a 1 and the number is called a normal value.
Subnormal TODO  explain here, also called denormal,
NaN  Agner Fog has good paper on differences between processors, 2018 https://grouper.ieee.org/groups/msc/ANSI_IEEEStd7542019/background/nanpropagation.pdf, also 2020 https://www.agner.org/optimize/nan_propagation.pdf
Not IEEE, bfloat (Brain Floating Point)  16 bits, in Intel AI processors, Xeon (AVX512 has BF16 format), Intel FPGAs, Google Cloud TPUs, and Tensorflow  many formats https://en.wikipedia.org/wiki/Bfloat16_floatingpoint_format bfloat reduces precision from 24 to 8 bits, else same (easy convert between bfloat and float32)
float16 [Half]  used in many image formats  allows larger range than 16 bit ints, stores in half the space as float32
The common float64 (double in C/C++/C#, etc.) is similar, except the field lengths and bias change. There are also float16 and float128 formats defined (and in use in some processors). The general pattern is:
 highest bit is a sign bit $s$
 $w$ bits for an unsigned integer exponent $E$, biased with $bias=2^{w1}1$. Then the actual exponent $e=Ebias$ has the range $e_{min}\leq e\leq e_{max}$ with $e_{max} = bias= 2^{w1}1$ and $e_{min}=1e_{max}$.
 $p1$ bits to store $p$ bits of precision, using the implicit hidden bit method explained above, based on $E$ not being 0 or $2^w1$.
TODO  above exponent range clarified, note normal, denormal, infinite, and fix table below
Here are some common parameters
Name  Common Name  Significant bits (1 less stored)  Exponent bits $w$  Exponent bias $bias=2^{w1}1$  $e_{min}=1e_{max}$  $e_{max}=bias$ 

binary16  Half precision  11  5  $2^41 = 15$  14  +15 
binary32  Single precision  24  8  $2^71=127$  126  127 
binary64  Double precision  53  11  $2^{10}1=1023$  1022  1023 
binary128  Quadruple precision  113  15  $2^{14}1=16383$  16382  +16383 
binary256  Octuple precision  237  19  $2^{18}1=262143$  262142  +262143 
There is also a bfloat16 (Brain Float) in use in modern systems for doing AI work: the format is the same as a float32, with the latter 16 bits dropped. This makes it have the same range as float32, it has less precision and takes less memory (so is faster to move to cache, faster to process).
Here’s a representation of all 64 bit doubles, and the meaning of them (idea from [Arnold17]):
Sign bit and exponent, then 13 nibbles of float64 significand  What it represents 

0x000_0_0000_0000_0000 
+0 
0x000_0_0000_0000_0001 
smallest subnormal 
…  
0x000_f_ffff_ffff_ffff 
largest subnormal 
0x001_0_0000_0000_0000 
smallest normal 
…  
0x001_f_ffff_ffff_ffff 

0x002_0_0000_0000_0000 
2 times smallest normal 
…  
0x7fe_f_ffff_ffff_ffff 
largest normal 
0x7ff_0_0000_0000_0000 
$+\infty$ 
0x7ff_0_0000_0000_0001 
NaN (high significand bit 0 is qNaN, else sNaN) 
…  
0x7ff_f_ffff_ffff_ffff 
NaN (high significand bit 0 is qNaN, else sNaN) 
0x800_0_0000_0000_0000 
0 
0x800_0_0000_0000_0001 
smallest negative subnormal 
…  
0x800_f_ffff_ffff_ffff 
largest negative subnormal 
0x801_0_0000_0000_0000 
smallest negative normal 
…  
0xffe_f_ffff_ffff_ffff 
largest negative normal 
0xfff_0_0000_0000_0000 
$\infty$ 
0xfff_0_0000_0000_0001 
NaN (high significand bit 0 is qNaN, else sNaN) 
…  
0xfff_f_ffff_ffff_ffff 
NaN (high significand bit 0 is qNaN, else sNaN) 
0x000_0_0000_0000_0000 
+0, back to the start! 
Ranges
For more details on ranges, see the section on IEEE above  TODO  merge
Todo  add bfloat and float16 to ranges, add eps, c++ values, etc…
“Laid out as bits, floating point numbers look like this: https://steve.hollasch.net/cgindex/coding/ieeefloat.html


Denormalized  Normalized  Approximate Decimal  

8bit  
bfloat  
Half (float16)  
Single Precision  ± 2−149 to (1−2−23)×2−126  ± 2−126 to (2−2−23)×2127  ± ≈10−44.85 to ≈1038.53 
Double Precision  ± 2−1074 to (1−2−52)×2−1022  ± 2−1022 to (2−2−52)×21023  ± ≈10−323.3 to ≈10308.3 
Rounding rules
A rounding mode takes a real number, treats it as if it’s infinitely precise, and rounds to the nearest representable floating point value for a given format. This also triggers exceptions for inexact, underflow, or overflow as appropriate (but note most programming languages do not support seeing these exceptions!).
There are 5 rounding modes (as of the 2008 spec):
TODO  pic for rounding modes, draw picture: arrows for each right, each left, each to 0, each away from 0, each to even.
 Round ties to even: this is the default, and wisest choice. When a value is exactly between two floating point values, it rounds to the one with a 0 in the lowest bit. TODO  explain why good, rounds magnitude at least $b^{emax}\times(b1/2 b^{1p})$ to same sign $\infty$. TODO  same emax?
 Round ties to away: breaks ties to the value with larger magnitude. Same infinity as #1
 Round toward positive: rounds $x$ to the floating point value closest to and no less than $x$, possible to $+\infty$
 Round toward negative: rounds $x$ to the closets floating point value no greater than $x$, possibly $\infty$.
 Round toward zero: $x$ rounds to the closest floating point number that is no greater in magnitude than $x$.
Arithmetic guarantees
NOTE: x <>Y is not the same as NOT(x=y) due to NaN
IEEE implementations shall provide
 nextUp(x), nextDown(x)  give neighbor values, behaves as expected (except can give 0, infinity, NaN>NaN quiet)
 r = remainder(x,y) for $y\neq 0$ and finite $x,y$: $r=xy\times n$ where $n$ is nearest $x/y$, ties break to even, regardless of rounding mode. remainder is always exact (?). if $r=0$ then the sign is that of $x$. remainder(x,inf) = x for finite x.
 minNum, maxNum NOTE: version in 2008 replaced in 2019!
 operations $\oplus, \otimes, \oslash, \ominus$
 sqrt. Note sqrt(0) = 0, all others shall have positive sign bit.
 fma(x,y,z) = (x\times y)+z with only one rounding at end (can underflow, overflow, inexact)  todo  explain why FMA so useful
 copy(sign), negate(sign), abs(sign), all quiet, only affect sign bits. Negate is not same as 0x
 compare < <= = >= > todo  explain!
 totalOrder x<y, x>y,
Language dependent on how things compare to NaN?  define some for C/C++, C#?
correctly rounded operations?

r = remainder (x,y)

operations $\oplus, \otimes, \oslash, \ominus$ all behave as expected when one operand is an infinity and other is finit

sqrt infinity = infinity

remainder(x,infinity)=x

NaN propagates through operations
NaN and denormal sometimes slow
 correctly rounded ops are: $+\times \div \sqrt{}$ and FMA.
 Many implementations have sin,cos, exp, ln, etc. close, especially near 0 (todo  test?)
recommended operations: these may have low error (todo a few ulps? Intel?)
 exp, expm1, exp2, exp2m1, exp10, exp10m1
 log,log2,log10
 logp1,log2p1,log10p1
 hypot
 rsqrt
 compound(x,n)=(1+x)^n
 rootn(x,n)=x^1/n
 pown(x,n)=x^n
 pow(x,y)=x^y
 sin,cos,tan,sinPi,cosPi,tanPi,asin,acos,atan,atan2,asinPi,acosPi,atan2Pi
 sinh, cosh, tanh, asinh, acosh, atanh
TODO  IEEE 9.2.1 has special values for many functions  are these true? If so, gives some nice handles for problems
Exceptions
These set flags in the hardware, sadly are not often exposed to software.
 Invalid operation > NaN:
 0 times infinity in mult or in FMA
 add, subtract two differnet signed infinities
 0/0, or inf/inf
 remainder for y = 0 or x infinite and neither NaN
 sqrt of < 0
 Div nonzero by 0: get proper infty where applicable,
 Overflow: result too big for format, result infty
 Underflow: tiny nonzero, smaller than smallest denormal, result 0
 Inexact: result affected by limits in
TODO
Standards
Before looking at what IEEE 754 guarantees and how to leverage the guarantees, some notation needs defined.
1985: must be provided: +*/,sqrt,rem, comparison (obvious, and inf=inf, +inf = +inf, x neq NaN for any x)  returns (?) x(round(x/y) * y )
1985: recommend: x returns sign flip, possibly different from 0x (e.g., when x=0)
2008: added decimal, merged in IEEE 854 (1987) (radix independent standard)
IEEE 754 is a standard defining floating point systems, by far the most widely used format. Created in 1985, it has had several revisions:

IEEE 7541985 Standard for Binary FloatingPoint Arithmetic
 Didn’t specify which bits denote a sNaN or a qNaN (?)
 The Intel 8087 coprocessor was the first integrated circuit to implement this standard.
 Only 4 rounding modes (round to even, to 0, to inf, and toinf)

IEEE 8541987 Standard for RadixIndependent FloatingPoint Arithmetic
 Added radix independent floating point arithmetic (base 10 for example)

IEEE 7542008 Standard for FloatingPoint Arithmetic (also called ISO/IEC/IEEE 60559:2011)

Merged the 1985 and 1987 standards

denormalized renamed to subnormal (TODO  use throughout)

Added more bit size formats (float16, float128, floatN for N=128+32k where k positive integer?)

new decimal formats

added new rounding mode: round away from 0

Standardized new operations, like fused multiply and add FMA.

defined min and max, but left them not sign ambiguous when operands are +0 and 0

min and max select the number when used with a quiet NaN (replaced in 2019 standard, so don’t use unless sure)

2008 (p35, 6.2.1) std says highest fraction bit = 1 is a qNaN, else a sNaN. This bit position was in the 1985 std, but didn’t define the value for which is which; hardware standardized on this which was formalized in the 2008 version. TODO


IEEE 7542019 (also called ISO/IEC 60559:2020)
 This is the current standard (2021)
 Minor fixes replacing 2008 version
 Adds the significant change that min and max should propagate NaNs, 2008 didn’t
 Does not have A+B = B+A for different NaNs [Agner]
These standards define a few key items:
 Binary formats for floating point numbers
 Rounding rules for converting real numbers to floating point
 Operation behavior, such as how accurate various operations are, and
 Exceptions such as divide by 0 or overflow
Special Number Operations
Operations on special numbers are welldefined by IEEE. In the simplest case, any operation with a NaN yields a NaN result. Other operations are as follows:
Operation  Result 

n ÷ ±∞  0 
±∞ × ±∞  ±∞ 
±nonZero ÷ ±0  ±∞ 
±finite × ±∞  ±∞ 
∞ + ∞ ∞ − −∞  +∞ 
−∞ − ∞ −∞ + −∞  −∞ 
∞ − ∞ −∞ + ∞  NaN 
±0 ÷ ±0  NaN 
±∞ ÷ ±∞  NaN 
±∞ × 0  NaN 
NaN == NaN  false 
NaN != NaN  false 
To sum up, the following are the corresponding values for a given representation:
Sign  Exponent (e)  Fraction (f)  Value 

0  00⋯00  00⋯00  +0 
0  00⋯00  00⋯01 ⋮ 11⋯11  Positive Denormalized Real 0.f × 2(−b+1) 
0  00⋯01 ⋮ 11⋯10  XX⋯XX  Positive Normalized Real 1.f × 2(e−b) 
0  11⋯11  00⋯00  +∞ 
0  11⋯11  00⋯01 ⋮ 01⋯11  SNaN 
0  11⋯11  1X⋯XX  QNaN 
1  00⋯00  00⋯00  −0 
1  00⋯00  00⋯01 ⋮ 11⋯11  Negative Denormalized Real −0.f × 2(−b+1) 
1  00⋯01 ⋮ 11⋯10  XX⋯XX  Negative Normalized Real −1.f × 2(e−b) 
1  11⋯11  00⋯00  −∞ 
1  11⋯11  00⋯01 ⋮ 01⋯11  SNaN 
1  11⋯11  1X⋯XX  QNaN 
I’ll be using C++ in this answer instead of C because it offers the convenient std::numeric_limits::quiet_NaN
and std::numeric_limits::signaling_NaN
which I could not find in C conveniently.
I could not however find a function to classify if a NaN is sNaN or qNaN, so let’s just print out the NaN raw bytes:
x8664


The Facts!
Here is a summary of things that are true and not true for IEEE 754 floating point numbers (and similar formats) which are useful when creating, analyzing, or fixing algorithms.
Notation
Notation to make statements precise: let $\mathbb{R}$ be the real numbers, $\mathbb{F}$ be the subset that is representable as floating point values. Note that $\mathbb{F}$ is NOT a field. Call $\mathbb{F}$ the representable numbers. The real values that round to a representable number are said to lie in the representable range.
To make the following not incredibly verbose we assume all operations are performed on numbers that avoid overflow, underflow, NaN, infinities, etc. Each statement below, for practical work needing to handle those cases, needs to be carefully checked for edge cases.
Write $x,y,z$ for real numbers, $a,b,c$ for floating point numbers.
Components:
 Let $\beta$ be the base of the representation, usually 2 or 10. Almost everything below will be base $\beta=2$.
 Let $s$ be 0 or 1 to denote the sign of a number via $(1)^s$.
 Let $p$ be the precision of the representation, which is the number of base $\beta$ significant digits, including any implied 1 bit hidden in normalized numbers.
 Let integers $e_{min} < e_{max}$ denote the valid exponent range, i.e., the integer $e$ satisfies $e_{min}\leq e \leq e_{max}$. TODO  includes inf and denormal?
 Let $b_i$ be bits (digits when in base 10) with each $0\leq b_i<\beta$.
 Then the floating point numbers representable as $(1)^s\times \beta^e\times \sum_{i=0}^{p1}b_i \beta^{i}$ are a subset of the real numbers called $\mathbb{F}$
 Call the expression $f=b_0.b_1…b_{p1}$ the fraction, following [Knuth]. We’ll avoid the word “mantissa,” often used here, for the same reasons Knuth gives.
Note these are discrete: there are “gaps” between them. Roughly speaking, for exponent $e$, two neighboring floating point numbers differ by $\beta^{ep+1}$. Real numbers between any two floating point numbers can be approximated by taking the nearest one, with ties broken by a tie breaking rule called a rounding mode.
Errors in representation (TODO  cleanup, merge, seems conflicting info…):

Write $fl(x)$ for the floating point approximation to $x$ when $fl(x)$ is in $\mathbb{F}$. Also written as $round(x)$ to make some exposition clearer.

The machine epsilon $\epsilon_M$ (often written eps or $\epsilon$) is distance from 1.0 to next larger representable floating point value, $\epsilon_M = \beta^{(p1)}=\beta^{1p}$. In the binary case $\beta=2$ this is $\epsilon_M=2^{1p}$ .

Converting a real number $x$ in the representable range to the floating point number $fl(x)$ is rounded, giving an error at most $\epsilon_M/2$.

Units in last place of a number $x$, often denoted ulp(x), is (following [Mueller] who follow [Kahan TODO])” ulp(x) is the gap between the two floating point numbers nearest to x, even if x is one of them. ulp(NaN) is NaN. Some places define it as if (x>0) ulp(x) is dist between x and next higher representable floating point number. If x<0, between x and next smaller. Note this changes at powers of 2 gaps (todo example)

ulp = units in last place = in range $[2^e,2^{e+1}]$ the ulp for float (23 bit?) is $2^{e23}$ todo check. $2^{23}\approx1.19209…\times10^{7}$.

Unit roundoff is max amount rounded, $\mu=\epsilon_M/2 = \beta^{p}$

The absolute error is $xfl(x)$

For nonzero $x$ the relative error is $\frac{xfl(x)}{x}$. This is at most u = eps/2
Basic Operations:
 Infinite precision operations on real numbers are the operations $\bullet\in{+,,\times,\div }$
 Floating point operations are $\circ\in{\oplus,\ominus,\otimes,\oslash}$. For example $fl(x+y)$ is written $x\oplus y$
Floating point representation theorem: $fl(x)=x(1+\delta$) where $\delta \leq \mu$
Fundamental axiom of floatingpoint arithmetic: $fl(x \bullet y)=(x \bullet y)(1+\delta),$ where $\delta\leq \mu$.
TODO: want limits on when a+b is a, a */ b is a, etc. Tells what number sizes will not affect a calculation step, allows different analysis
Lomont: check: if precision $p$ and $a\geq b \geq 0$ and $b\leq a/2^{(p+2)}$, then $a\oplus b=a$ TODO  tested on float, experiments….size on b? Same for , or negative numbers? Also checked for subtraction  seems right? Did get matches on ratio 2^24 for float32 which has p = 23
Lomont: check: if $a\geq b \geq 0$ then $a\otimes b = a \iff $ TODO  size on b? Same for div, or negative numbers? here rounds once a/b >= 2^((p+1))  easy to check, is beyond eps? Yes, only happens if value exactly 1.0 ?(TODO)?
TODO  check more exactly…. Prove solidly
IEEE Guarantees
TODO: slight disclaimer on rounding mode and denormalized numbers and overflows
 The floating point operations $\oplus,\ominus,\otimes,\oslash$, and $\sqrt{x}$ all act as if they are computed to infinite precision, then rounded according to the current rounding mode. Thus the relative error is at most (TODO  based on rounding mode)
 any real number representable as a float will be stored as that float  so exact things ok, like many integers, 1.5, 1.125, etc. 0.1 and 0.2 and 0.3 are all infinite repeating  rule is fraction only has power of 2 in base (for base 2, for base 10 can have 2 and 5).
 The IEEE standard guarantees bit patterns for +0 and 0 compare as equal numbers.
Nice trick: multiplying or dividing by powers of the base (usually 2) are exact as long as no overflow or underflow.
for 32 bit floats, $\epsilon=2^{24}\approx 5.96\times 10^{8}$. TODO  check,
$\epsilon$ ~ 6e8 for float32, 1.1e16 for double
todo  use fl or round for operation?
$a\oplus b=round(a+b)$  this says addition treated as if infinite precision, then rounded
$a\ominus b=fl(ab)$  this says subtraction treated as if infinite precision, then rounded
$a\otimes b=fl(a\times b)$  this says multiplication treated as if infinite precision, then rounded
$a\oslash b=fl(a/b)$  this says division treated as if infinite precision, then rounded
$\sqrt{a}=round(\sqrt{a})$  square root rounded exactly.
round puts value within $\epsilon$ rel error. Thus $a\odot b$ is in the interval $[(a\odot b)(1\epsilon),(a\odot b)(1+\epsilon)]$. for op in todo
For a particular value, write $\delta$ for the actual error. Then for a given $a,b$ we can write $a\odot b=(a\odot b)(1+\delta)$ where $d \leq \epsilon$.
TODO  is this bound by $\delta \leq \epsilon$ or by $\epsilon/2$?
rel error: $xfl(x)\leq\frac{1}{2}\beta^{1p}$ for $\beta=2$
Guidelines
Often analyze with $x\oplus y=fl(x+y)=(x+y)(1+\epsilon)$ for $\epsilon<u$
TODO use $\delta_k$ for various proofs needing exact values.
[pbrt] [Higham] shows if have bound $(1\pm \epsilon)^n$ by $(1\pm(n+1)\epsilon)$ is too big, can use $1+\theta_n$ where $\theta_n\leq\frac{n\epsilon}{1n\epsilon}$, call the exact value of the fraction $\gamma_n$ to clean proofs.
TODO: make a guidelines section of easy to apply help
Mult and div add about an rel error of one $\epsilon$ per use, a subtraction (or addition where one val is negative) of like sized numberd can add a lot of rel error. Addition of same sign values adds about 1 eps. This results in associative law breaks down.
Steps to check an algo:
 analyze based on infinite precision, condition numbers, etc
 analyze based on floating point ranges  does fit?
 analyse on normalized precision
 analyze based on denomrmal cases  less digits left
 analyze for NaN rpropogation  check at some point to ensure behavior, assert
 todo? enough?
Tricks: todo
 change rounding modes, check answers
 change precision, check answers
Not true
 Associative law DOES NOT hold
 $(a\oplus b)\oplus c \neq a\oplus (b\oplus c)$
 $(a\otimes b)\otimes c \neq a\otimes(b\otimes c)$
 Distributive law does not hold! $x\otimes (y\oplus z) \neq (x\otimes y) \oplus(x\otimes z)$
Here are things that are not true for floating point operations. Many are true for real numbers
 Associative law fails: $(a\circ b )\circ c \neq a\circ(b\circ c)$
 Distributive law fails: $a\times(b+c)$ need not be $(a\times b) + (a\times c)$. This also fails for other operation combinations.
 $a\oplus b =a $ does not mean $b=0$.
 Inverses: $(1\oslash x)\otimes x= x$ is not always true! Same for each of 4 basic ops!
compare to NaN lots of false
n*x != x+x+…+x (different number of roundings!)
checking against div by 0 is often broken  still overflow quite often
compare x==y bad. xy < eps also bad unless you know x and y behave certain way. Else
float(str(x)) != x  be super careful about roundtripping  give C code, C# code, etc. examples. Also these not portable across platforms (Mac vs Intel bit us)
FMA may change results: show example from Higman where changes sign of determinant: that is, FMA = $fl(a\times b+c)$ need not be $(a\otimes b)\oplus c$
example: c = a>=b ? std::sqrt(a*ab*b) : 0
 there are values of a,b that give sqrt of a negative number, surprisingly!
Catastrophic cancellation
catastrophic cancellation: happens on subtraction (or addition of different signed numbers!), worse the closer in magnitude the numbers are
Theorems
This is a list of things that are true you can use when developing algorithms.
Sterbenz lemma: if a and b are floating point #s, with 1/2 <= a/b <= 2, then ab is also floating point exact
Note: Sterbenz warns where catastroophic cancellation occurs! so use depends on if you want exactness and/or precision loss!
TODO = Derive following PBRT
Splitting: $((x\oplus y)\ominus x)+((x\oplus y)\ominus((x\oplus y)\ominus x))=x\oplus y$
Cancellation amount (Loss of Precision Theorem): if x,y positive normalzied fp #s, x>y, and 2^q <= 1(y/x) <= 2^p, then at most q and at least p sig digits lost in xy
x a real number in floating point range (not overflow or underflow), then fl(x) = x(1+delta) where delta<= u, the machine unit roundoff
unit roundoff is 2^t where t bit fractional part, machine eps is u/2 = 2^(1t)
Fundamental Axiom of Floating Point arithmetic: x,y normalized floats, ops +/*, and x op y is in floating point range, then fl(x op y)=(x op y)(1+delta) some delta<=u
Check:
 does $x\oslash y=x\otimes(1\oslash y)$ ? I’d guess not
 $0\ominus(0\ominus x)=x$ ?
 $1\oslash(1\oslash x)=x$ ?
 Does midpoint: $x\leq y$ imply $x\leq (x\oplus y)\oslash 2 \leq y$?
True statements for $a,b, c$ floating point numbers, $x,y,z$ real numbers in representable range:
 Commutative law holds
 $a\oplus b=b\oplus a$
 $a\otimes b = b\otimes a$
 $1\otimes a=a$. Note since the commutative law holds, this also means $a\otimes 1=a$. Similarly throughout this section.
 $a\oplus 0=a$
 for $a\neq 0$, $a\oslash a=1$
 if $a\ominus b=0$ then $a=b$,
 $a\oplus b=0$ if and only if $a=b$
 $2\otimes a$ and $a \oslash 2$ are exact (assuming no overflow or underflow, and in base 2)
 $a\oslash 2^i=a\otimes 2^{i}$ assuming no underflow or overflow, and n base 2.
 TODO ?If $x$ is a power of the base then distributive law does hold since $fl(\beta x) = \beta fl(x)$
 $a\ominus b = a\oplus b$
 $(a\oplus b)=a\oplusb$
 $a\ominus b=(b\ominus a)$
 if $a \leq b$ then $a \oplus c \leq b\oplus c$
 $x\leq y$ implies $fl(x)\leq fl(y)$ (Rounding is order preserving  very important)
 $a\otimes b=0$ if and only if $a=0$ or $b=0$.
 $(a)\oslash b = a \oslash (b) = (a \oslash b)$
 $a\oslash 1 = a$
 If $a\leq b$ and $c>0$ then $a\otimes c \leq b\otimes c$ and $a\oslash c \leq b\oslash c$
 If $a\oplus b=a+b$ then $(a\oplus b)\ominus b = a$
 If $a\otimes b=a\times b \neq 0$ then $(a\otimes b)\oslash b=a$
 (Kahan, Knuth) $1\oslash(1\oslash(1\oslash x))=1\oslash x$ for all $x\neq 0$
Representation theorem: $\frac{xfl(x)}{x}\leq u$. Tells about rounding
Axiom of arithmetic: $fl(x \circ y) = (x \circ y)(1+\epsilon)$ for some $\epsilon\leq u$.
$\sum_{k=0}^n x_k$ has at most $(1+\epsilon)^n1\approx n\epsilon$ error.
If $x>y$ are normalized and $2^{q}\leq1\frac{y}{x}\leq 2^{p}$ then at most $q$ and at least $p$ significant binary digits will be lost in the subtraction $xy$.
if $x$ in the range of floating point and $x\in[\beta^e,\beta^{e+1})$ then the abs error is given by $xfl(x)\leq \frac{1}{2}\beta^{e(1p)}$ and (if $x\neq 0$) the relative error is given by $\frac{xfl(x)}{x}\leq \beta^{1p}$. The RHS of the last is called machine precision or unit roundoff.
Most proofs simply expand as $x=(\sum d_i \beta^{i})\beta^e$ and work through details and cases (for x in range above).
Some follow from principle that if number can be represented exactly, it is.
spacing between normalized floating point number $x$ and normalized neighbor is at least $\beta^{1}\epsilon_Mx$ and at most $\epsilon_Mx$.
Two from Goldberg:
Thrm 5: $a,b$ floating point #’s, sequence $a_0=a$, $a_1=(a_0\ominus b)\oplus b$, $a_2=(a_1 \ominus b)\oplus b$, …, $a_n=(a_{n1}\ominus b)\oplus b$. Then $a_n$ equals $a_0$ or $a_1$ for all $n>1$. The sequence is stable.
Thrm 6: $p$ is precision, even if $\beta>2$, ops exactly rounded. Then if $k=p/2$, rounded up, and $m=\beta^k+1$, $a$ can be split as $a_H + a_L$ via $a_H=(m\otimes a)\ominus (m\otimes a \ominus a)$ and $a_L=a\ominus a+H$. This is useful for making algorithms that stay in range, or doing double (or higher) precision calcs.
Multiple precision
For some algorithms needing more precision in places, there is a method of using multiple floating point numbers to provide more precision. The trick is there is a high part $a_H$ and a low part $a_L$ which represent the real number $a_H+a_L$. Operations on these often go by the names Doubledouble or Quaddouble. TODO library, papers.
Knuth double and quad representations, priestly, lomont template to recursively make bigger floats or integers?
Mandlezoom needs vastly more
[Priest91]  Arbitrary precision algorithms from standard operations
Bailey QD and DD algorithms an ideas
Examples
(Knuth, kulisch) poly $2y^2+9x^4y^4$ eval at x=408855776, y=708158977 using IEEE doubles, gives 3.7x10^19. Eval equiv form $2y^2+(3x^2y^2)(3x^2+y^2)$ gives 1x10^18. True answer exactly 1.0. How why so bad? Find more
Can drift? Knuth problem. $x_0=u, x_{2n+1}=x_{2n}\otimes v,x_{2n+2}=x_{2n+1}\otimes v$ for $ui,v\neq 0$. Largest $k$ so that $x_k \neq x_{k+2}$?
compute associ of add, mul, in Knuth 4.2.2
changing rounding mode trick to test code.
horner polynomial stability? Most stable?
lomont iszero predicate, is close, errors in old, etc.
Here’s an ever growing list of common idioms and algorithms and how to do things in a numerically stable manner.
quaternion computation  check algorithms
Quad formula]: x1=b  sign(b)sqrt…, x2 = c/(ax1)
sqrt(x^2+y^2) for x and/or y large (overflow?)
x^2y^2 vs (xy)(x+y)
alpha blend
quaternion
Kahan summation
higham: kahan example where FMA computation of determinant may change signs from nonFMA version  says do not use FMA indiscriminantly
triangle area  Boldo paper nice?
higham ch 1: sample variance, quadratic,
[Priest91], citing Milenkovic, line segment to line segment intersection. If initial floats have $p_1$ precision, fully correct intersection needs $2p_1 +1$ precision.
compute p(x)=(x10)^9 around x=10 in this form and in expanded polynomial form and graph to shopw error
horner method  stable poly evaluation?
comupute sum for pi from large values down, or from small values up, and show difference
compute 1cos(x) near 0, bad rel error, use equiv sin^2(x)/(1+cos(x)) and get better answers, removes cancellation. Or expand in power series, allows controlling error directly
geometric ray intersection algo used in that used the infinity as part of the correct answer
[pbrt] compute error on ((a+b)+c)+d) versus (a+b)+(c+d)  has some tradeoff
[pbrt]  derives conservative ray bounds intersections and ray triangle intersections, robust spawned rays, etc
TODO  put simple code examples here?
number of digits needed  proof simple in goldberg
Integers you can store in float32  you have 23 (24?) bits. Thus 2^24 + 1 = 2^24 in float. todo  check
Language support
C++ max, min eps, isnan, isinf, etc. C#, others?
memcpy only portable, safe way to move float to from int  unions are not guaranteed, nor is pointer tricks  todo ok now?
things can change on release, debug, compiler changes, platform changes, OS changes, etc… check your environment, or make some code that tests some assumptions carefully.
TODO
Things to merge
 don’t call things mantissa  it’s a misnomer  see Knuth. Call it fraction.
 Image of float density  halves each power of 2, half of all numbers in [01]
Z3 proof IEEE
show examples of changing rounding modes and testing code.
mpfr.org  code, GNU, etc.
Since some real numbers will have much larger magnitude than any representation with a fixed size exponent, TODO  infinities, overflow, …
scan for ideas https://news.ycombinator.com/item?id=9837342
hardware flushes denormals to zero  loses precision, can cause problems.
MISC
TODO  link to online calculator to visualize
TODO  flesh out.
TODO  “floating point or floatingpoint”  google, make consistent
Notes on floating point math, theorems, algorithms
float32 range 1.8 * 10^38 to 3.4 * 10^38
some common values seen in debugging: 1.0 = 0x3f800000, check, more?
float32 often much faster, half as much memory for caches, float16 and bfloat16 even more so.
relative precision: float32 2^23 ~ 6 decimal digits, double 2^52 ~ 16 decimal digits
IEEE first extra bit guard digit for rounding addition, 2nd bit round digit for rounding multiplication, third bit sticky bit used for resolving ties
TODO  endianess? copies into integer for C/C++
Testing and evaluation
Software packages, results
Bibliography
[Goldberg94] “What Every Computer Scientist Should Know About FloatingPoint Arithmetic ,” Goldberg, 1994
[Knuth] Knuth Seminumerical algorithms
[Priest91] “Algorithms for Arbitrary Precision Floating Point Arithmetic ,” Priest, 1991
[Muller+10] “Handbook of FloatingPoint Arithmetic ,” Muller et. al., 2010
[Muller06] “Elementary Functions, Algorithms and Implementation ,” 2nd Ed., Muller, 2006
[Higham02] “Accuracy and Stability of Numerical Algorithms”, 2nd Ed., Higham, 2002
[IEEE7542019] “IEEE Standard for FloatingPoint Arithmetic,” 2019
[Kahan97] “Lecture Notes on the Status of IEEE Standard 754 for Binary FloatingPoint Arithmetic “, Kahan, 1997
[Arnold17] “An Introduction to FloatingPoint Arithmetic and Computation ,” Jeff Arnold, CERN slides, 2017
[Overton01] “Numerical Computing with IEEE Floating Point Arithmetic,” Michael Overton, 2001.
“Managing Rounding Error”  PBRT
Look look through kahan stuff
a few floating point books exist
PBRT  rounding stuff
[Agner] https://www.agner.org/optimize/blog/read.php?i=1012
[Wiki] https://en.wikipedia.org/wiki/IEEE_754, Feb 2021
some testing http://www.math.utah.edu/~beebe/software/ieee/
[Half] https://en.wikipedia.org/wiki/Halfprecision_floatingpoint_format
[Dawson] TODO  mention this stie as aton of good stuff https://randomascii.wordpress.com/2017/06/19/sometimesfloatingpointmathisperfect/
[Muller] “On the definition of ulp(x)”, 2005, http://www.enslyon.fr/LIP/Pub/Rapports/RR/RR2005/RR200509.pdf
todo  investigate https://mortoray.com/2015/07/06/essentialfactsaboutfloatingpointcalculations/
check https://news.ycombinator.com/user?id=brucedawson
https://floatingpointgui.de/errors/comparison/
https://floatingpointgui.de/references/
https://anniecherkaev.com/thesecretlifeofnan
Sites to look at
https://randomascii.wordpress.com/2017/06/19/sometimesfloatingpointmathisperfect/
https://ieeexplore.ieee.org/abstract/document/4358257?section=abstract
Good question:
Comments
Comment is disabled to avoid unwanted discussions from 'localhost:1313' on your Disqus account...