(Also lomonster.com and clomont.com)

Lomont floating point notes


Notes on floating point issues, including lists of facts and theorems useful for algorithm development.

Floating-point notes

Chris Lomont

Document history

v0.1 Feb 2021 - Initial layout, notes

v0.2 Aug 2022 - significant extensions and cleanup


These are notes on many aspects of floating point numbers in computing. This is very messy, and needs lots of cleaning. Instead of continuing to not post this, I decided to post and edit, clean up, and refine over time. Good luck.

The level is for skilled TODO programmers to work on good floating point code. More details than most places, gives tools to develop your own algorithms and start learning how to do this well.

significant amount of facts useful for understanding, deriving, and developing floating point algorithms.

TODO - quick contents

TODO - image here

TODO If you’re already decently knowledgeable about floating-point, jump to the section titled “The Facts!” for a quick summary of IEEE 754 rules and subsequent theorems as well as many examples of how to use floating point well in common situations.

Floating point numbers

Floating-point formats define a finite precision approximation to infinite precision real numbers.

An arbitrary real number $r$ can be written in base 10 in the form $r=\pm 10^N \times d_0.d_1d_2…$ where $N$ is an integer, the $d_i$ are digits in ${0,1,…,9}$. Note the decimal point between $d_0$ and $d_1$. If we require $d_0 \neq 0$ whenever $r\neq 0$, then the exponent $N$ is uniquely determined by $r$. (To keep uniqueness, we exclude representations with an infinite trailing string of 9’s - these are equivalent to incrementing the digit before the start of the string and replacing those trailing 9’s with 0’s.)

Similarly, and to make things more suitable for computers, represent $r$ in base 2: an arbitrary real number $r$ can be written in base 2 in the form $r=\pm 2^N \times b_0.b_1b_2…$ where $N$ is an integer, the $b_i$ are bits (binary digits) in ${0,1}$. If we require $b_0 \neq 0$ (i.e., $b_0=1$) whenever $r\neq 0$, then $N$ is determined by $r$. (To keep uniqueness, we exclude representations with an infinite trailing string of 1’s - these are equivalent to incrementing the digit before the start of the string and replacing those trailing 1’s with 0’s.)

Following convention, instead of integral $N$ for exponents, we will use $e$ for the integral exponent. This is not the natural base $e=2.7181828…$.

To approximate real numbers for computer calculations, it is convenient to fit an approximation $\hat{r}$ of a real number $r$ into a fixed number of bits. This forces using only finitely many bits to store the exponent $e$ and to store some number $p$ of bits $b_0,…,b_{p-1}$. Including a bit $s$ to denote a sign $\pm1$, this leads to a representation like

$$\hat{r}=(-1)^s \times 2^e \times b_0.b_1b_2…b_{p-1}=(-1)^s\times 2^e\times\sum_{i=0}^{p-1}b_i 2^{-i}$$

TODO - image here

Here $s$ is the sign bit 0 or 1, $e$ is an integer exponent , the $b_i$ are binary digits (bits) each in ${0,1}$, and either $b_0$ is nonzero or all $b_i$ are 0. $p$ is the precision: the number of digits used to represent the floating point approximation to a real number. TODO - clean, moved below. To make this representation finite, the exponent is bounded by a min and max value as such that $e_{min}\leq e\leq e_{max}$. Such a format is specified by the four integer parameters $e_{min} \leq e_{max}$, precision $p$, and base $\beta$.

This is the basic idea behind floating point numbers: approximate a real number by limiting the number of significant digits and restricting the exponent to a limited range.

IEEE 754

Most modern computing follows a standard for dealing with floating-point numbers, the IEEE 754 standard, which has several versions. IEEE 754 has specific binary formats for various sizes of floating-point numbers, it precisely specifies minimum requires to conform to IEEE 754, and most importantly, makes various guarantees on what to expect from a conforming implementation. These guarantees are what enables careful analysis of algorithms to make code robust and portable. Without some guarantees portable code would be nearly impossible and would certainly be a mess.

IEE 754 defines a representation of floating point formats of various lengths, the most common being a 32-bit and a 64-bit format (called float and double in C/C++, float and double in C#, etc.). For the rest of this document we’ll call them float32 and float64 (and float 16 and float 128, etc., where needed).

todo - IEEE also defines bit sizes >= 128 for multiples of 32.

It also defines both base 2 and base 10 formats, but we’ll only cover base 2 here, which is the base most commonly used. Similar facts and conclusions can be developed for base 10.

Binary formats

How IEEE 754 floating point works is best illustrated by looking at the bit layout of a floating point number. Consider the common 32 bit floating point layout:

  • the highest bit is the sign bit $s$ to denote the sign $(-1)^s$
  • The next $w=8$ bits are an unsigned integer representing the (biased) exponent $E$. The actual exponent is $e=E-127$. The 127 is called the bias, and is determined as $bias = 2^{w-1}-1$. The actual exponent has range $e_{min}\leq e\leq e_{max}$, where $e_{max}=2^{w-1}-1=bias$ asnd $e_{min}=1-e_{max}$
    • biased exponent allows faster comparisons of floating point numbers, useful for sorting: unbiased 1/2->1111_1111, 2->0000_0001, biased: 1/2->01111110, 2->1000_0000 TODO
  • The final $p-1=23$ bits store $p=24$ bits of precision $b_1b_2…b_{23}$: 23 explicitly and the initial hidden bit $b_0$ is determined implicitly
    • The number in base 2 (with decimal point) $b_0.b_1b_2…b_{23}$ is called the significand.
    • if $E=0$ this bit is $b_0=0$, else this bit is $b_0=1$.
    • When $E=0$ and the significand is 0, this represents +0 or -0 (depending on sign bit). These compare as equal, but have different bit patterns.
    • When $E=0$ and the significand is nonzero, this represents a subnormal number (sometimes called denormal).
    • When $0<E<255=2^w-1$ the number is called a normal number (or a normalized float).
  • When $E=255=2^w-1$ and the significand is 0, the float represents $\pm \infty$ with the sign from $s$.
  • When $E=255=2^w-1$ and the significand is nonzero, this represents Not a Number (NaN), used to flag invalid operations like sqrt of a negative number.
float32 layout:

/- 1 sign bit s
|  /- 8 exponent bits E
|  |         /
|  |        /
1  1000_0001  100_1100_0000_0000_0000_0000 = -6.375 = (-1)^1 * 2^(129-127) * (1 + 1/2 + 1/16 + 1/32)
              |                           \
              \- 23 fraction bits (not including 1 hidden bit!) b1 b2 b3 b4 ... b23
This float has sign bit s = 1, biased exponent E=0xF0=, TODO - fill in value

This represents the floating point number 
    (-1)^s * 2^(E-127) * (b0/2^0 + b1/2^1 + b2/2^2 + b3/2^3 + ... + b32/2^23)
where b0 = 0 if E = 0, else b0 = 1

Subnormal numbers TODO - details:

The hidden bit is 1 whenever $E\neq0$, that is, whenever the exponent is not the minimum value. This means the stored fraction is shifted until there is a 1 in the leading spot, and then the 1 is not written in. When $E=0$, the float stores two kinds of numbers: subnormal ones when the fraction bits are not all 0, and a zero when the fraction bits are all 0. Note there is both a +0 and a -0 bit pattern, and they are distinct. When $E$ is not 0 or the highest value the hidden bit is a 1 and the number is called a normal value.

Subnormal TODO - explain here, also called denormal,

NaN - Agner Fog has good paper on differences between processors, 2018 https://grouper.ieee.org/groups/msc/ANSI_IEEE-Std-754-2019/background/nan-propagation.pdf, also 2020 https://www.agner.org/optimize/nan_propagation.pdf

Not IEEE, bfloat (Brain Floating Point) - 16 bits, in Intel AI processors, Xeon (AVX-512 has BF16 format), Intel FPGAs, Google Cloud TPUs, and Tensorflow - many formats https://en.wikipedia.org/wiki/Bfloat16_floating-point_format bfloat reduces precision from 24 to 8 bits, else same (easy convert between bfloat and float32)

float16 [Half] - used in many image formats - allows larger range than 16 bit ints, stores in half the space as float32

The common float64 (double in C/C++/C#, etc.) is similar, except the field lengths and bias change. There are also float16 and float128 formats defined (and in use in some processors). The general pattern is:

  • highest bit is a sign bit $s$
  • $w$ bits for an unsigned integer exponent $E$, biased with $bias=2^{w-1}-1$. Then the actual exponent $e=E-bias$ has the range $e_{min}\leq e\leq e_{max}$ with $e_{max} = bias= 2^{w-1}-1$ and $e_{min}=1-e_{max}$.
  • $p-1$ bits to store $p$ bits of precision, using the implicit hidden bit method explained above, based on $E$ not being 0 or $2^w-1$.

TODO - above exponent range clarified, note normal, denormal, infinite, and fix table below

Here are some common parameters

Name Common Name Significant bits (1 less stored) Exponent bits $w$ Exponent bias $bias=2^{w-1}-1$ $e_{min}=1-e_{max}$ $e_{max}=bias$
binary16 Half precision 11 5 $2^4-1 = 15$ -14 +15
binary32 Single precision 24 8 $2^7-1=127$ -126 127
binary64 Double precision 53 11 $2^{10}-1=1023$ -1022 1023
binary128 Quadruple precision 113 15 $2^{14}-1=16383$ -16382 +16383
binary256 Octuple precision 237 19 $2^{18}-1=262143$ -262142 +262143

There is also a bfloat16 (Brain Float) in use in modern systems for doing AI work: the format is the same as a float32, with the latter 16 bits dropped. This makes it have the same range as float32, it has less precision and takes less memory (so is faster to move to cache, faster to process).

Here’s a representation of all 64 bit doubles, and the meaning of them (idea from [Arnold17]):

Sign bit and exponent, then 13 nibbles of float64 significand What it represents
0x000_0_0000_0000_0000 +0
0x000_0_0000_0000_0001 smallest subnormal
0x000_f_ffff_ffff_ffff largest subnormal
0x001_0_0000_0000_0000 smallest normal
0x002_0_0000_0000_0000 2 times smallest normal
0x7fe_f_ffff_ffff_ffff largest normal
0x7ff_0_0000_0000_0000 $+\infty$
0x7ff_0_0000_0000_0001 NaN (high significand bit 0 is qNaN, else sNaN)
0x7ff_f_ffff_ffff_ffff NaN (high significand bit 0 is qNaN, else sNaN)
0x800_0_0000_0000_0000 -0
0x800_0_0000_0000_0001 smallest negative subnormal
0x800_f_ffff_ffff_ffff largest negative subnormal
0x801_0_0000_0000_0000 smallest negative normal
0xffe_f_ffff_ffff_ffff largest negative normal
0xfff_0_0000_0000_0000 $-\infty$
0xfff_0_0000_0000_0001 NaN (high significand bit 0 is qNaN, else sNaN)
0xfff_f_ffff_ffff_ffff NaN (high significand bit 0 is qNaN, else sNaN)
0x000_0_0000_0000_0000 +0, back to the start!


For more details on ranges, see the section on IEEE above - TODO - merge

Todo - add bfloat and float16 to ranges, add eps, c++ values, etc…

“Laid out as bits, floating point numbers look like this: https://steve.hollasch.net/cgindex/coding/ieeefloat.html

         Sign bit, exponent bits, fraction bits (including the implied '1' bit)
8-bit : (1, 4, 4) SEEEEFFF
bfloat: (1, 8, 8) SEEEEEEE EFFFFFFF 
Half  : (1, 5,11) SEEEEEFF FFFFFFFF
Denormalized Normalized Approximate Decimal
Half (float16)
Single Precision ± 2−149 to (1−2−23)×2−126 ± 2−126 to (2−2−23)×2127 ± ≈10−44.85 to ≈1038.53
Double Precision ± 2−1074 to (1−2−52)×2−1022 ± 2−1022 to (2−2−52)×21023 ± ≈10−323.3 to ≈10308.3

Rounding rules

A rounding mode takes a real number, treats it as if it’s infinitely precise, and rounds to the nearest representable floating point value for a given format. This also triggers exceptions for inexact, underflow, or overflow as appropriate (but note most programming languages do not support seeing these exceptions!).

There are 5 rounding modes (as of the 2008 spec):

TODO - pic for rounding modes, draw picture: arrows for each right, each left, each to 0, each away from 0, each to even.

  1. Round ties to even: this is the default, and wisest choice. When a value is exactly between two floating point values, it rounds to the one with a 0 in the lowest bit. TODO - explain why good, rounds magnitude at least $b^{emax}\times(b-1/2 b^{1-p})$ to same sign $\infty$. TODO - same emax?
  2. Round ties to away: breaks ties to the value with larger magnitude. Same infinity as #1
  3. Round toward positive: rounds $x$ to the floating point value closest to and no less than $x$, possible to $+\infty$
  4. Round toward negative: rounds $x$ to the closets floating point value no greater than $x$, possibly $-\infty$.
  5. Round toward zero: $x$ rounds to the closest floating point number that is no greater in magnitude than $x$.

Arithmetic guarantees

NOTE: x <>Y is not the same as NOT(x=y) due to NaN

IEEE implementations shall provide

  • nextUp(x), nextDown(x) - give neighbor values, behaves as expected (except can give -0, infinity, NaN->NaN quiet)
  • r = remainder(x,y) for $y\neq 0$ and finite $x,y$: $r=x-y\times n$ where $n$ is nearest $x/y$, ties break to even, regardless of rounding mode. remainder is always exact (?). if $r=0$ then the sign is that of $x$. remainder(x,inf) = x for finite x.
  • minNum, maxNum NOTE: version in 2008 replaced in 2019!
  • operations $\oplus, \otimes, \oslash, \ominus$
  • sqrt. Note sqrt(-0) = -0, all others shall have positive sign bit.
  • fma(x,y,z) = (x\times y)+z with only one rounding at end (can underflow, overflow, inexact) - todo - explain why FMA so useful
  • copy(sign), negate(sign), abs(sign), all quiet, only affect sign bits. Negate is not same as 0-x
  • compare < <= = >= > todo - explain!
  • totalOrder x<y, x>y,

Language dependent on how things compare to NaN? - define some for C/C++, C#?

correctly rounded operations?

  • r = remainder (x,y)

  • operations $\oplus, \otimes, \oslash, \ominus$ all behave as expected when one operand is an infinity and other is finit

  • sqrt infinity = infinity

  • remainder(x,infinity)=x

  • NaN propagates through operations

NaN and denormal sometimes slow

  • correctly rounded ops are: $+-\times \div \sqrt{}$ and FMA.
  • Many implementations have sin,cos, exp, ln, etc. close, especially near 0 (todo - test?)

recommended operations: these may have low error (todo- a few ulps? Intel?)

  • exp, expm1, exp2, exp2m1, exp10, exp10m1
  • log,log2,log10
  • logp1,log2p1,log10p1
  • hypot
  • rsqrt
  • compound(x,n)=(1+x)^n
  • rootn(x,n)=x^1/n
  • pown(x,n)=x^n
  • pow(x,y)=x^y
  • sin,cos,tan,sinPi,cosPi,tanPi,asin,acos,atan,atan2,asinPi,acosPi,atan2Pi
  • sinh, cosh, tanh, asinh, acosh, atanh

TODO - IEEE 9.2.1 has special values for many functions - are these true? If so, gives some nice handles for problems


These set flags in the hardware, sadly are not often exposed to software.

  1. Invalid operation -> NaN:
    1. 0 times infinity in mult or in FMA
    2. add, subtract two differnet signed infinities
    3. 0/0, or inf/inf
    4. remainder for y = 0 or x infinite and neither NaN
    5. sqrt of < 0
  2. Div nonzero by 0: get proper infty where applicable,
  3. Overflow: result too big for format, result infty
  4. Underflow: tiny non-zero, smaller than smallest denormal, result 0
  5. Inexact: result affected by limits in



Before looking at what IEEE 754 guarantees and how to leverage the guarantees, some notation needs defined.

1985: must be provided: +-*/,sqrt,rem, comparison (obvious, and -inf=-inf, +inf = +inf, x neq NaN for any x) - returns (?) x-(round(x/y) * y )

1985: recommend: -x returns sign flip, possibly different from 0-x (e.g., when x=0)

2008: added decimal, merged in IEEE 854 (1987) (radix independent standard)

IEEE 754 is a standard defining floating point systems, by far the most widely used format. Created in 1985, it has had several revisions:

  • IEEE 754-1985 Standard for Binary Floating-Point Arithmetic

    • Didn’t specify which bits denote a sNaN or a qNaN (?)
    • The Intel 8087 coprocessor was the first integrated circuit to implement this standard.
    • Only 4 rounding modes (round to even, to 0, to inf, and to-inf)
  • IEEE 854-1987 Standard for Radix-Independent Floating-Point Arithmetic

    • Added radix independent floating point arithmetic (base 10 for example)
  • IEEE 754-2008 Standard for Floating-Point Arithmetic (also called ISO/IEC/IEEE 60559:2011)

    • Merged the 1985 and 1987 standards

    • denormalized renamed to subnormal (TODO - use throughout)

    • Added more bit size formats (float16, float128, floatN for N=128+32k where k positive integer?)

    • new decimal formats

    • added new rounding mode: round away from 0

    • Standardized new operations, like fused multiply and add FMA.

    • defined min and max, but left them not sign ambiguous when operands are +0 and -0

    • min and max select the number when used with a quiet NaN (replaced in 2019 standard, so don’t use unless sure)

    • 2008 (p35, 6.2.1) std says highest fraction bit = 1 is a qNaN, else a sNaN. This bit position was in the 1985 std, but didn’t define the value for which is which; hardware standardized on this which was formalized in the 2008 version. TODO

  • IEEE 754-2019 (also called ISO/IEC 60559:2020)

    • This is the current standard (2021)
    • Minor fixes replacing 2008 version
    • Adds the significant change that min and max should propagate NaNs, 2008 didn’t
    • Does not have A+B = B+A for different NaNs [Agner]

These standards define a few key items:

  • Binary formats for floating point numbers
  • Rounding rules for converting real numbers to floating point
  • Operation behavior, such as how accurate various operations are, and
  • Exceptions such as divide by 0 or overflow

Special Number Operations

Operations on special numbers are well-defined by IEEE. In the simplest case, any operation with a NaN yields a NaN result. Other operations are as follows:

Operation Result
n ÷ ±∞ 0
±∞ × ±∞ ±∞
±nonZero ÷ ±0 ±∞
±finite × ±∞ ±∞
∞ + ∞ ∞ − −∞ +∞
−∞ − ∞ −∞ + −∞ −∞
∞ − ∞ −∞ + ∞ NaN
±0 ÷ ±0 NaN
±∞ ÷ ±∞ NaN
±∞ × 0 NaN
NaN == NaN false
NaN != NaN false

To sum up, the following are the corresponding values for a given representation:

Sign Exponent (e) Fraction (f) Value
0 00⋯00 00⋯00 +0
0 00⋯00 00⋯01 ⋮ 11⋯11 Positive Denormalized Real 0.f × 2(−b+1)
0 00⋯01 ⋮ 11⋯10 XX⋯XX Positive Normalized Real 1.f × 2(eb)
0 11⋯11 00⋯00 +∞
0 11⋯11 00⋯01 ⋮ 01⋯11 SNaN
0 11⋯11 1X⋯XX QNaN
1 00⋯00 00⋯00 −0
1 00⋯00 00⋯01 ⋮ 11⋯11 Negative Denormalized Real −0.f × 2(−b+1)
1 00⋯01 ⋮ 11⋯10 XX⋯XX Negative Normalized Real −1.f × 2(eb)
1 11⋯11 00⋯00 −∞
1 11⋯11 00⋯01 ⋮ 01⋯11 SNaN
1 11⋯11 1X⋯XX QNaN

See https://stackoverflow.com/questions/18118408/what-is-the-difference-between-quiet-nan-and-signaling-nan

I’ll be using C++ in this answer instead of C because it offers the convenient std::numeric_limits::quiet_NaN and std::numeric_limits::signaling_NaN which I could not find in C conveniently.

I could not however find a function to classify if a NaN is sNaN or qNaN, so let’s just print out the NaN raw bytes:


qnan 7fc00000
snan 7fa00000
 inf 7f800000
-inf ff800000
nan0 7fc00000
nan1 7fc00001
nan2 7fc00002
 0/0 ffc00000
sqrt ffc00000

The Facts!

Here is a summary of things that are true and not true for IEEE 754 floating point numbers (and similar formats) which are useful when creating, analyzing, or fixing algorithms.


Notation to make statements precise: let $\mathbb{R}$ be the real numbers, $\mathbb{F}$ be the subset that is representable as floating point values. Note that $\mathbb{F}$ is NOT a field. Call $\mathbb{F}$ the representable numbers. The real values that round to a representable number are said to lie in the representable range.

To make the following not incredibly verbose we assume all operations are performed on numbers that avoid overflow, underflow, NaN, infinities, etc. Each statement below, for practical work needing to handle those cases, needs to be carefully checked for edge cases.

Write $x,y,z$ for real numbers, $a,b,c$ for floating point numbers.


  • Let $\beta$ be the base of the representation, usually 2 or 10. Almost everything below will be base $\beta=2$.
  • Let $s$ be 0 or 1 to denote the sign of a number via $(-1)^s$.
  • Let $p$ be the precision of the representation, which is the number of base $\beta$ significant digits, including any implied 1 bit hidden in normalized numbers.
  • Let integers $e_{min} < e_{max}$ denote the valid exponent range, i.e., the integer $e$ satisfies $e_{min}\leq e \leq e_{max}$. TODO - includes inf and denormal?
  • Let $b_i$ be bits (digits when in base 10) with each $0\leq b_i<\beta$.
  • Then the floating point numbers representable as $(-1)^s\times \beta^e\times \sum_{i=0}^{p-1}b_i \beta^{-i}$ are a subset of the real numbers called $\mathbb{F}$
  • Call the expression $f=b_0.b_1…b_{p-1}$ the fraction, following [Knuth]. We’ll avoid the word “mantissa,” often used here, for the same reasons Knuth gives.

Note these are discrete: there are “gaps” between them. Roughly speaking, for exponent $e$, two neighboring floating point numbers differ by $\beta^{e-p+1}$. Real numbers between any two floating point numbers can be approximated by taking the nearest one, with ties broken by a tie breaking rule called a rounding mode.

Errors in representation (TODO - cleanup, merge, seems conflicting info…):

  • Write $fl(x)$ for the floating point approximation to $x$ when $fl(x)$ is in $\mathbb{F}$. Also written as $round(x)$ to make some exposition clearer.

  • The machine epsilon $\epsilon_M$ (often written eps or $\epsilon$) is distance from 1.0 to next larger representable floating point value, $\epsilon_M = \beta^{-(p-1)}=\beta^{1-p}$. In the binary case $\beta=2$ this is $\epsilon_M=2^{1-p}$ .

  • Converting a real number $x$ in the representable range to the floating point number $fl(x)$ is rounded, giving an error at most $\epsilon_M/2$.

  • Units in last place of a number $x$, often denoted ulp(x), is (following [Mueller] who follow [Kahan TODO])” ulp(x) is the gap between the two floating point numbers nearest to x, even if x is one of them. ulp(NaN) is NaN. Some places define it as if (x>0) ulp(x) is dist between x and next higher representable floating point number. If x<0, between x and next smaller. Note this changes at powers of 2 gaps (todo example)

  • ulp = units in last place = in range $[2^e,2^{e+1}]$ the ulp for float (23 bit?) is $2^{e-23}$ todo check. $2^{-23}\approx1.19209…\times10^{-7}$.

  • Unit roundoff is max amount rounded, $\mu=\epsilon_M/2 = \beta^{-p}$

  • The absolute error is $|x-fl(x)|$

  • For nonzero $x$ the relative error is $\frac{|x-fl(x)|}{|x|}$. This is at most u = eps/2

Basic Operations:

  • Infinite precision operations on real numbers are the operations $\bullet\in{+,-,\times,\div }$
  • Floating point operations are $\circ\in{\oplus,\ominus,\otimes,\oslash}$. For example $fl(x+y)$ is written $x\oplus y$

Floating point representation theorem: $fl(x)=x(1+\delta$) where $|\delta| \leq \mu$

Fundamental axiom of floating-point arithmetic: $fl(x \bullet y)=(x \bullet y)(1+\delta),$ where $|\delta|\leq \mu$.

TODO: want limits on when a+-b is a, a */ b is a, etc. Tells what number sizes will not affect a calculation step, allows different analysis

Lomont: check: if precision $p$ and $a\geq b \geq 0$ and $b\leq a/2^{-(p+2)}$, then $a\oplus b=a$ TODO - tested on float, experiments….size on b? Same for -, or negative numbers? Also checked for subtraction - seems right? Did get matches on ratio 2^24 for float32 which has p = 23

Lomont: check: if $a\geq b \geq 0$ then $a\otimes b = a \iff $ TODO - size on b? Same for div, or negative numbers? here rounds once a/b >= 2^(-(p+1)) - easy to check, is beyond eps? Yes, only happens if value exactly 1.0 ?(TODO)?

TODO - check more exactly…. Prove solidly

IEEE Guarantees

TODO: slight disclaimer on rounding mode and denormalized numbers and overflows

  1. The floating point operations $\oplus,\ominus,\otimes,\oslash$, and $\sqrt{x}$ all act as if they are computed to infinite precision, then rounded according to the current rounding mode. Thus the relative error is at most (TODO - based on rounding mode)
  2. any real number representable as a float will be stored as that float - so exact things ok, like many integers, 1.5, 1.125, etc. 0.1 and 0.2 and 0.3 are all infinite repeating - rule is fraction only has power of 2 in base (for base 2, for base 10 can have 2 and 5).
  3. The IEEE standard guarantees bit patterns for +0 and -0 compare as equal numbers.

Nice trick: multiplying or dividing by powers of the base (usually 2) are exact as long as no overflow or underflow.

for 32 bit floats, $\epsilon=2^{-24}\approx 5.96\times 10^{-8}$. TODO - check,

$\epsilon$ ~ 6e-8 for float32, 1.1e-16 for double

todo - use fl or round for operation?

$a\oplus b=round(a+b)$ - this says addition treated as if infinite precision, then rounded

$a\ominus b=fl(a-b)$ - this says subtraction treated as if infinite precision, then rounded

$a\otimes b=fl(a\times b)$ - this says multiplication treated as if infinite precision, then rounded

$a\oslash b=fl(a/b)$ - this says division treated as if infinite precision, then rounded

$\sqrt{a}=round(\sqrt{a})$ - square root rounded exactly.

round puts value within $\epsilon$ rel error. Thus $a\odot b$ is in the interval $[(a\odot b)(1-\epsilon),(a\odot b)(1+\epsilon)]$. for op in todo

For a particular value, write $\delta$ for the actual error. Then for a given $a,b$ we can write $a\odot b=(a\odot b)(1+\delta)$ where $|d| \leq \epsilon$.

TODO - is this bound by $|\delta| \leq \epsilon$ or by $\epsilon/2$?

rel error: $|x-fl(x)|\leq\frac{1}{2}\beta^{1-p}$ for $\beta=2$


Often analyze with $x\oplus y=fl(x+y)=(x+y)(1+\epsilon)$ for $\epsilon<u$

TODO use $\delta_k$ for various proofs needing exact values.

[pbrt] [Higham] shows if have bound $(1\pm \epsilon)^n$ by $(1\pm(n+1)\epsilon)$ is too big, can use $1+\theta_n$ where $|\theta_n|\leq\frac{n\epsilon}{1-n\epsilon}$, call the exact value of the fraction $\gamma_n$ to clean proofs.

TODO: make a guidelines section of easy to apply help

Mult and div add about an rel error of one $\epsilon$ per use, a subtraction (or addition where one val is negative) of like sized numberd can add a lot of rel error. Addition of same sign values adds about 1 eps. This results in associative law breaks down.

Steps to check an algo:

  1. analyze based on infinite precision, condition numbers, etc
  2. analyze based on floating point ranges - does fit?
  3. analyse on normalized precision
  4. analyze based on denomrmal cases - less digits left
  5. analyze for NaN rpropogation - check at some point to ensure behavior, assert
  6. todo? enough?

Tricks: todo

  1. change rounding modes, check answers
  2. change precision, check answers

Not true

  • Associative law DOES NOT hold
    • $(a\oplus b)\oplus c \neq a\oplus (b\oplus c)$
    • $(a\otimes b)\otimes c \neq a\otimes(b\otimes c)$
  • Distributive law does not hold! $x\otimes (y\oplus z) \neq (x\otimes y) \oplus(x\otimes z)$

Here are things that are not true for floating point operations. Many are true for real numbers

  • Associative law fails: $(a\circ b )\circ c \neq a\circ(b\circ c)$
  • Distributive law fails: $a\times(b+c)$ need not be $(a\times b) + (a\times c)$. This also fails for other operation combinations.
  • $a\oplus b =a $ does not mean $b=0$.
  • Inverses: $(1\oslash x)\otimes x= x$ is not always true! Same for each of 4 basic ops!

compare to NaN lots of false

n*x != x+x+…+x (different number of roundings!)

checking against div by 0 is often broken - still overflow quite often

compare x==y bad. |x-y| < eps also bad unless you know x and y behave certain way. Else

float(str(x)) != x - be super careful about roundtripping - give C code, C# code, etc. examples. Also these not portable across platforms (Mac vs Intel bit us)

FMA may change results: show example from Higman where changes sign of determinant: that is, FMA = $fl(a\times b+c)$ need not be $(a\otimes b)\oplus c$

example: c = a>=b ? std::sqrt(a*a-b*b) : 0 - there are values of a,b that give sqrt of a negative number, surprisingly!

Catastrophic cancellation

catastrophic cancellation: happens on subtraction (or addition of different signed numbers!), worse the closer in magnitude the numbers are


This is a list of things that are true you can use when developing algorithms.

Sterbenz lemma: if a and b are floating point #s, with 1/2 <= a/b <= 2, then a-b is also floating point exact

Note: Sterbenz warns where catastroophic cancellation occurs! so use depends on if you want exactness and/or precision loss!

TODO -= Derive following PBRT

Splitting: $((x\oplus y)\ominus x)+((x\oplus y)\ominus((x\oplus y)\ominus x))=x\oplus y$

Cancellation amount (Loss of Precision Theorem): if x,y positive normalzied fp #s, x>y, and 2^-q <= 1-(y/x) <= 2^-p, then at most q and at least p sig digits lost in x-y

x a real number in floating point range (not overflow or underflow), then fl(x) = x(1+delta) where |delta|<= u, the machine unit roundoff

unit roundoff is 2^-t where t bit fractional part, machine eps is u/2 = 2^(1-t)

Fundamental Axiom of Floating Point arithmetic: x,y normalized floats, ops +-/*, and x op y is in floating point range, then fl(x op y)=(x op y)(1+delta) some |delta|<=u


  • does $x\oslash y=x\otimes(1\oslash y)$ ? I’d guess not
  • $0\ominus(0\ominus x)=x$ ?
  • $1\oslash(1\oslash x)=x$ ?
  • Does midpoint: $x\leq y$ imply $x\leq (x\oplus y)\oslash 2 \leq y$?

True statements for $a,b, c$ floating point numbers, $x,y,z$ real numbers in representable range:

  • Commutative law holds
    • $a\oplus b=b\oplus a$
    • $a\otimes b = b\otimes a$
  • $1\otimes a=a$. Note since the commutative law holds, this also means $a\otimes 1=a$. Similarly throughout this section.
  • $a\oplus 0=a$
  • for $a\neq 0$, $a\oslash a=1$
  • if $a\ominus b=0$ then $a=b$,
  • $a\oplus b=0$ if and only if $a=-b$
  • $2\otimes a$ and $a \oslash 2$ are exact (assuming no overflow or underflow, and in base 2)
  • $a\oslash 2^i=a\otimes 2^{-i}$ assuming no underflow or overflow, and n base 2.
  • TODO- ?If $x$ is a power of the base then distributive law does hold since $fl(\beta x) = \beta fl(x)$
  • $a\ominus b = a\oplus -b$
  • $-(a\oplus b)=-a\oplus-b$
  • $a\ominus b=-(b\ominus a)$
  • if $a \leq b$ then $a \oplus c \leq b\oplus c$
  • $x\leq y$ implies $fl(x)\leq fl(y)$ (Rounding is order preserving - very important)
  • $a\otimes b=0$ if and only if $a=0$ or $b=0$.
  • $(-a)\oslash b = a \oslash (-b) = -(a \oslash b)$
  • $a\oslash 1 = a$
  • If $a\leq b$ and $c>0$ then $a\otimes c \leq b\otimes c$ and $a\oslash c \leq b\oslash c$
  • If $a\oplus b=a+b$ then $(a\oplus b)\ominus b = a$
  • If $a\otimes b=a\times b \neq 0$ then $(a\otimes b)\oslash b=a$
  • (Kahan, Knuth) $1\oslash(1\oslash(1\oslash x))=1\oslash x$ for all $x\neq 0$

Representation theorem: $\frac{|x-fl(x)|}{|x|}\leq u$. Tells about rounding

Axiom of arithmetic: $fl(x \circ y) = (x \circ y)(1+\epsilon)$ for some $|\epsilon|\leq u$.

$\sum_{k=0}^n x_k$ has at most $(1+\epsilon)^n-1\approx n\epsilon$ error.

If $x>y$ are normalized and $2^{-q}\leq1-\frac{y}{x}\leq 2^{-p}$ then at most $q$ and at least $p$ significant binary digits will be lost in the subtraction $x-y$.

if $x$ in the range of floating point and $|x|\in[\beta^e,\beta^{e+1})$ then the abs error is given by $|x-fl(x)|\leq \frac{1}{2}\beta^{e(1-p)}$ and (if $x\neq 0$) the relative error is given by $\frac{|x-fl(x)|}{|x|}\leq \beta^{1-p}$. The RHS of the last is called machine precision or unit roundoff.

Most proofs simply expand as $x=(\sum d_i \beta^{-i})\beta^e$ and work through details and cases (for x in range above).

Some follow from principle that if number can be represented exactly, it is.

spacing between normalized floating point number $x$ and normalized neighbor is at least $\beta^{-1}\epsilon_M|x|$ and at most $\epsilon_M|x|$.

Two from Goldberg:

Thrm 5: $a,b$ floating point #’s, sequence $a_0=a$, $a_1=(a_0\ominus b)\oplus b$, $a_2=(a_1 \ominus b)\oplus b$, …, $a_n=(a_{n-1}\ominus b)\oplus b$. Then $a_n$ equals $a_0$ or $a_1$ for all $n>1$. The sequence is stable.

Thrm 6: $p$ is precision, even if $\beta>2$, ops exactly rounded. Then if $k=p/2$, rounded up, and $m=\beta^k+1$, $a$ can be split as $a_H + a_L$ via $a_H=(m\otimes a)\ominus (m\otimes a \ominus a)$ and $a_L=a\ominus a+H$. This is useful for making algorithms that stay in range, or doing double (or higher) precision calcs.

Multiple precision

For some algorithms needing more precision in places, there is a method of using multiple floating point numbers to provide more precision. The trick is there is a high part $a_H$ and a low part $a_L$ which represent the real number $a_H+a_L$. Operations on these often go by the names Double-double or Quad-double. TODO library, papers.

Knuth double and quad representations, priestly, lomont template to recursively make bigger floats or integers?

Mandlezoom needs vastly more

[Priest91] - Arbitrary precision algorithms from standard operations

Bailey QD and DD algorithms an ideas


(Knuth, kulisch) poly $2y^2+9x^4-y^4$ eval at x=408855776, y=708158977 using IEEE doubles, gives -3.7x10^19. Eval equiv form $2y^2+(3x^2-y^2)(3x^2+y^2)$ gives 1x10^18. True answer exactly 1.0. How why so bad? Find more

Can drift? Knuth problem. $x_0=u, x_{2n+1}=x_{2n}\otimes v,x_{2n+2}=x_{2n+1}\otimes v$ for $ui,v\neq 0$. Largest $k$ so that $x_k \neq x_{k+2}$?

compute associ of add, mul, in Knuth 4.2.2

changing rounding mode trick to test code.

horner polynomial stability? Most stable?

lomont iszero predicate, is close, errors in old, etc.

Here’s an ever growing list of common idioms and algorithms and how to do things in a numerically stable manner.

quaternion computation - check algorithms

Quad formula]: x1=-b - sign(b)sqrt…, x2 = c/(ax1)

sqrt(x^2+y^2) for x and/or y large (overflow?)

x^2-y^2 vs (x-y)(x+y)

alpha blend


Kahan summation

higham: kahan example where FMA computation of determinant may change signs from non-FMA version - says do not use FMA indiscriminantly

triangle area - Boldo paper nice?

higham ch 1: sample variance, quadratic,

[Priest91], citing Milenkovic, line segment to line segment intersection. If initial floats have $p_1$ precision, fully correct intersection needs $2p_1 +1$ precision.

compute p(x)=(x-10)^9 around x=10 in this form and in expanded polynomial form and graph to shopw error

horner method - stable poly evaluation?

comupute sum for pi from large values down, or from small values up, and show difference

compute 1-cos(x) near 0, bad rel error, use equiv sin^2(x)/(1+cos(x)) and get better answers, removes cancellation. Or expand in power series, allows controlling error directly

geometric ray intersection algo used in that used the infinity as part of the correct answer

[pbrt] compute error on ((a+b)+c)+d) versus (a+b)+(c+d) - has some tradeoff

[pbrt] - derives conservative ray bounds intersections and ray triangle intersections, robust spawned rays, etc

TODO - put simple code examples here?

number of digits needed - proof simple in goldberg

Integers you can store in float32 - you have 23 (24?) bits. Thus 2^24 + 1 = 2^24 in float. todo - check

Language support

C++ max, min eps, isnan, isinf, etc. C#, others?

memcpy only portable, safe way to move float to from int - unions are not guaranteed, nor is pointer tricks - todo ok now?

things can change on release, debug, compiler changes, platform changes, OS changes, etc… check your environment, or make some code that tests some assumptions carefully.


Things to merge

  • don’t call things mantissa - it’s a misnomer - see Knuth. Call it fraction.
  • Image of float density - halves each power of 2, half of all numbers in [0-1]

Z3 proof IEEE

show examples of changing rounding modes and testing code.

mpfr.org - code, GNU, etc.

Since some real numbers will have much larger magnitude than any representation with a fixed size exponent, TODO - infinities, overflow, …

scan for ideas https://news.ycombinator.com/item?id=9837342

hardware flushes denormals to zero - loses precision, can cause problems.


TODO - link to online calculator to visualize

TODO - flesh out.

TODO - “floating point or floating-point” - google, make consistent

Notes on floating point math, theorems, algorithms

float32 range 1.8 * 10^-38 to 3.4 * 10^38

some common values seen in debugging: 1.0 = 0x3f800000, check, more?

float32 often much faster, half as much memory for caches, float16 and bfloat16 even more so.

relative precision: float32 2^-23 ~ 6 decimal digits, double 2^-52 ~ 16 decimal digits

IEEE first extra bit guard digit for rounding addition, 2nd bit round digit for rounding multiplication, third bit sticky bit used for resolving ties

TODO - endianess? copies into integer for C/C++

Testing and evaluation

Software packages, results


[Goldberg94] “What Every Computer Scientist Should Know About Floating-Point Arithmetic ,” Goldberg, 1994

[Knuth] Knuth Semi-numerical algorithms

[Priest91] “Algorithms for Arbitrary Precision Floating Point Arithmetic ,” Priest, 1991

[Muller+10] “Handbook of Floating-Point Arithmetic ,” Muller et. al., 2010

[Muller06] “Elementary Functions, Algorithms and Implementation ,” 2nd Ed., Muller, 2006

[Higham02] “Accuracy and Stability of Numerical Algorithms”, 2nd Ed., Higham, 2002

[IEEE-754-2019] “IEEE Standard for Floating-Point Arithmetic,” 2019

[Kahan97] “Lecture Notes on the Status of IEEE Standard 754 for Binary Floating-Point Arithmetic “, Kahan, 1997

[Arnold17] “An Introduction to Floating-Point Arithmetic and Computation ,” Jeff Arnold, CERN slides, 2017

[Overton01] “Numerical Computing with IEEE Floating Point Arithmetic,” Michael Overton, 2001.

“Managing Rounding Error” - PBRT

Look look through kahan stuff

a few floating point books exist

PBRT - rounding stuff

[Agner] https://www.agner.org/optimize/blog/read.php?i=1012

[Wiki] https://en.wikipedia.org/wiki/IEEE_754, Feb 2021

some testing http://www.math.utah.edu/~beebe/software/ieee/

[Half] https://en.wikipedia.org/wiki/Half-precision_floating-point_format

[Dawson] TODO - mention this stie as aton of good stuff https://randomascii.wordpress.com/2017/06/19/sometimes-floating-point-math-is-perfect/

[Muller] “On the definition of ulp(x)”, 2005, http://www.ens-lyon.fr/LIP/Pub/Rapports/RR/RR2005/RR2005-09.pdf

todo - investigate https://mortoray.com/2015/07/06/essential-facts-about-floating-point-calculations/

check https://news.ycombinator.com/user?id=brucedawson




Sites to look at



Good question:


Categories: Algorithms Numerical Math

Tags: Software


Comment is disabled to avoid unwanted discussions from 'localhost:1313' on your Disqus account...