Lomont.org

(Also lomonster.com and clomont.com)

Generating Uniform Random Numbers on [0,1]

Published

How does one generate uniform random numbers in [0,1] using IEEE 754 floating point formats? It's tricky.

Generating Uniform Random Numbers on [0,1]

Chris Lomont, March 2017

Common advice for generating uniform random numbers in [0,1] in many languages looks like this [1,2,3]:

double value = ((double)rand())/(RAND_MAX);

This however does not generate many possible floating-point numbers in \([0,1]\), leaving gaps - there are many floating-point numbers that should occur but cannot due to this method. To understand why, you need to understand how floating-point numbers are (usually) stored on computers.

Throughout we'll assume the random number generators produce uniformly random integers in a known range.

IEEE 754 floating-point format

The most common way to store floating-point numbers is the IEEE 754 format, which has 1 sign bit, \(e\) exponent bits, and \(m\) mantissa bits. I'll write \(S\) for the sign bit, \(E\) for the exponent bit pattern and \(M\) for the mantissa bit pattern. For a 64 bit double used in C, C#, JavaScript, and many similar languages, it has this layout:

sign exponent mantissa
1 bit 11 bits 52 bits

When \(E\neq 0\) and \(E\neq 2047\), this represents the floating point value \(f=(-1)^S \times 2^{E-bias}\times 1.M\).

When \(E=0\) and \(M=0\), then \(f=0\).

These two cases are called normal numbers, and most numbers you'll use fall in this category.

When \(E=0\) and \(M\neq 0\), then the value \(f\) is \(f=(-1)^S \times 2^{1-bias}\times 0.M\).

This case is called a subnormal number. It lets you use smaller numbers than naively using the above format for all bit patterns. Note the implicit 0 in the mantissa and the 1 in the exponent, instead of 0. These ensure that all values behave nicely.

When \(E=2047\) and \(M=0\), the bit patterns represents signed \(\infty\).

When \(E=2047\) and \(M\neq0\), the bit pattern represents NaNs (Not a Number), used for error conditions.

The bias for a 64-bit IEEE 754 floating-point number is 1023, allowing positive and negative exponents. The \(1.M\) symbols mean there is an implied 1 bit, then a decimal point, then \(M\) is the 52 bits following the decimal. Thus if \(M=1010...000_2\), \(1.M\) is \(1+1\times\frac{1}{2}+0\times\frac{1}{4}+1\times\frac{1}{8}+0...0=1.625\). When a number is stored in this format the hardware determines position of the leading 1 bit, from that computes the exponent bits \(E\), and if \(E>0\), removes the implicit leading bit, and stores the next 52 bits for \(M\). If \(E>2046\), then one of the infinities is stored. If \(E=0\), then subnormal numbers are stored.

For the rest of this article we'll use the non-negative floating-point values, i.e., \(S=0\).

The problem

So, now that you understand the format, why is the original code bad? The first problem is rand() in many languages does not generate enough values; in many C environments RAND_MAX was (is?) 32767, meaning there are only 32767 possible output values. This is fine for a few uses, but if you really need a large number of floating-point values with more variety for modern simulations, you need more possible output values.

How many IEEE 754 floating-point values are there in \([0,1]\)? For each exponent in \(E=0,E=1,...,E=bias-1\) there are \(2^{52}\) distinct floating-point values, and the final value \(f=1\) is the exponent \(E=bias\) with \(M=0\). This is \(1+1023\times 2^{52}\) which is almost \(2^{62}\). So there is no way to generate every possible floating-point number in [0,1] with the appropriate odds using less than a 62 random bits.

So the next attempt might be something like

double value = (double)Rand64()/(pow(2,64));

The problem with this (and all such attempts), is this assumes the possible floating-point values are equally spaced, which they are not. For example, when \(E=1\), the lowest bit of \(M\) is a change of \(2^{-1074}\), whereas when \(E=1000\), the lowest bit of \(M\) is a change of \(2^{-75}\).

Thus all such attempts will fail at hitting all possible floating-point values in \([0,1]\), lowering the quality of the random number generation for demanding simulations.

The solution

To generate all possible floating-point bit patterns with the appropriate probabilities, consider generating arbitrarily long random bit sequences \(0.b_1b_2b_3...\), and rounding to the nearest representable floating-point numbers. This terminates since the smallest possible non-zero positive IEEE 754 64-bit number is \(E=0\) and \(M=1\), which is \(2^{1-1023-51}=2^{-1074}\).

To do rounding, note that the probability of a trailing infinite sequence of 0s or 1s is each zero. (Note this is not the same as saying that it cannot happen; it is saying the probability of the infinite sequence happening is zero, which are two distinct things. As long as your simulation uses a finite number of random values, which seems pretty likely, then this is mathematically correct.).

To generate numbers in \([0,1)\), round down; to generate in \((0,1]\), round up, and for our case \([0,1]\), we'll round up if the first bit past \(M\) is 1, else we'll round down. This will generate all possible floating-point numbers, with the correct probabilities.

In pseudo code (note this does not work in JavaScript since there is no 64 bit integer type) this is (we assume rand64 returns a uniform random 64 bit integer, over the whole range. Using other sources can be made to fit by concatenating bits as needed.):

// get exponent and first part of mantissa
i32 e = -1 + 64    // want e = -1, -64 happens always below
uint64 m = 0       // this must be at least 64 bits
do
    m = rand64()    // get 64 random bits
    e = e - 64      // we used 64 bits
    if (e < -1074) 
        return 0.0   // will round to 0.0
    while m == 0    // if all bits are so far 0, continue till we get a 1
   
// pos will be the index of the leading 1 in the mantissa 
i32 pos = 63       // start at highest bit
while ((m >> pos) & 1) == 0
    pos = pos - 1
   
pos = 63 - pos     // invert for ease of use. pos is now the # of leading 0s 

if (pos != 0)
    // we need more bits: shift m, and adjust exponent
    m = (m << pos) | (rand64() >> (64-pos))
    e = e - pos

// at this point, e is the exponent value 
// m has a 1 in the highest bit, then 63 more bits, 
// enough to pick the proper 52 for the mantissa and round correctly

if (e > -1023)
    // normal numbers
    m = m >> 10              // shift unneeded bits out, 54 left
    m = (m >> 1) + (m & 1)   // shift and round
    if ((m>>53) == 1)       
        // rounding caused overflow past implicit leading 1, fix exponent, m fixed below
        e = e + 1

else
    // subnormal numbers, want top 52 bits of m
    i32 d = -1023 - e        // amount e is past the smallest e = -1023

    m = m >> (11 + d)        // want 52 bits left, minus the excess, plus 1 for carry
    m = (m >> 1) + (m & 1)   // shift and round
    e = -1023                // where we are
    if ((m>>52) == 1)       
        // rounding caused overflow into implicit leading 0, fix exponent, m fixed below
        e = e + 1

uint64 mask = 1              // a single one bit
mask = (mask << 52) - 1      // done this way to avoid language specific shift issues
m = m & mask                 // mask bits 52-63 out. m now ready

// now e is the final exponent and m is the final 52 bit mantissa, with no leading 1
   
// create final 64 bit representation of the final value
int64 fi 
fi = e + 1023                // exponent
fi = fi << 52                // room for m
fi = fi | m                  // final value

// language specific conversion trick
double f = convert fi   

// some example ways to convert are
// C/C++ :  memcpy(&f,&fi,8)  // other methods often use undefined behavior
// C#    :  f = BitConverter.Int64BitsToDouble(fi)
// Java  :  f = Double.longBitsToDouble(fi)

return f

Finally, specific language tricks/support allow shortcuts for some of these steps.

More problems

Do the same analysis for the interval \([a,b]\), \(a\leq b\) arbitrary. It's messy but doable from the techniques explained above.

[1] http://stackoverflow.com/questions/1340729/how-do-you-generate-a-random-double-uniformly-distributed-between-0-and-1-from-c

[2] http://stackoverflow.com/questions/6218399/how-to-generate-a-random-number-between-0-and-1

[3] http://stackoverflow.com/questions/9878965/rand-between-0-and-1

Categories: Software Math

Tags: Software C# C++ Math IEEE 754

Comments

Comment is disabled to avoid unwanted discussions from 'localhost:1313' on your Disqus account...