Lomont.org

(Also lomonster.com and clomont.com)

Published

Color Conversions

This is a quick note for converting byte values colors in 0-255 back and forth to floating-point colors in 0-1.

The Problem

A common (yet incorrect) method (shown in C++) looks like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
#include <cstdint> // uint8_t

// byte color in 0-255 to 32-bit float in 0-1
float ColorU8ToF32_BAD(uint8_t colorB)
{
    return colorB/255.0f;
}

uint8_t ColorF32ToU8_BAD(float colorF)
{
    return (uint8_t)(colorF*255.0f+0.5f);    
}

This suffers from significant bias away from the end colors 0 and 255, and the +0.5f “trick” to round values does not always work, which can bite you. Some problems with this style of conversion:

  1. The end values 0 and 255 only have half as much representation on the [0-1] range as the other byte values, making this method highly non-uniform. This bias reduces image quality.
  2. The common method to round float to integer, (int)(floatVal +0.5), can fail to round correctly. First of all, casting in C/C++ truncates towards 0, giving problems (although not here), and some values cause floating point precision issues to round incorrectly, such as applying the above to the float value 0.49999997f yields not 0, but 1. (This value is the predecessor of 0.5f, obtainable via nextafter and nexttoward routines in <cmath> from C++ 11 on).
  3. Values slightly outside the range [0,1], common in routine calculations, cause wrapping instead of clamping. Pre-clamping floats to [0,1] suffices, but it’s better to make this safer since users often don’t clamp.

The solution

To continue, here are some important rules how numerics work in C/C++ (and related languages). We’ll design the final code to be explicit so conversion to other places can be done correctly.

  1. Floating point values on most architectures are IEEE 754 format. All floats in this note are such.
  2. Real numbers are called representable if they can be stored exactly as floating point. Representable values must be expressible as a sum of powers of 2 (and satisfy some other technical relationships). For example, 0.3, 1.9, 9.1 are not representable. 0.5, 0.75, 23.125 are representable.
  3. Casting from a float to an integer truncates towards zero (this is true in most languages, such as C/C++, C#, Java, etc.)
  4. Addition, subtraction, multiplication, and division of representable numbers, as guaranteed by the IEEE 754 standard, are computed as if there were infinite precision and rounded at the end to the limited number of bits in the format. Breaking rounding ties can be one of several methods. I’ll be careful to avoid mistakes here.

Here is a better method of converting colors. First, some desires/requirements

  1. Uniform representation: we want floats in [0,1] to be equally likely to end up as 0,1,2,…,255. Since the number of representable real numbers in 0-1 is not a multiple of 256, there must be some bias, but we will require it to be minimal. The method above has the mistake of 0 and 255 getting half as many values as 1,2,…,254. This is bad
  2. Tolerance for some numerical error. Computing with floating point, since they have limited precision, leads to situations where values are close but not quite the mathematical truth, leading to rounding errors or comparison errors.
  3. TODO

To match up the interval [0,1] with the colors {0,1,…,255}, treat the latter as the interval [0,255) (open ended) and for a moment treat the floats as [0,1) open ended. This treats the single value 1.0f as off the end (to be handled), and now there are nice intervals for each piece:

1
[0,1) [1,2) [2,3) ... [255,256) <-> [0,a) [a,2a) [2a,3a) ... [255a, 256a)

where a = 1/256. This has the nice property that a is representable, and all the interval endpoints are representable. So there is no error in moving between these, and all are equal sized.

To map colors, let’s map midpoints to midpoints, and explicitly list each operation for careful analysis.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
#include <cstdint>   // uint8_t
#include <algorithm> // std::clamp since C++ 17
#include <cmath>     // std::floor, becomes constexpr in C++ 23

// byte color in 0-255 to 32-bit float in 0-1
float ColorU8ToF32_GOOD(uint8_t colorB)
{
    float f1 = (float) colorB; // convert integral to float, exact
    float f2 = f1 + 0.5f;      // bias, move interval midpoints to midpoints, exact
    float f3 = f2 / 256.0f;    // scale to 0-1 range, exact (can be made * of inverse)
    return f3;
}

uint8_t ColorF32ToU8_GOOD(float colorF)
{
    float f1 = colorF * 256.0f; // scale, is exact for floats near [0,1]
    float f2 = std::floor(f1);  // can remove the -0.5 and round issues with this
    float f3 = std::clamp(f2, 0.0f, 255.0f); // needed for numerical stability.
    return (uint8_t)f3;
}

Let’s analyze:

  1. f1, f2, and f3 are all exact. This is nice since we can reverse this and get an exact integer back, with zero error. You can even compare the results bitwise as floats, which is usually a bad numerical idea. But here the conversion is perfect (and reversible).
  2. The other direction needs some care. todo - picture nice here… The idea is to scale the [0,1] interval to the [0,256] interval, and map each interval [n,n+1) to the integer n. Using clamp ensures that slightly out of bounds float values (common from computations) and the single value 1.0f do not end up out of bounds. It’s important to clamp while the range is wide. Simply casting the result to uint8_t will incorrectly map 1.0f to the byte value 0 if care is not taken.
  3. Note compilers will pack the above verbose code as tightly as if you write terse code.

Final Code

For a last gain, in C++, we can make the conversions constexpr. Since std::floor is not constexpr until C++ 23, I’ll use another method. The C++ standard guarantees that casting a floating point value to an integral type truncates towards zero, which is different than floor for negative values, but any negative values from numerical errors will be ok to truncate towards 0. This leads to the best C/C++ version:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
// Chris Lomont, 2019
#include <cstdint>   // uint8_t
#include <algorithm> // std::clamp since C++ 17

// Code to convert between byte colors in {0,1,...,255}
// and float32 colors in [0,1]. These methods are uniform, 
// stable, and high quality.

// byte color in 0-255 to 32-bit float in 0-1
constexpr float ColorU8ToF32(uint8_t colorB)
{
    constexpr float inv256 = 1.0f/256.0f; // exact
    return (colorB + 0.5f) * inv256; // exact
}

constexpr uint8_t ColorF32ToU8(float colorF)
{
    int32_t i32 = static_cast<int32_t>(colorF * 256.0f); // scale, truncate towards 0
    return static_cast<uint8_t>(std::clamp(i32, 0, 255));
}

// some tests to check it all
// static assert only works for compile time if the constexpr succeeds
static_assert(ColorU8ToF32(127)==(127.0f+0.5f)/256.0f);
static_assert(ColorF32ToU8(0.0f) == 0);
static_assert(ColorF32ToU8(0.5f-1.0f/1024.0f) == 127);
static_assert(ColorF32ToU8(0.5f) == 128);
static_assert(ColorF32ToU8(1.0f) == 255);
static_assert(ColorF32ToU8(1.01f) == 255);
static_assert(ColorF32ToU8(-0.01f) == 0);

Making the functions constexpr allows putting in compile-time testing via use of static_assert.

Also, compilers will likely do the division to multiply trick, and you can even hand code it to remove all multiplies and divisions altogether using tricks on the exponent, such as the C++ 11 scalbn style functions.

As a final note, for debugging such things, it’s useful to know the C++ functions

  1. to_chars and from_chars from the header <charconv> are especially useful. They convert floating-point to and from char arrays, and are the only methods in the C++ library that are guaranteed to roundtrip such values correctly, which means you can output a float to text and then back and get the same float. Surprisingly, other methods in the standard do not guarantee this, and often fail. And, cine it’s C++, even these methods have a significant gotcha: you cannot use them across implementations, since that is not guaranteed to round trip!
  2. The other functions mentioned above, that provide manipulation of float internals and finding nearby floats, are useful for learning and testing.

As a final test, let’s run all possible float32s through the above routines and check ranges are equally the same size. Note we cannot count how many hit each bin and expect those to be the same: there are lots more representable values near 0 than away from 0. but checking the range is a good test.

1
2
// Final test
TODO

Categories:

Tags:

Comments

Comment is disabled to avoid unwanted discussions from 'localhost:1313' on your Disqus account...