Lomont.org

(Also lomonster.com and clomont.com)

Accurate color conversions

Published

This is a quick note for converting byte values colors in 0-255 back and forth to floating-point colors in 0-1 which avoids common errors.

Accurate color conversions

Chris Lomont, June 2023

This is a quick note for accurate conversions between color values represented as a byte value in ${0,1,…,255}$ and a floating-point value in $[0.0,1.0]$. Surprisingly, this is tricky to do well, and the most common methods found online suffer from significant problems. This result is something I’ve derived many times for gadgets (like Hypnocube stuff) over the years, and to avoid having to rederive it more times, I decided to write it out here once and for all.

TL;DR

Use the C++ code at the bottom of this post, and you’re good to go. It avoids many pitfalls for this problem you’ll find on the web.

The Problem

Abstractly, we want to convert color values represented as integers, which are how colors are generally stored in image formats, into and from floating point values, which is how colors are frequently manipulated for image processing tasks. The most common ranges are byte values representing red, green, and blue, and sometimes alpha, with the integer range ${0,1,…,255}$ to and from the floating point interval $[0.0,1.0]$. For this note I’ll write floating point values using a decimal, and integers using no decimal. Also note the integer range abstractly can be $N$ color values $0$ through $N-1$, and the analysis below still works. Here $N=256$.

We’ll start off with one requirement for color conversions, and add more as we find them.

  1. Roundtripping: going roundtrip integer to float to integer must return the original integer value.

The bad method

The most common advice is to convert integer $i$ to float $f$ with a simple $$ f = i/255.0 $$ This has the nice property that $0 \rightarrow 0.0$ and $255\rightarrow 1.0$ which seems right. The next question is how to convert floating values back, and this leads to trouble.

Since the first direction used a divide by $255.0$, it seems reasonable to do what most places will advise, and multiply by $255.0$ to go the other way. So we’ll consider a first step as $$ \hat{f} = f\times 255.0 $$ $\hat{f}$ has range $[0.0, 255.0]$. How to we convert these to integers? We can try rounding, floor, ceiling, or other methods.

Floor performs the following from half open integers (except the last, which is a single point} intervals to integers: $$ \begin{align*} [0.0,1.0) & \rightarrow 0 \ [1.0,2.0) & \rightarrow 1 \ [2.0,3.0) & \rightarrow 2 \ & … \ [254.0,255.0) & \rightarrow 254 \ {255.0} & \rightarrow 255 \ \end{align*} $$ And right away you see the problem. Each integer $n$ can occur from a floating point range $[n,n+1)$ except the largest integer $255$, which comes from a single point. This non-uniformity is bad since it means operations on images as floating point will bias away from the value 255.

Using ceiling has the same problem, except it biases away from $0$. Rounding (we’ll pick ties round up, any rounding mode wil have similar issues) leads to $$ \begin{align*} [0.0,0.5) & \rightarrow 0 \ [0.5,1.5) & \rightarrow 1 \ [1.5,2.5) & \rightarrow 2 \ & … \ [253.5,254.5) & \rightarrow 254 \ [254.5,255.0) & \rightarrow 255 \ \end{align*} $$ This is better, except the integers $0$ and $255$ have half the size of interval that map to them as all the other integers. This means that image processing will lose some representation of the end colors, which is bad. This leads to the second requirement:

  1. Uniformity: each integer should come from a uniform size floating point range (as much as possible).

We’ll call this method the BAD METHOD (it’s by far the most common on the web) $$ f = \frac{i}{255.0}\ i = round(f\times 255.0) $$

The better method

You can try lots of other methods to map, and soon you’ll realize that the multiply by $255.0$ is the culprit. You need 256 equal sized “bins”, not 255. Split $[0.0,1.0)$ into $256$ bins each of width $\Delta = \frac{1.0}{256.0}$. Note for now I removed the single end value $1.0$. Similarly, consider the integers as 256 bins, each of form $[n,n+1)$. Now lets map the center in each integer bin to the center of each floating point bin. We could pick different maps, but these center maps have nice properties. This map looks like: $$ f = \frac{i+0.5}{256.0} $$ A reasonable inverse would be $$ i = f\times 256.0 - 0.5 $$ and this would map exactly back, even in IEEE 754 floating point. But the float value $0.0$ and 1.0 would map outside the legal range 0,255, so we need something more careful. We want the map $$ [n\Delta,(n+1)\Delta) \rightarrow n $$ Multiplying the interval by 256.0 leaves a desired $[n,n+1) \rightarrow n)$, so we can use inverse $$ i = \lfloor{f\times 256.0}\rfloor $$ where $\lfloor t \rfloor$ is the floor function. For the single float value of $1.0$ that maps to 256, outside the legal range, we can clamp the result to $[0,255]$. Be careful to catch this case.

This method is found on many websites that have reached this far in the analysis, and it works far better than the first method. It is uniform, sending the integer to the middle of a range, and the range back to the integer, allows some error to occur in floating point calculations, and still reach the best pixels. This need for fuzzy safety, and for catching errors off either end, is so important in practice we’ll add a requirement

  1. Robust: the method must handle out of range correctly and be robust against small precision errors in roundtrip.

We’ll call this the BETTER METHOD $$ f = \frac{i+0.5}{256.0}\ i = clamp(floor(f\times 256.0),0,255) $$ But there is still a better method, one I have so far not seen anywhere. What is the issue with the above? Well, consider where 0 (darkest, least energy) and 255 (brightest, most energy) map to: $$ 0\rightarrow\frac{0.5}{256.0} \neq 0.0\ 255\rightarrow\frac{255.5}{256.0}\neq 1.0 $$ Having the darkest, lowest byte value not map to the darkest, lowest floating point value causes some problems. Consider checking 2 images for differences - often the values are subtracted, then the difference scaled up to increase its visibility. Having a nonzero value, then multiplying, will imply errors that should not be there. There many little places having this slightly larger than 0.0 “lowest value” will cause trouble, and the same happens for the slightly less than 1.0 “highest value”. So let’s add another requirement.

  1. Full-range: end points should map to endpoints.

The best method

Here is a method that meets all these requirements, the BEST METHOD $$ f=\frac{i}{255.0}\ i = clamp(floor(f\times 256.0),0,255) $$ So, here are the requirements repeated that we must meet:

  1. Roundtripping: going roundtrip integer to float to integer must return the original value.
  2. Uniformity: each integer should come from a uniform size floating point range (as much as possible).
  3. Robust: the method must handle out of range correctly and be robust against small precision errors in roundtrip.
  4. Full-range: end points should map to endpoints.

Let’s prove these are met.

Let $\Delta=\frac{1}{256}$. For $i\in {0,1,2,…,254}$ let $I_i$ be the half open interval $[i \Delta,(i+1)\Delta)$, and for $i=255$ let $I_{255}=[255\Delta,256\Delta]$, a closed interval. Then I claim

  1. $i$ maps into $I_i$. $0$ maps to the left edge of $I_0$, $255$ maps to the right edge of $I_{255}$, and for $i=1,2,…,254$, $i$ maps away from the endpoints of $I_i$, in fact, $i$ maps $\frac{i}{255\times256}$ into the interval $I_i$.
  2. $I_i$ maps to $i$.

If these are true, then the requirements are met:

  1. Roundtripping: $i\rightarrow I_I\rightarrow i$ for $i = 0,1,2,…,255$ is clear.

  2. Uniformity: The $I_i$ are disjoint, each is contained in $[0.0,1.0]$, and they cover $[0.0,1.0]$, and each has width $\Delta$.

  3. Robust: each $i$ maps to a point that has a neighborhood that maps back to $i$. The endpoints are covered by the clamp(_,0,255) part, the other values of $i$ map $\frac{i}{255.0\times256.0}>0$ into the interval.

  4. Full-range: this is easily hand checked for $i=0$ and $i=255$.

Proof of claims:

  1. $\frac{i}{256.0}\leq\frac{i}{255.0}\leq\frac{i+1}{256.0}$ gives $i\rightarrow I_i$ (clear denominators, simplify). The distance from the left had side of the interval to where $i$ maps is $\frac{i}{255.0}-\frac{i}{256.0}=\frac{i}{255.0\times256.0}$, which is not an endpoint for $i\neq 0,255$.

  2. For $i<255$, $$ \begin{eqnarray*} clamp(floor(I_i\times256.0),0,255)&=&clamp(floor([i,i+1),0,255)\ &=& clamp(i,0,255)\ &=&i \end{eqnarray*} $$ For $i=255$, floor returns 255 or 256, and the clamp maps both to 255.

So the algorithm meets all requirements.

Final Code

For BEST METHOD in C++, we can make the conversions constexpr. Since std::floor is not constexpr until C++ 23, I’ll use another method. Making the functions constexpr allows putting in compile-time testing via use of static_assert. The C++ standard guarantees that casting a floating point value to an integral type truncates towards zero, which is different than floor for negative values, but any negative values from numerical errors will be ok to truncate towards 0 in this case.

This leads to the C++ BEST METHOD:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
#pragma once
/*
MIT License
Copyright 2023 Chris Lomont, www.lomont.org

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the “Software”), to deal
in the Software without restriction, including without limitation the rights 
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 
copies of the Software, and to permit persons to whom the Software is 
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in 
all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL 
THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING 
FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS 
IN THE SOFTWARE.
*/

// code to do accurate conversions between 
// byte colors in 0-255 and 
// float colors in 0.0 to 1.0f
// see https://lomont.org/posts/2023/accuratecolorconversions/


#include <cstdint>   // uint8_t
#include <algorithm> // std::clamp since C++ 17

// Code to convert between byte colors in {0,1,...,255}
// and float32 colors in [0,1]. These methods are uniform, 
// stable, and high quality.

// byte color in 0-255 to 32-bit float in 0-1
constexpr float ColorU8ToF32(uint8_t colorB)
{
    constexpr float inv255 = 1.0f/255.0f; 
    return colorB * inv255; 
}

// float color in [0,1] to byte color in 0-255
// floats out of range clamped on output
constexpr uint8_t ColorF32ToU8(float colorF)
{
    // cast truncates towards 0, which is slightly different than 
    // floor, but works ok here since <= 0 values get clamped to 0.
    int32_t i32 = static_cast<int32_t>(colorF * 256.0f);
    return static_cast<uint8_t>(std::clamp(i32, 0, 255));
}

// some tests to check it all
// static assert only works for compile time if the constexpr succeeds
static_assert(ColorF32ToU8(-0.01f) == 0); // clamps
static_assert(ColorF32ToU8(0.0f) == 0);
static_assert(ColorF32ToU8(0.5f-1.0f/65536.0f) == 127);
static_assert(ColorF32ToU8(0.5f) == 128);
static_assert(ColorF32ToU8(1.0f) == 255);
static_assert(ColorF32ToU8(1.01f) == 255); // clamps

// all these should be exact as float32s
static_assert(ColorU8ToF32(0) == 0.0f); 
static_assert(ColorU8ToF32(255) == 1.0f);
static_assert(ColorU8ToF32(2) - ColorU8ToF32(1) == 1.0f/255.0f); 
static_assert(ColorU8ToF32(127) == 127.0f*(1.0f/255.0f)); 
// note that, as floats, 1.0f/255.0f != 1.0f*(1.0f/255.0f), they differ slightly

You can fiddle with this code on godbolt.org to see what your compiler does to it.

Code Notes

Some notes to think through when you’re converting math to floating-point. Also see my floating point notes (). I assume IEEE 754 floating point format here.

  1. Real numbers are called representable if they can be stored exactly as floating point. Representable values must be expressible as a sum of powers of 2 (and satisfy some other technical relationships). For example, 0.3, 1.9, 9.1 are not representable. 0.5, 0.75, 23.125 are representable.
  2. Casting from a float to an integer truncates towards zero (this is true in most languages, such as C/C++, C#, Java, etc.). Note this is not the same as floor, which moves towards $-\infty$.
  3. Addition, subtraction, multiplication, and division of representable numbers, as guaranteed by the IEEE 754 standard, are computed as if there were infinite precision and rounded at the end to the limited number of bits in the format. Breaking rounding ties can be one of several methods. I’ll be careful to avoid mistakes here.
  4. The common method to round float to integer, (int)(floatVal +0.5), can fail to round correctly. First of all, casting in C/C++ truncates towards 0, giving problems (although not here), and some values cause floating point precision issues to round incorrectly, such as applying the above to the float value 0.49999997f yields not 0, but 1. (This value is the predecessor of 0.5f, obtainable via nextafter and nexttoward routines in <cmath> from C++ 11 on). This is the only positive value before around 8388609.0f for which this fails. It fails for a lot of negative numbers.
  5. You can convert multiplies and divides by powers of two for floating point to faster methods with things like C++ 11 scalbn style functions.
  6. Compilers will likely do the division to multiply trick, and you can even hand code it to remove all multiplies and divisions altogether using tricks on the exponent
  7. to_chars and from_chars from the header <charconv> are especially useful. They convert floating-point to and from char arrays, and are the only methods in the C++ library that are guaranteed to roundtrip such values correctly, which means you can output a float to text and then back and get the same float. Surprisingly, other methods in the standard do not guarantee this, and often fail. And, cine it’s C++, even these methods have a significant gotcha: you cannot use them across implementations, since that is not guaranteed to round trip!

Addendum

Existing implementations

A: f = i*255.0 and i = func(ff * 255.0)

  1. Apple Metal shading language does A (https://developer.apple.com/metal/Metal-Shading-Language-Specification.pdf, table 7.6, 7.7, float to integer). Does specift must be full range on i->f, which we agree with, i = intRTNE(max(f*255.0,0.0),255.0) bad (intRTNE is rounding mode?)
  2. OpenGL also bad (see their docs…)
  3. Unity forum: (byte)(float*255.0) is best answer! https://answers.unity.com/questions/1359964/float-to-byte-for-color32.html
  4. scikit-image https://scikit-image.org/docs/stable/api/skimage.util.html img_as_float32, img_as_ubyte i = np.clip(np.rint(f*255.0))
  5. directX f=i/255.0, i=floor(clamp(f)*255.0+0.5) https://learn.microsoft.com/en-us/windows/win32/direct3d10/d3d10-graphics-programming-guide-resources-data-conversion
  6. https://registry.khronos.org/DataFormat/specs/1.3/dataformat.1.3.html
  7. Blinn, “Dirty Pixels”, i = int(D*255.0+0.5)
  8. look into opengl pixel transfer, or image format, find conversions
  9. some forum posts
    1. https://stackoverflow.com/questions/48351010/is-it-a-bug-scaling-0-0-1-0-float-to-byte-by-multiplying-by-255
    2. https://docs.google.com/document/d/1tNrMWShq55rfltcZxAx1N-6f82Dt7MWLDHm-5GQVEnE/edit
    3. https://devskrol.com/2021/02/20/a-tip-a-day-python-tip-8-normalize-image-pixel-values-or-divide-by-255/
    4. http://answers.google.com/answers/threadview/id/502016.html
    5. https://medium.com/analytics-vidhya/a-tip-a-day-python-tip-8-why-should-we-normalize-image-pixel-values-or-divide-by-255-4608ac5cd26a
    6. https://stackoverflow.com/questions/53674869/how-do-i-convert-rgb-to-float-values
    7. https://hub.jmonkeyengine.org/t/converting-rgb-values-in-0-255-format-to-values-in-0-0f-1-0f-format/22080/5
    8. https://imagej.nih.gov/ij/developer/source/ij/process/ByteProcessor.java.html
    9. https://docs.nvidia.com/cuda/npp/group__image__color__to__gray.html
    10. https://scikit-image.org/docs/stable/user_guide/data_types.html
    11. https://learn.microsoft.com/en-us/dotnet/maui/user-interface/graphics/colors
    12. https://forums.ni.com/t5/LabVIEW/How-to-convert-32-bit-floating-point-pixel-values-to-colour/td-p/3961230
    13. https://community.khronos.org/t/how-to-convert-from-rgb-255-to-opengl-float-color/29288/3
  10. Total error (compared to double) for f/255.0f is 2.54e-6, for f*(1.0f/255.0f) is 7.56e-6, so constant multiply is about 3x the error in conversion.

Scikit code has as references:

References
​ .. [1] DirectX data conversion rules.
https://msdn.microsoft.com/en-us/library/windows/desktop/dd607323%28v=vs.85%29.aspx
​ .. [2] Data Conversions. In “OpenGL ES 2.0 Specification v2.0.25”,
​ pp 7-8. Khronos Group, 2010.
​ .. [3] Proper treatment of pixels as integers. A.W. Paeth.
​ In “Graphics Gems I”, pp 249-256. Morgan Kaufmann, 1990.
​ .. [4] Dirty Pixels. J. Blinn. In “Jim Blinn’s corner: Dirty Pixels”,
​ pp 47-57. Morgan Kaufmann, 1998.

Misc

possible other requirement, but at odd with above: some err metric (avg, max, mean ?) when roundtrip float -> int -> float

THE END!

Comments

Comment is disabled to avoid unwanted discussions from 'localhost:1313' on your Disqus account...