Compression Notes
Published
Here’s some notes on compression.
Compression Notes
Copyright Chris Lomont, August 2016 Version 1.0
Introduction
Consider storing the 8 letter sequence AAAABBCC. Often letters are stored 8 bits per letter, which would require 64 bits. If you want to store this in a fewer number of bits, you could assign shorter bit sequences to each letter, for example, storing symbols using 2 bits each:
Letter | Bit string |
---|---|
A | 00 |
B | 01 |
C | 10 |
This requires 16 bits, a substantial savings. We’ll call this type of table a code table. A code table lists a bitstring for each symbol in some set of symbols, called an alphabet. Common alphabets used in compression are bits {0,1} and bytes {0,1,2,…,255}.
Note that A uses 1/2 of the symbols, B and C each use 1/4. So storing A in a shorter length and B and C in perhaps longer lengths may save even more. A little fiddling shows that
Letter | Bit string |
---|---|
A | 0 |
B | 00 |
C | 01 |
does just this. Now the length is 4 * 1 + 4 * 2 = 12 bits. This illustrates a general principle that the more common a symbol, the fewer bits we want to use to denote it in order to get better compression.
But if we try to decode such a string, there are ambiguities. For example, the string 00 may mean AA or it may mean B, so we need to find strings that can be decoded unambiguously. We find that
Letter | Bit string |
---|---|
A | 0 |
B | 10 |
C | 11 |
works. Every string of A, B, and C can be encoded this way and decoded unambiguously.
To make this more useful we need to find a way to compute such bit strings for any sequence we want to encode. We also need to transmit the code table, so a decoder knows what each bitstring means.
Huffman compression
There is an optimal method to assign a bit string to each symbol in a sequence to be compressed, called Huffman coding. We build a binary tree from the symbols up, taking at each step the two least frequent items, which has the effect of making less frequent items lower in the tree, which gives them longer codewords. This is provably optimal. Here are steps
- Scan the sequence, obtaining a count C(S) for each distinct symbol S. We’ll take S to be byte value in 0-255.
- For each occurring symbol, create a Node(count,symbol,left,right,codeword) containing the symbol S, the count C(S), and left and right node pointers, initialized to null. Codeword is empty for now. Store in a list (or priority queue).
- Find the two nodes with the lowest counts in the list, remove them from the list, create a parent node with the two nodes as left and right, parent symbol = nothing, parent count = sum of children counts, and put the parent in the list.
- if there is more than one node left in the list, go to step 3.
- Now we assign codewords. Give the root node an empty codeword, and the for each left step append a ‘0’ to the parent codeword to get the child codeword, and each right step append a 1.
- Note all leaves in the tree are symbol nodes. Keep each symbol and codeword pair. The rest of the tree can be thrown out.
It is inefficient to parse the sequence twice, once to count the symbol frequencies, and once to compress, so many Huffman compressors build the frequency count as the data is parsed. This is called an adaptive model. This requires memory to store the table and routines to keep everything in sync. The decompressor also builds the same model as symbols are decompressed so the compressor and decompressor stay in lockstep.
Another option is to transmit the codeword table to the decompressor. Transmitting both the symbols and their codewords takes a lot of space, even if they are compressed, but there is a better way. It turns out it is possible to send for each length of codeword present only those symbols with that codeword length.
To do this, the codewords need put in canonical order, which means a (somewhat) unique ordering, called the Canonical Huffman code. Here is the algorithm:
|
|
Now transmit the max length of a codeword, and for each length of codeword, write the count of symbols with that length followed by those symbols. This allows the decompressor to reconstruct the codeword table. Surprisingly, it even allows decompression directly from this table with no need to compute more items. This allows low memory decompression by simply referencing the table in place. This is done via recreating the process above while reading in bits to make decisions, and walking the table. Here is the algorithm
|
|
With these few algorithms you can make a Huffman compressor that transmits the code table to the decompressor.
There are several details that allow making this even more efficient. Multiple sets of integers need compressed to send all this. For example, most Huffman trees don’t have low length codes, so sending the min and max code lengths present saves some zero values needing sent. The code length counts usually go from low to high then back to low as the length increases, so this can be exploited.
A compressed stream looks like
|
|
If the minimum code length is > 0, then it is smaller to transmit the delta max-min instead of the max, possibly saving some bits.
Storing these various integers leads to the next useful topic, which is how to store integers in an adaptive manner, called Universal encodings.
https://en.wikipedia.org/wiki/Canonical_Huffman_code https://pineight.com/mw/index.php?title=Canonical_Huffman_code http://www.compressconsult.com/huffman
Universal encodings
Quite often it is useful to send some integers without knowing ahead of time how big they might be. For example, in a compression format, the final length may need sent. One may just assume it will be 32 bits or less, and send 32 bits, but this is wasteful if someone uses the format for short items. And it fails if someone needs more than 32 bits.
At the cost of a few bits, this can be done in many ways. Here are a few.
Notations
The length of a number $N$ in bits, which we write $b(N)$, is $b(N)=1+\lfloor\log_2 N\rfloor$ where $\lfloor x \rfloor$ is the greatest integer less than or equal to $x$. Note that $\lceil\log_2 N\rceil$, where $\lceil x \rceil$ is the least integer greater than or equal to $x$ fails for powers of 2: the correct form would be $b(n)=\lceil \log_2 (N+1)\rceil$. We also take the convention that $b(0)=b(1)=1$, i.e., the 0 bit requires one bit to store.
Unary
The simplest way to send the integer N in binary without knowing the length beforehand is to send N zero bits followed by a 1 bit (or flip the 0’s and 1’s). The decoder counts the zero bits, stopping at the 1 bit, and deduces the number N. Of course, this is quite wasteful in many cases. If $N>1$, then $N-1$ zeros suffice if both encode and decoder know this. This takes $b(N)+1$ bits.
Truncated coding
When storing some integer from a range [0,n-1] where n may not be a power of two, storing it in $k=b(n)$ bits is somewhat wasteful, since when n is not a power of two, there is a way to store some of the numbers in $k$ bits and others in $k-1$ bits. The idea is to see how many of the $2^k$ binary values are not used, write this many values in $k-1$ bits, then the rest from the top down in $k$ bits.
It is especially neat how these can be arranged as prefix codes, so they are uniquely decodable.
Encoding
|
|
For example, suppose you need to encode 5 possible values from [0,4]. Not all choices can be fit into two bits, and three bits, while enough, can encode 8 different values, so is too much. There are 3 unused values, so 3 of the outputs can be a one bit short each. Note how looking at the first two bits tells whether or not a third one is needed.
value | bits |
---|---|
0 | 00 |
1 | 01 |
2 | 10 |
3 | 110 |
4 | 111 |
Decoding
|
|
Golomb
The next interesting format are the Golomb codes, which are defined by a positive integer $m$. To encode an integer $N\geq 0$, compute the quotient $q=\lfloor N/m \rfloor$ and remainder $r=N-qm$. Then encode $q$ in unary and r in binary using $b(m)$ bits. When $m=2^k$ this can be done efficiently and is called a Rice code.
This takes $q+1+b(m)$ bits (or one less for some numbers using truncated codes), better than unary for almost every case.
Picking the best $m$ can be done via testing your data, and is tricky to estimate, but there is a lot of literature on it.
This is most optimal when the integers are from a certain distribution.
Encoding
|
|
Decoding:
|
|
Elias coding
Peter Elias (1975) made three universal codes, called gamma, delta, and omega codes. Each encode the size of an integer $N>0$ (in one or more recursive steps) then the integer itself. Note these are for positive integers, but any finite set of integers can be shifted to satisfy this. He called unary the alpha code and binary the beta code. These others are named based on how man prefixes they have.
Gamma code
|
|
Decoding is easy.
Delta code Here the length of the length is written, which shortens larger numbers
|
|
Decoding is easy.
Omega code. Here prefixes of prefixes are written until they stop. Each one starts with a 1, so a 0 terminates the list. Encoding
|
|
Decoding
|
|
In my experience the delta codes are most useful for everyday compression values.
Even-Rodeh
These codes are equivalent to the Stout code $S_3$ below, so I’ll skip them.
Stout
Better than the Elias codes for many values, are the Stout codes, defined recursively for each $k\geq2$ as follows.
An integer $N\geq 0$ is encoded as $S(k,L)\text{ }0\text{ }B(N,L)$ where $L$ is the bit length of $N$ (recall if $N=0$ then $L=1$), $B(a,b)$ is the binary representation of $a$ in $b$ bits, and the $S(k,n)$ are defined recursively as
$$ S(k,n)= \begin{cases} B(n,l) & \text{if $0\leq n < 2^k$} \ S(\lfloor\log_2 n\rfloor - k) B(n,\lfloor\log_2 n\rfloor + 1) & \text{if $n\geq 2^k$} \end{cases} $$
This basically keep prepending the length of the next value (minus $k$) until the length gets below a length $k$, then it stops. The 0 in the definition is a delimiter telling the decoder to stop after the next symbol, which is the original number.
Encoding
|
|
Decoding. This one is tricky to get correct, but this works
|
|
Binary Adaptive Sequential Coding
Binary Adaptive Sequential Coding, or BASC for short, is a very useful technique to code a sequence of numbers where the size varies little locally, but may vary a lot over longer ranges. This is useful for things like sending arithmetic code tables, or Huffman lengths, where a sequence of numbers is relatively the same in size but does have enough short ones that simply using fixed encoding is wasteful.
The idea is this. The length $b(x_1)$ of the first number is stored, the count of numbers (so the decoder knows where to start and how to start), then the numbers are stored, starting with $x_1$ in $b(x_1)$ bits. When the next number can be stored in the same number of bits as the previous, a $0$ is written, else enough $1$’s are written to tell the decoder to increase the bit size, then a $0$ delimiter, then the next number. Each number read sets the bitsize for the next ones.
Variants fiddle with how fast the bitsizes can decrease or use the average of some previous few to smooth out some types of data.
Encoding
|
|
Decoding
|
|
Lomont1
I’ve found, when encoding file sizes, that a simple method results in shorter codes than the above methods on average. I suspect it is named, but since I don’t know what it’s called, I call it the Lomont1 method. It breaks an integer $N$ into $k$-bit sized chunks, then stores each prefixed with a 0 meaning last chunk or 1 meaning more chunks follow. $k$=6 is good for file sizes.
In many specialized compression formats I’ve designed, various Lomont codes perform very well for things like file sizes (with parameters 6,0 - depends on use), etc.
Note that as $N$ gets very large the Elias codes are better.
Encoding
|
|
0 1/3 1
+------------------------+
| A | B |
+------------------------+
|
|
low <- low + low(symbol) * (high - low)
high <- low + high(symbol) * (high - low)
new A cutoff 1/3 from low to high = low + 1/3(high-low)
0 1/3 1
+------------------------+ 0 + 0 * (1-0) = 0
| A | B | Encode A => 0 + 1/3 * (1-0) = 1/3
+------------------------+ 0 + 1/3 * (1-0) = 1/9
0 1/9 1/3
+------------------------+ 0 + 0 * (1/3 - 0) = 0 TODO
| A | B | Encode B => 0 + 1/3 * (1/3 - 0) = 1/3
+------------------------+ 0 + 1/3 * (1/3 - 0) = 1/3
1/9 5/27 1/3 +————————+ | A | B | Encode B => +————————+
5/27 19/81 1/3 +————————+ | A | B | Encode A => +————————+
5/27 19/81
+————————+
| A | B |
+————————+
|
|
low = 0 high = 0x7FFFFFFF
|
|
step = (high - low + 1)/total
high = low + step * C(S) - 1
low = low + step * C(S)
|
|
while high < range50Percent OR range50Percent <= low if highValue < range50Percent Write(0) low = 2 * low high = 2 * high + 1 else if range50Percent <= lowValue Write(1) low = 2 * (low - range50Percent) high = 2 * (high - range50Percent) + 1
|
|
// stretch interval to preserve size while range25Percent <= low AND high < range75Percent scaling++ // possible overflow here, very unlikely low = 2 * (low - range25Percent) high = 2 * (high - range25Percent) + 1
|
|
step = (high - low + 1)/total
high = low + step * highCount - 1
low = low + step * lowCount
// apply scaling to keep ranges in bounds while high < range50Percent OR range50Percent <= low if highValue < range50Percent Write(0) low = 2 * low high = 2 * high + 1 // output decision if any for (; scaling > 0; scaling–) Write(1) else if range50Percent <= lowValue Write(1) low = 2 * (low - range50Percent) high = 2 * (high - range50Percent) + 1 // output decision if any for (; scaling > 0; scaling–) Write(0)
// stretch interval to preserve size while range25Percent <= low AND high < range75Percent scaling++ // possible overflow here, very unlikely low = 2 * (low - range25Percent) high = 2 * (high - range25Percent) + 1
|
|
range100Percent = 0x80000000 range75Percent = 0x60000000 range50Percent = 0x40000000 range25Percent = 0x20000000
|
|
low = 0 high = range100Percent - 1 scaling = 0
|
|
total = number of symbols in stream assert(total < (1 « 29)) lowCount = Cumulative(S-1) = sum of counts of symbols below S highCount = Cumulative(S) = sum of counts of symbols up to and including S
// update bounds.
step = (high - low + 1)/total
high = low + step * highCount - 1
low = low + step * lowCount
// apply scaling to keep ranges in bounds while high < range50Percent OR range50Percent <= low if highValue < range50Percent Write(0) low = 2 * low high = 2 * high + 1 // output decision if any for (; scaling > 0; scaling–) Write(1) else if range50Percent <= lowValue Write(1) low = 2 * (low - range50Percent) high = 2 * (high - range50Percent) + 1 // output decision if any for (; scaling > 0; scaling–) Write(0)
// stretch interval to preserve size while range25Percent <= low AND high < range75Percent scaling++ // possible overflow here, very unlikely low = 2 * (low - range25Percent) high = 2 * (high - range25Percent) + 1
|
|
// two possible low and high distributions, so // two bits enough to distinguish if low < range25Percent Write(0) Write(1) for (var i = 0; i < scaling + 1; ++i) // final scaling Write(1) else Write(1) Write(0) for (var i = 0; i < scaling + 1; ++i) //final scaling Write(0)
// note decoder may need more 0 bits than output, hard to determine
// precisely how many (experimentally around 20-25).
// bitstream can simply return 0’s when out of bits.
// literature has other solutions
|
|
low = 0 high = range100Percent - 1 buffer = 0 // start buffer with values for (i = 0; i < 31; ++i) buffer = 2* buffer + ReadBit()
|
|
//split range into single steps step = (high - low + 1)/total S = symbol entry with probability value (buffer - low)/step
total = number of symbols in stream assert(total < (1 « 29)) lowCount = Cumulative(S-1) = sum of counts of symbols below S highCount = Cumulative(S) = sum of counts of symbols up to and including S
high = low + step *highCount - 1 low = low + step *lowCount
while high < range50Percent OR range50Percent <= low if high < range50Percent low = 2 * low high = 2 * high + 1 buffer = 2*buffer + ReadBit() else if range50Percent <= low low = 2 * (low - range50Percent) high = 2 * (high - range50Percent) + 1 buffer = 2 * (buffer - range50Percent) + ReadBit()
while range25Percent <= low AND high < range75Percent low = 2 * (low - range25Percent) high = 2 * (high - range25Percent) + 1 buffer = 2 * (buffer - range25Percent) + ReadBit() return S
|
|
Comments
Comment is disabled to avoid unwanted discussions from 'localhost:1313' on your Disqus account...