(Also lomonster.com and clomont.com)

Compression Notes


Here's some notes on compression.

Compression Notes

Copyright Chris Lomont, August 2016 Version 1.0 ## Introduction

Consider storing the 8 letter sequence AAAABBCC. Often letters are stored 8 bits per letter, which would require 64 bits. If you want to store this in a fewer number of bits, you could assign shorter bit sequences to each letter, for example, storing symbols using 2 bits each:

Letter Bit string
A 00
B 01
C 10

This requires 16 bits, a substantial savings. We'll call this type of table a code table. A code table lists a bitstring for each symbol in some set of symbols, called an alphabet. Common alphabets used in compression are bits {0,1} and bytes {0,1,2,...,255}.

Note that A uses 1/2 of the symbols, B and C each use 1/4. So storing A in a shorter length and B and C in perhaps longer lengths may save even more. A little fiddling shows that

Letter Bit string
A 0
B 00
C 01

does just this. Now the length is 4 * 1 + 4 * 2 = 12 bits. This illustrates a general principle that the more common a symbol, the fewer bits we want to use to denote it in order to get better compression.

But if we try to decode such a string, there are ambiguities. For example, the string 00 may mean AA or it may mean B, so we need to find strings that can be decoded unambiguously. We find that

Letter Bit string
A 0
B 10
C 11

works. Every string of A, B, and C can be encoded this way and decoded unambiguously.

To make this more useful we need to find a way to compute such bit strings for any sequence we want to encode. We also need to transmit the code table, so a decoder knows what each bitstring means.

Huffman compression

There is an optimal method to assign a bit string to each symbol in a sequence to be compressed, called Huffman coding. We build a binary tree from the symbols up, taking at each step the two least frequent items, which has the effect of making less frequent items lower in the tree, which gives them longer codewords. This is provably optimal. Here are steps

  1. Scan the sequence, obtaining a count C(S) for each distinct symbol S. We'll take S to be byte value in 0-255.
  2. For each occurring symbol, create a Node(count,symbol,left,right,codeword) containing the symbol S, the count C(S), and left and right node pointers, initialized to null. Codeword is empty for now. Store in a list (or priority queue).
  3. Find the two nodes with the lowest counts in the list, remove them from the list, create a parent node with the two nodes as left and right, parent symbol = nothing, parent count = sum of children counts, and put the parent in the list.
  4. if there is more than one node left in the list, go to step 3.
  5. Now we assign codewords. Give the root node an empty codeword, and the for each left step append a '0' to the parent codeword to get the child codeword, and each right step append a 1.
  6. Note all leaves in the tree are symbol nodes. Keep each symbol and codeword pair. The rest of the tree can be thrown out.

It is inefficient to parse the sequence twice, once to count the symbol frequencies, and once to compress, so many Huffman compressors build the frequency count as the data is parsed. This is called an adaptive model. This requires memory to store the table and routines to keep everything in sync. The decompressor also builds the same model as symbols are decompressed so the compressor and decompressor stay in lockstep.

Another option is to transmit the codeword table to the decompressor. Transmitting both the symbols and their codewords takes a lot of space, even if they are compressed, but there is a better way. It turns out it is possible to send for each length of codeword present only those symbols with that codeword length.

To do this, the codewords need put in canonical order, which means a (somewhat) unique ordering, called the Canonical Huffman code. Here is the algorithm:

1. Sort symbols first by increasing length, then within each length, by symbol value
2. codeword = 0 // start at 1-bit codeword 0
3. Then do
    foreach symbol in symbols
        // lengthen code
        while bitLength of codeword < bitLength of symbol codeword
            append 0 to codeword
        set leaf codeword = codeword
        add 1 to codeword

Now transmit the max length of a codeword, and for each length of codeword, write the count of symbols with that length followed by those symbols. This allows the decompressor to reconstruct the codeword table. Surprisingly, it even allows decompression directly from this table with no need to compute more items. This allows low memory decompression by simply referencing the table in place. This is done via recreating the process above while reading in bits to make decisions, and walking the table. Here is the algorithm

firstCodewordOnRow = 0 // 0 codeword
symbolFound = false
tableIndex  = point to table start
while not symbolFound
    numberOfCodes = read from table
    if numberOfCodes > 0 AND accumulator - firstCodewordOnRow < numberOfCodes
        itemIndex   = accumulator - firstCodewordOnRow
        tableIndex += itemIndex * BitsPerSymbol // skip to symbol
        symbol      = read BitsPerSymbol
        symbolFound = true
        firstCodewordOnRow += numberOfCodes

        assert accumulator < 0x80000000 // catch overflows!
        accumulator = 2*accumulator + read bit from table // accumulate bits
        firstCodewordOnRow <<= 1

        // next entry
        tableIndex += numberOfCodes * BitsPerSymbol

With these few algorithms you can make a Huffman compressor that transmits the code table to the decompressor.

There are several details that allow making this even more efficient. Multiple sets of integers need compressed to send all this. For example, most Huffman trees don't have low length codes, so sending the min and max code lengths present saves some zero values needing sent. The code length counts usually go from low to high then back to low as the length increases, so this can be exploited.

A compressed stream looks like

1. Number of symbols encoded so decompressor knows when to stop
2. Bits per symbol (ASCII needs 7, for example, shortening many later values)
3. The table, which consists of 
   1. Bits per code length to tell how many bits are in each code length transmitted
   2. Minimum code length present
   3. Maximum code length present
   4. Then for each codeword length in increasing order, 
      4a. transmit a count of such codewords
      4b. transmit that many symbols in increasing order.

If the minimum code length is > 0, then it is smaller to transmit the delta max-min instead of the max, possibly saving some bits.

Storing these various integers leads to the next useful topic, which is how to store integers in an adaptive manner, called Universal encodings.

https://en.wikipedia.org/wiki/Canonical_Huffman_code https://pineight.com/mw/index.php?title=Canonical_Huffman_code http://www.compressconsult.com/huffman

Universal encodings

Quite often it is useful to send some integers without knowing ahead of time how big they might be. For example, in a compression format, the final length may need sent. One may just assume it will be 32 bits or less, and send 32 bits, but this is wasteful if someone uses the format for short items. And it fails if someone needs more than 32 bits.

At the cost of a few bits, this can be done in many ways. Here are a few.


The length of a number \(N\) in bits, which we write \(b(N)\), is \(b(N)=1+\lfloor\log_2 N\rfloor\) where \(\lfloor x \rfloor\) is the greatest integer less than or equal to \(x\). Note that \(\lceil\log_2 N\rceil\), where \(\lceil x \rceil\) is the least integer greater than or equal to \(x\) fails for powers of 2: the correct form would be \(b(n)=\lceil \log_2 (N+1)\rceil\). We also take the convention that \(b(0)=b(1)=1\), i.e., the 0 bit requires one bit to store.


The simplest way to send the integer N in binary without knowing the length beforehand is to send N zero bits followed by a 1 bit (or flip the 0's and 1's). The decoder counts the zero bits, stopping at the 1 bit, and deduces the number N. Of course, this is quite wasteful in many cases. If \(N>1\), then \(N-1\) zeros suffice if both encode and decoder know this. This takes \(b(N)+1\) bits.

Truncated coding

When storing some integer from a range [0,n-1] where n may not be a power of two, storing it in \(k=b(n)\) bits is somewhat wasteful, since when n is not a power of two, there is a way to store some of the numbers in \(k\) bits and others in \(k-1\) bits. The idea is to see how many of the \(2^k\) binary values are not used, write this many values in \(k-1\) bits, then the rest from the top down in \(k\) bits.

It is especially neat how these can be arranged as prefix codes, so they are uniquely decodable.


Assert value < n
k = BitsRequired n
// u = number of unused codewords
u = (1U << k) - n 
if value < u
    Write  (value) in k-1 bits
    Write (value + u) in k bits

For example, suppose you need to encode 5 possible values from [0,4]. Not all choices can be fit into two bits, and three bits, while enough, can encode 8 different values, so is too much. There are 3 unused values, so 3 of the outputs can be a one bit short each. Note how looking at the first two bits tells whether or not a third one is needed.

value bits
0 00
1 01
2 10
3 110
4 111


k = BitsRequired n
// u = number of unused codewords
u = (1U << k) - n; 
x = Read k-1 bits
if x >= u 
    // need another bit
    x = 2 * x + Read 1 bit
    x -= u
return x


The next interesting format are the Golomb codes, which are defined by a positive integer \(m\). To encode an integer \(N\geq 0\), compute the quotient \(q=\lfloor N/m \rfloor\) and remainder \(r=N-qm\). Then encode \(q\) in unary and r in binary using \(b(m)\) bits. When \(m=2^k\) this can be done efficiently and is called a Rice code.

This takes \(q+1+b(m)\) bits (or one less for some numbers using truncated codes), better than unary for almost every case.

Picking the best \(m\) can be done via testing your data, and is tricky to estimate, but there is a lot of literature on it.

This is most optimal when the integers are from a certain distribution.


b = BitsRequired m
n = value
q = n/m     
r = n%m
for i = 1 to q
    Write 1
Write 0
write truncated code (r,m)


q = 0
while Read(1) == 1
r = read truncated code m
return q * m + r

Elias coding

Peter Elias (1975) made three universal codes, called gamma, delta, and omega codes. Each encode the size of an integer \(N>0\) (in one or more recursive steps) then the integer itself. Note these are for positive integers, but any finite set of integers can be shifted to satisfy this. He called unary the alpha code and binary the beta code. These others are named based on how man prefixes they have.

Gamma code

1. L = number of bits to store N > 0. Note L>0
2. Write L-1 0 bits, then a 1 bit.
3. Write N using L-1 bits (strips off the highest 1 bit of N

Decoding is easy.

Delta code Here the length of the length is written, which shortens larger numbers

1. L1 = number of bits to store N > 0. Note L1 > 0
2. L2 = number of bits to store L1. Note L2 > 0
3. Write L2-1 0 bits
4. Write the L2 bit representation of L1, which starts with a 1.
5. Write L1-1 bits of N, stripping of the highest 1.

Decoding is easy.

Omega code. Here prefixes of prefixes are written until they stop. Each one starts with a 1, so a 0 terminates the list. Encoding

value = N
stack = new Stack<int>
while value != 1
    value = (BitsRequired value32) - 1
while stack not empty
    v = statck pop
    Write v in BitsRequired for v
Write 0 // terminates the value


n = 1
    b = Read 1
    if b != 0
        t = Read n
        n = (1U << n) + t
    while b != 0
return n

In my experience the delta codes are most useful for everyday compression values.


These codes are equivalent to the Stout code \(S_3\) below, so I'll skip them.


Better than the Elias codes for many values, are the Stout codes, defined recursively for each \(k\geq2\) as follows.

An integer \(N\geq 0\) is encoded as \(S(k,L)\text{ }0\text{ }B(N,L)\) where \(L\) is the bit length of \(N\) (recall if \(N=0\) then \(L=1\)), \(B(a,b)\) is the binary representation of \(a\) in \(b\) bits, and the \(S(k,n)\) are defined recursively as

\[ S(k,n)= \begin{cases} B(n,l) & \text{if $0\leq n < 2^k$} \\ S(\lfloor\log_2 n\rfloor - k) B(n,\lfloor\log_2 n\rfloor + 1) & \text{if $n\geq 2^k$} \end{cases} \]

This basically keep prepending the length of the next value (minus \(k\)) until the length gets below a length \(k\), then it stops. The 0 in the definition is a delimiter telling the decoder to stop after the next symbol, which is the original number.


Recurse(n, k)
    if n < (1 << k)
        Write n in k bits
        m = FloorLog2(n)
        Recurse(m - k, k)
        Write n in m + 1 bits

    m = BitsRequired(value)
    Write(value in m bits

Decoding. This one is tricky to get correct, but this works

    v = Read k bits
    while true
        b = Read 1 bit
        if (b == 0) // end of chain
            return Read v bits
        v = v + k
        v = (1<<v) + Read v bits

Binary Adaptive Sequential Coding

Binary Adaptive Sequential Coding, or BASC for short, is a very useful technique to code a sequence of numbers where the size varies little locally, but may vary a lot over longer ranges. This is useful for things like sending arithmetic code tables, or Huffman lengths, where a sequence of numbers is relatively the same in size but does have enough short ones that simply using fixed encoding is wasteful.

The idea is this. The length \(b(x_1)\) of the first number is stored, the count of numbers (so the decoder knows where to start and how to start), then the numbers are stored, starting with \(x_1\) in \(b(x_1)\) bits. When the next number can be stored in the same number of bits as the previous, a \(0\) is written, else enough \(1\)'s are written to tell the decoder to increase the bit size, then a \(0\) delimiter, then the next number. Each number read sets the bitsize for the next ones.

Variants fiddle with how fast the bitsizes can decrease or use the average of some previous few to smooth out some types of data.


store number of entries if desired
b1 = BitsRequired(data[0])
store initial number of bits b1 if desired
for i = 1 to data.Count-1
    d = data[i]
    b2 = BitsRequired(d)
    if (b2 <= b1)
        // b1 is enough bits, signal with a 0
        // b2 requires more bits, tell how many
        for ik = 0; ik < b2-b1; ++ik
        Write(0,1)         // end of bit count
        Write(d, b2-1)     // strip off leading '1'
    b1 = BitsRequired(d)   // for next pass


decode number of entries if stored
decode initial number of bits b1 if stored
xi = Read(b1)  // initial value
output(x1) to user
while more to decode
    decision = Read(1)
    if (decision == 0)
        // bi is <= b(i-1), so enough bits
        xi = Read(b1)
        // bi is bigger than b(i-1), must increase it
        delta = 0
            decision = Read(1)
        } while decision != 0
        b1 += delta
        xi = Read(b1 - 1); 
        // xi has implied leading 1, so reinsert it
        xi |= 1U << (int) (b1-1)
    output xi to user
    b1 = BitsRequired(xi)  // for next pass


I've found, when encoding file sizes, that a simple method results in shorter codes than the above methods on average. I suspect it is named, but since I don't know what it's called, I call it the Lomont1 method. It breaks an integer \(N\) into \(k\)-bit sized chunks, then stores each prefixed with a 0 meaning last chunk or 1 meaning more chunks follow. \(k\)=6 is good for file sizes.

In many specialized compression formats I've designed, various Lomont codes perform very well for things like file sizes (with parameters 6,0 - depends on use), etc.

Note that as \(N\) gets very large the Elias codes are better.


```C# // write value in chunks of possible changing size EncodeLomont1 (value, chunkSize, deltaChunk) mask = (1U << chunkSize) - 1 while value32 >= (1 << chunkSize)

    Write 1 // single bit - denotes another chunk coming
    Write (value & mask) in chunkSize bits
    value >>= chunkSize // shrink value
    mask = (1U << chunkSize) - 1 // new mask

    if deltaChunk != 0
        chunkSize += deltaChunk // cahange chunksize
        if chunkSize <= 0
            chunkSize = 1 // do not allow non-positive chunk sizes

Write 0 in one bit - marks last chunk
Write value in chunkSize bits

Decoding is easy.

### Conclusion
There is a large literature on such items. See Salomon's [Handbook of Data Compression](http://www.springer.com/us/book/9781848829022) for a solid overview.

TODO - nice table here

## Arithmetic compression
Huffman compression allocates an integral number of bits per symbol, which is optimal as long as the frequency of each symbol is a rational number with a power of 2 as denominator (because each bit effectively divides a probability by 2), but is not optimal for, say, a sequence of A's and B's here p(A) = 1/3 and p(B) = 2/3. (This is because 1/3 is non-terminating in binary). To get optimal symbol compression in such cases arithmetic compression effectively allocates partial bits to symbols.

Note however Huffman is very good in practice and rarely is arithmetic compression needed. But it's a useful tool.

### Shannon entropy
What is meant by optimal compression? At the symbol level, not taking account into larger patterns, a concept called Shannon Entropy quantifies the optimum. For symbols $a_1,a_2,...,a_m$ with probabilities $p_i = p(a_i)$, the **Shannon Entropy**.

$$\sum_{i=1}^m - p_i \log_2 p_i$$

For example if there are 2 symbols in the alphabet, each with likelihood 1/2, the Shannon Entropy is 

$$\sum_{i=1}^2 - \frac{1}{2} \log_2 \frac{1}{2} = \frac{1}{2}+\frac{1}{2}=1$$

which matches the intuitive notion that each bit requires one bit to send on average. However, suppose the two symbols occur with probability $1/8$ and $7/8$, then the Shannon Entropy is **0.54356**, much less than one. This means it takes less than a bit, on average, to store a bit. 

### Main idea
Arithmetic compression works by representing a stream of symbols as a real number between 0 and 1, where an ever shrinking interval represents the symbols seen so far. To illustrate with two symbols with probabilites $p(A)=1/3$ and $p(B)=2/3$, imagine them laid end to end: 
0       1/3              1
|  A     |        B      |

If the first symbol is A, then the real number will lay in the first part, else if B, in the B range. So how to encode more symbols? Subdivide the current interval with the same ratios as before (or changing if doing adaptive probabilities). Here is a few steps, with the interval stretched visually to see it, encoding ABBA (and tracking the high = 1 and low = 0 ends of the interval):
                                               low  <- low + low(symbol)  * (high - low)
                                               high <- low + high(symbol) * (high - low)
                                               new A cutoff 1/3 from low to high = low + 1/3(high-low)

0       1/3              1
+------------------------+                     0 +  0  * (1-0) = 0      
|  A     |        B      |   Encode A =>       0 + 1/3 * (1-0) = 1/3
+------------------------+                     0 + 1/3 * (1-0) = 1/9

0       1/9             1/3
+------------------------+                     0 +  0  * (1/3 - 0) = 0  TODO
|  A     |        B      |   Encode B =>       0 + 1/3 * (1/3 - 0) = 1/3
+------------------------+                     0 + 1/3 * (1/3 - 0) = 1/3

1/9 5/27 1/3 +------------------------+ | A | B | Encode B => +------------------------+

5/27 19/81 1/3 +------------------------+ | A | B | Encode A => +------------------------+

5/27 19/81 +------------------------+ | A | B |

So after encoding the symbols ABBA with the given probabilities, every number in the range [5/27,19/81] represents this stream. A decoder, given the probabilities, and number of symbols and any number in this range, can decode this string.

The problem with this simple variation is that as the stream of symbols gets longer and longer, the range gets smaller and smaller, requiring longer and longer numbers to store the endpoints. Fortunately there is a clever integer way to implement this, only using fixed length integers.

### Integer version
Here is how the integer version works, using 32-bit integers for concreteness.

Instead of probabilities we'll use counts C(S) of each symbol S, and the sum of the counts, called total.

Then let low(S) be the sum of all counts up to but not including S, and high(S) be the sum including S. These numbers, divided by total, represent the probability range.

Treat the real number range [0,1] as the integer range [0,2^31] (we don't use the last bit to allow certain calculations, below). Represent the current range as [low, high), where the right integer is one less than the max range (open ended).

Initialize the high and low to 

low = 0 high = 0x7FFFFFFF

Then, given symbol S, update the range via

step = (high - low + 1)/total
high = low + step * C(S) - 1 low = low + step * C(S)

This squeezes the endpoints together, and would eventually make them too close and step would become 0, losing information. So the trick is to output known bits and rescale the integers. This happens if high < 1/2, when it is known the range is in the lower half, so we can output a 0 bit, or when 1/2 < low, when the range stays in the high half, when we can output a 1 bit. Then we stretch the range by a factor of 2. This looks like

while high < range50Percent OR range50Percent <= low if highValue < range50Percent Write(0) low = 2 * low high = 2 * high + 1 else if range50Percent <= lowValue Write(1) low = 2 * (low - range50Percent) high = 2 * (high - range50Percent) + 1

This leaves one last problem - what if the range is shrinking, but overlaps the halfway mark? This is handled by ensuring the range never gets too small, and delaying the decision on which side the range lands in, by discarding a bit, incrementing a counter to track it, and continuing. This looks like 

// stretch interval to preserve size while range25Percent <= low AND high < range75Percent scaling++ // possible overflow here, very unlikely low = 2 * (low - range25Percent) high = 2 * (high - range25Percent) + 1

So we add to the initialization setting scaling to 0. Finally, we need to output these scaling bits when relevant, which was when we knew which half we were in, so rewrite the first routine and merge to get an encoder.

step = (high - low + 1)/total
high = low + step * highCount - 1 low = low + step * lowCount

// apply scaling to keep ranges in bounds while high < range50Percent OR range50Percent <= low if highValue < range50Percent Write(0) low = 2 * low high = 2 * high + 1 // output decision if any for (; scaling > 0; scaling--) Write(1) else if range50Percent <= lowValue Write(1) low = 2 * (low - range50Percent) high = 2 * (high - range50Percent) + 1 // output decision if any for (; scaling > 0; scaling--) Write(0)

// stretch interval to preserve size while range25Percent <= low AND high < range75Percent scaling++ // possible overflow here, very unlikely low = 2 * (low - range25Percent) high = 2 * (high - range25Percent) + 1

This can overflow if the range shrinks more than 2^32 times without choosing a half, since scaling can overflow. But it's unlikely, and can be checked if needed. 

The decoder does the same work. Here are example working algorithms:

This is an implementation for 32 bit architectures. The constants representing the 1/4, 1/2, 3/4, and the entire interval can be changed for other sizes.

 - number of symbols in stream
 - counts table (full or sparse). 1 bit denotes full (0) or sparse(1) table present
    - Both tables start with: 
        - min+1 and max+1 symbol index are Elias delta encoded
        - number of bits +1 in table Elias delta encoded, allows skipping table
    - Full table (when min to max symbol are mostly present):  
        - list of frequencies (max-min+1 entries) is BASC encoded
    - Sparse table (when many symbols not present, such as LZ77 distance values)
        - min+1 and max+1 symbol index are Elias delta encoded
        - number of bits +1 in table Elias delta encoded, allows skipping table
        - BASC counts
        - BASC symbols

All values are unsigned 32 bit integers.

Some constants - use 31 bits to allow multiplication by 2 of open intervals in algorithm without overflow

range100Percent = 0x80000000 range75Percent = 0x60000000 range50Percent = 0x40000000 range25Percent = 0x20000000

Initialize encoder

low = 0 high = range100Percent - 1 scaling = 0

Encode a symbol S

total = number of symbols in stream assert(total < (1 << 29)) lowCount = Cumulative(S-1) = sum of counts of symbols below S highCount = Cumulative(S) = sum of counts of symbols up to and including S

// update bounds. step = (high - low + 1)/total
high = low + step * highCount - 1 low = low + step * lowCount

// apply scaling to keep ranges in bounds while high < range50Percent OR range50Percent <= low if highValue < range50Percent Write(0) low = 2 * low high = 2 * high + 1 // output decision if any for (; scaling > 0; scaling--) Write(1) else if range50Percent <= lowValue Write(1) low = 2 * (low - range50Percent) high = 2 * (high - range50Percent) + 1 // output decision if any for (; scaling > 0; scaling--) Write(0)

// stretch interval to preserve size while range25Percent <= low AND high < range75Percent scaling++ // possible overflow here, very unlikely low = 2 * (low - range25Percent) high = 2 * (high - range25Percent) + 1

Finish encoding

// two possible low and high distributions, so // two bits enough to distinguish if low < range25Percent Write(0) Write(1) for (var i = 0; i < scaling + 1; ++i) // final scaling Write(1) else Write(1) Write(0) for (var i = 0; i < scaling + 1; ++i) //final scaling Write(0)

// note decoder may need more 0 bits than output, hard to determine // precisely how many (experimentally around 20-25).
// bitstream can simply return 0's when out of bits. // literature has other solutions

Initialize decoder

low = 0 high = range100Percent - 1 buffer = 0 // start buffer with values for (i = 0; i < 31; ++i) buffer = 2* buffer + ReadBit()

Decode a symbol S (after freq table constructed/transmitted). Mostly mimics encoder

//split range into single steps step = (high - low + 1)/total S = symbol entry with probability value (buffer - low)/step

total = number of symbols in stream assert(total < (1 << 29)) lowCount = Cumulative(S-1) = sum of counts of symbols below S highCount = Cumulative(S) = sum of counts of symbols up to and including S

high = low + step highCount - 1 low = low + step lowCount

while high < range50Percent OR range50Percent <= low if high < range50Percent low = 2 * low high = 2 * high + 1 buffer = 2buffer + ReadBit() else if range50Percent <= low low = 2 (low - range50Percent) high = 2 * (high - range50Percent) + 1 buffer = 2 * (buffer - range50Percent) + ReadBit()

while range25Percent <= low AND high < range75Percent low = 2 * (low - range25Percent) high = 2 * (high - range25Percent) + 1 buffer = 2 * (buffer - range25Percent) + ReadBit() return S ```

And there you have it. Arithmetic compression.

LZ compression

Arithmetic and Huffman compression compress symbols, and do not (without higher order probability modeling) take into account order. A static Huffman code used on AAAAAB and AABAA will be the same length. To take into account larger features in symbol streams, patterns are detected and referenced, making much stronger compression possible. Two common ones are LZ77 (after Abraham Lempel and Jacob Ziv in 1977), which looks at past data for a match on upcoming data, and stores info about the match instead of the upcoming data, and LZ78 (Lempel and Ziv, 1978), which builds a dictionary of past seen symbol strings, which outputs dictionary indices to replace longer strings as it compresses.

These methods underlie all sorts of modern technology from image compression to internet transmission.

Each also has a huge amount of variants in the literature.


I'll explain LZSS (Lempel–Ziv–Storer–Szymanski), a decent variant to learn from.

In this variant, for each next symbol to encode, the history (up to an optional max distance back) is searched for matching the next and upcoming data. If there is a useful match at a distance with length, then the pair (distance,length) is stored next. If there is no useful match, the symbol itself is stored. To differentiate these cases, a decision bit before can signal what the next item is.

All the magic is in the choices of allowed distances, the matching lengths, and tricks to encode each. Since the lengths tend to skew towards smaller numbers, Huffman encoding is often used, but sometimes Golomb coding works better.

Another trick, instead of coding naive (distance,length) pairs is to encode them as a variant of distance *(maxLength+1)+length with truncated coding. This sometimes saves a bit per item. However, smarter encoding of them by any more advanced method usually beats this.


LZ78 is like LZ77, in that it uses a dictionary of previous phrases (LZ77 used the recent memory buffer). LZ78 uses a clever method to encode tokens in a dictionary that is built (and optionally flushed) adaptively. Read elsewhere for more details.


LZCL, LZ77 Chris Lomont style, is a version I created using many of the above methods, in order to meet requirements for a specific project. See here http://clomont.com/software/lzcl-an-embedded-compression-library/ for details, or here https://github.com/ChrisLomont/CompressionTools for my GitHub implementation.


Delta, unique, range

Sometimes data is in such a format that doing some filtering on it first allows better compression. Three basic ones are delta, unique, and range.


Replace each value with the difference from the previous one, keeping the first one the same. This sometimes smooths out values, making runs of smaller or zero values. As an extreme example, consider the data \(1,2,3,....,1000000\). Compressing this is not as nice as compressing the delta encoded version \(1,1,1,...,1\), a run of 1,000,000 1's. Rarely (but sometimes!) does applying multiple delta filters help.


This is useful when the symbols occurring are sparse compared to the possible data size. For example, if the integers \(1,99,1000000\) occur in some stream, encoding the 1 and the 1000000 should use a different number of bits. A unique filter replaces these with the values \(0,1,2\) and a table telling how to decode.


Similar to the unique filter, a range filter is useful when the data occurs in a certain range \([a,b]\), and shifting all data to make this range be \([0,b-a]\) allows encoding symbols in a smaller number of bits.

Sample compression tools

The reason I wrote this up was because I needed a specialized compressor for some embedded work, and I found nothing suitable. The main requirements were very small code size and very low RAM usage. I needed to fit into FLASH a lot more data than the FLASH allowed uncompressed. Thus the decompressor had access to the compressed data, and only needed to decompress it once in a while to use then discard. I've developed all sorts of specialized algorithms over the years, including many compression algorithms, and this time I decided to take the time to write up some notes and create tools for myself to develop such algorithms more quickly. The result is a codebase of pluggable codecs, analysis, and development tools to make future designs happen more quickly.

The result is explained in my article LZCL: An embedded compression library, with code at GitHub at https://github.com/ChrisLomont/CompressionTools.

Happy hacking! Chris Lomont

Categories: Software

Tags: Software Compression


Comment is disabled to avoid unwanted discussions from 'localhost:1313' on your Disqus account...