Description of LZ77

Basic Definitions

In this article, \(D\) is used to represent an i-elements data array. \(d_0, d_1, ... , d_i\) are the elements of \(D\). Any continuous subset of \(D\) is called slices, represented by \(S(i, j)\), where \(i\) and \(j\) is the start and end position. For two slices \(S_1\) and \(S_2\), if \(S_1\) and \(S_2\) are the same, this pair of slices is called a match. \(|S_1|\) and \(|S_2|\) are their lengths. If data \(D\) and a slice \(S\) have more than \(k\) same elements from the start, we say \(D\) and \(S\) is a \(k\)-match.

General Description

Given a data array \(D\), LZ77 compresses the data by replacing the repeated part with a position and length indactor. For example, if we have the following data:

1 2 3 4 5 1 2 3 4 8 9 0

It is easy to point out that the slice 1 2 3 4 appears twice. The slice tt the very first is called \(S_1(0,3)\) and another is \(S_2(5,8)\). It means that there is no need to store this slice twice, but a slice and just a indactor pointing where we can find this silce. Here, to index \(S_2(i_2,j_2)\) by \(S_1(i_1,j_1)\), we use the following indactor:

i_2 - i_1, |S_2|

The first term of this indactor represents the distance of \(S_1\) from current position. The second term is the length of \(S_2\). Then, the example we give at first can be compressed to:

1 2 3 4 5 5 4 8 9 0

However, by looking at the above compressed data, we do not know which part is compressed and which part is raw. Thus, to address this problem, we introduce the compress flag. Compress flag is a 8-bit byte \([b_1, b_2, b_3, b_4, b_5, b_6, b_7, b_8]\). Each bit is assigned to a part of following data. If the bit is \(0\), this part is compressed, and vice versa. One thing need to be pointed out is that we always check the bits from low to high, so \(b_8\) is always assigned to the first part following this flag.

Apparently, the simplest way of using compress flag is as follows:

If current bit is \(1\), the following one byte in raw data.
If current bit is \(0\), the following two byte represent a indactor.

Now we can rewrite the compress example:

0xDF 1 2 3 4 5 4 8 9 0x01 0

0xDF = `0b11011111`.
0x01 = `0b00000001`.

You can manually decompress and validate this compressed array.

Algorithm

We have talked about the general idea. However, there are several remaining problems:

How far should we look back?
What is the bound of the length of match?
How do we know that we reach the end of data?

Generally speaking, the shorter we look back, the faster the algorithm will be. The most common solution is set the upbound of the distance we can look back. If we reach the max distance and there is still no match, then current byte is marked as raw. We can also say the totally same thing for the upper bound of the length of match. However, for lower bound, things become different, because the indactor contains two bytes. If we compress a one-byte slice, we will get a longer data. Thus, the lower bound is usually fixed as 3, which is just larger than the length of indactor. The final problem is that how should we mark this stream as end. There are variant ways to achieve this. Here we will use a simple way: a 0-length compressed slice with position at 0 represents the end of data. This type of marks is safe because we can not get a 0-length compressed slice. Back to the example in previous section, now it becomes:

0xDF 1 2 3 4 5 4 8 9 0x01 0 0 0

0xDF = `0b11011111`.
0x01 = `0b00000001`.

Compress

Now we will actually talk about the compress/decompress method. Slightly different from our previous discuss, when we try to find a match, we do not search in the raw data, but in a special slice, called window. A windows \(W\) is a fixed-size slice. The content of window can be changed during the compression/decompression. The advantage of using window is that:

We do not need to touch the data.
Window is a better choice when we are working on stream data.
We can set the initial value of window according dictonary to get better compression ratio.

Here we give the algorithms as follows:

Initialize window.
Set current position p = 0, current window position w = 0
Initialize flag byte.
for 1 to 8 do:
- find match between data and window from current position.
- if match exists
  - set current bit to 0, write the position and length of match
- else
  - set current bit to 1, write current byte.
Repeat until the end of data.
Write the end mark.

Decompress

Decompress is based on the same idea of compress. Thus, we won't talk about it too much but just give the algorithm:

Initialize window.
Set current position p = 0, current window position w = 0
Read flag byte.
for 1 to 8 do:
- if current bit is 0
  - read position and lengeth, copy the slice(position, length) of window to output.
- if current bit is 1
  - read one byte and write to output.
Repeat until reach the end mark.