|
- Contents
-
- Article 1: Notes on Data Compression & Related topics
- Article 2: Data Compression Implementation
Notes on Data Compression & Related topics
by
Simon Southwell
29th March 1999
- 1 Introduction
- 1.1 Information
- 1.2 Compression Applications
- 2 Lossless Data Compression Techniques
- 2.1 Run Length Encoding
- 2.2 Huffman Encoding
- 2.3 Arithmetic Encoding
- 2.4 Models
- 2.5 LZ Adaptive Coding
- 3 The Lempel-Ziv-Welch Compression Algorithm
- 3.1 Compression
- 3.2 Decompression
- 3.3 The KωK Exception Case
1 Introduction
1.1 Information
When we receive junk mail through the door, we can often feel that it contains
almost no information, and the bin is the only fit place for it---often unopened. But
this feeling of 'content free' communication is really a subjective assessment. For a
definition of information that will be useful for us, we need to remove our 'human'
interest. Carrying on the junk mail analogy, if, every month, you receive a letter from
a credit card company, extolling the virtues of their particular credit card, perhaps
because it has a picture of a Moose on the front, then the information content of
letters AFTER the first letter was sent are, indeed, information free. When you
receive the letter you know exactly what it contains, and it can go straight in the bin.
Perhaps, though, the credit card company are becoming more desperate each month
to win your business, and are reducing the first year's APR each time the send you the
letter. The rest of the bumph is the same, but the APR value is different each time.
So, the credit card company could just send a postcard with a number on it---the new
APR being offered. Therefore the information content is not zero, but you should be
able to see that it is still small.
Now suppose, on another day, another credit card company sends you a letter,
but they have come up with a scheme where by the more you borrow, the more they
pay you interest. In addition they give away charitable donations to save Moose from
the illegal Moose slave market every time you use their card. Now, you must agree,
you are not likely to expect such a communication. If one turned up, you'd probably
not believe a word, but you would not have expected it. Therefore its information
content is very high. None of what is in the communication was known to you before,
and now you know.
This, then, gives us a definition of information content: information content is
proportional to the data's 'unexpectedness'. Relating this to computer signaling,
suppose we are receiving a serial bit stream, 1's and 0's at regular intervals. Now
suppose we always receive a steady stream of, say, 1's. Initially the information
content is high, the first bit could be either a 1 or a 0. However, as time passes, and
1s continue to arrive, the amount of information the signal is conveying heads
towards zero, as the subsequent values become increasingly predictable. At the other
extreme, suppose the bit stream indicates the flip of a coin, with one bit for each flip;
1 for heads, say, and 0 for tails. Assuming the flipping is not biased in any way, we
cannot predict the value of the next bit to be received by looking at the present, or
any of the past bits. The incoming signal is totally random, and contains 100%
information. However, suppose that someone at the transmitting end forgot to
connect up the transmission wire to the transmission electronics, and instead of coin
flips, the line is floating around randomly, so that a random sequence of 1s and 0s is
received at the other end. What is the information content of the received signal now?
It's 100%! The signal at the receiver is still totally unpredictable. Just because some
person is attaching meaning to the signal (i.e. the result of coins being flipped) doesn't
change the information value, even when the data doesn't actually have a intended
meaning. Indeed, one can argue that the signal in the example is conveying
information about the voltage that the loose transmission wire is at, and some thing
about the detection circuit's characteristics for distinguishing 1s and 0s. Meaning is
only some that we add to information.
Somewhere between these two extremes, totally predictable and totally
random, lies all the values in between. Much data that we produce contains much
information, but it isn't totally random. Take English text for example. Suppose one is
reading English text, one character at a time (I don't recommend this). If one comes
across a 'Q' (or a 'q'!), I'd bet you would be almost certain that the next character
would be a 'u' (but less likely 'U'). You then find you are reading a memo regarding
Australia's leading airline and it turns out to be an 'A'. If the next letter turned out to
be 'X' then you might feel a typing error had occurred. So we see that we're in this
grey area where we can give a good guess of what might follow based on what we
have already seen. It's not 100% as we've seen, but the data is not random either.
Much data is like this; the data contains information, but is not 100% efficiently
encoded, and redundant, predictable, data is contained within it. It is this
characteristic of some data that allows us to 'compress' the data. What we need to do
is encode the data more efficiently such that every bit contains information.
1.2 Compression Applications
The need for more and more secondary storage has been apparent in recent years
due to the growth in the area of information technology. Transfer of data over
transmission media has also increased, putting large demands on bandwidth. Solving
some of these problems has led to technological advances and innovation, such as
fibre optics, read/write optical disks and helical scan tape drives. All these advances
have increased bandwidth and storage capacities, but often have only just satisfied the
ever increasing demands.
One solution for an 'instant' increase in capacity of a medium is to compress the
data, using an encoding scheme which represents this data in fewer symbols than the
original data set. This will be possible if the data is structured 'inefficiently' and
contains redundancy, as discussed earlier. Compression methods have been around
for a while, but only in more recent years, with the maturing of VLSI technology, has
it been possible to make practical compression implementations at high enough rates
to used in useful applications. Compression methods may be implemented on Digital
Signal Processors (DSPs), or, for the maximum data rates, in dedicated hardware, as
ASICs (Application Specific Integrated Circuits).
There are two major classes of data compression techniques:
Lossy algorithms act on data which has redundancy in information which is 'not
critical' in order for the data to retain 'meaning'. This usually means data that is to be
interpreted by human perception. The most classic cases of this type of data are
images (either still or moving pictures) and audio (either voice or music). Of the
former type, an example is the JPEG (Joint Photographic Expert Group) standard.
This takes advantage of the fact that high frequency components in an image are
much less important (for human interpretation) than low frequency components, and
it thus reduces (or loses) some of the information in this part of the data content in
order to compress the image into a smaller space. Lossy algorithms are also
characterised by achieving high data compression figures, as one might expect if data
can be thrown away. But this all depends on just how much data can tolerably be lost.
After all, in the limit, continued compression will result in just a single bit, but this
carries no interpretable data whatsoever.
By cutting down this data to smaller sizes, reduced storage requirements are not
the only benefit. A fair sized image may be 2MBytes when raw data, but might
compress down to 50KBytes. Imagine downloading this image over the web at, say,
5KBytes/S (on a good day). The original image would take nearly three and a half
minutes. Compressed, however, and the image is transferred in twenty seconds.
Lossless algorithms, by contrast, do not 'lose' one single bit of data, and any
compressed file is decodable to its original form. Here the algorithms work on data
redundancy due to 'inefficient' representation---effectively, on average, more than 1
bit is used to store 1 bit of information . If the data is 100% efficient, then no lossless
data compression technique can compress this any further, and will in fact expand the
data slightly due to its own coding overheads. However, much data does have this
type of redundancy in it; databases with largely empty search table, source code with
repeated spaces and frequent key word strings, English text and so on.
In the world of computer storage compression of data is desirable for two
reasons. Firstly, more data can be stored per unit of media (reducing the cost of that
media) and secondly the transfer rate to the device is increased in proportion to the
compression rate (with a sufficiently fast compression device, and a large enough
buffer), thus reducing the storage time for a given unit of data. It, of course, goes
without saying that lossless compression techniques are used exclusively in computer
secondary storage, where all types of data are to be stored, and these types are not
predictable in advance. Thus 'universal' compression algorithms are utilised. It is
possible that data is compressed by the computer system itself before sending to
secondary storage, but this is most likely to be much slower than the transfer rate
achievable between the host and a storage device. Dedicated hardware could reside
on the host to speed this operation up considerably, but this may become a resource
with multiple demands on it from several data transfers that may be required at any
one time, causing a bottleneck. The fastest solution of all is to have each storage
device contain within it its own dedicated compression hardware, to compress data
on-the-fly as it arrives at the device, and to decompress it as it leaves. The host is not
involved and only 'sees' an increase in the storage capacity, and an increase in the
maximum sustainable data transfer rate.
So, which algorithm to choose? Which is the best? There are certainly a few to
choose from (refs here), and, as with many engineering problems, there are trade-offs
to be made before a final implementation is decided upon. These problems can be as
diverse as the maximum amount of dedicated memory that's required or who owns a
particular patent and dictates the licencing agreements for that patent. In the
following sections we will look at some lossless algorithms in a general manner,
followed by a more detailed look at how two of these algorithms may be turned into
algorithms that can (and are) used in embedded hardware compression 'engines' in
data storage devices.
2 Lossless Data Compression Techniques
Lossless compression algorithms (sometimes referred to as noiseless or
reversible algorithms) are characterised by the ability to retrieve the pre-encoded
data with 100% accuracy. They all attempt to re-encode data in order to remove
redundancy, and this implies that data with no redundancy cannot be compressed by
these (or any other) techniques without some loss of information. Indeed all these
algorithms will have some degree of overhead in their encoding methods and will
therefore actually expand non-redundant data. An example of such data might be a
data set that has already been compressed.
Lossless data compression techniques do not always remove redundancy with
100% efficiency, and often algorithms with different emphasis are used together to
give more efficient compression. Or perhaps some pre-processing of data (which
doesn't compress) is used in order to make the compression more favourable for the
compression technique. E.g. the Barrow-Wheeler transform (BWT).
Many algorithms exist with broadly similar characteristics, and sub-variants of
these abound in countless measure, each trying to squeeze out the last drop of
efficiency. In the following discussions, various different types of algorithm are given
a perusal in order to make comparisons, and to give a context to two particular
schemes, which we will look at most fully. As had been stated before, one may
implement data compression as software running on a CPU, and much has been
written on software implementation (see Nelson and Gailly 1996 for an excellent
account). To give a different flavour, I want to look at these techniques with an
hardware implementation bent. Hopefully this will give a different perspective on the
subject, but it will also place different constraints on us that might not be applicable
to a software implementation. For example, we are likely to want to compress data in
excess of 40Mbytes/s, but the amount of memory available to might be less than
1Kbyte. So what gives! Well, we will have to design an ASIC or FPGA, no small
task, and we are likely to fall short in compression ratio compared with a good
software compression program; but a Formula 1 racing car might not be comfortable,
but it sure leaves a fat Mercedes standing. So let's actually look at our first strategy-run-length encoding.
2.1 Run Length Encoding
This is probably the simplest algorithm to understand, though it is quite powerful
for particular types of data that have certain characteristics, as we shall see.
This algorithm works by substituting a run of repeated characters with a single
instance of the repeated character followed by a count indicating the repetition
number. So, for example, suppose we had a data set consisting of the following
sequence of characters from a 3 character alphabet:
AAAABBBBBBBCCCCCCCAAACCCCCCAAAAAAA
This might be encoded as:
A4B7C7A3C6A7
The new encoding has added some extra 'symbols' for encoding the run length, but in
this very simple scheme we have compressed the character sequence by 34/12 = 2.8.
Of course, the scheme shown is too simple to be practical (you could make it more
efficient, for instance, by avoiding having to encode single characters as '1').
The compression ratio also assumes that we didn't need to add any more 'bits' in
order to insert run length characters, which is the same as saying that the original
alphabet was grossly inefficiently encoded. A 3 character alphabet only need 2 bits to
encode each character. Assuming that we allow run lengths of 1 to 10, then the run
lengths plus the original alphabet can all be encoded in 4 bit characters. So, the
original message has 34 x 2 = 68 bits. The compressed message has 12 x 4 = 48 bits,
a compression ratio of approximately 1.42. So you see, practicalities of encoding
limit simplistic schemes, and one must be careful to understand what one is
compressing. It is very easy to compress data with high redundancy, but data is rarely
that forthcoming.
Even in a practical scheme, you should be able to see that it will only be good if
the data is compresses has a high instance of repetition. This characteristic is not
shared universally between data sets, and run length encoding is often restricted to
data which are essentially bit streams (hence have a two character alphabet, 1 and 0)
and which also has a high likelihood of repetition, such as one might find in graphical
data; e.g. bitmaps. It is also sometimes used as a post processing algorithm of other
compression techniques where their output has some of the favourable characteristics
of run length encoding.
2.2 Huffman Encoding
This scheme (Huffman 1952) is a 'variable-length' encoding. It is still widely
used today, but can be quite complicated to implement. The general aim is to have a
set of uniquely identifiable 'codewords,' which have different lengths, and then assign
the smallest codewords to represent the most frequent characters in the data set.
Of course, identifying which characters (or more generally, symbols) occur most
frequently is not an easy task. We will need a 'model' of this character frequency; i.e.
a table of probabilities for the occurence of any given character. There are many ways
to do this, but the simplest to understand is where the entire data set is processed,
counting character occurences as we go. The table is then constructed as 'number of
characters X divided by total number of characters in data set.' Actually, for the
method we're about to describe, the character totals are all we need, as only their
relative rank is important and we want to avoid any non-integer arithmetic in an
implementation. So, we can easily rank each character in terms of its frequency for a
given data set. Now we need a coding scheme with the following attributes. It is
made up of a variable length set of codes, and each code is uniquely identifiable from
all the others.
It isn't necessarily obvious how we might do this. One can't simply assign low
numbered codes with less bits, as in 0, 1, 10, 11, 100 etc. Given a bit stream of, say,
010101001101010 the coding scheme could yield many different interpretations. The
way to proceed is to have a coding scheme whereby no code is a prefix of another
code. For example: 00, 10, 11, 010, 011 etc. Here, two bit codes are 00, 10 or 11.
This leaves 01 as an unused two bit number. All other codes must use this sequence
as the first two bits of their code. By expanding a scheme like this for a more practical
number of codewords (such as 256) an implementation can be realised. Note,
however, that it is not possible to have a code of a maximum N bits encode 2N
codewords, and the longer codewords will 'expand' the characters they encode.
Some characters must be more frequently encountered than others.
An algorithm can be described which automatically sort the codes for each
character, assigning the smallest codes to the most frequent characters. The algorithm
is going to build a Huffman Tree which can then be read off to find the code to use.
First the frequency counts for each character used in the data set is calculated, as
described above. The two lowest ranked (i.e. lowest counts) characters are 'paired,'
forming a 'node' of a tree, where the new node has a count which is the sum of the
two characters, and it has 'branches' to each of the two paired characters. These
characters will no longer be considered in subsequent pairings. If more than one
choice is available when pairing, then it is totally arbitrary which choice is made. The
next two lowest values are paired in the same way, but any new nodes formed, with
there summed counts, must be considered as if they were characters. This procedure
continues until only one node exists at the top of a tree which should have a count
total equal to the number of characters in the data set being considered. The 'leaves'
of the tree should be the characters that appear in the data set.
Codes can now be assigned as follows: starting from the top of the tree, to get to
a character one must move down a branch, left or right, to reach a given character.
We can assign (arbitrarily) 0 to mean left and 1 to mean right (though the reverse
works just as well). So the path from the top of the tree to the leaf containing the
character maps out a sequence of 0s and 1s for the left or right transversals. These are
the bits which make up the code for that character. All the other characters are
assigned codes in the same manner.
Let's have an example. Suppose that we wish to encode the short message
RINGADINGDING. The frequency table for this example is as follows:
The sequence of lowest frequency pairings is shown in Figure 2.1, below. The final
tree now gives us the coding for the characters. Choosing a left branch as 0, and a
right branch as 1, we have the codes of R=0000, A=0001, D=001, G=01, I=10 and
N=11. As you should be able to see, the most frequent characters have been assigned
the shortest codes, and the shorter codes are not prefixes of the longer ones. The
message (RINGADINGDING) is thus encodes as:
00001011010001001101101001101101
This is 32 bits which gives a compression ratio (assuming 8 bit ASCII characters in
the original message) 48/32 = 1.5:1 (not bad). However, if we received the above
encoded message how would we know what the codes were and how they had been
assigned. Unfortunately, in our simple scheme, we would also have to pass over the
code table in order for the message to be decoded. In our example, this would be an
enormous overhead. As data sets get bigger the overhead diminishes and it becomes
more practical a method.
Figure 2.1 Building a Huffman Tree
It might be possible to agree a fixed coding table, so that both sender and
receiver knew the code mappings and a table needn't be sent. This is fine if the same
message is always sent, but every new message will have a different frequency table,
and thus a different Huffman tree with its code assignments. Data of similar origin
might have similar frequency distributions, for example English text. So we might
have an English text code mapping, and could use this if all messages are in English.
The compression rate will only be as good as the table, and so care would need to be
taken when constructing this 'model.' Many examples of English would need to be
used, and unless the message has exactly the frequency distribution of the fixed table
we are using, the compression ratio will always fall short of the ideal. This is why you
are unlikely to come across this kind of pre-calculated table in any implementation.
Constructing and sending the codes for the particular method is more likely, and the
penalty in speed efficiency of reading the data twice (once for the mapping, once for
the encoding) is bourne.
An alternative is to start with all characters being equal probability. As new
characters are encoded, the frequency distribution is updated 'on the fly.' As the
character ranks change, so do the code assignments. The receiver has the same
algorithm available to it. As it decodes data it too keeps relative ranking tracked and
adjusts the code mappings in exactly the same manner as the encoder. This has the
advantage of obviating the need to transmit the coding table, and the encoded data is
not read twice in order to construct the codes. This will still not be as efficient
(compression ratio wise) as the Huffman tree construction mentioned above, but is an
improvement on fixed tables, with the speed advantages mention, and it is used
widely with Huffman and other variable-length coding schemes.
2.3 Arithmetic Coding
Huffman coding, mentioned in the previous section, has been described as 'best,'
assuming the two pass encoding described above. This is true only if we use codes
that are an integer number of bits, and are fixed. A related method is Arithmetic
coding (Witten et. al. 1987). With this method codes can be stored in fractions of a
bit. The method will achieve at least as good a result as Huffman coding, but will
often beat it. The penalty for this gain, as with so many things in engineering, is
increased complexity.
The basis for the algorithm is to find a unique fractional number for the message
being encoded, based on the probability of the symbols appearing in the exact order
as in the message. It won't be an exact probability measure, as two messages may
have exactly the same probability. As with Huffman encoding, each symbol has a
probability of appearing in the message, which is the frequency of the symbol divided
by the length of the message. So, for example, if the symbol '#' appears 5 times in a 20
symbol message, its probability is 5/20 = 0.25. For a message to be encoded, all its
symbols have their probability calculated in such a manner, with the totals, hopefully,
adding to 1 (i.e. there is a 100% chance of finding a symbol that is somewhere in the
message). Next, the probability space is divided up into ranges matching the symbol
probabilities, stacked on top of each other, as it were. E.g. If a message has a 3
symbol alphabet (A1, A2 and A3, say), and the probabilities for the symbols is 0.25,
0.45 and 0.30, then we would divide the range of 0 to 1 into 0 to 0.25, 0.25 to 0.7
and 0.7 to 1.0. When the first symbol arrives for encoding, we select the range
correspondingly (so if it were A2, the the range is 0.25 to 0.7). This new range is now
our working range and is divided in to the same proportional segments as was the 0
to 1.0 range initially. When the next symbol is encoded we modify the current range
to be the subrange of the first symbols range. So, if the next symbols is A1, then we
need the range 0 to 0.25 of the range 0.25 to 0.7, which is 0.25 to 0.3625. This
continues for the entire message. The range get narrower, and the decimal fractions
get longer. When the entire message is encoded, we can pick on of the range bounds
as the encoded message (usually the lower bound).
Let's have a more extended example to illustrate these ideas. We wish to encode
the message RINTINTIN.
(10 characters, including the full-stop.) Firstly we work
out the symbol probabilities (which I have made easier by choosing a 10 symbol
message.)
Symbol |
Probability |
Lower Bound |
Upper Bound |
R |
0.1 |
0.9 |
1.0 |
I |
0.3 |
0.6 |
0.9 |
N |
0.3 |
0.3 |
0.6 |
T |
0.2 |
0.1 |
0.3 |
. |
0.1 |
0 |
0.1 |
Also shown in the table are the range allocations for the symbols, with the upper an
lower bound difference equalling the symbols' probabilities. It should be noted that it
is arbitrary the order in which the symbols probabilities are allocated, though, of
course, the encoder and decoder must agree on this point.
So, to start encoding the message, we take the first symbol 'R', and modify the
current range to be 0.9 to 1.0. The final encoding will have a value which must lie
between (or possibly on) these two values. The next symbol is 'I', with a range of 0.6
to 0.9. We then calculate what this range is within the previously calculated range.
So, 0.6 of the way between 0.9 and 1.0 is (0.9 + ((1.0 - 0.9) * 0.6) = 0.96. Similarly,
0.9 of the way between 0.9 and 1.0 is 0.99. So we now know the message values lies
between 0.96 and 0.99. We have narrowed the range of possible numbers for our
message. This procedure is carried through for the rest of the message's symbols,
with the range becoming more and more restricted. The table below shows the results
for the entire example message.
Symbol |
Lower Bound |
Upper Bound |
Difference |
R |
0.9 |
1.0 |
0.1 |
I |
0.96 |
0.99 |
0.03 |
N |
0.969 |
0.978 |
0.009 |
T |
0.9699 |
0.9717 |
0.0018 |
I |
0.97098 |
0.97152 |
0.00054 |
N |
0.971142 |
0.971304 |
0.000162 |
T |
0.9711582 |
0.9711906 |
0.0000324 |
I |
0.97117764 |
0.97118736 |
0.00000972 |
N |
0.971180556 |
0.971183472 |
0.000002916 |
. |
0.9711805560 |
0.9711808476 |
0.0000002916 |
As can be seen, the resulting numbers converge toward a particular point, with
the difference between the upper and lower bounds becoming smaller and smaller.
The final lower bound (0.9711805560) can now be used to uniquely identify the
message. To decode the message, the reverse is applied. Since 0.9711805560 lies
between 0.9 and 1.0 we know that the first symbol must be 'R'. If we now remove
the lower bound values from this number (i.e. 0.9), and divide by the probability for
'R' (0.1), (that is, ((0.9711805560 - 0.9)/0.1), then we get the value 0.71181. Since
this lies between 0.6 and 0.9, then the next symbol must be 'I'. This process is
repeated to decode the entire message.
Now you are probably thinking that not only is this method more complicated
than Huffman encoding, but it is very much more complicated than Huffman coding,
and is the effort worth it? We have used multiplication and division, with large
decimal fractions with many significant figures. Can we even handle this in a
computer program or an ASIC? We can, fortunately, and using only integer
arithmetic and minimal storage of bits. Firstly, any decimal fraction can be
represented as a binary fraction, as one might expect, and, like floating point
numbers, decimal places may be inferred, effectively scaling the numbers to be
integers. Also, from the table may be noticed that, one a significant figure is the same
in both the upper and lower bound values, then it will remain at this value for the rest
of he encoding. This means that, when this occurs, we can output that value (or bit),
rescale the remaining bits, and continue. With all this it is possible to get sufficient
resolution with just 16 bit arithmetic . There is one problem where the value may be
converging to a point where (equivalently in decimal) the lower bound values are
...9999... and the upper bound values are ...0000... . This can be solved by
'borrowing' in a similar fashion in arithmetic subtraction. It is mathematically
provable (though not by me) that 1.0 = 0.999999... recuring. We can substitute the
zeros with this, so long as we remember, and correct the situation once the
convergence has resolved one way or the other (i.e. on the ...9999... side or the
...0000... side).
So, we can solve some of the practical difficulties, and make this a viable
algorithm for use in software or hardware systems. Having said this, the
implementation is still not as simple as other alternatives, and this may be why
Arithmetic Coding hasn't caught on as one might expect. That isn't to say it hasn't
been used. IBM introduced and Arithmetic codec (coder/decoder) implementing a
variant called IRDC (Increased Data Recording Capacity-it was used in a tape
storage device), which Exabyte, for one, have used.
2.4 Models
The Huffman and Arithmetic Coding techniques outlined above both require a
table of symbol probabilities-a 'Model.' This can be a 100% accurate model, as
used in the examples for these methods, but which requires a time consuming first
pass over the data in order to construct the mode, which also needs to be transmitted
(or stored) with the data in order to reconstruct the message. Adaptive methods were
eluded to in the Huffman discussion. These have the advantage of needing no first
pass of the data, and the model may be reconstructed on decoding, obviating the need
for transmission or storage of that model with the message. The price paid is
decreased compression ratios.
Models, ideally, should accurately reflect the probability distribution of the
symbol occurences, which, if compression is to occur should be non-uniform. In the
above discussions, it has been assumed that a symbol's probability is fixed. This was
calculated based on the occurences of the symbols divided by the message length.
This takes no account of any other symbol in the message. This is a zero-order model.
A symbol's probability does, in fact, jump around quite a bit. For example, a text file
may have a space character after every word, which might be a probability of 1 in 10,
or 0.1. If the current character is a full-stop, then the probability for a space might
rise to 0.7, say. By looking at the context of the character, i.e. the symbols occuring
around it, a better statistical model may be obtained. The order of the model indicates
the number of symbols used in determining the current symbol's context. So a 2nd-
order model will use the two previous symbols in order to calculate the current
symbols probability. However, as the model order goes up, so does the memory
requirements in order to store statistics for each context. In a zero-order model, with 8
bit symbols, only 256 statistics need to be kept. A 1st order model needs 65536 and a
2nd order model needs 16Mwords. I.e. it quickly gets very silly, and managing the
resource also becomes cumbersome, having to search through larger amounts of data,
and updating (in adaptive methods) greater table sizes.
There is another way; the LZ adaptive coding class of algorithms which we shall
look at next.
2.5 LZ Adaptive Coding
The Lempel-Ziv family of compression techniques are what are known as
'substitutional' compression algorithms. They are characterised by the use of
'codewords' to substitute for 'strings' of symbols, which then reference a store to
reconstruct the strings on decoding. The methods have some advantages over other
methods in that they are 'universal'; i.e. they make no assumption about the data sets
symbol probability distributions, and are inherently adaptive. They are also
automatically encode the 'model' table within the encoded data, so it can be
reconstructed by the decoder, without additional data transmission.
There are two main flavours of LZ algorithm LZ1 (also known as LZ77---Ziv et.
al. 1977) and LZ2 (also known as LZ78---Ziv et. al. 1978). The first of these is
conceptually the simplest to understand, but can be slightly more complicated to
implement in software, or indeed hardware). In brief, the last n symbols are kept in a
buffer. As new symbols arrive, they are appended to the current 'string' of symbols. If
the string is in the buffer, then the string is maintained, and the buffer updated with
the new symbol. Once the string has grown to a point where it is no longer in the
buffer, the pre-appended string (that was in the buffer) is substituted for a code which
consists of a pointer into the buffer where the string occurs, and a run-length
indicating how long the string was. The symbol that caused the search to fail now
becomes the single symbol string, it is appended to the buffer, and the process
continues. Since only the last n symbols are remembered, the buffer can be thought of
as a window on the data, n symbols wide, sliding along the data set. Hence you will
often find this method called the Sliding Window technique. The buffer is a history of
the last n symbols, and so is often referred to as a History Buffer. In its purest form
all strings are encoded as pointer/run-length codes, but this is inefficient if the string is
a single symbol. Variants might allow single characters to be encoded as 'literals', but
then the pointer/run-length codes must be distinguishable from literals.
So this is quite a simple method. The difficulty is in determining whether a
particular string is in the history buffer, and ensuring the search is exhaustive. This
method is, in my experience, easier to do well in hardware than in software, where
Context Addressable Memory can be implemented, if not easily, at least with some
confidence. More on this later.
The second method (LZ2) is similar to LZ1 in its substitutional nature, but uses a
'dictionary' of strings which have previously been seen in the input data, with
codewords making references into the dictionary's list of strings, or 'entries.' These
codewords are then substituted in place of the strings of symbols.
Initially the dictionary contains only the single symbol strings (e.g. values 0 to
255 for 8 bit bytes). As symbols arrive, they are appended to the current string, and
the dictionary is searched to see whether it has an entry for that string. If it does, then
a new symbol is appended to the current string. This continues until no entry is found
in the dictionary for the current string. When this happens a new entry is assigned a
codeword, and the string placed in the dictionary entry for that codeword. The
codeword for the last matching string is output, and in the purest LZ2 form, the
symbol causing the mismatch is also output. However, in the most popular variant,
due to Welch (Welch 1984), called LZW, the offending symbol becomes the current
single symbol string, and the process continues.
So this rounds up four major algorithms (excluding run-length encoding, rarely
used on its own), used in varying degrees in both software and hardware
implementations. Next I want to look in more detail at one of the more popular LZ algorithms
and look at LZW, as a
case study, and at what it takes to make it a fully specified and operational algorithm
which can be implemented as a hardware codec (usually) as a standard for tape storage
devices.
3 The Lempel-Ziv-Welch Compression Algorithm
As has been mentioned in section 2.5 above, the LZ2 algorithm, in particular the
LZW variant, has a dictionary with a number of entries. Each entry has a codeword
associated with it, with a unique value. Each entry refers to a string of symbols that
have either been seen in the incoming data stream before, or are the single symbol
strings that the dictionary was initialised with.
From now on, symbols will be 8 bit bytes, with values 0 to 255. It doesn't matter
if these bytes represent ASCII codes, or object data or whatever. The entries, with
their associated codewords, which represent the single byte strings are known as root
entries (or codewords). To make this algorithm practical we will have to start
imposing some restrictions on what is allowed. For example, we cannot keep building
new entries in our dictionary indefinitely, and it is usual (though not universal) to
restrict codewords to be 12 bits wide, giving a possible 4096 codewords, 256 of
which must be the root codewords. The other remain unassigned at the start of a
compression procedure, and we are free to assign them to particular strings as we
choose.
In our previous discussions, no limit has been placed on the size of the strings that the
dictionary may refer to. There are two reasons why we might want to restrict this.
The first is to do with the fact that our dictionary will be held in memory, and we
cannot allow the dictionary to grow unpredictably large. Secondly, for reasons that
we will come on to, on decompression, we will have to hold entire strings together
before outputting the bytes that make them up, and we will want to limit the holding
store used to do this. These are practical restrictions, rather than algorithm
necessities, but realising the algorithm in tangible form is what we're aiming to do.
Now, as each string in a dictionary is one that we have seen in the input data
stream, and we only add new entries when we see a string which has not been seen
before, but we do always add it, then it must always be the case that a new entry
consists of a string that already has an entry in the dictionary, plus a single byte which
caused the string to be 'new.' Therefore, we can always make a new dictionary entry
by storing a byte value and a pointer to another entry in the dictionary, where the
entry pointed to, appended with the byte value equals the new string being added. If a
codeword is 12 bits, and the byte takes 8 bits, all 4096 entries in the dictionary will
need 80K bits in total. This, then, gives us a predictable maximum dictionary size,
which is a manageable quantity.
Now I've gone through some of the practical aspects of implementing this
algorithm before embarking on describing the actual algorithm in detail, because the
algorithm is not independent of practical considerations. It was designed to be
implemented in digital form, and the way it is described is based on these practical
concerns. So the next section describes the algorithm in more general terms, but you
should see that the method is shaped by the restrictions we have placed on our
implementation. We will first look at compression, working through an example to
illustrate what the process is. The example we will use is of encoding the message
RINGADINGDING.
3.1 Compression
The table below (Table 3.1) shows the sequence of operations when compression
the example. It is assumed that, at the start of the compression, the dictionary is in its
initial state; i.e. it contains entries for the 256 single byte strings only. So as to avoid
getting confused with actual numbers, the following convention will be used. All of
the bytes will be represented by their ASCII character (e.g. R). A codeword for a
string will have its string surrounded with parentheses; e.g. (IN) is the codeword for
the string 'IN'. At this point it won't matter what value is assigned to the codewords,
so long as it is understood that they will all be different.
Input |
String |
Match? |
Build Entry |
Output |
|
R |
R |
yes |
- |
- |
1 |
I |
RI |
´
|
(R)I |
(R) |
2 |
N |
IN |
´
|
(I)N |
(I) |
3 |
G |
NG |
´
|
(N)G |
(N) |
4 |
A |
GA |
´
|
(G)A |
(G) |
5 |
D |
AD |
´
|
(A)D |
(A) |
6 |
I |
DI |
´
|
(D)I |
(D) |
7 |
N |
IN |
yes |
- |
- |
8 |
G |
ING |
´
|
(IN)G |
(IN) |
9 |
D |
GD |
´
|
(G)D |
(G) |
10 |
I |
DI |
yes |
- |
- |
11 |
N |
DIN |
´
|
(DI)N |
(DI) |
12 |
G |
NG |
yes |
- |
- |
13 |
<EOF> Use best match |
(NG) |
14 |
Table 3.1 Compression of RINGADINGDING
From the table, the first character input is 'R', as shown in the leftmost column.
Since this is the first input of any kind, the current string we are working with
must be an empty (or NULL) string. Adding the 'R' now makes it a
single character string, as shown in the next column. Is this string in the dictionary?
Well, yes. All single character strings are always in the dictionary; that is how it is
initialised. The third column indicates this with a tick. When a match is found in the
dictionary, no new entry is built (the matched entry already exists), and nothing is
output (under normal circumstances). The next character is input, 'I', and the current
working string becomes "RI". This string is not in the dictionary. We haven't
built any new entries yet, so only single character strings can match. The cross in the
third column shows that no match is to be found. So, a new string has been seen, and an
entry must be built for it.
From an algorithm point of view, the entire string could be stored in any manner,
so long as it can be matched against (however inefficiently) at some future point.
However, we're working toward a practical implementation, and by introducing a
representation, that isn't strictly necessary, now, will save us a lot of trouble later.
Since the entry to be built is for a string which must be composed of a string already
in the dictionary, plus one character, then all that is needed to store the new string is
the codeword for the string already in the dictionary, and the character which, when
appended, makes the new string. If it is not clear to you why this works, remember,
we keep pulling in characters from the input, appending them to the current string,
until the current string is not in the dictionary. Therefore, the current string at that
point must be one character longer than a string with a dictionary entry. Thus, shown
in column 4 of the table, the new entry is shown as '(R)I', meaning the codeword for
the string "R" (a single character string in this case), followed by the character 'I',
representing the string "RI". At this point output is required. On a mismatch we
output the codeword for the string last matched against. This is the same codeword
as was added in the new entry we built (in this case (R) ).
Now, you might ask, why not output the codeword for the new entry we just
built, as this represents all the input we have had so far? Well, when we decompress
the data, we'd be in the unfortunate position of receiving a codeword for a string
which wasn't in the dictionary. For example, (RI) would be the first codeword for
decompression which clearly cannot exist in a dictionary which is initialised only for
single character strings. As we shall see later, even if we don't output new entry
codewords as we build them, this problem is not entirely avoided. More of this later.
So (R) is output, which still leaves the input character 'I' unrepresented, and so this
becomes our new string, lest we forget about it.
The next input character is 'N', making a new current string of IN. There is no
match in the dictionary, and so '(I)N' is added, as for '(R)I' previously. The
codeword '(I)' is output, and the new string becomes "N". Now this procedure
remains the same for the next four input characters (G, A, D, I) where four new
entries are built ((N)G, (G)A, (A)D and (D)I) and four codewords are output ((N),
(G), (A), (D)).
Then, something more interesting happens (at last!). We have "I" as our
unprocessed string, and 'N' as our input character, giving us "IN", which is (hurrah!)
in the dictionary, built at row 3. Now all we have to do is remember the entry for
"IN" (stored as (I)N) and input the next character. No codeword is output, and no
new entry is built. The next character input is 'G', giving us "ING" as our current
string, and this is not in the dictionary. In fact, no three character entries have yet
been built. The procedure for a failed match applies once more. A new entry is built,
stored as (IN)G. That is, the entry number for (IN) is stored, along with the character
value for 'G'. The string we have matched is output (i.e. (IN)); the first dynamically
generated entry codeword. And the current string is set to "G".
The rest of the input is processed in the same manner, until the end. There is one
further match to a dynamically generated dictionary entry, and it is processed in the
same manner as described above. The only thing left to clear up is what to do with
the end of the data stream. It will always be the case that at the end of processing
there will be a string that is matched in the dictionary, but has not been represented in
the output. This could be a single character string, or, as in our example a match with
a dynamically allocated entry. Since the normal procedure is to carry it forward for
appending the next input character, we must do something different, since no more
input is forthcoming. In this case we simply 'pretend' that there is no match, and
output the codeword for the current outstanding string. No entry is built, and the
current string is effectively set to the null string. All the input has been represented in
the output, in an encoded form, which must, therefore, contain enough information
about the original data to recover it. It doesn't explicitly contain the dictionary, but,
as we shall soon see, this too has been encoded in the output data; a feature which is
one of the major attractions to this class of algorithm. So, without further ado, let's
decompress the encoded data, and see what happens.
3.2 Decompression
In decompression, the input must consist of the codewords we have previously
generated as the compressed output, in the order in which we generated them. These
will be, as in our compression example, a mixture of root codewords and dynamically
allocated codewords. We can only start with an initialised dictionary, as the final
compression dictionary has not been stored with the data. This restricts the point at
which decompression can begin within a compressed data set. Only at points where
the dictionary was in its initial state during compression can decompression begin. In
the simple examples here, this means at the beginning of the data.
Table 3.2 below shows the decompression sequence for our previously
compressed example. The first six codewords are all root codewords, and their byte
values may be sent directly to the output. At each stage (except for the first
codeword) a dictionary entry is built with a byte value of the current input codeword,
and a pointer to the previously processed codeword's dictionary entry. I.e. an entry
representing the last string processed with the current codeword's byte value
appended to it. Hence the first entry built is (R)I, since, when (I) is input, the last
codewords processed was (R) and the byte value of (I) is 'I'.
I/P |
Byte Value |
Ptr |
Root? |
Lifo |
Build Entry |
O/P |
(R) |
R |
- |
yes |
R |
|
R |
(I) |
I |
- |
yes |
I |
(R)I |
I |
(N) |
N |
- |
yes |
N |
(I)N |
N |
(G) |
G |
- |
yes |
G |
(N)G |
G |
(A) |
A |
- |
yes |
A |
(G)A |
A |
(D) |
D |
- |
yes |
D |
(A)D |
D |
(IN) |
N |
(I) |
´
|
N |
|
|
|
I |
- |
yes |
NI |
(D)I |
I |
|
|
|
|
|
|
N |
(G) |
G |
- |
yes |
G |
(IN)G |
G |
(DI) |
I |
(D) |
´
|
I |
|
|
|
D |
|
yes |
ID |
(G)D |
D |
|
|
|
|
|
|
I |
(NG) |
G |
(N) |
´
|
G |
|
|
|
N |
- |
yes |
GN |
(DI)N |
N |
|
|
|
|
|
|
G |
Table 3.2 Decompression of Previously Compressed Data
The seventh codeword input is (IN); a non-root codeword. It already has an
entry in the dictionary at this point, as show on the third line in the figure. This entry
has a byte value of 'N', and so this is placed on a 'LIFO'. LIFO stands for 'last in
first out'. As we follow this example through, you will see that the byte values
recovered in a multi-byte codeword are in reverse order. Therefore we must
store them until the entire string is decoded, and then output them
in the correct order. A 'LIFO' is used, where we can 'push' bytes into
it, and then 'pop' them when we have all the bytes-the last byte
pushed being the first byte popped.
The entry for (IN) also has a pointer to
another entry; in this case it is the root entry (I). Its byte value is also pushed onto the
LIFO and, as it is a root codeword, the contents of the LIFO flushed to the output.
At this time, an entry is built using the same rules as before. The new entry has a
pointer to the previously processed codeword (i.e. (D)) and a byte value equal to the
current codeword's. Since we terminated the link list traversal pointing to (I), the
byte value is 'I', as shown in the figure. The rest of the input codewords are
processed in a similar manner, with the entire original data recovered, and the
dictionary left in the same state as at the end of compression.
So, we have managed to recover the data, without loss of information, and
without storing the dictionary---all was encoded uniquely within the compressed data
itself. And that would be all there is to it, except there is one slight twist in the tale.
There is a circumstance where, during decompression, a codeword is encountered
which has no entry in the dictionary! This is the KωK exception case.
3.3 The KωK Exception Case
Welch describes in his paper a particular corner case in the LZW algorithm
which, if processed as described above, will cause a breakdown of the procedure. It is
called the KωK (kay-omega-kay) exception case, since inputting the sequence KωK, where K is any
single character and ω is a string for which there is a dictionary entry, causes a
codeword for the string KωK to be output (during compression) at the same time as
it is built in the dictionary. This causes a problem on decompression since no entry
will have been built on encountering it, as entries are built after the codeword is
processed. More properly, the sequence KωKωK causes the problem, and the clue to
its solution is in the fact that the entry needed is the very next entry we would have
built at the end of processing the codeword causing the problem. So, we might
have enough information to construct the missing entry uniquely, and recover the
situation.
To see how, an example is in order. The simplest KωK sequence is the repeated
character. This is because K can be any character, and all single character strings have
entries in the dictionary be default. Table 3.3 below shows a repeated character
compression sequence.
I/P |
String |
Match? |
Build Entry |
O/P |
K |
K |
yes |
|
|
K |
KK |
´
|
(K)K |
(K) |
K |
KK |
yes |
|
|
K |
KKK |
´
|
(KK)K |
(KK) |
K |
KK |
yes |
|
|
K |
KKK |
yes |
|
|
K |
KKKK |
´
|
(KKK)K |
(KKK) |
Table 3.3 Repeated Character Compression Case
The procedure followed is, I hope you can see, exactly that in the previous
compression example. Entries are built when no match is found, and the last
character input is carried forward as the new current string.
Now, if we are presented with the codeword sequence shown in the output
column of the figure, what will happen? We can process the first codeword as it is a
root codeword with a know byte value. No entry can be built, as no previous
codeword has been processed. The next codeword is (KK). But this is a dynamically
allocated codeword, and we haven't built any entries yet. There are a couple of
slightly different methods for detecting, and then correcting this problem, but,
basically, the problem is detected if the value of the codeword is the same as the next
dictionary entry that will be built (and they are built in ascending order). It is
corrected by placing the byte value of the last processed codeword onto the LIFO,
and re-submitting the last codeword value again in place of the current problematic
one. This will, once the FIFO is flushed, place the right characters on the output, and
will result in the 'missing' dictionary entry being built, using the normal procedures.
An alternative exists where the missing entry is built first, but the result is just the
same, and it is left to the reader to figure out the details. This final detection and
correction works for all KωK cases, and ties up the last lose end, as far as the basic
algorithm goes.
Describing an algorithm and manipulating symbols on paper shows how an
algorithm might work, but we haven't arrived yet at a method which is fully
practical. There are still some details to consider before the above can become a
working piece of software, or a digital hardware implementation. Some additional
(fortunately much simpler) algorithms need to be roped in, and some design
constraints and limits placed on the algorithm to map the symbolic representation
onto more practical units of information, such a bytes. But that must wait
for the next article.
Last updated 16th July 2007
© 1999-2007 Simon Southwell. All rights reserved.
simon@anita-simulators.org.uk
|