Data Compression for Beginners

by Carole Williams

Last updated November 30, 1996 (this is OLD INFO!).


Data compression is the process of reproducing information in a more compact form (Sayood, 1996, p. 1). People compress data to save space in memory and to decrease data transmission times. Data compression algorithms remove repeated and/or irreleva nt information from the data (Jacobson). The compression ratio is a measurement of how well a particular piece of data is compressed. Literally, the compression ratio is the number of bits required to represent the data before compression divided by the number of bits required to represent the data after compression (Sayood, p. 5).

Data compression is either lossless or lossy. In the "equation" X-->Xcompressed-->Y, X is the original data, Xcompressed is the data after compression, and Y is the data uncompressed. After lossless compression, Y is identical to X. After lossy compression, Y is not exactly X because some bits have been removed (Sayood, p. 3). Lossless compression is required when the user needs to reconstruct the original data exactly. If the post-compression data will be processed or enhan ced in order to produce more information, lossless data compression is necessary (Sayood, p. 3). Most textual data requires lossless compression (Sayood, p. 3).

Lossy compression has one big advantage over lossless: lossy generally allows for higher compression ratios (Sayood, p. 3). Sound and video data are often compressed using lossy compression because the loss of small amounts of those types of data ge nerally does not reduce the perceived quality of the post-compression data (Prosise, 1993).

Users choose among data compression schemes depending upon the characteristics of the data they want to compress (Sayood, p. 6), since some schemes work better and enable higher compression ratios for some types of data, but not for others. Well-know n data compression schemes include Huffman coding, arithmetic coding, dictionary techniques, run length (en)coding, quantization, transform coding and linear prediction.

The basic idea behind Huffman coding is that the algorithm reads through the data and assigns sequences of binary code to the symbols used in the source. Compression occurs because Huffman coding assigns short code sequences to the common symbols and longer code sequences to the uncommon symbols (Jacobson). A simplistic example: in the English alphabet, 'e' is the most common letter. In an English text document, the most common letter (probably 'e') gets the shortest binary sequence: possibly '1' . The next most common letter could get a '0'. As the shorter sequences are used up, the algorithm inserts longer and longer sequences until letters like 'q' or 'z' get the longest sequences of code. Huffman coding does not work well when one or a few symbols are very common and the rest very uncommon.

Arithmetic coding works well in that situation and with sources that have very small alphabets (for example, binary sources) (Sayood, p. 61). In arithmetic coding, a code (usually a range between 0 and 1) is assigned to a sequence of symbols rather t han to individual symbols, allowing for higher compression ratios. Arithmetic coding is more complex than both Huffman coding and my explanation. JBIG (Joint Bi-level Experts Group), the standard for coding black and white (a binary source: one bit per pixel) images, is a well-known application of arithmetic coding (Gailly, FAQ [19]). JBIG's standard also allows for progressive encoding. During progressive encoding, a low resolution representation of the image (requiring comparatively less bits) is s ent first, so that the user on the other end of the transmission has something to look at immediately. More bits are sent to refine the image to better quality (Sayood, p. 87-88).

Dictionary techniques look for repeated symbols in the source. The algorithm program identifies strings of repeated symbols and stores a (hopefully shorter) code instead of the string (Robert, 1994). Dictionary techniques work best with small source s of frequently occurring symbol patterns, such as text (Sayood, p. 97). Several well known adaptive dictionary techniques are the result of two papers written by Jacob Ziv and Abraham Lempel in 1977 and 1978. Their 1977 paper outlined a dictionary tech nique which used the source output as the dictionary (Sayood, p. 116), so that the compressed file is made up of codes which point to previous sequences found earlier in the file (Jacobson). Programs which incorporate the 1977 paper's theories are known as the LZ77 or LZ1 family, which includes PKZip, Zip, LHarc, and ARJ.

Ziv and Lempel's 1978 paper outlined a dictionary technique in which a dictionary (separate from the source) is dynamically constructed from patterns in the source output (Sayood, p. 116). The dictionary varies in size and is transmitted or stored al ong with the compressed file. In 1984, Terry Welch incorporated LZ78 (or LZ2) theories into a data compression algorithm, known as LZW, which inspired most of the interest in Ziv and Lempel's work (Sayood, p. 107). Three well-known applications which in corporate LZW are Unix compress, V.42bis (a modem standard with built-in compression (Jacobson)) and GIF (Graphics Interchange Format) (Sayood, p. 113). GIF was the most popular image compression format for use on the World Wide Web, before Unisys (and p ossibly IBM) began to consider enforcing patents they hold on the LZW compression algorithms involved (Moncur, 1995). If Unisys does enforce those patents, other free domain image compression methodologies will probably emerge to take GIF's place in the WWW image world.

Run length (en)coding (RLE) reduces a series of repeated bits to two parts: one template bit and a code identifying how many times that bit is repeated in the run. RLE works very well for files that repeat the same bit frequently, such as some bitma pped images. If the source has long repeats of the same bit and/or lots of repeats, RLE achieves a high compression ratio (Prosise).

Quantization is a complicated form of lossy data compression which can require lots of computer processing. Very basically, the quantization algorithm represents a set of values in the source with a much smaller set (Sayood, p. 169). The information in the source is grouped into like blocks or vectors. A representative block or vector is substituted for the original (Sayood, p. 214). The Linde-Buzo-Gray (LBG) algorithm is used for most vector quantization (Sayood, p. 224). GIF does color quantiza tion as well as LZW (Jacobson). On my examples page, the photograph of Bonnie in GIF form as compared to JPEG (Joint Photographic Experts Group) form shows the results of color quantization.

Transform coding techniques divide the source data into parts. The individual parts are compressed using whatever method is most applicable (Sayood, p. 325). The lossy version of the public domain JPEG image standard is an example of transform codin g. During a complicated process, JPEG lossy images are transformed, quantized and Huffman coded in order to achieve a high rate of compression (Sayood, p. 343-345). The resulting image is in 24-bit color and, even though it has undergone lossy compressi on, appears to be of higher quality than the GIF'd lossless image of the same photograph. The JPEG standard takes advantage of the way the human mind perceives visual images, and loses data that humans won't miss. This process of losing the bits that hu mans don't need to see is called perceptual coding (Jacobson).

JPEG also has a lossless variant, which uses linear prediction for data compression. During linear prediction, the values of individual pixels are predicted based on the values of surrounding pixels. Rather than coding each individual pixel value, t he difference between the prediction and reality is coded (Sayood, p. 132-4). The lossless JPEG standard is not used very much because the compression achieved is much less than that achieved by lossy JPEG (Gailly, FAQ[19]).

Video compression can be thought of as repeated image compression over a period of time, since video is a series of images (Sayood, p. 381). Most video compression methods use the first frame to predict the second frame, the second frame to predict t he third frame, etc. The difference between the predicted frame and the actual frame is encoded. To view the video, the receiver or playback mechanism uses the prediction, the prediction scheme and the previously reconstructed frame to show the next fra me (Sayood, p. 382). Motion compensation, predicting the motion of objects during a series of frames, is also a factor in video compression (Sayood, p. 382), since more objects moving, or objects moving quickly, increases the amount of data to be compres sed. Understandably, video compression is very complicated and the end result (after compression) is still a lot of data (Sayood, p. 381).

The MPEG (Motion Pictures Experts Group) standard, simplistically, can be thought of as a series of stills (transformed and quantized for compression) which are each used to predict the next in the series. MPEG-1, a standard for digital storage media only, achieved playback at VHS quality for low or moderate levels of motion in the video sequence. High levels of motion degrade the quality of the video. MPEG-2 can be used for any application, achieve higher quality, and compresses full screen/full m otion video to 500KB per second. MPEG-2 is currently in the specifications for HDTV (Sayood, p. 403, 406).

For more accurate, technical and comprehensive information, see one or all of the documents in my reference and links page.



References and Useful Links


Gailly, J.-L. (No date). Comp.compression frequently asked questions. [Online]. Available HTTP: http://www.cis.ohio-state.edu/hypertext/faq/usenet/compression-f aq/top.html [1996, November 4].

Jacobson, G. (No date). Data compression [Online]. Available HTTP: http://www.cs.princeton.edu/courses/cs333/notes/complec.txt/ [1996, November 14].

Moncur, M. (1995, January 9). A summary of the GIF licensing situation [Online]. Available HTTP: http://www.xmission.com/~mgm/gif/ [1996, November 27].

Nelson, M. and J.-L. Gailly. (1996). The Data Compression Book. New York: M&T Books.

Note: I did not use The Data Compression Book for this presentation, because it was checked out while I was doing my research. However, I did read recommendations of it, so it is included for anyone using this reference list as a starting point for research.

Prosise, J. (1993, May 25). Understanding data compression. PC Magazine [Database], 12, 305(3). Available: Expanded Academic ASAP/RN: A13725816 [1996, November 14].

Robert, T. (1994, January). Data compression: packing it in. Compute [Database], 16, 52(1). Available: Expanded Academic ASAP/RN: A14731989 [1996, November 14].

Sayood, K. (1996). Introduction to Data Compression. San Francisco: Morgan Kaufmann Publishers, Inc.

To beginning of full text
Back to data compression presentation