ZeroError Data Compression
Q1. (DNA Encoding) You are asked to develop a system to store the DNA sequences of some viruses for a medical company. Each sequence consists of n, which is a large number, characters. There are only four types of characters, namely, A,C,G and T, in these sequences. The company wants to reduce the storage cost for the DNA sequences. Your friends suggest the following schemes for you to choose:
Scheme 1 (Random Code): To form the codebook, you randomly (under the uniform distribution) and independently assign bits to each possible DNA sequence. To reconstruct the DNA sequence from a given bit sequence, you search for a unique and typical DNA sequence which matches the given bit sequence. If so, the corresponding DNA sequence is outputted. Otherwise, simply pops up an error message.
Scheme 2 (Uniform Code): Each character is assigned 2 bits uniquely and you encode the sequence characterbycharacter. The decoder takes 2 bits each time and outputts the corresponding character.
Scheme 3 (Instant Code): The encoding process is the same as the uniform code except that you adopt the following codebook:
Character

Bit Sequence

A 
1 
C 
01 
G

001 
T 
000 
The decoder keeps reading the bits until a bit 1 is obtained or three 0's are obtained. In the former case, it decides whether a character is A, C or G based on the number of zeros received. In the latter case, T is outputted.
(a) Why does the company prefer the last two codes to the random code?
(b) Construct the code trees for the uniform code and instant code.
(c) Consider a code in which we can construct a code tree. Let be the number of bits assigned to the characters A, C, G and T respectively and let . Extend the tree to level . How many descendents of each codeword are there in level ? Based on this observation, prove
.
This is called the Kraft Inequality.
(d) Uniform code and instant code belong to a class of source code called prefix code (a.k.a. prefixfree code). In prefix code, none of the codeword is the prefix of another codeword. Why do the codeword lengths of a prefix code satisfy the Kraft Inequality? Conversely, if the codeword lengths satisfy the Kraft Inequality, can they be the codeword lengths of a prefix code?
(e) Generalize the results in (c) and (d) to Dary encoding (i.e. we use Dary symbols instead of bits to represent the characters).
Key Concepts: Zeroerror compression, code tree, Kraft Inequality, prefix code
Q2 (Uniquely Decodable Code and NonSingular Code) In a uniquely decodable code, any two distinct sequences of characters are encoded into two distinct bit sequences. A code is said to be nonsingular if two characters are mapped into two distinct bit sequences.
(a) Give a code for the DNA sequences which is uniquely decodable but not a prefix code. Also, give a code for the DNA sequence which is nonsingular but not uniquely decodable.
(b) Let be the number of bits assigned to the characters A, C, G and T respectively and let . Consider the following equation.
How does the quantity relate to the encoding of N characters? Hence, give an upper bound for and prove that
.
(c) Prove that the codeword lengths of a uniquely decodable code satisfy the Kraft Inequality.
(d) Generalize the results to Dary encoding.
Key Concepts: Uniquely decodable code, nonsingular code, Kraft inequality for uniquely decodable code
Q3 (Alternative Proof for Entropy Bound)
(a) Prove that for any x>0, . What are the necessary and sufficient conditions for the equality? This inequality is called Fundamental Inequality.
(b) Let be the probabilities of the characters A, C, G and T respectively. By using the results in (a) and Kraft Inequality, prove that
.
What are the necessary and sufficient conditions for the equality?
(c) Let L be the expected number of bits of a character and X be a random variable with the probability distribution . Use the results of (b) to prove that . This lower bound is known as the entropy bound. What are the necessary and sufficient conditions for the equality? What does it imply to the source coding?
(d) Consider a random variable Y which has m possible outcomes. Extend the above results to encode the outcomes of Y. Use the results of (c) to prove that .
Key Concepts: Fundamental Inequality, entropy bound
Q4 (Tightness of Entropy Bound)
(a) Consider the binary encoding of a fair dice toss. Is it possible to achieve the entropy bound?
(b) Let X be a random variable with the outcomes {1, 2, ..., m} and be the probability of outcome i. Also, let be the number of bits to encode outcome i. Prove that there exists a prefix code such that . This code is called Shannon Code. Prove that
.
(c) Instead of encoding symbolbysymbol, we encode n symbols at a time, i.e. we introduce the Shannon code for all possible sequence of n symbols. Let L be the expected number of bits to encode n symbols and be the random variable for a sequence of n symbols. Prove that .
Key Concepts: Shannon Code, tightness of entropy bound, asymptotic analysis for block encoding
Q5 (Huffmann Code)
(a) Consider the encoding of 3 symbols with probability mass function {0.6, 0.3, 0.1}. Construct a Shannon code for these 3 symbols. Can you improve the coding scheme? Guess an algorithm to construct prefix code with the smallest expected number of bits. Below are the hints if needed:
i) What is the relationship between the number of bits and the probability for a particular symbol? Prove your answer formally.
ii) Consider the code tree of a prefix code. There always exists an optimal code (optimal in the sense of number of bits) such that the two longest codewords have the same number of bits and they only differ in the last bit. Why?
iii) The prefix of the two longest codewords and other codewords form another prefix code.
(b) Prove that your algorithm in (a) indeed gives the smallest expected number of bits among all prefix codes. Is it also the best one among all the uniquely decodable codes?
(c) Give an upper bound between the expected number of bits by your algorithm and the entropy bound. How can you reduce the gap? What is the disadvantage of doing so?
(d) Suppose your algorithm is used to encode a symbol with N possibilities. What is the time complexity of your algorithm?
(e) In the previous parts, you have shown that your algorithm requires the smallest expected number of bits and it has polynomialtime complexity. However, this is not adopted in those popular compression software. For instance, LZMA is adopted in winrar and DEFLATE (a combination of LZ77 and Huffmann code) is adopted in gzip. Why is it the case?
Kep Concepts: Huffmann code and its optimality, dynamic programming, greedy algorithm, computational complexity of Huffmann code, limitations of implementation of Huffmann code
A profile of David Huffman by Scientific American
Comments (0)
You don't have permission to comment on this page.