• If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.

  • Dokkio Sidebar (from the makers of PBworks) is a Chrome extension that eliminates the need for endless browser tabs. You can search all your online stuff without any extra effort. And Sidebar was #1 on Product Hunt! Check out what people are saying by clicking here.


Problem Set 4

Page history last edited by sidjaggi 10 years, 10 months ago

Zero-Error Data Compression

Q1. (DNA Encoding) You are asked to develop a system to store the DNA sequences of some viruses for a medical company. Each sequence consists of n, which is a large number, characters. There are only four types of characters, namely, A,C,G and T, in these sequences. The company wants to reduce the storage cost for the DNA sequences. Your friends suggest the following schemes for you to choose:

Scheme 1 (Random Code): To form the codebook, you randomly (under the uniform distribution) and independently assign Formula bits to each possible DNA sequence. To reconstruct the DNA sequence from a given bit sequence, you search for a unique and typical DNA sequence which matches the given bit sequence. If so, the corresponding DNA sequence is outputted. Otherwise, simply pops up an error message.

Scheme 2 (Uniform Code): Each character is assigned 2 bits uniquely and you encode the sequence character-by-character. The decoder takes 2 bits each time and outputts the corresponding character.

Scheme 3 (Instant Code): The encoding process is the same as the uniform code except that you adopt the following codebook:

 Bit Sequence
A 1
C 01
T 000

The decoder keeps reading the bits until a bit 1 is obtained or three 0's are obtained. In the former case, it decides whether a character is A, C or G based on the number of zeros received. In the latter case, T is outputted.

(a) Why does the company prefer the last two codes to the random code?

(b) Construct the code trees for the uniform code and instant code.

(c) Consider a code in which we can construct a code tree. Let Formula be the number of bits assigned to the characters A, C, G and T respectively and let Formula. Extend the tree to level Formula. How many descendents of each codeword are there in level Formula? Based on this observation, prove


This is called the Kraft Inequality.

(d) Uniform code and instant code belong to a class of source code called prefix code (a.k.a. prefix-free code). In prefix code, none of the codeword is the prefix of another codeword. Why do the codeword lengths of a prefix code satisfy the Kraft Inequality? Conversely, if the codeword lengths satisfy the Kraft Inequality, can they be the codeword lengths of a prefix code?

(e) Generalize the results in (c) and (d) to D-ary encoding (i.e. we use D-ary symbols instead of bits to represent the characters).


Key Concepts: Zero-error compression, code tree, Kraft Inequality, prefix code


Q2 (Uniquely Decodable Code and Non-Singular Code) In a uniquely decodable code, any two distinct sequences of characters are encoded into two distinct bit sequences. A code is said to be non-singular if two characters are mapped into two distinct bit sequences.

(a) Give a code for the DNA sequences which is uniquely decodable but not a prefix code. Also, give a code for the DNA sequence which is non-singular but not uniquely decodable.


(b) Let Formula be the number of bits assigned to the characters A, C, G and T respectively and let Formula. Consider the following equation.


How does the quantity Formula relate to the encoding of N characters? Hence, give an upper bound for Formula and prove that


(c) Prove that the codeword lengths of a uniquely decodable code satisfy the Kraft Inequality.

(d) Generalize the results to D-ary encoding.


Key Concepts: Uniquely decodable code, non-singular code, Kraft inequality for uniquely decodable code


Q3 (Alternative Proof for Entropy Bound)

(a) Prove that for any x>0, Formula. What are the necessary and sufficient conditions for the equality? This inequality is called Fundamental Inequality.

(b) Let Formula be the probabilities of the characters A, C, G and T respectively. By using the results in (a) and Kraft Inequality, prove that


What are the necessary and sufficient conditions for the equality?

(c) Let L be the expected number of bits of a character and X be a random variable with the probability distribution Formula. Use the results of (b) to prove that Formula. This lower bound is known as the entropy bound. What are the necessary and sufficient conditions for the equality? What does it imply to the source coding?

(d) Consider a random variable Y which has m possible outcomes. Extend the above results to encode the outcomes of Y. Use the results of (c) to prove that Formula.


Key Concepts: Fundamental Inequality, entropy bound


Q4 (Tightness of Entropy Bound)

(a) Consider the binary encoding of a fair dice toss. Is it possible to achieve the entropy bound?

(b) Let X be a random variable with the outcomes {1, 2, ..., m} and Formula be the probability of outcome i. Also, let Formula be the number of bits to encode outcome i. Prove that there exists a prefix code such that Formula. This code is called Shannon Code. Prove that


(c) Instead of encoding symbol-by-symbol, we encode n symbols at a time, i.e. we introduce the Shannon code for all possible sequence of n symbols. Let L be the expected number of bits to encode n symbols and Formula be the random variable for a sequence of n symbols. Prove that Formula.


Key Concepts: Shannon Code, tightness of entropy bound, asymptotic analysis for block encoding


Q5 (Huffmann Code)

(a) Consider the encoding of 3 symbols with probability mass function {0.6, 0.3, 0.1}. Construct a Shannon code for these 3 symbols. Can you improve the coding scheme? Guess an algorithm to construct prefix code with the smallest expected number of bits. Below are the hints if needed:

i) What is the relationship between the number of bits and the probability for a particular symbol? Prove your answer formally.

ii) Consider the code tree of a prefix code. There always exists an optimal code (optimal in the sense of number of bits) such that the two longest codewords have the same number of bits and they only differ in the last bit. Why?

iii) The prefix of the two longest codewords and other codewords form another prefix code.

(b) Prove that your algorithm in (a) indeed gives the smallest expected number of bits among all prefix codes. Is it also the best one among all the uniquely decodable codes?

(c) Give an upper bound between the expected number of bits by your algorithm and the entropy bound. How can you reduce the gap? What is the disadvantage of doing so?

(d) Suppose your algorithm is used to encode a symbol with N possibilities. What is the time complexity of your algorithm?

(e) In the previous parts, you have shown that your algorithm requires the smallest expected number of bits and it has polynomial-time complexity. However, this is not adopted in those popular compression software. For instance, LZMA is adopted in winrar and DEFLATE (a combination of LZ77 and Huffmann code) is adopted in gzip. Why is it the case?


Kep Concepts: Huffmann code and its optimality, dynamic programming, greedy algorithm, computational complexity of Huffmann code, limitations of implementation of Huffmann code


A profile of David Huffman by Scientific American

Comments (0)

You don't have permission to comment on this page.