scribe_node_4


Symbol-by-symbol Source Coding: JIANG Yunxiang

 

Notations: 

A source code Formula is a mapping from Formula to Formula, where Formula is

the source alphabet and Formula is the code alphabet. Formula is

called the codeword of x.

 The expected lengthFormula is denoted by

Formula

whereFormula is the length of the codeword for x.

 

 

 

 

 

 

 

1. Some codes definition

(a)  Non-singular codes (random codes): the codewords of two different symbols are different.

Every element in Formula maps into a different codeword string.

    i.e. Formula     Formula

 

 

(b) Uniquely decodable codes (uniform codes)

A code is uniquely decodable if its extension is non-singular. The extension Formula  of a source code C is denoted by

Formula

Also, we could conclude that  different sequences maps into different streams of codewords.

   i.e. Formula     Formula

     (If Formula, then Formula the codeword stream for Formula)

 

 

 

 

(c) Prefix-free or instantaneous codes: prefix-free or instantaneous codes are codes where no codeword is a prefix of any other codewords.

The code tree for uniform code can be seen as follows:

 

 

 

Uniform code and instantaneuos code tree

Uniform code and instant code belong to a class of source code called prefix-free code which has been introduced above.

 

For a uniform code, each character is assigned uniquely and you encode the sequence character-by-character. The decoder takes the same bits each time and outputs the corresponding character.

 

The code tree for uniform code can be seen as follows:

 

 

For instantaneous codes, the decoder keeps reading the bits until a bit 0 is obtained or three 1's are obtained. In the former case, it decides whether a character is A, C or G based on the number of zeros received. In the latter case, T is output. The instantaneous code tree could be seen as follows:

 

2. Kraft inequality

 

The important things about a code are the decodability and the codeword length. The following theorem connects the two.

Theorem (Kraft inequality)

For a prefix code over an alphabet of size D for m possible outputs, the codeword lengths Formula, must satisfy the Kraft inequality

 

Formula

 

Conversely, given lengths that satisfy the Kraft inequality, one can construct a prefix-free code.

 

Proof:

 

A source code can be related to a D-ary tree. A codeword of length Formula  correspond to a node at level Formula (the root is at level 0). If it is prefix, no codeword is a descendant of any other codewords in the tree.

Start with the code tree. One can grow from each leaf complete descendants up to the depth of the tree, the length of the longest codeword. By counting the disjoint leafs added at this level and compare to the maximum possible number, the inequality is proven.

To prove the converse, start with a full tree of the maximum length M. Repeatedly assign the leftmost node at level l to a codeword of length l and

remove its descendants, from Formula to Formula. At any stage, the existence of a node at level Formula is guaranteed by the Kraft inequality.

 

 

3. Uniquely Decodable Code and Non-Singular Code

 

(a) Give a code for the DNA sequences which is uniquely decodable but not a prefix code. Also, give a code for the DNA sequence which is non-singular but not uniquely decodable.

source singular  non-singular but not uniquely decodable uniquely decodable but not instantaneous instantaneous 
10 
010  00  10 
01  11  110 
10  110  111 

 

For the example of non-singular but not uniquely decodable codes , Formula, but Formula, which means that it is not uniquely decodable.

 

For the example of  uniquely decodable but not instantaneous codes, it is easy to prove that this codeword is uniquely decodable but G is the prefix of T.

for the example of instantaneous codes, this codeword is uniquely decodable and instantaneous.

 

 

 Let Formula be the number of bits assigned to the characters A, C, G and T respectively and let Formula. Consider the following equation.

Formula

How does the quantity Formula relate to the encoding of N characters? Hence, give an upper bound for Formula and prove that

Formula.

 

 

Proof:

 

Formula 

                          Formula    //Formulacan be any sequence consisting of N symbols; Formulais the length of Formula

                          Formula

Compare the two expressions, we find that if i is Formula, then A(i) is the number of source sequences mapping into length-i codeword stream.

The size of length-i codeword stream is Formula(binary case).

By the definition of uniquely decodable code, at most one source sequence is mapped into one codeword stream.

Therefore, no more than Formula source sequences can be mapped into length-i codeword, i.e. Formula

Formula

Take Nth root at both side, we get Formula.

 

 

4. Alternative Proof for Entropy Bound

   (a) Prove that for any x>0, Formula. What are the necessary and sufficient conditions for the equality? This inequality is called Fundamental Inequality.

Proof:

when FormulaFormula,         Formula => Formula. Let Formula, we could get that

Formula

Formula

Formula

Thus, we could get the maximum value of Formula when Formula. As a result, Formula => 

Formula.  Formula is  the necessary and sufficient conditions for the equality.

(b) According to Jensen's inequality

 Formula

then

Formula

 

 

5. Tightness of Entropy Bound

(a) Consider the binary encoding of a fair dice toss. Is it possible to achieve the entropy bound?

Formula

Formula is always an integer while Formula is usually not a rational number. They are not equal in general.

The easiest way to make logarithm an integer is to use ceiling function.

 

(b) 

Let X be a random variable with the outcomes {1, 2, ..., m} and Formula be the probability of outcome i. Also, let Formula be the number of bits to encode outcome i. Prove that there exists a prefix code such that Formula. This code is called Shannon Code. Prove that

Formula

Proof:

          firstly, Formula     

          then Formula           Proof over.

(C)   Instead of encoding symbol-by-symbol, we encode n symbols at a time, i.e. we introduce the Shannon code for all possible sequence of n symbols. Let L be the expected number of bits to encode n symbols and Formula be the random variable for a sequence of n symbols. 

 

The average length of n symbols Formula, the average length of each symbol Formula

 

 

6. Huffman code

Given an algorithm that achieve the requirement, we suggest Huffman code which could be described as follows:

Algorithm:

Step1: calculate the probability of all N probable sequences and then sort them descendant.

Step2: Plus the two lowest probability event (sequence) N and N-1 to event N+1. Then record the higher one as 1 and lower one as 0.

Step3: Insert event N+1 to the sorted previous N-2 probable event according to the descendant probability.

Step3: Step into upper branch. If there is only one event in current branch, define it as root, go to Step4. Otherwise, go to Step 2.

Step4: Read {0,1} sequence from root to leaves of the tree as the corresponding bit-sequence. 

 

   For example: if p=0.9, 1-p=0.2, we want to get length-3 sequence of coin tossed. All the possible event could be written as follows: HHH, HHT, HTH, HTT, THH, THT, TTH, TTT. Since the probability of these events are decided by the probability of Heads and Tails. So we just need to calculate the number of H and T in each events to sort them. TTT have three T so it must be the lowest one. And HTT, THT, TTH have the lower one, so could be the lower one than TTT. But each of them could be randomly sorted between each other. Then HHT, HTH, THH is lower than before events. The highest is HHH. After sorting them, we execute step2-step4 accordingly. As result, HHH, HHT, HTH, HTT, THH, THT, TTH, TTT could be encoded sequentially as 1, 011, 010, 00011, 001, 00010, 00001, 00000. And the average length of this example = 1*0.512+3*0.128+3*0.128+5*0.032+3*0.128+5*0.032+5*0.032+5*0.008= 2.304 Formula 

   Actually, in this unbalance tree algorithm,  the number bits assigned a unique length-N bit-sequence to each possible outcome decided by the probability p according to Shannon’s source coding theorem 

  Time-complexity: In this algorithm, it has two parts: the one is sorting algorithm and the other is encoding algorithm.  For normal sorting algorithm like merge sorting algorithm, time complexity is O(N* N). (N represents the possible number of outcomes). But for this condition, just based on the number of Heads and Tails in each sequence, sorting algorithm could be according to the number of Heads in each sequence from highest to lowest. So time complexity could be O(N). And the time complexity of encoding algorithm is O((1+1)*(N-1)) = O(2*(N-1)). (the first 1 represents plus every two event probability, the second 1represents insert step as Step3 described, and (N-1) represents the depth of calculated tree branch). 

So it is could be concluded that time complexity of this algorithm is linear.