**Symbol-by-symbol Source Coding: JIANG Yunxiang**

Notations:

A source code is a mapping from to , where is

the source alphabet and is the code alphabet. is

called the codeword of x.

The expected length is denoted by

where is the length of the codeword for x.

**1. Some codes definition**

(a) Non-singular codes (random codes): the codewords of two different symbols are different.

Every element in maps into a different codeword string.

i.e.

(b) Uniquely decodable codes (uniform codes)

A code is uniquely decodable if its extension is non-singular. The extension of a source code C is denoted by

Also, we could conclude that different sequences maps into different streams of codewords.

i.e.

(If , then the codeword stream for )

(c) Prefix-free or instantaneous codes: prefix-free or instantaneous codes are codes where no codeword is a prefix of any other codewords.

The code tree for uniform code can be seen as follows:

*Uniform code and instantaneuos code tree*

Uniform code and instant code belong to a class of source code called prefix-free code which has been introduced above.

For a uniform code, each character is assigned uniquely and you encode the sequence character-by-character. The decoder takes the same bits each time and outputs the corresponding character.

The code tree for uniform code can be seen as follows:

For instantaneous codes, the decoder keeps reading the bits until a bit 0 is obtained or three 1's are obtained. In the former case, it decides whether a character is A, C or G based on the number of zeros received. In the latter case, T is output. The instantaneous code tree could be seen as follows:

**2. Kraft inequality**

The important things about a code are the decodability and the codeword length. The following theorem connects the two.

**Theorem (Kraft inequality)**

For a prefix code over an alphabet of size D for m possible outputs, the codeword lengths , must satisfy the Kraft inequality

Conversely, given lengths that satisfy the Kraft inequality, one can construct a prefix-free code.

** **

**Proof:**

A source code can be related to a D-ary tree. A codeword of length correspond to a node at level (the root is at level 0). If it is prefix, no codeword is a descendant of any other codewords in the tree.

Start with the code tree. One can grow from each leaf complete descendants up to the depth of the tree, the length of the longest codeword. By counting the disjoint leafs added at this level and compare to the maximum possible number, the inequality is proven.

To prove the converse, start with a full tree of the maximum length M. Repeatedly assign the leftmost node at level l to a codeword of length l and

remove its descendants, from to . At any stage, the existence of a node at level is guaranteed by the Kraft inequality.

3. *Uniquely Decodable Code and Non-Singular Code*

(a) Give a code for the DNA sequences which is uniquely decodable but not a prefix code. Also, give a code for the DNA sequence which is non-singular but not uniquely decodable.

source |
singular |
non-singular but not uniquely decodable |
uniquely decodable but not instantaneous |
instantaneous |

A |
0 |
0 |
10 |
0 |

C |
0 |
010 |
00 |
10 |

G |
0 |
01 |
11 |
110 |

T |
0 |
10 |
110 |
111 |

For the example of non-singular but not uniquely decodable codes , , but , which means that it is not uniquely decodable.

For the example of uniquely decodable but not instantaneous codes, it is easy to prove that this codeword is uniquely decodable but G is the prefix of T.

for the example of instantaneous codes, this codeword is uniquely decodable and instantaneous.

Let be the number of bits assigned to the characters A, C, G and T respectively and let . Consider the following equation.

How does the quantity relate to the encoding of *N* characters? Hence, give an upper bound for and prove that

.

**Proof:**

//can be any sequence consisting of N symbols; is the length of

Compare the two expressions, we find that if i is , then A(i) is the number of source sequences mapping into length-i codeword stream.

The size of length-i codeword stream is (binary case).

By the definition of uniquely decodable code, at most one source sequence is mapped into one codeword stream.

Therefore, no more than source sequences can be mapped into length-i codeword, i.e.

Take Nth root at both side, we get .

**4. Alternative Proof for Entropy Bound**

(a) Prove that for any *x*>0, . What are the necessary and sufficient conditions for the equality? This inequality is called *Fundamental Inequality*.

**Proof:**

when , , => . Let , we could get that

Thus, we could get the maximum value of when . As a result, =>

. is the necessary and sufficient conditions for the equality.

(b) According to Jensen's inequality

then

**5. Tightness of Entropy Bound**

(a) Consider the binary encoding of a fair dice toss. Is it possible to achieve the entropy bound?

is always an integer while is usually not a rational number. They are not equal in general.

The easiest way to make logarithm an integer is to use ceiling function.

(b)

Let *X* be a random variable with the outcomes {1, 2, ..., m} and be the probability of outcome *i*. Also, let be the number of bits to encode outcome *i*. Prove that there exists a prefix code such that . This code is called *Shannon Code*. Prove that

**Proof:**

firstly,

then Proof over.

(C) Instead of encoding symbol-by-symbol, we encode *n* symbols at a time, i.e. we introduce the Shannon code for all possible sequence of *n* symbols. Let *L* be the expected number of bits to encode *n* symbols and be the random variable for a sequence of *n* symbols.

The average length of n symbols , the average length of each symbol

**6. Huffman code**

Given an algorithm that achieve the requirement, we suggest Huffman code which could be described as follows:

**Algorithm:**

Step1: calculate the probability of all N probable sequences and then sort them descendant.

Step2: Plus the two lowest probability event (sequence) N and N-1 to event N+1. Then record the higher one as 1 and lower one as 0.

Step3: Insert event N+1 to the sorted previous N-2 probable event according to the descendant probability.

Step3: Step into upper branch. If there is only one event in current branch, define it as root, go to Step4. Otherwise, go to Step 2.

Step4: Read {0,1} sequence from root to leaves of the tree as the corresponding bit-sequence.

For example: if *p*=0.9, 1-*p*=0.2, we want to get length-3 sequence of coin tossed. All the possible event could be written as follows: HHH, HHT, HTH, HTT, THH, THT, TTH, TTT. Since the probability of these events are decided by the probability of Heads and Tails. So we just need to calculate the number of H and T in each events to sort them. TTT have three T so it must be the lowest one. And HTT, THT, TTH have the lower one, so could be the lower one than TTT. But each of them could be randomly sorted between each other. Then HHT, HTH, THH is lower than before events. The highest is HHH. After sorting them, we execute step2-step4 accordingly. As result, HHH, HHT, HTH, HTT, THH, THT, TTH, TTT could be encoded sequentially as 1, 011, 010, 00011, 001, 00010, 00001, 00000. And the average length of this example = 1*0.512+3*0.128+3*0.128+5*0.032+3*0.128+5*0.032+5*0.032+5*0.008= 2.304

Actually, in this unbalance tree algorithm, the number bits assigned a unique length-*N* bit-sequence to each possible outcome decided by the probability *p according to Shannonâ€™s source coding theorem *

* Time-complexity: In this algorithm, it has two parts: the one is sorting algorithm and the other is encoding algorithm. For normal sorting algorithm like merge sorting algorithm, time complexity is O(N* **N). (N represents the possible number of outcomes). But for this condition, just based on the number of Heads and Tails in each sequence, sorting algorithm could be according to the number of Heads in each sequence from highest to lowest. So time complexity could be O(N). And the time complexity of encoding algorithm is O((1+1)*(N-1)) = O(2*(N-1)). (the first 1 represents plus every two event probability, the second 1represents insert step as Step3 described, and (N-1) represents the depth of calculated tree branch).** *

*So it is could be concluded that time complexity of this algorithm is linear.** *

## Comments (2)

## sidjaggi said

at 12:43 am on Nov 15, 2011

Nice job.

1. The figure showing the code tree for instantaneous codes, and the corresponding text, is confusing. Where is the root of the tree? Did you mean three 1s instead of three zeroes?

2. About your description of the time-complexity of Huffman codes, the preprocessing step (of sorting the probabilities) takes time N\log(N), so the overall complexity is slightly greater than linear.

Also, add your own name to the scribe notes -- be proud! ;)

## Jiang Yunxiang said

at 2:20 am on Nov 21, 2011

Sorry to reply so late

1. yeah, I just make a mistake which you say. I just modify that three 1s instead of three zeroes

2. Oh, yes. I don't make clear my description which just is suitable for two-side coin tossed. For normal scenario, the time-complexity is what you say

You don't have permission to comment on this page.