• If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.

  • You already know Dokkio is an AI-powered assistant to organize & manage your digital files & messages. Very soon, Dokkio will support Outlook as well as One Drive. Check it out today!



Page history last edited by Jiang Yunxiang 12 years, 7 months ago

Symbol-by-symbol Source Coding: JIANG Yunxiang



A source code Formula is a mapping from Formula to Formula, where Formula is

the source alphabet and Formula is the code alphabet. Formula is

called the codeword of x.

 The expected lengthFormula is denoted by


whereFormula is the length of the codeword for x.








1. Some codes definition

(a)  Non-singular codes (random codes): the codewords of two different symbols are different.

Every element in Formula maps into a different codeword string.

    i.e. Formula     Formula



(b) Uniquely decodable codes (uniform codes)

A code is uniquely decodable if its extension is non-singular. The extension Formula  of a source code C is denoted by


Also, we could conclude that  different sequences maps into different streams of codewords.

   i.e. Formula     Formula

     (If Formula, then Formula the codeword stream for Formula)





(c) Prefix-free or instantaneous codes: prefix-free or instantaneous codes are codes where no codeword is a prefix of any other codewords.

The code tree for uniform code can be seen as follows:




Uniform code and instantaneuos code tree

Uniform code and instant code belong to a class of source code called prefix-free code which has been introduced above.


For a uniform code, each character is assigned uniquely and you encode the sequence character-by-character. The decoder takes the same bits each time and outputs the corresponding character.


The code tree for uniform code can be seen as follows:



For instantaneous codes, the decoder keeps reading the bits until a bit 0 is obtained or three 1's are obtained. In the former case, it decides whether a character is A, C or G based on the number of zeros received. In the latter case, T is output. The instantaneous code tree could be seen as follows:


2. Kraft inequality


The important things about a code are the decodability and the codeword length. The following theorem connects the two.

Theorem (Kraft inequality)

For a prefix code over an alphabet of size D for m possible outputs, the codeword lengths Formula, must satisfy the Kraft inequality




Conversely, given lengths that satisfy the Kraft inequality, one can construct a prefix-free code.




A source code can be related to a D-ary tree. A codeword of length Formula  correspond to a node at level Formula (the root is at level 0). If it is prefix, no codeword is a descendant of any other codewords in the tree.

Start with the code tree. One can grow from each leaf complete descendants up to the depth of the tree, the length of the longest codeword. By counting the disjoint leafs added at this level and compare to the maximum possible number, the inequality is proven.

To prove the converse, start with a full tree of the maximum length M. Repeatedly assign the leftmost node at level l to a codeword of length l and

remove its descendants, from Formula to Formula. At any stage, the existence of a node at level Formula is guaranteed by the Kraft inequality.



3. Uniquely Decodable Code and Non-Singular Code


(a) Give a code for the DNA sequences which is uniquely decodable but not a prefix code. Also, give a code for the DNA sequence which is non-singular but not uniquely decodable.

source singular  non-singular but not uniquely decodable uniquely decodable but not instantaneous instantaneous 
010  00  10 
01  11  110 
10  110  111 


For the example of non-singular but not uniquely decodable codes , Formula, but Formula, which means that it is not uniquely decodable.


For the example of  uniquely decodable but not instantaneous codes, it is easy to prove that this codeword is uniquely decodable but G is the prefix of T.

for the example of instantaneous codes, this codeword is uniquely decodable and instantaneous.



 Let Formula be the number of bits assigned to the characters A, C, G and T respectively and let Formula. Consider the following equation.


How does the quantity Formula relate to the encoding of N characters? Hence, give an upper bound for Formula and prove that







                          Formula    //Formulacan be any sequence consisting of N symbols; Formulais the length of Formula


Compare the two expressions, we find that if i is Formula, then A(i) is the number of source sequences mapping into length-i codeword stream.

The size of length-i codeword stream is Formula(binary case).

By the definition of uniquely decodable code, at most one source sequence is mapped into one codeword stream.

Therefore, no more than Formula source sequences can be mapped into length-i codeword, i.e. Formula


Take Nth root at both side, we get Formula.



4. Alternative Proof for Entropy Bound

   (a) Prove that for any x>0, Formula. What are the necessary and sufficient conditions for the equality? This inequality is called Fundamental Inequality.


when FormulaFormula,         Formula => Formula. Let Formula, we could get that




Thus, we could get the maximum value of Formula when Formula. As a result, Formula => 

Formula.  Formula is  the necessary and sufficient conditions for the equality.

(b) According to Jensen's inequality






5. Tightness of Entropy Bound

(a) Consider the binary encoding of a fair dice toss. Is it possible to achieve the entropy bound?


Formula is always an integer while Formula is usually not a rational number. They are not equal in general.

The easiest way to make logarithm an integer is to use ceiling function.



Let X be a random variable with the outcomes {1, 2, ..., m} and Formula be the probability of outcome i. Also, let Formula be the number of bits to encode outcome i. Prove that there exists a prefix code such that Formula. This code is called Shannon Code. Prove that



          firstly, Formula     

          then Formula           Proof over.

(C)   Instead of encoding symbol-by-symbol, we encode n symbols at a time, i.e. we introduce the Shannon code for all possible sequence of n symbols. Let L be the expected number of bits to encode n symbols and Formula be the random variable for a sequence of n symbols. 


The average length of n symbols Formula, the average length of each symbol Formula



6. Huffman code

Given an algorithm that achieve the requirement, we suggest Huffman code which could be described as follows:


Step1: calculate the probability of all N probable sequences and then sort them descendant.

Step2: Plus the two lowest probability event (sequence) N and N-1 to event N+1. Then record the higher one as 1 and lower one as 0.

Step3: Insert event N+1 to the sorted previous N-2 probable event according to the descendant probability.

Step3: Step into upper branch. If there is only one event in current branch, define it as root, go to Step4. Otherwise, go to Step 2.

Step4: Read {0,1} sequence from root to leaves of the tree as the corresponding bit-sequence. 


   For example: if p=0.9, 1-p=0.2, we want to get length-3 sequence of coin tossed. All the possible event could be written as follows: HHH, HHT, HTH, HTT, THH, THT, TTH, TTT. Since the probability of these events are decided by the probability of Heads and Tails. So we just need to calculate the number of H and T in each events to sort them. TTT have three T so it must be the lowest one. And HTT, THT, TTH have the lower one, so could be the lower one than TTT. But each of them could be randomly sorted between each other. Then HHT, HTH, THH is lower than before events. The highest is HHH. After sorting them, we execute step2-step4 accordingly. As result, HHH, HHT, HTH, HTT, THH, THT, TTH, TTT could be encoded sequentially as 1, 011, 010, 00011, 001, 00010, 00001, 00000. And the average length of this example = 1*0.512+3*0.128+3*0.128+5*0.032+3*0.128+5*0.032+5*0.032+5*0.008= 2.304 Formula 

   Actually, in this unbalance tree algorithm,  the number bits assigned a unique length-N bit-sequence to each possible outcome decided by the probability p according to Shannon’s source coding theorem 

  Time-complexity: In this algorithm, it has two parts: the one is sorting algorithm and the other is encoding algorithm.  For normal sorting algorithm like merge sorting algorithm, time complexity is O(N* N). (N represents the possible number of outcomes). But for this condition, just based on the number of Heads and Tails in each sequence, sorting algorithm could be according to the number of Heads in each sequence from highest to lowest. So time complexity could be O(N). And the time complexity of encoding algorithm is O((1+1)*(N-1)) = O(2*(N-1)). (the first 1 represents plus every two event probability, the second 1represents insert step as Step3 described, and (N-1) represents the depth of calculated tree branch). 

So it is could be concluded that time complexity of this algorithm is linear. 



Comments (2)

sidjaggi said

at 12:43 am on Nov 15, 2011

Nice job.

1. The figure showing the code tree for instantaneous codes, and the corresponding text, is confusing. Where is the root of the tree? Did you mean three 1s instead of three zeroes?

2. About your description of the time-complexity of Huffman codes, the preprocessing step (of sorting the probabilities) takes time N\log(N), so the overall complexity is slightly greater than linear.

Also, add your own name to the scribe notes -- be proud! ;)

Jiang Yunxiang said

at 2:20 am on Nov 21, 2011

Sorry to reply so late
1. yeah, I just make a mistake which you say. I just modify that three 1s instead of three zeroes
2. Oh, yes. I don't make clear my description which just is suitable for two-side coin tossed. For normal scenario, the time-complexity is what you say

You don't have permission to comment on this page.