Introduction to Coding Problem and Part of Problem Set 1(Scribe: WANG Limin)
I. Introduction to Coding Problem
The main problem of information theory can be described as follows: Imagine a stream of source data, say in the form of bits (0 or 1), is being transmitted over a communication channel. In the source, we want to encode the source message to make it more efficient and robust for transmission. Meanwhile, receiver will hope to decode the message and get meaning of the message(sometimes we may not recover the source message totally due to the noise in the channel).
See the next figure for an example, our source send a message denoted as
where the alphabet is
, then the encoder encodes the message as codeword
where the alphabet is
, finally the decoder receives the codeword and recovers the message as
where the alphabet is
. Encoder is usually a function that assigns a codeword to each source message and decoder attempts to correct any errors in transmission and then recovers the original message. Usually the random variable
form a first order Markov Chain, i.e. 

Ideally, we wish to decode correctly and get the source message, i.e.
. However, in fact there is always noise during the transmission (e.g. bit 0 has probability p to be turned into 1 and vice-versa) and we can not guarantee the receiver recovers the message totally correct. In information theory, we usually consider the error probability
less than some threshold
,i.e.
. In addition to robustness, we need to take the efficiency into account and want the length of our code is smaller than the source message, i.e.
. We usually define the rate
. When
, the error
may converge to
and the rate
can reach to best rate
, see the next figure for illustration.

There is a common sense that encoding source information, by adding additional information, sometimes referred to as redundancy, that can be used to detect and perhaps correct errors in transmission. The more redundancy that we add to the original source code. the more reliably we can detect and correct errors, but the less efficient we become at transmitting the source data. Thus, we need to keep a balance between efficiency and robustness.
II. Part of Problem Set 1
A certain fair coin is tossed n times. Each time it is tossed, the result is independent of the previous toss. This is an example of an i.i.d. binary random variable.
Q1: (Worst-case "compression")
(a) If the coin described above is tossed n times, how many distinct length-n sequences of coin tosses are possible?
The answer is 
(b) Given N possible outcomes of a random variable, how many bits (elements from {0,1}) are required to assign a unique length-N bit-sequence to each possible outcome?
The answer is
,see the next figure for an example, we have 8 possible out comes and we need 3 bit to assign a unique length-3 bit sequence to each possible outcome. Generally, for a full binary tree with N leaves, the height is 

(c) Describe a method based on binary trees, to assign a length-N bit-sequence to an outcome, based on a binary tree, and a corresponding method to reconstruct the coin-toss sequence from the bit-sequence.
Code: if coin is tossed head, the output bit is 0; if coin is tossed tail the output bit is 1.
Decode: if the bit is 0, the tossed result is head; if the bit is 1, the tossed result is tail.
Since the coin is fair, i.e the probability of head is 0.5, we can use a full binary tree for our coding and decoding scheme, see the next figure for a example. If we coin is head, we then go left and the bit is 0. If coin is tail, we then go right and the bit is 1. The same process is for decoding.

(d) If the encoding alphabet is ternary (eg: {0,1,2}) instead of binary, how does your answer to (b) change? How about your answer to (c), if ternary trees are allowed? What if you were tossing a six-sided dice instead of a coin? How do your answers change? What is the time-complexity of your assignment schemes?
The answer is
for problem (b). We can think of ternary trees with N leaves, then the average of tree height is
, see the next figure for a example with 9 leaves, we only need 2 ternary bits.

Coding idea:
Step1: We can according the coding scheme of problem (d) to map the tossed result into a n-length bit, the time complex is
, e.g. HTHHT coded as 01001.
Step2: We need translate the binary code into ternary code, the time complexity is
, e.g. (01001)2=(100)3
Thus the total time complexity is
.
Decoding idea:
The reverse order of coding scheme and the time complexity is
.
Note: n is the length of bit sequence.
If I toss a six-sided dice, the answer is similar to a coin.
A second coin is biased with probability p, i.e., it comes up Heads with probability p and Tails with probability 1-p. Each time it is tossed, the result is independent of the previous toss. This is an example of an i.i.d. binary random variable, also called a Bernoulli(p) random variable.
Q2: (Average-case compression)
(b) What is the probability of observing a particular coin-toss sequence with
Heads?
The answer is 
(c) What is the size of the type-class T(k,n) of all coin-toss sequences with exactly k Heads? What is therefore the probability of observing the type-class T(k,n)?
The size of the type-class T(k,n) is
and thus the probability of observing the type class T(k,n) is 
The other questions are in the next scribe and thanks for your reading.
Comments (8)
ZhanLei_Jacky said
at 2:04 pm on Sep 23, 2011
Hi, regarding the second figure showing the relationship between Rn and n, why the Rn decreases as n increases while Rn = n / m?
Jacky
WANG Limin said
at 2:43 pm on Sep 23, 2011
I think the second figure show our coding goal. We hope that Rn and en decreases when n increase. The reason may be related to the coding scheme. I do not know the details either.
Pili HU said
at 10:08 pm on Sep 23, 2011
I think the left part and right part of figure2 illustrates difference scenes. For the right part, it's easier to understand that, when m is a constant number, if we increase code length, we could reach lower error rate. As when n goes infinite, we can repeat the message for infinite times, to guarantee it could be correctly received. For the left part, I think it illustrates the scene that, the CODING ALGORITHM is defined, and m appears to be a function of n, say f(n), thus the rate differs with different n(more precisely, n be a function of m, but this is invertible).
Here's an example: consider a parity checking scheme, which will append one more bit to the message. Thus, n = m + 1, Rn = n / (n - 1) . The series of Rn appears to be, 2/1, 3/2, ..., 1(at infinite, this is Rbest)
Of course, this scheme can't help to correct fault messages.
That's my understanding of the graph...
ZhanLei_Jacky said
at 11:00 am on Sep 26, 2011
Actually, even though m is a function of n, you cannot guarantee that Rn is a decreasing function~ I mean, the figure may not be true for some CODING ALGORITHMs.
sidjaggi said
at 2:16 am on Oct 11, 2011
Good comments, all!
First of all, let's clarify the meaning of rate for compression. Before that, let's even clarify the meaning of "code".
Our encoding and decoding functions are denoted respectively by the functions f and g. That is, x^n = f(u^m), and u'^m = g(x^n) = g(f(u^m)). The sequence u^m is a random sequence due to the fact that U is a random variable. Because of this x^n and u'^m may also be random variables (includnig their lengths).
Also, in general, we allow randomization in our definitions of functions, so, for instance, x^n may not be a deterministic function of u^m.
Together, (f,g) denotes the "code".
In general, the code may be "good", or "bad". Let's call a code ``\epsilon_m-good" if the (average) probability of error (defined as \sum_{u^m} \Pr(u^m) l(u'^m \neq u^m) is less than some \epsilon_m.
Also, the "rate" R_m should be defined as 1/m \sum_{u^m} \Pr(u^m) l(x^n). (Note that I denote it as a function of m rather than n.)
Here l(x^n) is defined as the length of the codeword x^n (might be variable length, not necessarily exactly n), and \Pr(u^m) is the probability of observing message u^m. Hence \sum_{u^m} \Pr(u^m) l(x^n) denotes the average length of the codewords, and when we further divide by m, this gives the "amortized" average length, i.e., the average length of encoding per input symbol. This average length is, in general, of course a function of the encoding-decoding scheme.
Now let's see what the goals are.
We want that for increasing m, that there exist an infinite sequence of \epsilon_m-good codes (f,g) (one for each m) so that, simultaneously, R_m converges to R_best, and \epsilon_m converges to zero.
It is true that it is not clear that this is possible (for one thing, it depends on R_best), but that is one of the goals of this class.
Hope that clarifies definitions -- we'll also repeat this in class!
sidjaggi said
at 2:16 am on Oct 11, 2011
Good job with the scribe notes, btw! I like the figure!
sidjaggi said
at 2:26 am on Oct 11, 2011
Actually, for the answers you gave, would you mind explaining a little bit more, so that future readers will be able to understand a bit better? For instance, for 1(b), your answer is
"The answer is \lceil \log_2 N \rciel"
I would have explained, perhaps, with the help of a figure of a tree, showing why this is the case. Similarly for 1(c) the tree figure would help! For 1(d), doing a better job describing the algorithm for completeness sake (even if the next scribe also does so) would be nice. And so on.
Think of these notes as being for posterity -- future generations of information theorists should look at these notes and be like "WOWWWWW!!!!" ;)
WANG Limin said
at 8:35 pm on Oct 11, 2011
Thanks for your advice for the scribe note and I will modify my scribe note and add some details
You don't have permission to comment on this page.