• If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.

  • You already know Dokkio is an AI-powered assistant to organize & manage your digital files & messages. Very soon, Dokkio will support Outlook as well as One Drive. Check it out today!



Page history last edited by Manson Fong 12 years, 7 months ago

Scribe note: Manson FONG (Week 4)


Source coding theorem


1. Overview


First, let's recall the scenario of source coding.


Here, on the left is a source, which "emits" sequences of source symbols from a certain alphabet Formula. The sequences, Formula, are sent to an encoder, which converts the sequences into sequences of code symbols, FormulaFormula being from some alphabet Formula. For simplicity, Formula will be taken as binary. The goal in source coding then is to encode (or represent) the source sequences using as few bits as possible, at the same time allowing the code sequences to be decoded uniquely into their respective original sequences, Formula. In other words, it is desirable for m to be as small as possible for a given n. For "historical" reasons, the sender (the coder) is always named Alice, and the receiver (the decoder) is always named Bob.


     For a source sequence of length n, it would take Formula bits to encode the sequence without any compression. We can, of course, go with a more economical way, by taking into account the statistical properties of the source symbols. For example, in a symbol-by-symbol compression scheme, we can opt to only shorten the codes for symbols that occur with relatively high frequencies, and that alone, on average, would already result in a shorter code length for a long sequence. Or, in a whole-sequence compression scheme, as discussed in a couple of classes back, we could opt to encode only the typical sequences Formula with Formula bits, and leave the atypical sequences unencoded.


     The source coding theorem states that for n i.i.d. random variables, each with entropy  Formula1, there is a limit as to how much the data can be compressed; moreover, this limit happens to be  Formula . Conversely, when a scheme attempts to compress data beyond this limit, there is always a certain non-zero probability of error, i.e., there are occasions when the original data cannot be recovered (or be decoded).


     In the following, we will first prove the converse of the theorem by establishing a lower bound for the probability of error for schemes that disregard this limit,  Formula. After that, we will return to show that the limit can indeed be achieved, using the method of random coding.


2. Converse for source coding theorem


Our goal here is to prove that when we compress the data associated with n i.i.d. random variables to a sequence of  Formula  bits, then for Formula, Formula; i.e., if Formula, there would be a finite probability of error.




Formula (Formula Formula's are i.i.d.)

Formula (Formula Formula)

Formula (Formula data processing inequality)


Formula (Formula the maximum of Formula is achieved

                                         when every Formula's are independent)

Formula (Formula the maximum value of Formula is 1 since

                                 Formula is binary)

Formula (Formula Fano's inequality)




Formula as Formula.


Obviously, therefore, when Formula, Formula.


3. Achieveability for source coding theorem (using the method of random coding)

To prove the achieveability for source coding theorem, we will consider a scenario in which every possible source sequence Formula is randomly assigned a code Formula. Here, the rate Formula.


     To decode, i.e., to retrieve the original sequence Formula given a particular string Formula, the decoder would check if there is a unique, typical Formula that is mapped to this Formula. If so, the decoder would output Formula; if not, the decoder would declare a decoding error. Below, we will prove that the probability of error, i.e., Formula, is asymptotically negligible.





First, recall that the codes for a particular sequence is assigned entirely randomly, regardless of what codes have already been used. Thus, the set of possible resulting code sequences can be approximated by the set of typical sequences.


Therefore, given Formula and a particular source sequence Formula, we have Formula,


Moreover, since the sequences Formula are chosen ENTIRELY randomly, Formula. Hence,










As Formula, the probability of error is negligible.


Intuition for channel coding theorem


In channel coding, we are concerned about transmitting a message, Formula, over a noisy channel. The message, after transmission, will be converted to a certain Formula. To motivate the proof of the channel coding theorem (to be discussed in the next class), we will first consider a special case of the DMC (discrete memoryless channel), in which the channel is characterized by the per-symbol transition probabilities Formula.


     The first question one might like to ask is: how many number of distinct messages can such a channel "support", i.e., to transmit reliably with negligible errors? Assuming that only the typical sequences Formula can be observed on the receiving end of the channel , then, there are approximately Formula spaces available2. However, because of the noise in the channel, a typical Formula can be mapped to approximately Formula elements, which is the size of of  Formula.


     To ensure that the messages can be decoded accurately, the encoding scheme must be such that given any two messages Formula and Formula, the two sets Formula and  Formula must be as non-overlapping as possible. To use an analogy, it is like we are trying to fit balls of size Formula into a larger ball of size  Formula, and there are only just so much room for Formulaballs.



     Moreover, since Formula depends on Formula, Formulaalso depends on Formula; therefore, the number of distinct messages that can be sent per block of n bits varies according to Formula. As a result, it is useful to define a quantity, called capacity (denoted by Formula), which is the maximum rate achieveable for a given channel (defined via its transition probabilities Formula). It is a rate in a sense that it specifies the number of distinct messages that can be sent per block of n bits, through this expression Formula.


     The channel coding theorem is a rigorous statement of the above reasonings. It states that it is impossible to transmit messages at any rate beyond  Formula; in other words, any rates beyond Formula are not achievable.




1. In the following, we use Formula and Formula interchangeably. Here, of course, Formula represents the probability mass functions (p.m.f.s) of the random variable Formula.

2. In principle, there can be as many as Formula distinct codes for a code sequence of length n. The number of allowed distinct messages, however, is smaller than Formula, because the set of observable Formula's are constrained by both Formula and Formula. Thus, such rate as Formula  is not achieveable, as required by the channel coding theorem.


Comments (6)

sidjaggi said

at 3:00 am on Oct 11, 2011

Thanks for your (partial) notes thus far, Manson. We'll see some more material in today's class that'll hopefully help address your last concern!

Jiang Yunxiang said

at 11:35 am on Oct 11, 2011

In a DMC channel, the probability of y could easily be gained by p(x) and p(y/x). If we also consider typical sequence of y, its number is obviously the previous one (2^nH(y)), not |y|^n.
Of course, we could generate the distinct codes based on |y|^n. But |y|^n - 2^nH(y) ones are atypical which could be considered as the error correcting fields.
If the actual sequence falls into typical sequence, we can say this code is correct. But if the actual sequence falls into atypical sequence without any error correction, we should consider it as atypical sequence. Then considering some error correction method, we should try our best to map the atypical sequence into a typical sequence.
I think this may be the key of channel coding theorem in theoretical analysis. That's why in the first design of channel coding, we need map 2^nH(x) into 2^nH(y), not |y|^n. Mapping is a method that does not mean actual codes could not fall into other balls.
Because of just comments, I could not type some formula in this reply and just give some simple analysis about this problem. Actually, it could be solved in mathematical way. This is just my point of view, welcome to give some comments.

Jiang Yunxiang said

at 11:44 am on Oct 11, 2011

Moreover, increasing coding space of y could not increase the mutual information I(x,y). Using 2^nH(y) space is enough to achieve this capacity. Another |y|^n - 2^nH(y) can be considered as redundant space.

sidjaggi said

at 3:01 am on Oct 11, 2011

I'll give more comments once your notes are complete (this Thursday, yes?) And the other scribe should also be working on his/her notes ;)

sidjaggi said

at 11:56 am on Oct 11, 2011

Nice intuition, yunxiang.
however, remember a y^n that's atypical with respect to a particular x^n may be typical with respect to another x'^n.
the real reason is a little bit more subtle, and we'll discuss this in class today (remind me if i forget :)

Manson Fong said

at 1:01 pm on Oct 11, 2011

Thanks Prof. Jaggi & Yunxiang, for the responses. Looking forward to the discussion today!

I will try to complete the notes by 11:59 pm this Thursday :)

You don't have permission to comment on this page.