Scribe note: Manson FONG (Week 4)
Source coding theorem
1. Overview
First, let's recall the scenario of source coding.
Here, on the left is a source, which "emits" sequences of source symbols from a certain alphabet . The sequences, , are sent to an encoder, which converts the sequences into sequences of code symbols, , being from some alphabet . For simplicity, will be taken as binary. The goal in source coding then is to encode (or represent) the source sequences using as few bits as possible, at the same time allowing the code sequences to be decoded uniquely into their respective original sequences, . In other words, it is desirable for m to be as small as possible for a given n. For "historical" reasons, the sender (the coder) is always named Alice, and the receiver (the decoder) is always named Bob.
For a source sequence of length n, it would take bits to encode the sequence without any compression. We can, of course, go with a more economical way, by taking into account the statistical properties of the source symbols. For example, in a symbol-by-symbol compression scheme, we can opt to only shorten the codes for symbols that occur with relatively high frequencies, and that alone, on average, would already result in a shorter code length for a long sequence. Or, in a whole-sequence compression scheme, as discussed in a couple of classes back, we could opt to encode only the typical sequences with bits, and leave the atypical sequences unencoded.
The source coding theorem states that for n i.i.d. random variables, each with entropy 1, there is a limit as to how much the data can be compressed; moreover, this limit happens to be . Conversely, when a scheme attempts to compress data beyond this limit, there is always a certain non-zero probability of error, i.e., there are occasions when the original data cannot be recovered (or be decoded).
In the following, we will first prove the converse of the theorem by establishing a lower bound for the probability of error for schemes that disregard this limit, . After that, we will return to show that the limit can indeed be achieved, using the method of random coding.
2. Converse for source coding theorem
Our goal here is to prove that when we compress the data associated with n i.i.d. random variables to a sequence of bits, then for , ; i.e., if , there would be a finite probability of error.
Proof
( 's are i.i.d.)
( )
( data processing inequality)
( the maximum of is achieved
when every 's are independent)
( the maximum value of is 1 since
is binary)
( Fano's inequality)
Thus,
as .
Obviously, therefore, when , .
3. Achieveability for source coding theorem (using the method of random coding)
To prove the achieveability for source coding theorem, we will consider a scenario in which every possible source sequence is randomly assigned a code . Here, the rate .
To decode, i.e., to retrieve the original sequence given a particular string , the decoder would check if there is a unique, typical that is mapped to this . If so, the decoder would output ; if not, the decoder would declare a decoding error. Below, we will prove that the probability of error, i.e., , is asymptotically negligible.
Proof
First, recall that the codes for a particular sequence is assigned entirely randomly, regardless of what codes have already been used. Thus, the set of possible resulting code sequences can be approximated by the set of typical sequences.
Therefore, given and a particular source sequence , we have ,
.
Moreover, since the sequences are chosen ENTIRELY randomly, . Hence,
.
Consequently,
As , the probability of error is negligible.
Intuition for channel coding theorem
In channel coding, we are concerned about transmitting a message, , over a noisy channel. The message, after transmission, will be converted to a certain . To motivate the proof of the channel coding theorem (to be discussed in the next class), we will first consider a special case of the DMC (discrete memoryless channel), in which the channel is characterized by the per-symbol transition probabilities .
The first question one might like to ask is: how many number of distinct messages can such a channel "support", i.e., to transmit reliably with negligible errors? Assuming that only the typical sequences can be observed on the receiving end of the channel , then, there are approximately spaces available2. However, because of the noise in the channel, a typical can be mapped to approximately elements, which is the size of of .
To ensure that the messages can be decoded accurately, the encoding scheme must be such that given any two messages and , the two sets and must be as non-overlapping as possible. To use an analogy, it is like we are trying to fit balls of size into a larger ball of size , and there are only just so much room for balls.
Moreover, since depends on , also depends on ; therefore, the number of distinct messages that can be sent per block of n bits varies according to . As a result, it is useful to define a quantity, called capacity (denoted by ), which is the maximum rate achieveable for a given channel (defined via its transition probabilities ). It is a rate in a sense that it specifies the number of distinct messages that can be sent per block of n bits, through this expression .
The channel coding theorem is a rigorous statement of the above reasonings. It states that it is impossible to transmit messages at any rate beyond ; in other words, any rates beyond are not achievable.
Note:
1. In the following, we use and interchangeably. Here, of course, represents the probability mass functions (p.m.f.s) of the random variable .
2. In principle, there can be as many as distinct codes for a code sequence of length n. The number of allowed distinct messages, however, is smaller than , because the set of observable 's are constrained by both and . Thus, such rate as is not achieveable, as required by the channel coding theorem.