**Scribe note: Manson FONG (Week 4)**

**Source coding theorem**

**1. Overview**

First, let's recall the scenario of source coding.

Here, on the left is a source, which "emits" sequences of source symbols from a certain alphabet . The sequences, , are sent to an encoder, which converts the sequences into sequences of code symbols, , being from some alphabet . For simplicity, will be taken as **binary**. The goal in source coding then is to encode (or represent) the source sequences using as few bits as possible, at the same time allowing the code sequences to be decoded uniquely into their respective original sequences, . In other words, it is desirable for *m* to be as small as possible for a given *n*. For "historical" reasons, the sender (the coder) is always named Alice, and the receiver (the decoder) is always named Bob.

For a source sequence of length n, it would take bits to encode the sequence without any compression. We can, of course, go with a more economical way, by taking into account the statistical properties of the source symbols. For example, in a symbol-by-symbol compression scheme, we can opt to only shorten the codes for symbols that occur with relatively high frequencies, and that alone, on average, would already result in a shorter code length for a long sequence. Or, in a whole-sequence compression scheme, as discussed in a couple of classes back, we could opt to encode only the typical sequences with bits, and leave the atypical sequences unencoded.

The **source coding theorem** states that for *n* **i.i.d.** random variables, each with entropy * *^{1}, there is a limit as to how much the data can be compressed; moreover, this limit happens to be * ** *. Conversely, when a scheme attempts to compress data beyond this limit, there is always a certain non-zero probability of error, i.e., there are occasions when the original data cannot be recovered (or be decoded).

In the following, we will first prove the converse of the theorem by establishing a lower bound for the probability of error for schemes that disregard this limit, * *. After that, we will return to show that the limit can indeed be achieved, using the method of random coding.

**2. Converse for source coding theorem**

Our goal here is to prove that when we compress the data associated with *n* i.i.d. random variables to a sequence of * ** * bits, then for , ; i.e., if , there would be a finite probability of error.

**Proof**

( 's are i.i.d.)

( )

( data processing inequality)

( the maximum of is achieved

when every 's are independent)

( the maximum value of is 1 since

is binary)

( Fano's inequality)

Thus,

as .

Obviously, therefore, when , .

**3. Achieveability for source coding theorem (using the method of random coding)**

To prove the **achieveability** for source coding theorem, we will consider a scenario in which every possible source sequence is randomly assigned a code . Here, the rate .

To decode, i.e., to retrieve the original sequence given a particular string , the decoder would check if there is a unique, typical that is mapped to this . If so, the decoder would output ; if not, the decoder would declare a decoding error. Below, we will prove that the probability of error, i.e., , is asymptotically negligible.

**Proof**

First, recall that the codes for a particular sequence is assigned entirely randomly, regardless of what codes have already been used. Thus, the set of possible resulting code sequences can be approximated by the set of typical sequences.

Therefore, given and a particular source sequence , we have ,

.

Moreover, since the sequences are chosen ENTIRELY randomly, . Hence,

.

Consequently,

As , the probability of error is negligible.

**Intuition for channel coding theorem**

In channel coding, we are concerned about transmitting a message, , over a noisy channel. The message, after transmission, will be converted to a certain . To motivate the proof of the **channel coding theorem** (to be discussed in the next class), we will first consider a special case of the **DMC (discrete memoryless channel)**, in which the channel is characterized by the per-symbol transition probabilities .

The first question one might like to ask is: how many number of distinct messages can such a channel "support", i.e., to transmit reliably with negligible errors? Assuming that only the typical sequences can be observed on the receiving end of the channel** **, then, there are approximately spaces available^{2}. However, because of the noise in the channel, a typical can be mapped to approximately elements, which is the size of of .

To ensure that the messages can be decoded accurately, the encoding scheme must be such that given *any* two messages and , the two sets and must be as non-overlapping as possible. To use an analogy, it is like we are trying to fit balls of size into a larger ball of size , and there are only just so much room for balls.

Moreover, since depends on , also depends on ; therefore, the number of distinct messages that can be sent per block of *n* bits varies according to . As a result, it is useful to define a quantity, called **capacity **(denoted by ), which is the maximum rate achieveable for a given channel (defined via its transition probabilities ). It is a rate in a sense that it specifies the number of distinct messages that can be sent per block of *n* bits, through this expression .

The channel coding theorem is a rigorous statement of the above reasonings. It states that it is impossible to transmit messages at any rate beyond ; in other words, any rates beyond are not achievable.

Note:

**1**. In the following, we use and interchangeably. Here, of course, represents the probability mass functions (p.m.f.s) of the random variable .

**2.** In principle, there can be as many as distinct codes for a code sequence of length n. The number of allowed distinct messages, however, is smaller than , because the set of observable 's are constrained by both and . Thus, such rate as is not achieveable, as required by the channel coding theorem.

## Comments (6)

## sidjaggi said

at 3:00 am on Oct 11, 2011

Thanks for your (partial) notes thus far, Manson. We'll see some more material in today's class that'll hopefully help address your last concern!

## Jiang Yunxiang said

at 11:35 am on Oct 11, 2011

In a DMC channel, the probability of y could easily be gained by p(x) and p(y/x). If we also consider typical sequence of y, its number is obviously the previous one (2^nH(y)), not |y|^n.

Of course, we could generate the distinct codes based on |y|^n. But |y|^n - 2^nH(y) ones are atypical which could be considered as the error correcting fields.

If the actual sequence falls into typical sequence, we can say this code is correct. But if the actual sequence falls into atypical sequence without any error correction, we should consider it as atypical sequence. Then considering some error correction method, we should try our best to map the atypical sequence into a typical sequence.

I think this may be the key of channel coding theorem in theoretical analysis. That's why in the first design of channel coding, we need map 2^nH(x) into 2^nH(y), not |y|^n. Mapping is a method that does not mean actual codes could not fall into other balls.

Because of just comments, I could not type some formula in this reply and just give some simple analysis about this problem. Actually, it could be solved in mathematical way. This is just my point of view, welcome to give some comments.

## Jiang Yunxiang said

at 11:44 am on Oct 11, 2011

Moreover, increasing coding space of y could not increase the mutual information I(x,y). Using 2^nH(y) space is enough to achieve this capacity. Another |y|^n - 2^nH(y) can be considered as redundant space.

## sidjaggi said

at 3:01 am on Oct 11, 2011

I'll give more comments once your notes are complete (this Thursday, yes?) And the other scribe should also be working on his/her notes ;)

## sidjaggi said

at 11:56 am on Oct 11, 2011

Nice intuition, yunxiang.

however, remember a y^n that's atypical with respect to a particular x^n may be typical with respect to another x'^n.

the real reason is a little bit more subtle, and we'll discuss this in class today (remind me if i forget :)

## Manson Fong said

at 1:01 pm on Oct 11, 2011

Thanks Prof. Jaggi & Yunxiang, for the responses. Looking forward to the discussion today!

I will try to complete the notes by 11:59 pm this Thursday :)

You don't have permission to comment on this page.