If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.
You already know Dokkio is an AI-powered assistant to organize & manage your digital files & messages. Very soon, Dokkio will support Outlook as well as One Drive. Check it out today!

Scribe Notes 7-3

Page history last edited by Mak 14 years, 12 months ago

First Draft

In this lecture, we finished up Problem Set 7. Topics include:

Rate distortion theorem: achievability amd converse
Calculation of R(D): method of test channel and idea of reverse water filling
Iterative algorithm for computing rate distortion and channel capacity

Background and Definitions

First let us recall our framework:

Figure 1

Here, Formula can be different from Formula because we allow distortion. R is the rate which we want to minimize for fixed allowed distortion.

In previous lectures, we talked about distortion for a symbol. We now define distortion for vectors as the average per symbol:

Formula

We defined the rate distortion function as:

Formula

The term Formula in the inequality constraint is the expected distortion. The set Formula increases as D gets larger, hence by definition R(D) is decreasing.

We shall show that this definition of R(D) indeed gives the smallest rate for fixed distortion D.

Rate distortion Theorem

This is Thm 10.21 on p.307 of the textbook:

R(D) is the minimum achievable rate at distortion D for an iid source X with distribution p(x) and bounded distortion function .

Proof: (Achievability) For Formula , we would like to find Formula (see Figure 1) such that the limit of the expected distortion is no more than D. The textbook gives two proofs. In class we only look at the one in Section 10.6. We make use of random coding, as in the proof of achievability of channel coding theorem.

To generate our codebook (for ), we just randomly pick sequences (with probability distribution ; for fixed , is computed by ).
For encoding, given , we try to find a sequence in the codebook such that and are jointly typical. If there is only one satisfying this, we make to be the index of . If there are many, we can pick any of them (the book syggests taking the first one in lexicographical order). If there is no such , just give any index to it (the book suggests letting ).

The decoding side, on receiving an index , just decode it as .

We then need to compute the expected distortion and probability of error. This part was not done in class, and was left as an exercise. In short, we should get that the average distortion is close to D, and the probability of error tends to zero.

Proof: (Converse) For Formula , and Formula , we want to show that Formula . The inequalities involved are given on p.317 of the textbook. Instead of copying all the steps here, I just metion some points of interest:

The book says that (in Step g) follows from the definition of R(D). Indeed, by putting in the definition , we see that the inequality constraint is always satisfied (Expected distortion ), hence we get the result we need in Step g.
Step h makes use of Jensen's inequality. For this to hold, we need R(D) to be a convex function. The proof of this fact can be found on p.316 of the textbook. Actually, even without the definition using mutual information, convexity of R(D) is still expected.
Let us consider a random black-and-white image. It takes 0kB for expected distortion of 50% (no need to store anything, just reproducing an all-white image will do). Suppose we also have an optimal way that compresses the image to 100kB for expected distortion of 10%. Then, one obvious way to get expected distortion of 30% is dividing the image into two halves, then using the first scheme in one half and the second scheme in the other half. This combined scheme takes 50kB. This technique is referred to as time sharing. Hence, the smallest size for 30% distortion must not exceed 50kB. The numbers above are arbitrarily chosen, so this argument actually shows that R(D) is expected to be convex for any D.
For Step j, we believe that we should have rather than :
(by assumption)

(R(D) decreasing)

Note that this change does not affect the validity of the proof.

Caluculation of R(D)

Method of Test Channel

With the definition using mutual information, we can calculate R(D) for a Bernoulli(p) source, with the distortion measure being the Hamming distance.

Now, Formula

Formula

Formula (conditioning reduces entropy)

Formula

Here, Formula if and only if Formula , i.e. no distortion. If Formula , Formula .

Since the above holds for all Formula , we see that Formula .

Now, we would like to show that for D small Formula , Formula . (If D is large, R(D) = 0.) We need to find a joint distribution which satisfies the expected distortion constraint and Formula . We do so by considering the following test channel, with p and D fixed, and a is to be chosen:

Figure 2

We choose a such that Formula , i.e. Formula . Then, we have expected distortion = Formula , and Formula , making Formula .

So finally, we have proved that for a Bernoulli source with Hamming distortion, Formula for D small.

We can also use the same idea to calculate R(D) for Gaussian channels with variance Formula (with squared-error distortion) and get (note that we first need to extend the rate distortion theorem to continuous alphabets): Formula . More details can be found in Section 10.3.2 of the textbook.

Reverse water-filling

We now consider the case of k independent Gaussian sources with variances Formula respectively, and our distortion function being Formula .

We recall the method for computing channel capacity of parallel Gaussian channels in Scribe Notes 6-5. There we first solve some Kuhn-Tucker equations, then show that the solution can be found by the method of water-filling. Similar steps can be applied here to come up with a reverse water-filling method, where we get a constant Formula , and only the variables with Formula are to be described. For the examples in Figure 3, we only need to describe the blue parts. For variables with Formula , no bits are used.

Figure 3

More formally, we have Thm 10.3.3 on p.314:

, with and .

Readers are recommended to read Section 10.3.3 of the textbook for the actual derivation of the Kuhn-Tucker equations and other details.

Blahut-Arimoto Algorithm

Here we give an iterative algorithm that computes the rate distortion function.

First, let us consider a simpler problem: given two convex sets, find the minimum distance between them (see Figure 4):

Figure 4

One possible method is as follows:

Pick any two initial points .
Find in set A the closest point to .
Now find in set B the closest point to .
Carry on similarly until we see convergence.

It can be shown that this algorithm will indeed converge to the minimum distance if the "distance function" satisfies certain conditions (sorry I couldn't find the original paper by Csiszár and Tusnády, so I cannot be sure what the exact conditions are. I guess we need the "distance function" to be non-negative, convex and bounded for the initial points).

[Another method is gradient descent:

Pick any two initial points .
Move to along the negative gradient such that it is closer to .
Now move to along the negative gradient such that it is closer to .
Carry on similarly until we see convergence.

This method is not mentioned in the textbook. It is included here just for reference.]

It is interesting to note the similarities between this problem and the computation of R(D). We can rewrite the definition of R(D) as (see Q6(d), PS7):

Formula

We can further write Formula , where A is the set of joint distributions with p(x) satisfying the expected distortion constraint, and B is the set of Formula . Note that sets of probability distributions are convex, and the distortion constraint is linear, hence the sets A and B are convex. Also, we know that Kullback-Leibler divergence has the desired properties we need (convex, non-negative etc.; see Chapter 2 of the textbook). Hence, the algorithm we presented above can be applied here for computing R(D). Readers are suggested to refer to Section 10.8 for more details.

We note also that a similar method can be applied to compute channel capacity. The main difference is that instead of alternating minimization as above, we need to do alternating maximization.

Scribe Notes 7-3

Scribe Notes 7-3

Page Tools

Insert links

Comments (0)

Join this workspace

Navigator

SideBar

Related Links:

Recent Activity