Generally source coding reduce the redundancy, channel coding add the redundancy.

Description of an arbitrary real number requires an infinite number of bits. Finite representation of a continuous random variable can never be perfect. So it's necessary to define the "goodness" of a representation of a source. The problem is: Given a source distribution and a distortion measure, what is the minimum expected distortion achievable at a particular rate? or What's the minimum rate required to get a proper distortion?

The representation of X is denoted as .

1.1 1bit random variable Quantization Example

Let , assume the distortion measure is . We have only one bit to represent X, so it's clear that the bit will show whether X>0 or not. To minimize squared error, each reproduced symbol should be the conditional mean of its region, the equation and figure is showed as below:

The reconstruction points are at -0.79 and 0.79.

1.2 2bit Quantization Example

Like PS7 Q1, if we choose one bit to express X which is from {0,1,2,3}, we will choose 0 to express X=0 or 1, choose 1 to express X=2 or 3. When we're decoding, we decode 0 to X=0, decode 1 to X=1.

So the distortion's expectation E(distortion) = 0.25*1+0.25*1 = 0.5.f

If X is from {0,1,2,3}^2, we use 2 bits to express X, it's just like to select 4 positions in the graph:

The left side is the same choice as 1bit quantization, but for some positions(?), the distortion > 1. If we choose picture of right side, the distortion will keep 1.

Given X'(X), to choose X which is most closest to X' as the representation is a good choice. It only reach local minimum, not global minimum. This partition over space is called Voronoi partition. It's used in K-means clustering. First the algorithm randomly selects K centers, then divide the space according to distances between all points and these centers. Then it adjust the centers according to the new partition over the space. After some iterations, centers never change. The algorithm reach the local optimum.

2.1 Rate distortion encoder and decoder

2.2 distortion measure

Distortion measure is a mapping: . It's said to be bounded if

Distortion between sequences and is

2.3 rate distortion function

A rate distortion pair (R,D) is said to be achievable if there exists a sequence of -rate distortion codes with . R(D) is the infimum of rates R such that (R,D) is in the rate distortion region of the source for a given distortion D. D(R) is the infimum of all distortions D such that (R,D) is in the rate distortion region of the source for a given rate R.

The *information rate distortion function* :

Why is this? Look at the figure below:

The size of typical set of Xn is . Look reversely of the test channel, the ball size is . So the number of ball is : .

So the rate is .

2.3.1 rate distortion function of binary source

The rate distortion rate R(D) for a Bernoulli(p) source with Hamming distortion is given by:

Proof: Let denote modulo 2 addition. means .

The lower bound is found by a joint distribution that meets the distortion constraint, just like:

. If D >= p, just set . The function is illustrated in:

2.3.2 rate distortion function of Gaussian Source

The rate distortion rate R(D) for a source with squared-error distortion is given by:

Proof:

The joint distribution that achieves this lower bound is shown as:

if , we choose . If , .

The function is illustrated in:

2.3.2 rate distortion function of Independent normal random sources

The information rate distortion function of vector is :

, where

So, . To solve this optimization problem, we use Lagrange multipliers. The solution is:

Like a kind of reverse water-filling, the result is shown in:

No bits are used to describe random variables with variances less than the given constant , which is chosen so that .

2.4 converse to the rate distortion theorem

To prove: we cannot achieve a distortion of less than D if we describe X at a rate less than R(D). R(D) is given as above.

Proof: we prove R >= R(D) for randomized mappings fn and gn, as long as fn takes on at most 2^nR values.

(a): the range of fn is at most 2^nR.

(b): entropy >= 0

(c): data-processing inequality

(d): Xi is independent

(e): chain rule for entropy

(f): the conditioning will reduce the entropy

(g): definition of the rate distortion function

(h): convexity of the rate distortion function, Jensen's inequality

(i): definition of distortion for blocks of length n

For a fixed codebook C, There exist 3 categories of sequences :

1. Nontypical sequences: we can choose n large enough so that . So the nontypical sequences can contribute at most to the expected distortion.

2. Typical sequences such that there exists a codeword that is jointly typical with . Since the source sequence and the codeword are strongly jointly typical, the continuity of the distortion as a function of the joint distribution ensures that they are also distortion typical. So the distortion between these and their codewords is bounded by . The contribution to expected distortion is 1* = .

3. Typical sequences such that there doesn't exist a codeword that is jointly typical with . Let Pe be the total probability of such sequences. The contribution to expected distortion is at most .

The first and last class are not well represented by the distortion code. But Pr(X in first class) is very small. For Pe:

and

so , also is very small if and .

So the bad representation of x^n is very small. So the average distortion is close to D and the probability of second class is close to 1.

The problem is to minimize R(D) over a convex set of all and for all x and satisfying the distortion limit.

Let , and differentiate with respect to , we have

, which means .

Given , we get: .

Multiplying this by p(x) and summing over all x, we get :

If , according to KKT conditions, . So we get:

, if . , if .

If The distance measure is the relative entropy, we have lemma:

We use this lemma 10.131 to rewrite the R(D):

If A is the set of all joint distributions with marginal p(x) that satisfy the distortion constraints and if B is the set of product distributions p(x)r(x') with arbitrary r(x'), we can get: .

With Lagrange multipliers' use, we can obtain:

For this conditional distribution , we calculate the output distribution that minimizes the mutual information(10.132,133), which is:

. So we can repeat the process to minimize objective function over q(*|*) and over r(*), finally get the optimum.