A simple code scheme (Scribe: Siyao)
1. Overview
In this lecture, we are interested in designing a code (a pair of encoder and decoder) for a particular source. (Let’s ignore noise in following discussion, since this is simply a compression problem.)
(A few words about the code model, we use encoder to encode the message m from source into codeword n then send codeword to decoder, after that decoder decodes the codeword into m′. We care about the efficiency (the output length of encoder) and errors, so that we want rate and error to be as small as possible. It’s simple to see, there is a rate-error tradeoff. If we want no error, just let n = m so that R|n| = 1 and no saving of efficiency. If we want high efficiency, in extreme case, R|n| = 0, then no information is left after encoding so that ε|n| should be very large.)
We first analyze the source and show that the source has very good concentration property. Specifically, the source has almost all probability on a ”small” set S of events. Taking advantage of the good property, we will not lose too much by just considering events in the set S and ignoring other events. Note that if we just need to distinguish elements in S, log |S| bits for each element are enough. So that we can design a simply code in following way.
The encoder maps one-to-one from S to the set and the decoder is the inverse function of encoder (the encoder will ignore any event not in S, and hence they correspond to error events). The output length of encoder is log |S| which is smaller than message length and the error is very small as S defined.
It’s already a very good code. Actually, we can improve the code by changing a little bit to get a variant-length code. For variant-length code, we consider the expectation of encoder output length. The idea is that instead of ignoring events not in S, we encode them also. Since the probability of events not in S is very small, letting output length of encoder for them be as large as their original length won’t increase the expectation of the output length by very much. At the same time, error ε will be reduced into 0 when we encode all events instead of S. So a new code come up as following:
For events in S, the encoder maps one-to-one from S to the set and add an extra 0 to the beginning of encoding. For events not in S, the encoding will be the input itself, and an extra 1 is added to the beginning of encoding. The extra bit is for distinguishing events in S and events not in S. It’s simple to give the decoder so I skip the description.
We've almost finished the design of the code. For completeness, we should care about the computational complexity of encoder/decoder, i.e, whether encoder/decoder can work with limit time and limit space. More specifically, can we give an efficient encoder/decoder algorithm? Fortunately, we can give such an algorithm.
After all, it’s time to come up with the final question. Is it possible for any other scheme to "substantially" improve on your scheme?
2. Preliminaries
In following discussion log is base 2.
Definition 1. A p-biased distribution X over , denoted as , is a distribution such that for , , where |x| is the number of 1s in x. In other words, every bit is 1 with probability p, and 0 with probability 1 − p.
Source distributions which we consider are over , and where p < 1/2. It’s straightforward to show that the expectation of |x| is np and that the variance of |x| is np(1 − p).
Definition 2. T(k, n) is the type-class of all sequences over with |x| = k (that is, k ones, and n-k zeroes).
The size of type-class T(k,n) is .
Definition 3. The entropy of a distribution X over is defined as following:
Entropy is a measure of randomness. The larger entropy is, the more random distribution is. By the definition of entropy, we can know uniform distribution has the maximal entropy. In the case , a bit with 1/2 to be 1 and 0 has the maximal entropy. Let's also see the Taylor expansion of H(p+x) to verify that a random bit has maximal entropy.
where
and
When we consider p=1/2, we have
Hence, we can know
that's
Lemma 4. , where
Proof. By Stirling’s approximation , we have
Denition 5. For , let be the binary relative entropy dened as with .
It’s a measure of the (entropy) distance between two distribution. Considering the Taylor expansion of the function up to the second derivative we can obtain (see Pinsker's inequality).
3. Property of source
So that the probability of observing type-class T (k, n) is
By Lemma 4, we obtain
The last inequality is because . Intuitively, tells us if k/n is far from p, the probability of observing type-class T (k, n) will be very small. Concretely, let’s try to bound the probability of observing an element from an ϵ-strongly atypical set, i.e., T (k, n) with as follows:
By the union bound, the probability of observing an element from any ϵ-strongly atypical set can be bounded by
In other word, the source has most of its probability concentrated in ϵ-strongly typical sets. Let the union of all ϵ-strongly typical sets be S. The size of S can be upper bounded by
where c is constant and p+ϵ<=1/2. And |S| can be lower bound by
This tells us that the size of S is roughly 2nH(p+ϵ) and |S| ≪ 2n . In a word, the source has most probability on a set S of size 2nH(p+ϵ) , specically, the union of all ϵ-strongly typical type-classes.
4. Efficient encode/decode algorithm
The encoding algorithm is as following: On input x over ,
1. If , output 1x.
2. Else, let , b=0.
From i = 1 to n, we do following for each i,
1) If xi=0, continue.
2) Else, and b++.
After that, convert a into a binary string a' of length nH(p+ϵ) and output 0a'.
The decoding algorithm is as following: On input x over or
1. If x0 = 1, output x1x2 .....xn.
2. Else, let k = n(p-ϵ ) and do following until :
Then, let b=0. From i=1 to n, we do following for each i,
1) If , then , and output 1;
2) Else, output 0 and continue.
First we define the order for elements in S. String with less 1s are said to be smaller than other strings with more 1s. Strings of the same number of 1 will follow the outcome of comparing them by their value in base 10. The encoding algorithm gives the rank for elements of S, i.e., counting how many string in S is smaller than it. For elements not in S, we just encode them as themselves. The extra bit in beginning of encoding is to identify whether input belongs to S or not
If we calculate in advance and store them in memory, encoding and decoding algorithm only takes O(n) time. If we calculate all locally, we need at most O(n^2) time by following code, where c[i][j]=
for(i=0;i<=n;++i) c[i][0]=1;
for(i=1;i<=n;++i)
for(j=1;j<=i;++j)
c[i][j]=c[i-1][j]+i*c[i-1][j-1]
The expected encoding length is bounded by
5. Can we improve?
Note that each element in the ϵ-strongly typical set has "nearly equal" probability of being observed. By symmetry, we should give them almost equal length.
Since bits are necessary to distinguish nearly equal elements. So that we can't substantially improve our code. Even we relax our requirement to only require error to be constant, because that bits are still necessary.
6. What changes for general alphabets?
Consider general alphabet instead of {0,1}. We denote the outcome for one toss be X and where 0<=i<=m and . Type class will be T(k0,...,km;n) which means i happens for ki times in the outcome for n tosses. The the probability of a particular sequence in Type class occur T(k0,...,km;n) will be and type class T(k0,...,km;n) has possible sequences. So we observe type class T(k0,...,km;n) with probability .
Next, we should argue and where D is a generalized binary relative entropy. And show that D(a||b) satisfied by the Taylor expansion for multiple-variable functions.
After that, we can show the typical set should be class T(nq0,...,nqm;n) and design a variant code with expected length nH(p0,...,pm).
Comments (3)
sidjaggi said
at 2:44 am on Oct 11, 2011
Great job, Siyao! Corrected a few typos (mostl grammatical :) in your write-up.
Comments:
1. The way you define \doteq in class (and indeed the way I defined it the first time, though not the second time) is not the standard way in textbooks, where the definition is usually \lim_{n\rightarrow \infty} \log(A_n\B_n) \rightarrow 0. Sequences satisfying this definition will satisfy your definition, but not necessarily the other way round.
2. Pinsker's inequality -- the inequality between K-L divergence and the L_1 distance is originally due to Pinsker -- perhaps put in a link to the wiki site for Pinsker's inequality?
3. The Taylor series expansion of H(p+\epsilon) would be good to show, to relate it to H(p)
4. What changes in your discussions for general alphabets? (Multinomials coefficients, Taylor series expansions, etc.). Be good to have a short section at the end describing, at a high level these changes.
5. The order complexity of computing {n}\choose{k} -- your description is admirably succinct :), but not all readers have a background in algorithms. Would you mind specifying it in some more detail?
Thanks again for a sterling job!
Chin Ho Lee said
at 1:43 pm on Oct 18, 2011
I think the definition of \doteq is \lim_{n\rightarrow \infty} 1/n \log(A_n\B_n) \rightarrow 0 in Cover and Thomas. (there is an extra 1/n)
siyao said
at 12:16 pm on Oct 12, 2011
Thanks for your comments! I will fix them by this Friday.
You don't have permission to comment on this page.