**Odds and ends**

So far most of the sequences of random variables we've discussed correspond to *memoryless* stochastic processes.

**Q1. (****Memory****)** (a) Let be a stationary stochastic process (what do those terms mean). What would a good notion of the "average entropy of the process"... the *entropy rate* be? Can you give two alternate definitions?

(b) How would you define the *stationary distribution* of a Markov chain? When would you expect it NOT to exist? How would you compute the stationary distributions(s) of a Markov chain? When would you expect the Markov chain NOT to have a unique stationary distribution? (Proof outside scope of the class).

(c) For a Markov process, can you give a simpler formulation of the entropy rate? How would you write this in terms of the transition matrix of the Markov chain? (Here's a somewhat amusing use of Markov chains)

(d) Let be the unique stationary distribution of a Markov chain, and be the distribution of the *n*th r.v. in the chain. Prove that gets monotonically closer to , *i.e.*, that , AND *.*

(e) The Second law of thermodynamics states that the entropy of an isolated system never decreases. What's the implicit assumption here? Can you construct an example where this assumption is violated, and therefore the entropy of the system does actually decrease?

(f) How would you compress a Markov source using a Huffman code? (There's no single right answer.)

*Key concepts in Q1:* *Non i.i.d. processes, entropy rates, Markov sources, stationary distributions, existence/uniqueness of stationary distributions of Markov processes, convergence to stationary distribution, second law of thermodynamics, compression of non-i.i.d. sources.*

**Q2. (Arithmetic codes)** Consider the following coding scheme. A source generates i.i.d. bits 0 and 1 with probabilities *p* and *1-p*. For the first bit the encoder observes, it divides the unit inverval [0.1) into two sub-intervals [0,p) and [p,1), and selects the first or the second depending on whether with 0 or 1, respectively. For each successive bit the encoder divides the previous interval in an analogous manner -- for example, the bit sequence 01 corresponding to the interval .

(i) What interval does the bit sequence 0110 correspond to?

(ii) After this division into intervals, what should the encoder actually transmit to the decoder?

(iii) Given your answer in (iii), what should the decoder do to decode?

(iv) Can you think of an implementation to do both encoding and decoding in a streaming manner? What is the time-complexity of your

algorithms?

(v) What is the length of the interval assigned to each sequence? Why is this a "good" thing? *(Hint: Encoding length, dimension of alphabet,*

* compare with Huffman code...)*

(vi) What if the input alpabet were non-binary?

(vii) What if the source was not i.i.d.? For instance, what if it was a Markov process? Or if it was not identically distributed?

(viii) After reading this link, do you have any comments on software patents?

*Key concepts in Q2**: Arithmetic coding (encoder/decoder), expected rate, complexity, intervals of real line, non-i.i.d. source coding, software patents.*

**Q3. (Weak typicality)** We define the -weakly typical set as the set of all such that . Stated differently, is -weakly typical if the average (over block-length *n*) of the negative logarithm of the probability of a sequence occuring (under the probability distribution *p(.)*) is "roughly equal" to the entropy *H(p)*, i.e., .

(a) Show that weakly typical sets are also "good enough", in the sense that they have properties similar to those that we've used for strongly typical sets. Namely, show that

(i) For a sequence of i.i.d. random variables (use WLLN -- the Weak Law of Large Numbers). How about if the sequence is NOT i.i.d.?

(ii)

(iii) .

(b) Can you think of one reason why weak-typicality not be as "representative" of typicality as strong typicality? Can you then think of some reasons why people might prefer using it?

*Key concepts in Q3**: Weak typicality, definition, comparison with strong typicality, properties (size, probability)*, weak law of large numbers.

**Q4. (Differential Entropy)** Let *X* be a continuous random variable with pdf *p(x)*.

(a) What would be a "natural" definition of the *differential entropy h(***X**) of *X*?

(b) Which properties of entropy does differential entropy satisfy? Which does it not? In particular, consider PS2, Questions 2 and 3. How about the corresponding "differential mutual information"?

(c) For *p(x)* with finite support, let the p.m.f. be a quantized version of *p(x)*. That is, let . Can you relate the quantities *h(***X**) and ?

(d) How about Q3(a) from this problem set?

(e) For a given mean and variance, what distribution has maximal entropy?

*Key concepts in Q4:* *Differential entropy, properties of differential entropy (differences and similarities with "discrete" entropy)*

## Comments (7)

## Tong said

at 5:13 pm on Feb 20, 2009

Can you provide some reference about today's lecture or others?

## Cho Yiu Ng said

at 2:24 pm on Feb 21, 2009

Please refer to chapter 4 of the textbook.

## MEI Yuchen said

at 4:26 pm on Mar 1, 2009

Where may I find the details of Arithmetic codes?

## MEI Yuchen said

at 5:51 pm on Mar 1, 2009

Is it possible to do arithmetic coding in a streaming manner if p is not equal to 0.5?

For example, supoose p=0.4.

Then the interval for 100 is (0.4, 0.496). However, if we do streaming translation, we will produce a 1 as soon as we see the first 1.

Then no matter what we add to the codeword later, it can never fall into the correct interval.

## Silas said

at 9:47 am on Mar 2, 2009

I agree with Mei Yuchen.

## Cho Yiu Ng said

at 10:37 am on Mar 3, 2009

Good point, Yuchen! And you have at least one "supporter" as well. Any objections? Or on the other hand, if you agree, can you prove it (a little bit) more formally?

For details of arithmetic coding, of course, wikipedia is always your friend:

http://en.wikipedia.org/wiki/Arithmetic_coding

You may also refer to the following lecture notes for IEG7007 (p.532-561):

http://course.ie.cuhk.edu.hk/~ieg7007/L7_Text%20Compression.ppt

## Lingxiao XIA said

at 10:24 am on Mar 4, 2009

it is easy to prove that at least the first log(1/length of interval)-1 bits are the same, in this case, log(1/0.6) +2 = 3, we need three bits for the interval, but only the first log(1/0.6) - 1 = 0 bit can be used in the next one

You don't have permission to comment on this page.