**Entropy: Definitions and properties**

Let (**X**,**Y***)*in *{0,1}*x*{0,1}* be two *correlated random variables* with probability distribution *q**(.,.)*. In particular, let *q(1,1)* be the probability that it rains on Christmas in both Shenzhen and Hong Kong airports, *q(0,0)* be the probability that it rains on Christmas in neither Shenzhen and Hong Kong airports, *q(1,0)* be the probability that it rains on Christmas in Shenzhen airport but not in Hong Kong airport, and *q(0,1)* be the probability that it does not rain on Christmas in Shenzhen airport but does rain in Hong Kong airport. Assume that the probability of rain is independent from year to year (this is reasonable), but that the probability of rain in Shenzhen and Hong Kong on the same day each year is correlated (given their locations, this is also reasonable). These events are measured for *n* years, to result in the sequence of pairs of random variables , where each pair of random variables is distributed i.i.d. according to the distribution *q(.,.)*.

**Q1: ***(Joint entropy, conditional entropy, mutual information, and divergence)* (a) What is a "natural" definition for the *jointly strongly typical set* , the events of which have a "high" probability of occuring? What is the size of this set? (Note Q2 from PS1). Denote the coefficient of the leading term of the exponent of this quantity by *H(*X*,*Y*)*, the *joint entropy* beween **X** and **Y**.

**(b)** Suppose a weather station has records of the daily rainfall in Shenzhen airport of the previous *n* years, and, as expected, the rainfall records on Christmas are "typical". What is a natural definition of the *conditionally strongly typical set* , the events of which have a "high" probability of occuring conditioned on ? What is the size of this set? Denote the coefficient of the leading term of the exponent of this quantity by *H(*Y|X*)*, the *conditional entropy* of **Y** given **X**.

**(c)** How about the corresponding answers for the rainfall in Hong Kong is recorded instead of in Shenzhen?

**(d)** What is the ratio of the sizes of the typical sets and ? Denote the coefficient of the leading term of the exponent of this quantity by *I(X;*Y*)*, the *mutual information* of **X** and **Y**. Similarly, compute the ratio of the sizes of the typical sets and . Why do you suppose this quantity is called the mutual information?

**(e)** What is the probability of observing a sequence with type *p(.)* , rather than the true type *q(.)*? (Here *q(.)* refers to the marginal of *q(.,.)* on *X*)? Denote the coefficient of the leading term of the exponent of this quantity by *D(p||q**)*, the divergence/relative entropy/Kullback-Leibler "distance"/cross entropy from *q(.)* to *p(.)*. Why do you suppose this quantity is called the divergence?

**(f) **How about upper and lower bounds on the probability of *typical* elements?

**(g)** For general random variables **X** and **Y** over arbitrary finite alphabets and with arbitrary joint distributions, can you guess (or derive!) the quantites above?

**Key concepts in Question 1:** Joint entropy, jointly typical set, conditional entropy, conditionally typical set, mutual information, divergence.

**Q2: ***(Information theory equalities -- use 1(f))* (a) Prove the following equalities

**(b)** Represent the quantities *H(X), H(Y), H(X|Y), H(Y|X), H(X,Y)* and *I(X;Y)* in a *Venn diagram*.

**(c)** *(Chain rule for entropies:)* Prove that

**(d)** *(Chain rule for mutual information:)* Define the *conditional mutual information* as . Prove that .

**Key concepts in Question 2:** Information equalities, Venn diagram, chain rule for entropies/mutual informations.

**Q3: ***(Information theory inequalities)* (a) Prove the following inequalities by noting elementary properties of the log function -- , , , (when are these tight?) and *(Hint: Give an example)*.

**(b)** What are "reasonable" definitions of *convexity, strict convexity, concavity,* and *strict concavity* of functions?

**(c)** *(Jensen's inequality:)* For an arbitrary convex function *f* and arbitrary random variable *X*, prove by induction that -- here *E* is the *expectation operation* (averaging) over the random variable. For a strictly convex function *f*, prove that equality holds if and only if *X* is a constant *(Hint: Sketch convex functions)*.

**(d)** Use (c) to prove that (i.e., the *alphabet-size* of *X*), (does this justify calling is a "distance"?), *(Hint: What is the appropriate convex function?)*. Also, show that (*Hint: Use 2(a)*). When are these tight?

**(e)** **(Convexity/Concavity of entropic quantities)** Is entropy a convex or concave function of the input distribution? How about mutual information -- is it convex or concave in the input distribution p(x)? How about in the transition probabilities p(y|x)?

**(f)** **(Conditioning reduces entropy:)** Use (d) to prove that . When is this tight?

**(g)** *(Data processing inequality:)* If form a *Markov chain* *of degree 1*, prove that (*Hint: Expand * in two ways using 2(d), and use the Markov property). When is this tight?

**(h)** *(Fano's inequality:)* Let *Y* be related to *X* via *p(y|x)*, let be some deterministic function of *Y*, and *P* be the probability of the event *E* that . Prove that . *(Crucial in proving converses in our coding theorems. Hint: Proceed as follows. Consider the binary random variable corresponding to ***E**. Expand in two ways, and use bounds on entropy functions. In particular, why is at most 1, , and is at most ? Lastly, use the data processing inequality...)

**Key concepts in Question 3:** Positivity of entropy, conditional entropy, mutual information, divergence. Upper bounds on their values. Condtioning reduces entropy. Data processing inequality. Fano's inequality.

**Q4: ***(Bound on divergence)* Let denote the *total variational distance* between two p.m.f.s *p* and *q*. Prove that ** **for some constant *c*. *(Crucial in proving achievability in our coding theorems.)*

**Pretty pictures**

Here are some images that attempt to graphically represent the structure of jointly typical sets (Copyright Sidharth Jaggi, 2002)

Pairs of length-12 binary strings with distribution Pairs of length-11 binary strings with distribution

(0.25,0.25,0.25.0.25) (3/11,3/11,2/11,3/11)

Pairs of length-12 binary strings with distribution Pairs of length-12 binary strings with distribution

(5/12,2/12,0.5/12) (1/3,1/6,1/6,1/3)

## Comments (6)

## MEI Yuchen said

at 9:16 pm on Jan 21, 2009

I add a "log" in the first inequality of Q3(d). Otherwise the inequality will never be tight.

Am I right?

## sidjaggi said

at 10:49 pm on Jan 21, 2009

Ah yes, one of the inequalities we didn't do in class...

Any comments, anyone? (In the spirit of not giving substantial comments on a thread for 24 hours, I refuse to get involved at this point :)

## Cho Yiu Ng said

at 1:57 pm on Feb 10, 2009

One comment to the submitted solutions. When you state equality holds iff "certain conditions" are true, remember to prove it. So for those who have shown that, please try to do so by yourself.

## CAI, Sheng said

at 4:03 pm on Sep 28, 2011

I think there are two typos in Q2. In(b), X(X,Y) should be H(X,Y). In(c), H(X_{i}|X_{i-1},...,X_{n}) should be H(X_{i}|X_{i-1},...,X_{1}) :)

## Zirui Zhou said

at 1:55 pm on Oct 4, 2011

Haha, I also found that. Strongly agree with you.

## sidjaggi said

at 2:12 pm on Oct 4, 2011

Thanks, guys :)

You don't have permission to comment on this page.