If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.
You already know Dokkio is an AI-powered assistant to organize & manage your digital files & messages. Very soon, Dokkio will support Outlook as well as One Drive. Check it out today!

Problem-Set-2

Page history last edited by sidjaggi 12 years, 6 months ago

Entropy: Definitions and properties

Let (X,Y)in {0,1}x{0,1} be two correlated random variables with probability distribution q(.,.). In particular, let q(1,1) be the probability that it rains on Christmas in both Shenzhen and Hong Kong airports, q(0,0) be the probability that it rains on Christmas in neither Shenzhen and Hong Kong airports, q(1,0) be the probability that it rains on Christmas in Shenzhen airport but not in Hong Kong airport, and q(0,1) be the probability that it does not rain on Christmas in Shenzhen airport but does rain in Hong Kong airport. Assume that the probability of rain is independent from year to year (this is reasonable), but that the probability of rain in Shenzhen and Hong Kong on the same day each year is correlated (given their locations, this is also reasonable). These events are measured for n years, to result in the sequence of pairs of random variables Formula , where each pair of random variables is distributed i.i.d. according to the distribution q(.,.).

Q1: (Joint entropy, conditional entropy, mutual information, and divergence) (a) What is a "natural" definition for the jointly strongly typical set Formula , the events of which have a "high" probability of occuring? What is the size of this set? (Note Q2 from PS1). Denote the coefficient of the leading term of the exponent of this quantity by H(X,Y), the joint entropy beween X and Y.

(b) Suppose a weather station has records of the daily rainfall in Shenzhen airport of the previous n years, and, as expected, the rainfall records Formula on Christmas are "typical". What is a natural definition of the conditionally strongly typical set Formula , the events of which have a "high" probability of occuring conditioned on Formula ? What is the size of this set? Denote the coefficient of the leading term of the exponent of this quantity by H(Y|X), the conditional entropy of Y given X.

(c) How about the corresponding answers for the rainfall in Hong Kong is recorded instead of in Shenzhen?

(d) What is the ratio of the sizes of the typical sets Formula and Formula ? Denote the coefficient of the leading term of the exponent of this quantity by I(X;Y), the mutual information of X and Y. Similarly, compute the ratio of the sizes of the typical sets Formula and Formula . Why do you suppose this quantity is called the mutual information?

(e) What is the probability of observing a sequence Formula with type p(.) , rather than the true type q(.)? (Here q(.) refers to the marginal of q(.,.) on X)? Denote the coefficient of the leading term of the exponent of this quantity by D(p||q), the divergence/relative entropy/Kullback-Leibler "distance"/cross entropy from q(.) to p(.). Why do you suppose this quantity is called the divergence?

(f) How about upper and lower bounds on the probability of typical elements?

(g) For general random variables X and Y over arbitrary finite alphabets and with arbitrary joint distributions, can you guess (or derive!) the quantites above?

Key concepts in Question 1: Joint entropy, jointly typical set, conditional entropy, conditionally typical set, mutual information, divergence.

Q2: (Information theory equalities -- use 1(f)) (a) Prove the following equalities

Formula

(b) Represent the quantities H(X), H(Y), H(X|Y), H(Y|X), H(X,Y) and I(X;Y) in a Venn diagram.

(c) (Chain rule for entropies:) Prove that Formula

(d) (Chain rule for mutual information:) Define the conditional mutual information as Formula . Prove that Formula .

Key concepts in Question 2: Information equalities, Venn diagram, chain rule for entropies/mutual informations.

Q3: (Information theory inequalities) (a) Prove the following inequalities by noting elementary properties of the log function -- Formula , Formula , Formula , (when are these tight?) and Formula (Hint: Give an example).

(b) What are "reasonable" definitions of convexity, strict convexity, concavity, and strict concavity of functions?

(c) (Jensen's inequality:) For an arbitrary convex function f and arbitrary random variable X, prove by induction that Formula -- here E is the expectation operation (averaging) over the random variable. For a strictly convex function f, prove that equality holds if and only if X is a constant (Hint: Sketch convex functions).

(d) Use (c) to prove that Formula (i.e., the alphabet-size of X), Formula (does this justify calling is a "distance"?), Formula (Hint: What is the appropriate convex function?). Also, show that Formula (Hint: Use 2(a)). When are these tight?

(e) (Convexity/Concavity of entropic quantities) Is entropy a convex or concave function of the input distribution? How about mutual information -- is it convex or concave in the input distribution p(x)? How about in the transition probabilities p(y|x)?

(f) (Conditioning reduces entropy:) Use (d) to prove that Formula . When is this tight?

(g) (Data processing inequality:) If Formula form a Markov chain of degree 1, prove that Formula (Hint: Expand in two ways using 2(d), and use the Markov property). When is this tight?

(h) (Fano's inequality:) Let Y be related to X via p(y|x), let Formula be some deterministic function of Y, and P be the probability of the event E that Formula . Prove that Formula . (Crucial in proving converses in our coding theorems. Hint: Proceed as follows. Consider the binary random variable corresponding to E. Expand in two ways, and use bounds on entropy functions. In particular, why is at most 1, , and is at most ? Lastly, use the data processing inequality...)

Key concepts in Question 3: Positivity of entropy, conditional entropy, mutual information, divergence. Upper bounds on their values. Condtioning reduces entropy. Data processing inequality. Fano's inequality.

Q4: (Bound on divergence^[1]) Let Formula denote the total variational distance between two p.m.f.s p and q. Prove that for some constant c. (Crucial in proving achievability in our coding theorems.)

Pretty pictures

Here are some images that attempt to graphically represent the structure of jointly typical sets (Copyright Sidharth Jaggi, 2002)

Pairs of length-12 binary strings with distribution Pairs of length-11 binary strings with distribution

(0.25,0.25,0.25.0.25) Formula (3/11,3/11,2/11,3/11) Formula

Pairs of length-12 binary strings with distribution Pairs of length-12 binary strings with distribution

(5/12,2/12,0.5/12) Formula (1/3,1/6,1/6,1/3) Formula

Footnotes

This is tricky. It is Lemma 11.6.1 in Elements of Information Theory (2nd Edition) (or Lemma 12.6.1 in The 1st Edition). ▲

Comments (6)

I add a "log" in the first inequality of Q3(d). Otherwise the inequality will never be tight.
Am I right?

Ah yes, one of the inequalities we didn't do in class...
Any comments, anyone? (In the spirit of not giving substantial comments on a thread for 24 hours, I refuse to get involved at this point :)

One comment to the submitted solutions. When you state equality holds iff "certain conditions" are true, remember to prove it. So for those who have shown that, please try to do so by yourself.

I think there are two typos in Q2. In(b), X(X,Y) should be H(X,Y). In(c), H(X_{i}|X_{i-1},...,X_{n}) should be H(X_{i}|X_{i-1},...,X_{1}) :)

Haha, I also found that. Strongly agree with you.

Thanks, guys :)

You don't have permission to comment on this page.

To join this workspace, request access.

Already have an account? Log in!

Loading…

Problem-Set-2

Problem-Set-2

Page Tools

Insert links

Comments (6)

MEI Yuchen said

sidjaggi said

Cho Yiu Ng said

CAI, Sheng said

Zirui Zhou said

sidjaggi said

Join this workspace

Navigator

SideBar

Related Links:

Recent Activity