Scribe_Note_2


Entropy: Definitions and Properties (Scribe by Zirui)

 

1. Overview

     In the previous lecture, we have already introduced a simple coding theorem. We have seen that much of the definitions and properties are relevant with the entropy function Formula. In this lecture, fundamental quantities and tools will be introduced and rigorously defined, which is a prerequisite for further development of information theory. Besides entropy, we will introduce relative entropy, joint entropy, conditional entropy and mutual information. After defining entropy and mutual information, we establish chain rules, the nonnegativity of mutual information, the data-processing inequality, Jensen's inequality and Fano's inequality.

 

     Keywords: entropy, relative entropy, mutual information, chain rules.

 

2. Definitions

     2.1 Entropy

          Def. 1 The entropy Formula of a discrete random variable Formula is defined byFormula.

 

          In the previous lecture, we have seen that this entropy has many properties that agree with the intuitive notion of what a measure of information should be. More precisely, it is the      leading term of the exponent of the strongly typical set's size. Also, entropy is a function only of the distribution of Formula -- it describes the uncertainty of that random variable.

 

     2.2 Joint Entropy

          Def. 2 The joint entropy Formula of a pair of discrete random variables Formula with a joint distribution Formula is defined as Formula.

          Intuitively, the jointly Formula-strongly typical set could be defined as  Formula where Formula denotes Formula.

          Claim 1 Formula.

         The proof of this claim is the same as in the previous lecture when proving Formula  , and we will give a similar proof when we prove the claim about conditional entropy below.

 

     2.3 Conditional Entropy

          Def. 3 If Formula, the conditional entropy Formula is defined as Formula.

          Intuitively, given any Formula , define Formula as the strongly conditional typical set.

          Claim 2 Formula.

          Proof: For simplicity, we only consider the condition that XFormula and Formula can only take two values in the sample space, say 0 and 1. Assume that the marginal distribution of Formula is Formula. Then the number of strongly typical class should be

 

          Formula.

 

          Using Stirling formula and the proposition that Formula, we reach the following equation

          Formula

          From the definition we know that the last term above is just Formula, hence we completed the proof.

 

     2.4 Relative Entropy

          Def.4 The relative entropy or Kullback-Leibler distance between two probability mass functions Formula and Formula is defined as Formula.

          Intuitively, the relative entropy is a measure of the distance between two distributions. However, its not true distance because it disobeys the symmetry property. Later we will show that relative entropy is nonnegative.

 

      

     2.5 Mutual Information

 

          Def. 5 Consider two random variables Formula and Formula with a joint probability mass function Formula and marginal probability mass functions Formula and Formula. The mutual information Formula is the relative entropy between the joint distribution and the product distribution Formula

 

       Formula.

 

           Mutual information is a measure of the amount of information that one random variable contains about another random variable. It is the reduction in the uncertainty of one random variable due to the knowledge of the other. We will have deeper understanding after establishing the following properties.

 

3. Properties and Chain Rules.

     3.1 Information Theory Equalities

          (a) Equalities of mutual information

          Formula

          Formula

          Formula

          Formula

          Formula

 

          All the above equalities can be proved directly using the definition of mutual information and entropy.

          Also, there is a Venn diagram that can express the relationship of mutual information and entropy accurately. See Figure 1 below.

 

          (b)Chain Rule for Entropy

 

          Let Formula be drawn according to Formula. Then Formula.

          If we consider the two r.v case, the proof is easy using the definition. For three or more r.v case, using mathematical induction can help us complete the proof.

 

          (c)Chain Rule for Mutual Information

 

          Define the conditional mutual information as Formula, then we haveFormula

 

          Proof:   Formula

                  Formula

                  Formula

                  Formula

                  Formula

 

 

     3.2 Information Theory Inequalities.

          (a)FormulaFormulaFormula.

          These three inequalities are nothing but follows the property of log function.

 

 

          (b)Formula,Formula,Formula,Formula.

          The proof of the above inequalities needs an important lemma called Jensen's Inequality.

 

          Lemma 1(Jensen's inequality) If Formula is a convex function and Formula is a random variable, then we have Formula Moreover, if Formula is strictly  convex, the equality implies that Formula with probability 1, i.e. Formula is a constant.

 

          Proof: for a two-mass-point distribution, the inequality becomes Formula, which follows directly from the definition of convex functions. Suppose that the theorem is true for distributions with Formula mass points. Then writing Formula for Formula, we have

 

           Formula

                                Formula

                                Formula

                                Formula,

          hence the proof is completed.

 

 

          First we prove Formula. Let Formula be the support set of Formula. Then

          Formula

                            Formula

                            Formula

                            Formula

                            Formula

                            Formula.

 

          Secondly, we prove the nonnegativity of mutual information. Actually, the inequality is trival after we see that Formula.

 

          Also, as Formula,Formulaholds, and Formula, we have Formula.

 

          The last inequality needs to be prove is Formula. Denote Formula be the uniform probability mass function over Formula, and let Formula be the probability mass function for Formula. Then Formula,

          hence by the nonnegativity of relative entropy, the inequality Formula holds.