Entropy: Definitions and Properties (Scribe by Zirui)
1. Overview
In the previous lecture, we have already introduced a simple coding theorem. We have seen that much of the definitions and properties are relevant with the entropy function . In this lecture, fundamental quantities and tools will be introduced and rigorously defined, which is a prerequisite for further development of information theory. Besides entropy, we will introduce relative entropy, joint entropy, conditional entropy and mutual information. After defining entropy and mutual information, we establish chain rules, the nonnegativity of mutual information, the data-processing inequality, Jensen's inequality and Fano's inequality.
Keywords: entropy, relative entropy, mutual information, chain rules.
2. Definitions
2.1 Entropy
Def. 1 The entropy of a discrete random variable is defined by.
In the previous lecture, we have seen that this entropy has many properties that agree with the intuitive notion of what a measure of information should be. More precisely, it is the leading term of the exponent of the strongly typical set's size. Also, entropy is a function only of the distribution of -- it describes the uncertainty of that random variable.
2.2 Joint Entropy
Def. 2 The joint entropy of a pair of discrete random variables with a joint distribution is defined as .
Intuitively, the jointly -strongly typical set could be defined as where denotes .
Claim 1 .
The proof of this claim is the same as in the previous lecture when proving , and we will give a similar proof when we prove the claim about conditional entropy below.
2.3 Conditional Entropy
Def. 3 If , the conditional entropy is defined as .
Intuitively, given any , define as the strongly conditional typical set.
Claim 2 .
Proof: For simplicity, we only consider the condition that X and can only take two values in the sample space, say 0 and 1. Assume that the marginal distribution of is . Then the number of strongly typical class should be
.
Using Stirling formula and the proposition that , we reach the following equation
From the definition we know that the last term above is just , hence we completed the proof.
2.4 Relative Entropy
Def.4 The relative entropy or Kullback-Leibler distance between two probability mass functions and is defined as .
Intuitively, the relative entropy is a measure of the distance between two distributions. However, its not true distance because it disobeys the symmetry property. Later we will show that relative entropy is nonnegative.
2.5 Mutual Information
Def. 5 Consider two random variables and with a joint probability mass function and marginal probability mass functions and . The mutual information is the relative entropy between the joint distribution and the product distribution
.
Mutual information is a measure of the amount of information that one random variable contains about another random variable. It is the reduction in the uncertainty of one random variable due to the knowledge of the other. We will have deeper understanding after establishing the following properties.
3. Properties and Chain Rules.
3.1 Information Theory Equalities
(a) Equalities of mutual information
All the above equalities can be proved directly using the definition of mutual information and entropy.
Also, there is a Venn diagram that can express the relationship of mutual information and entropy accurately. See Figure 1 below.
(b)Chain Rule for Entropy
Let be drawn according to . Then .
If we consider the two r.v case, the proof is easy using the definition. For three or more r.v case, using mathematical induction can help us complete the proof.
(c)Chain Rule for Mutual Information
Define the conditional mutual information as , then we have
Proof:
3.2 Information Theory Inequalities.
(a), , .
These three inequalities are nothing but follows the property of log function.
(b),,,.
The proof of the above inequalities needs an important lemma called Jensen's Inequality.
Lemma 1(Jensen's inequality) If is a convex function and is a random variable, then we have Moreover, if is strictly convex, the equality implies that with probability 1, i.e. is a constant.
Proof: for a two-mass-point distribution, the inequality becomes , which follows directly from the definition of convex functions. Suppose that the theorem is true for distributions with mass points. Then writing for , we have
,
hence the proof is completed.
First we prove . Let be the support set of . Then
.
Secondly, we prove the nonnegativity of mutual information. Actually, the inequality is trival after we see that .
Also, as ,holds, and , we have .
The last inequality needs to be prove is . Denote be the uniform probability mass function over , and let be the probability mass function for . Then ,
hence by the nonnegativity of relative entropy, the inequality holds.
Comments (0)
You don't have permission to comment on this page.