| 
  • If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.

  • You already know Dokkio is an AI-powered assistant to organize & manage your digital files & messages. Very soon, Dokkio will support Outlook as well as One Drive. Check it out today!

View
 

Scribe Notes 2

Page history last edited by Silas 15 years, 1 month ago

Intuition behind "Entropy":

Joint Entropy: Total uncertainty of 2 or more random variables (r.v.)

Conditional Entropy: Given the knowledge of some r.v., how much we still don't know about others

Mutual Information:  Given the knowledge of some r.v., how much we do know about others

Divergence: A special quantity used to indicate the difference between two distributions

                  Notice: divergence is not any kind of distance from mathematical view, because generally Formula.

 

Formal Definitions:

Entropy:

  Formula

Joint Entropy:

  Formula

Mutual Information:

  Formula

Conditional Entropy:

  Formula

Divergence:

  Formula

 

Relationship between H and I (presented in venn diagram):

  slide errorPlugin error: That plugin is not available.

copied from text book: <Elements of information theory (2nd edition)>

 

Some Equalities:

(1) 

      Formula

Pf: 

           Formula

      Formula

      Formula

      Formula

      Formula

      Formula

      Formula

     By similar manner, we can easily prove the whole equation.

 

(2) Chain Rule

      Formula 

Pf(1):

           Formula

      Formula

      Formula

      Formula

      Formula

      Formula

Pf(2):

      Formula      //Apply log function to both sides

      Formula  //Calculate the expectation of both sides

           Formula

      Formula

           Formula

      Formula

      Formula

 

      Formula

Pf:

           Formula

      Formula

      Formula

      Formula

      Formula

 

Some Inequalities:

(1)  Definition of convexity, strict convexity, concavity, strict concavity.

      Convexity: f(x) is said to be convex on (a,b) if

                     Formula     Formula

      Strict Convexity: f(x) is said to be strict convex on (a,b) if

                             Formula     Formula

      Definitions for concavity and strict concavity are similar to the above two, just change the inequality signs.

      Geometric meaning for convexity: The line segment between Formulaand Formula is above the curve of f(x) between x1 and x2.

      Generalize the idea:  Formula(Jensen's inequality)

                                     For strictly convex function, Formula if and only if there is only one x in the domain.

(2) 

      Formula

Pf:

           Formula

      Formula

      Formula

      Formula          // log is a concave function

      Formula

      Formula

      Formula

      Therefore, Formula "=" if and only if p and q have the same distribution

      Moreover, Formula "=" if and only if X and Y are independent

(3)

      Formula (Conditioning reduces entropy)

Pf:

      Formula

      Since Formula, we directly draw the conclusion, Formula. "=" if and only if X and Y are independent

(4)

      Let Formulabe a Markov Chain (i.e. given Y, X and Z are independent). Then Formula.(Data processing inequality)

Pf:

      Formula 

                           Formula

      By definition of Markov Chain, Formula.

      Also, Formula.

      So, Formulaas we desired. "=" if and only if Formulai.e. Formulaforms a Markov Chain.  Q.E.D

     How to understand this inequality?

          Consider that X is the source of information, after some processing we get Y. Finally we want to reconstruct X, and we get Z.

          The inequality simply tells us that the relationship between X and Y is closer than that between X and Z. (Since Y is obtained directly from X)

Comments (5)

Lingxiao XIA said

at 6:08 am on Feb 2, 2009

i don really understand the markov chain for random variables

it's a markov chain for one r.v. called distribution?

Silas said

at 9:12 am on Feb 2, 2009

Dear Lingxiao,

I think you should view Markov chain as a joint distribution of 3 random variables. It's because when the definition of Markov chain relates to independence of random variables, the joint distribution must first be specified.

Best Regards,
Silas

sidjaggi said

at 10:42 am on Feb 2, 2009

Good answer, Silas.

Here's another question for everyone -- in everything we've done so far, we implicitly assume that it's a 1-Markov process -- i.e., if X <-> Y <-> Z, then Z and X are conditionally independent given Y.

How would you, first of all, define 2-Markov process? How about k-Markov processes? How would the formal definitions change? How about the data processing inequality?

Not that we have any use for such definitions in the near future (or perhaps at all in this course), but an interesting mental exercise to think about :)

Silas said

at 10:31 am on Feb 4, 2009

The proof for D(p||q) >= 0 in this scribe note is not rigorous enough because q/p is undefined when p is zero. The proof of this theorem in p.28 of Thomas Cover's book writes in a more rigorous way.

By the way, I think this scribe note is well organised and is easy to read.

MEI Yuchen said

at 3:15 pm on Feb 4, 2009

Thank you for your positive review, Silas.

About the proof of non-negativity of KL divergence, the proof in the text book is more rigurous than mine.
But we can view it from another angle, if p(x)=0 for some x, the original definition for D(p||q) is not "well defined" since log(0) is undefined.
I think we have the convention that p*log(anything)=0 if p=0.
Again, thank you for your comment!

You don't have permission to comment on this page.