• If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.

• You already know Dokkio is an AI-powered assistant to organize & manage your digital files & messages. Very soon, Dokkio will support Outlook as well as One Drive. Check it out today!

View

# Scribe Note 5_1

last edited by 12 years, 5 months ago

5.1 Entropy rate of a process

Key concepts: Non i.i.d. processes, entropy rates, Markov sources, stationary distributions, existence and uniqueness of stationary distributions of Markov processes, convergence to stationary distribution, second law of thermodynamics, compression of non i.i.d. sources.

In the previous discussions we have looked at the entropy of a single random variable and the entropy of a finite number of random variables. Today we discuss the entropy of a stochastic process, which is a sequence of random variables. In particular, our focus will be on "stationary (time-homogeneous/time invariant) Markov chain with a finite state space". We begin by introducing the definition of a stationary process, markov chain, and stationary distribution. Then we will discuss its relationship with the second law of thermodynamics, and finish with a brief description of how to compress a process using Huffman code.

A stochastic process  is a sequence of random variables. It is said to be stationary if the distribution of any subset of the sequence of the random variables is invariant under shifts of time. That is

for any and , and ALL .

Let us recall the definition of a Markov chain. A Markov chain is a stochastic process such that it has the property of "knowing the present, the future is independent of the past", that is

.

In our discussion, we will be focusing on stationary (also called time-homogeneous/time-invariant) Markov chain, which has the following additional constraint:

for all . We also define the transition matrix  for a Markov chain to be .

A stationary distribution is a probability distribution that is the same at any time , that is

for all . In particular, a distribution  is stationary for a Markov chain if .

Remark: A stationary distribution is NOT the same as a stationary process.

One would immediate question if a stationary distribution always exists and whether this distribution is unique. For a Markov chain with finite state space, one can show that it always has at least one stationary distribution. However, it may not be unique, as illustrated in the following example.

Example 1:  Consider a Markov chain with the transition matrix being the identity matrix . Then every distribution is stationary but the Markov chain does not have a unique stationary distribution.

Given a transition matrix of a Markov chain, we can always consider it as a weighted directed graph . In this graph, there is an edge connecting node to node  with weight  if and only if . We say a Markov chain is irreducible if its graph  is strongly connected, that is, for every pair of nodes and , there exists a path connecting to . Moreover, we also call a Markov chain aperiodic if the largest common divisor of the lengths of different paths from a state to itself is 1.

Proposition: If a finite stationary Markov chain with transition matrix is irreducible and aperiodic, then it has a unique stationary distribution . Moreover, for any distribution .

While we are not going to give the proof here, we consider what happens if either one of the properties is missing. Suppose a Markov chain is not irreducible. Then our example above shows that it is possibly to have more than one stationary distribution. Now consider a Markov chain with transition matrix , this chain is irreducible but not aperiodic. It is easy to see that if , then when n is odd  and when n is even . Hence it does not tend towards the stationary distribution .

We are now ready to define the entropy of a stochastic process. One attempt to define the entropy of a process would be  if the limit exists. However, this definition does not capture the dependency of the variables. Consider the following definition

Definition (Entropy rate of a stochastic process): The entropy of a stochastic process is defined by  if the limit exists.

At first sight, although this definition captures the dependency of the random variables, the R.H.S. does not seem easy to compute compared to the first one. Now we show that for a stationary Markov chain this is not the case. By its definition, one has

Therefore, in order to compute the entropy of a Markov process, we can simply find a stationary distribution , and compute .

Let  be the unique stationary distribution of a Markov chain, and  be the distribution of the th random variable in the chain. We now show that  decreases to  as  tends to infinity. More precisely, we show that  and .

Proof: We prove both inequality by first showing a more general inequality. Recall the chain rule for relative entropy says that given two joint probability distributions  one has

, and

Note that if  and  are two distributions of the same Markov chain, , where  is the transition probability of the chain. Hence . Previously we also showed that  for any  and . Therefore we have

.

The first inequality then follows from replacing and  by and  respectively, since  is a stationary distribution, . The second inequality is similar.

We have just showed that the relative entropy of a Markov chain decreases with respect to . However this does NOT imply whether its entropy always increases or decreases. This can be easily shown by considering the fact that the entropy of a random variable is maximized when its distribution is uniform. If the stationary distribution is non-uniform and we choose a uniform distribution to be our initial distribution, then the entropy must decrease. Likewise, if the stationary distribution is uniform, then its entropy is maximized and there is no way we can start with a distribution that has a higher entropy rate.

Therefore one possible implicit assumption in the second law of thermodynamics, which states that the entropy of an isolated system never decreases, is the stationary distribution of the system is uniform.

We finish this section by considering how to compress a Markov source with Huffman coding. One way to compress a Markov source is to do Huffman coding according to the stationary distribution of the Markov chain, since this is the limiting distribution of the process (assuming it is irreducible and periodic). Another way to compress a Markov source is to do Huffman coding according to the transition probability of the process.

# 5.3 Weak typicality

Key concepts: Weak typicality, definition, comparison with strong typicality, properties (size, probability), weak law of large numbers

We first recall the definitions of convergence in probability in mathematical analysis, then we state the weak law of large number. After that we introduce weak typicality and finish the section by briefly comparing weak typicality and strong typicality.

Definition (Convergence in probability) A sequence of random variables  is said to converge to  in probability (denoted by ) if for any , , there exists an , such that for all ,﻿.

Theorem (Strong law of large numbers) Let  be a sequence of i.i.d. random variables with expected value . The weak law of large numbers states that .

Definition (ε-weakly typical set) Let  be n random variables. We define the set

to be its ε-weakly typical set.

Now we show that the weakly typical sets have some good properties that we used for strongly typical sets

Proposition For a sequence of i.i.d. random variables .

Proof.

where the first quality follows from independence of the variables, convergence in probability follows from weak law of large numbers and the last equality follows from the definition of entropy.

Proposition For all , there exists  such that for all .

Proof. By the previous proposition, . Hence for any , choose , then there exists N such that for all , equivalently, . Therefore, .

Proposition  for  sufficiently large.

Proof. For each  in , . There are  typical sequences. Hence one has  and it follows that .

Recall that the ε-strongly typical set is defined to be . Weak-typicality may not be as "representative" of typicality as strong typicality as weak typical not only includes strong typical classes, but also other exponentially small type classes. However, this definition for strong typicality is not well-defined when the alphabet size is not finite. Hence sometimes people may still prefer using the weak typicality.