**5.1 Entropy rate of a process**

**Key concepts: Non i.i.d. processes, entropy rates, Markov sources, stationary distributions, existence and uniqueness of stationary distributions of Markov processes, convergence to stationary distribution, second law of thermodynamics, compression of non i.i.d. sources.**

In the previous discussions we have looked at the entropy of a single random variable and the entropy of a finite number of random variables. Today we discuss the entropy of a stochastic process, which is a sequence of random variables. In particular, our focus will be on "stationary (time-homogeneous/time invariant) Markov chain with a finite state space". We begin by introducing the definition of a stationary process, markov chain, and stationary distribution. Then we will discuss its relationship with the second law of thermodynamics, and finish with a brief description of how to compress a process using Huffman code.

A *stochastic process* is a sequence of random variables. It is said to be *stationary* if the distribution of any subset of the sequence of the random variables is invariant under shifts of time. That is

for any and , and **ALL** .

Let us recall the definition of a Markov chain. A *Markov chain* is a stochastic process such that it has the property of "*knowing the present, the future is independent of the past*", that is

.

In our discussion, we will be focusing on stationary (also called time-homogeneous/time-invariant) Markov chain, which has the following additional constraint:

for all . We also define the *transition matrix* for a Markov chain to be .

A *stationary distribution* is a probability distribution that is the same at any time , that is

for all . In particular, a distribution is stationary for a Markov chain if .

**Remark:** A stationary distribution is *NOT* the same as a stationary process.

One would immediate question if a stationary distribution always exists and whether this distribution is unique. For a Markov chain with finite state space, one can show that it always has at least one stationary distribution. However, it may not be unique, as illustrated in the following example.

**Example 1: **Consider a Markov chain with the transition matrix being the identity matrix . Then every distribution is stationary but the Markov chain does not have a unique stationary distribution.

Given a transition matrix of a Markov chain, we can always consider it as a weighted directed graph . In this graph, there is an edge connecting node to node with weight if and only if . We say a Markov chain is *irreducible* if its graph is* strongly connected*, that is, for every pair of nodes and , there exists a path connecting to . Moreover, we also call a Markov chain *aperiodic *if the largest common divisor of the lengths of different paths from a state to itself is 1.

**Proposition: **If a finite stationary Markov chain with transition matrix is *irreducible* and *aperiodic*, then it has a unique stationary distribution . Moreover, for any distribution , .

While we are not going to give the proof here, we consider what happens if either one of the properties is missing. Suppose a Markov chain is not irreducible. Then our example above shows that it is possibly to have more than one stationary distribution. Now consider a Markov chain with transition matrix , this chain is irreducible but not aperiodic. It is easy to see that if , then when n is odd and when n is even . Hence it does not tend towards the stationary distribution .

We are now ready to define the entropy of a stochastic process. One attempt to define the entropy of a process would be if the limit exists. However, this definition does not capture the dependency of the variables. Consider the following definition

**Definition (Entropy rate of a stochastic process)**: The entropy of a stochastic process is defined by if the limit exists.

At first sight, although this definition captures the dependency of the random variables, the R.H.S. does not seem easy to compute compared to the first one. Now we show that for a stationary Markov chain this is not the case. By its definition, one has

Therefore, in order to compute the entropy of a Markov process, we can simply find a stationary distribution , and compute .

Let be the unique stationary distribution of a Markov chain, and be the distribution of the th random variable in the chain. We now show that decreases to as tends to infinity. More precisely, we show that and .

*Proof***: **We prove both inequality by first showing a more general inequality. Recall the *chain rule for relative entropy* says that given two joint probability distributions one has

, and

Note that if and are two distributions of the same Markov chain, , where is the transition probability of the chain. Hence . Previously we also showed that for any and . Therefore we have

.

The first inequality then follows from replacing and by and respectively, since is a stationary distribution, . The second inequality is similar.

We have just showed that the relative entropy of a Markov chain decreases with respect to . However this does *NOT* imply whether its entropy always increases or decreases. This can be easily shown by considering the fact that the entropy of a random variable is maximized when its distribution is uniform. If the stationary distribution is non-uniform and we choose a uniform distribution to be our initial distribution, then the entropy must decrease. Likewise, if the stationary distribution is uniform, then its entropy is maximized and there is no way we can start with a distribution that has a higher entropy rate.

Therefore one possible implicit assumption in the second law of thermodynamics, which states that the entropy of an isolated system never decreases, is the stationary distribution of the system is uniform.

We finish this section by considering how to compress a Markov source with Huffman coding. One way to compress a Markov source is to do Huffman coding according to the stationary distribution of the Markov chain, since this is the limiting distribution of the process (assuming it is irreducible and periodic). Another way to compress a Markov source is to do Huffman coding according to the transition probability of the process.

**5.3 Weak typicality**

**Key concepts: Weak typicality, definition, comparison with strong typicality, properties (size, probability), weak law of large numbers**

We first recall the definitions of convergence in probability in mathematical analysis, then we state the weak law of large number. After that we introduce weak typicality and finish the section by briefly comparing weak typicality and strong typicality.

**Definition (Convergence in probability)** A sequence of random variables is said to converge to in probability (denoted by ) if for any , , there exists an , such that for all ,.

**Theorem (Strong law of large numbers) **Let be a sequence of i.i.d. random variables with expected value . The *weak law of large numbers* states that .

**Definition (ε-weakly typical set)** Let be n random variables. We define the set

to be its *ε-weakly typical set*.

Now we show that the weakly typical sets have some good properties that we used for strongly typical sets

**Proposition** For a sequence of i.i.d. random variables , .

Proof.

where the first quality follows from independence of the variables, convergence in probability follows from weak law of large numbers and the last equality follows from the definition of entropy.

**Proposition** For all , there exists such that for all , .

Proof. By the previous proposition, . Hence for any , choose , then there exists N such that for all , , equivalently, . Therefore, .

**Proposition** for sufficiently large.

Proof. For each in , . There are typical sequences. Hence one has and it follows that .

Recall that the ε-strongly typical set is defined to be . Weak-typicality may not be as "representative" of typicality as strong typicality as weak typical not only includes strong typical classes, but also other exponentially small type classes. However, this definition for strong typicality is not well-defined when the alphabet size is not finite. Hence sometimes people may still prefer using the weak typicality.

## Comments (0)

You don't have permission to comment on this page.