###

# Processes with memory

## 1. Entropy rate

In previous lectures, we had covered basic notions of information theory, and the source coding and channel coding theorems. We assumed for simplicity that the channel sources are memoryless and i.i.d., which is not always reasonable. For example, in English, obviously the characters are not i.i.d., combinations like "THE" is very common. Also, in reality we often encounter sequences of random variables coming from different sources, and their distributions are more likely to be non-identical. Hence, we need to consider sources with memory and not necessarily identically distributed. Markov chains are used to model sources with memory, such as texts, speech, etc.

**Definition (Stationary process). **A stochastic process X^{∞} = {X_{i}} is said to be *stationary* if

for any x_{i} and every n, k, where indexes are added modulo n.

Note that being stationary does not imply independence. For example, take a collection of i.i.d. random variables {X_{i}} and define Z_{n} = X_{n+1} + X_{n}.

**Definition (Stationary distribution). **A probability distribution on the states of a stochastic process is *stationary* if it is the same at time n + 1 as at time n.

**Definition (Markov chain/Markov process, time invariance).** A stochastic process X^{∞} = {X_{i}} is a *Markov chain* or a *Markov process* if for any n = 1, 2,...,

for all X_{i}. The Markov chain is said to be *time-invariant* if p(x_{n+1}|x_{n}) does not depend on n.

Let X^{∞} = {X_{i}} be a time-invariant Markov process. From the above definition, we see that it is sufficient to determine the state X_{n} at time n if the initial state and the transition matrix P = (P_{ij}) are given, where P_{ij} = Pr(X_{n+1} = j | X_{n} = i). Namely,

.

If v is a probability vector, then v is a stationary distribution if it satisfies v = vP. Hence, if the transition matrix P of X^{∞} has an eigenvalue 1, then it has a stationary distribution. Now, we may ask does a stationary distribution exist, and is it unique when exists? To answer this, we need to define two properties which a Markov chain can possess.

**Definition (Irreducibility, aperiodicity).** We say X^{∞} is *irreducible* if it is possible to go from one state to any other states with positive probabilities in finite steps. It is said to be *aperiodic* if the greatest common divisor of the lengths of paths from one state to itself is 1.

**Theorem (Existence** **and ****uniqueness** **of stationary distribution). ** A Markov chain has a unique stationary distribution if and only if it is both irreducible and aperiodic.

We now define the entropy rate of a process, which measures how the entropy of the sequence of random variables grows with n.

**Definition (Entropy rate).** The *entropy rate* H(X^{∞}) of a stochasic process X^{∞} = {X_{i}} is defined as the following:

For a stationary stochastic process, its entropy rate has an alternative expression.

**Theorem (Entropy rate of stationary processes).** Let X^{∞}= {X_{i}} be a stationary stochastic process. Then we have

For a stationary time-invariant Markov chain, we have from the above

.

## 2. Entropy and the second law of thermodynamics

The second law of thermodynamics states that the entropy of an isolated system can never decrease. In the context of information theory, there are situations where entropy decreases with time. Consider a Markov process with the following transition matrix:

Since the number of possible states decreases after several iterations, the entropy decreases if the initial distribution is (1, 0, 0, 0) or (1/4, 1/4, 1/4, 1/4).

In general, if there is a non-uniform stationary distribution, then entropy has to decrease with time for some initial distributions, and vice versa. Such situation does not happen in physics circumstance, as stationary distribution must be uniform for many physical systems.

## 3. Compressing a Markov source

To compress a Markov source, we may use Huffman code. We discuss two variants of this method.

Method 1: Compute the stationary distribution of the Markov process, and then use Huffman code with for compression. Here, n is chosen to be large.

Method 2: Two schemes of compression are employed. If the last symbol is j, then we compress using Huffman code with distribution (p_{1j},p_{0j}). The method works since the both encoder and decoder know the initial distribution, the first symbol indicating which scheme is used and transition probabilities in advance.

For further information about entropy rate, one may refer to Chapter 4 of the textbook.

## Comments (0)

You don't have permission to comment on this page.