The Noisy-Channel Coding Theoremmath.sfsu.edu/serkan/expository/michaelMaconExpository.pdffollow...

23
The Noisy-Channel Coding Theorem Michael W. Macon December 18, 2015 Abstract This is an exposition of two important theorems of information the- ory often singularly referred to as The Noisy-Channel Coding Theorem. Given a few assumptions about a channel and a source, the coding the- orem demonstrates that information can be communicated over a noisy channel at a non-zero rate approximating the channel’s capacity with an arbitrarily small probability of error. It originally appeared in C.E. Shan- non’s seminal paper, A Mathematical Theory of Communication, and was subsequently rigorously reformulated and proved by A.I. Khinchin. We first introduce the concept of entropy as a measure of information, and discuss its essential properties. We then state McMillan’s Theorem and attempt to provide a detailed sketch of the proof of what Khinchin calls Feinstein’s Fundamental Lemma, both crucial results used in the proof of the coding theorem. Finally, keeping in view some stringent assumptions and Khinchin’s delicate arguments, we attempt to frame the proof of The Noisy-Channel Coding Theorem. 1 Introduction C.E. Shannon wrote in 1949, “The fundamental problem of communication is that of reproduc- ing at one point either exactly or approximately a message selected at another point[5].” At the time Shannon wrote his celebrated paper, A Mathematical Theory of Communication, the efforts of communications engineers were beginning to shift from analog transmission models to digital models. This meant that rather than 1

Transcript of The Noisy-Channel Coding Theoremmath.sfsu.edu/serkan/expository/michaelMaconExpository.pdffollow...

Page 1: The Noisy-Channel Coding Theoremmath.sfsu.edu/serkan/expository/michaelMaconExpository.pdffollow Khinchin down the lengthy and complicated road to the proof of the Noisy Channel Coding

The Noisy-Channel Coding Theorem

Michael W. Macon

December 18, 2015

Abstract

This is an exposition of two important theorems of information the-ory often singularly referred to as The Noisy-Channel Coding Theorem.Given a few assumptions about a channel and a source, the coding the-orem demonstrates that information can be communicated over a noisychannel at a non-zero rate approximating the channel’s capacity with anarbitrarily small probability of error. It originally appeared in C.E. Shan-non’s seminal paper, A Mathematical Theory of Communication, and wassubsequently rigorously reformulated and proved by A.I. Khinchin. Wefirst introduce the concept of entropy as a measure of information, anddiscuss its essential properties. We then state McMillan’s Theorem andattempt to provide a detailed sketch of the proof of what Khinchin callsFeinstein’s Fundamental Lemma, both crucial results used in the proof ofthe coding theorem. Finally, keeping in view some stringent assumptionsand Khinchin’s delicate arguments, we attempt to frame the proof of TheNoisy-Channel Coding Theorem.

1 Introduction

C.E. Shannon wrote in 1949,

“The fundamental problem of communication is that of reproduc-ing at one point either exactly or approximately a message selectedat another point[5].”

At the time Shannon wrote his celebrated paper, A Mathematical Theory ofCommunication, the efforts of communications engineers were beginning to shiftfrom analog transmission models to digital models. This meant that rather than

1

Page 2: The Noisy-Channel Coding Theoremmath.sfsu.edu/serkan/expository/michaelMaconExpository.pdffollow Khinchin down the lengthy and complicated road to the proof of the Noisy Channel Coding

focusing on improving the signal-to-noise ratio by means of increasing the ampli-tude of the waveform signal, engineers were concerned with efficiently transmit-ting a discrete time sequence of symbols over a fixed bandwidth. Shannon’s pa-per piqued the interest of engineers because it describes what can be attained byregulating the encoding system. It was Norbert Wiener in [7] who first proposedthat a message should rightly be viewed as a stochastic process[3]. R.V.L. Hart-ley proposed the logarithmic function for a measure of information[5]. However,it was Shannon who formalized the theory by giving mathematical definitionsof information, source, code and channel, and a way to quantify the informationcontent of sources and channels. He established theoretical limits for the quan-tity of information that can be transmitted over noisy channels and also for therate of its transmission.

At the same time, mathematicians and statisticians became interested in thenew theory of information, primarily because of Shannon’s paper[5] and Wiener’sbook[7]. As McMillan paints it, information theory “is a body of statisticalmathematics.” The model for communication became equivalent to statisticalparameter estimation. If x is the transmitted input message, and y the receivedmessage, then the joint distribution of x and y completely characterizes thecommunication system. The problem is then reduced to accurately estimatingx given a joint sample (x, y)[3].

The Noisy-Channel Coding Theorem is the most consequential feature of infor-mation theory. An input message sent over a noiseless channel can be discernedfrom the output message. However, when noise is introduced to the channel,different messages at the channel input can produce the same output message.Thus noise creates a non-zero probability of error, Pe, in transmission. Intro-ducing a simple coding system like a repetition code or a linear error correctingcode will reduce Pe, but simultaneously reduce the rate, R, of transmission. Theprevailing conventional wisdom among scientists and engineers in Shannon’s daywas that any code that would allow Pe → 0 would also force R→ 0. Neverthe-less, by meticulous logical deduction, Shannon showed that the output messageof a source could be encoded in a way that makes Pe arbitrarily small. Thisresult is what we shall call the Noisy Channel Coding Theorem Part 1. Shannonfurther demonstrated that a code exists such that the rate of transmission canapproximate the capacity of the channel, i.e. the maximum rate possible overa given channel. This fact we shall call The Noisy Channel Coding TheoremPart 2. These two results have inspired generations of engineers, and persuadedsome to confer the title of “Father of the Information Age” to Claude Shannon.

As Khinchin narrates, the road to a rigorous proof of Shannon’s theorems is“long and thorny.” Shannon’s initial proofs were considered by mathematiciansto be incomplete with “artificial restrictions” that weakened and oversimpli-fied them. In the years after the publication of Shannon’s theorem, researchmathematicians such as McMillan, Feinstein et al., began to systematize in-formation theory on a sound mathematical foundation[2]. Now we attempt to

2

Page 3: The Noisy-Channel Coding Theoremmath.sfsu.edu/serkan/expository/michaelMaconExpository.pdffollow Khinchin down the lengthy and complicated road to the proof of the Noisy Channel Coding

follow Khinchin down the lengthy and complicated road to the proof of theNoisy Channel Coding Theorem.

2 Entropy

In his famous paper [5] Shannon proposed a measure for the amount of un-certainty or entropy encoded in a random variable. If a random variable Xis realized, then this uncertainty is resolved. So it stands to reason that theentropy is proportional to the amount of information gained by a realization ofrandom variable X. In fact, we can take the amount of information gained bya realization of X to be equal to the entropy of X. As Khinchin states,

“...the information given us by carrying out some experimentconsists in removing the uncertainty which existed before the exper-iment. The larger this uncertainty, the larger we consider to be theamount of information obtained by removing it[2].”

Suppose A is a finite set. Let an element a from set A be called a letter, andthe set A be called an alphabet.

Definition 1. Let X be a discrete random variable with alphabet A and prob-ability mass function pX (a) for a ∈ A. The entropy H(X) of X is definedby

H(X) = −∑a∈A

p(a) · log p(a)

Note: The base of the logarithm is fixed, but may be arbitrarily chosen.Base 2 will be used in the calculations and examples throughout this paper. Inthis case, the unit of measurement is the bit. It is also assumed that 0·log(0) = 0.

For example, the entropy of the binomial random variable X ∼ B(2, 0.4) isgiven by

H(X) = −(0.36 · log 0.36 + 0.48 · log 0.48 + 0.16 · log 0.16) = 1.4619 bits.

The entropy of a continuous random variable X with probability density func-tion p(x) is given by

H(X) = −∫ ∞−∞

p(x) log p(x)dx.

The entropy of a continuous random variable is sometimes called differentialentropy.As mentioned above, the entropy H(X) is interpreted as a measure of the un-certainty or information encoded in a random variable, but it is also impor-tantly viewed as the mean minimum number of binary distinctions—yes or noquestions—required to describeX. SoH(X) is also called the description length.

The definition of entropy can be extended to two random variables.

3

Page 4: The Noisy-Channel Coding Theoremmath.sfsu.edu/serkan/expository/michaelMaconExpository.pdffollow Khinchin down the lengthy and complicated road to the proof of the Noisy Channel Coding

Definition 2. The joint entropy H(X,Y ) of a pair of discrete random variables(X,Y ) with a joint distribution p(a, b) is defined as

H(X,Y ) = −∑a∈A

∑b∈B

p(a, b) log p(a, b)

Definition 3. The conditional entropy H(Y |X) is defined as

H(Y |X) = −∑a∈A

p(a) ·H(Y |X = a)

= −∑a∈A

p(a) ·∑b∈B

p(b|a) log p(b|a)

= −∑a∈A

∑b∈B

p(a, b) · log p(b|a)

The conditional entropy H(Y |X) is a random variable of X, and is theexpected value of H(Y ) given the realization of X. That is, H(Y |X) is theexpected value of the amount of information to be gained from the realizationof Y given the information gained by the realization of X.

Theorem 1. H(X,Y ) = H(X) +H(Y |X)

Proof.

H(X,Y ) = −∑a∈A

∑b∈B

p(a, b) log p(a, b)

= −∑a∈A

∑b∈B

p(a, b) log p(a)p(b|a)

= −∑a∈A

∑b∈B

p(a, b) log p(a)−∑a∈A

∑b∈B

p(a, b) log p(b|a)

= −∑a∈A

p(a) · log p(a)−∑a∈A

∑b∈B

p(a, b) log p(b|a)

= H(X) +H(Y |X)

Corollary 1. If X and Y are independent, it follows immediately that

H(X,Y ) = H(X) +H(Y ).

Corollary 2. We also have H(X,Y ) = H(X) +H(Y |X) = H(Y ) +H(X|Y ).

Corollary 3. Note that H(X)−H(X|Y ) = H(Y )−H(Y |X).

We now prove a useful inequality.

4

Page 5: The Noisy-Channel Coding Theoremmath.sfsu.edu/serkan/expository/michaelMaconExpository.pdffollow Khinchin down the lengthy and complicated road to the proof of the Noisy Channel Coding

Jensen’s Inequality.

If f(x) is a real-valued, convex continuous function and p1, p2, ..., pn are pos-

itive numbers such that

n∑i=1

pi = 1 and x1, x2, ..., xn ∈ R, then

f(p1 · x1 + p2 · x2 + ...+ pn · xn) ≤ p1 · f(x1) + p2 · f(x2) + ...+ pn · f(xn)

Proof of Jensen’s Inequality.

Jensen’s Inequality can be proved by mathematical induction. It is assumedthat f is convex, and for the case of n = 2 we have

f(p1 · x1 + p2 · x2) ≤ p1 · f(x1) + p2 · f(x2),

which is the condition for a convex function. Now assume the inequality is truefor n = k. Then we have

f

( k∑i=1

pi·xi)

= f(p1·x1+p2·x2+...+pk ·xk) ≤ p1·f(x1)+p2·f(x2)+...+pk ·f(xk)

And consider the case when n = k + 1:

f

( k+1∑i=1

pi · xi)

= f(

k∑i=1

pi · xi + pk+1 · xk+1)

= f(pk+1 · xk+1 + (1− pk+1) ·k∑i=1

pi1− pk+1

· xi)

≤ pk+1 · f(xk+1) + (1− pk+1) · f( k∑i=1

pi1− pk+1

· xi)

(since f is convex)

= pk+1 · f(xk+1) + f

( k∑i=1

pi · xi)

≤ pk+1 · f(xk+1) +

k∑i=1

pi · f(xi) =

k+1∑i=1

pi · f(xi)

Theorem 2. For any two random variables X and Y , H(X,Y ) ≤ H(X)+H(Y )where H(X,Y ) = H(X) +H(Y ) if and only if X and Y are independent.

Proof.

5

Page 6: The Noisy-Channel Coding Theoremmath.sfsu.edu/serkan/expository/michaelMaconExpository.pdffollow Khinchin down the lengthy and complicated road to the proof of the Noisy Channel Coding

For the proof, bear in mind that a function is convex if its second derivative

is non-negative. f(x) = x logb(x) is convex since f ′′(x) =1

x log(b)> 0. Also

seeing that f(x) = x logb(x) is continuous, we can apply Jensen’s theorem.

H(X,Y )−H(X)−H(Y )

= −∑a∈A

∑b∈B

p(a, b) log p(a, b) +∑a∈A

p(a) log p(a) +∑b∈B

p(b) log p(b)

=∑a∈A

∑b∈B

p(a, b) log

(1

p(a, b)

)+∑a∈A

∑b∈B

p(a, b) log p(a) +∑a∈A

∑b∈B

p(a, b) log p(b)

=∑a∈A

∑b∈B

p(a, b) logp(a)p(b)

p(a, b)

≤ log

(∑a∈A

∑b∈B

p(a)p(b)

)(by Jensen’s theorem)

= log(1) = 0

It is evident that if X and Y are independent we have p(a)p(b) = p(a, b) andthe equality holds.

It can also be demonstrated that the amount of uncertainty encoded in X willstay the same or diminish given the information gained from the realization ofY . This is demonstrated in the following theorem.

Theorem 3. For any two random variables X and Y

H(X|Y ) ≤ H(X).

Proof. From theorems 1 and 2, it follows that

H(X|Y ) = H(X,Y )−H(Y ) ≤ H(X) +H(Y )−H(Y ) = H(X)

.

2.1 Properties of Entropy

The entropy function described in definition 1 is, in fact, the most fitting mea-sure for the uncertainty of a random variable. If pX(a) = 1 for any a ∈ A, thenthe uncertainty of the random variable will be 0. Otherwise the uncertaintywill be positive. So the amount of uncertainty, i.e. the information encoded ina random variable, is a non-negative function. Furthermore, if all of the proba-bilities pX(a) for all a ∈ A are equal, then, accordingly, the uncertainty of therandom variable will achieve a maximum.

6

Page 7: The Noisy-Channel Coding Theoremmath.sfsu.edu/serkan/expository/michaelMaconExpository.pdffollow Khinchin down the lengthy and complicated road to the proof of the Noisy Channel Coding

Here are some necessary properties of entropy required for any sound measureof information or uncertainty:

1. H(X) ≥ 0

2. H(X) is a continuous function of P = {pX(a)}a∈A

3. H(X) is a symmetric function of P , i.e. the pX(a)’s can be permuted.

4. H(X) can be expanded. That is, if some a with 0 probability is added tothe alphabet, the entropy will not change.

5. If pX(a) = 1 for some a ∈ A, then the uncertainty vanishes and H(X) = 0.

6. H(X) is maximum when X is uniformly distributed on the alphabet A.

7. H(X,Y ) = H(X) +H(Y |X)

Furthermore, Khinchin shows that if a function H(X) is continuous and hasproperties #4, #6 and #7, and λ is a positive constant, then

H(X) = −λ ·∑a∈A

p(a) · log p(a)

is unique. In other words, H(X) is the only reasonable measure for informationwith the desired properties that is possible[2].

Theorem 4. For a discrete random variable X with alphabet A = {a1, a2, ..., an},H(X) ≤ log n. That is, the maximum value of H(X) is log n.

Proof. For this proof keep in mind that lnx ≤ x − 1, which can be easily

ascertained graphically. Given this, and that log x =lnx

ln 2, it follows that

lnx ≤ (x− 1) ⇐⇒ lnx

ln 2≤ x− 1

ln 2⇐⇒ log x ≤ x− 1

ln 2

⇐⇒ log x ≤ (x− 1) · ln e

ln 2⇐⇒ log x ≤ (x− 1) log e.

7

Page 8: The Noisy-Channel Coding Theoremmath.sfsu.edu/serkan/expository/michaelMaconExpository.pdffollow Khinchin down the lengthy and complicated road to the proof of the Noisy Channel Coding

So H(X)− log n

= −∑a∈A

p(a) log p(a)− log n

= −∑a∈A

p(a) log p(a)−∑a∈A

p(a) log n

= −∑a∈A

p(a)(log p(a) + log n)

=∑a∈A

p(a) log

(1

np(a)

)≤∑a∈A

p(a)

(1

np(a)− 1

)log e (using the identity listed above)

=

(∑a∈A

1

n−∑a∈A

p(a)

)log e

=

(n

1

n− 1

)log e

= 0

We will not prove all seven properties of entropy listed above, as these can beeasily found in the literature. However, below we give an elegant demonstrationof property #6.Proof. Let X be a random variable with alphabet A = {a1, a2, ..., an} such

that pX(ai) =1

nfor all i. Then it follows that H(X)

= −∑a∈A

pX(a) log pX(a)

= −∑a∈A

1

nlog

1

n

= −(1

nlog

1

n)∑a∈A

1

= −(1

nlog

1

n)(n) = − log

1

n= log(n)

3 McMillan’s Theorem, Sources and Channels

We now give a formal definition of source and channel, and introduce the conceptof a compound channel. We also prove a consequential lemma, which demon-strates that sequences of trials of a Markov chain can be partitioned into two

8

Page 9: The Noisy-Channel Coding Theoremmath.sfsu.edu/serkan/expository/michaelMaconExpository.pdffollow Khinchin down the lengthy and complicated road to the proof of the Noisy Channel Coding

distinct sets. Lastly, we state McMillan’s theorem, a central result used in theproof of Feinstein’s Fundamental Lemma and the Noisy Channel Coding Theo-rem. Since an information source can be represented by a sequence of randomvariables, and can be modeled with a Markov chain[1], we begin by defining theentropy of a Markov chain.

3.1 Entropy and Markov Chains

Let {Xl} be a finite stationary Markov chain with n states. Suppose thetransition matrix of the chain is Q = (qi,k) where i, k = 1, 2, 3, ...n. LetP (Xl = k|Xl−1 = i) = qi,k denote the probability of state l where l = 1, 2, 3, ...n.Recall that for any stationary Markov chain with finitely many states, a sta-tionary distribution ~µ exists for which ~µQ = ~µ.

Definition 4. Suppose {Xl} is a stationary Markov chain with transition ma-trix Q = (qi,k) and stationary distribution ~µ = (µ1, µ2, ..., µn) with i, k =1, 2, 3, ...n. Let the distribution of X1 be ~µ. The entropy rate of the Markovchain is given by

H(A) = −∑i,k

µiqi,k log qi,k

The entropy rate of a Markov chain is the average transition entropy whenthe chain proceeds one step.

We will follow the definition and theorem stated in Khinchin.

Definition 5. Let Pk be the probability of state k in a Markov chain and supposethat the relative frequency of state k after a sufficiently large number of trials s

is given bymk

s. A Markov chain is called ergodic if

P{∣∣∣mk

s− Pk

∣∣∣ > δ} < ε

for arbitrarily small ε > 0 and δ > 0, and for sufficiently large s.

Consider the set of all possible sequences of s consecutive trials of a Markovchain. An element from this set can be represented as a sequence k1, k2, ...kswhere k1, k2, ...ks are numbers from {1, 2, ..., n}. Let C be an arbitrary sequencefrom this set. Then

p(C) = Pk1pk1k2pk2k3 ...pks−1ks = Pk1

n∏i=1

n∏l=1

(pil)mil

where i, l = 1, 2, ...n, mil is the number of pairs krkr+1 with 1 ≤ r ≤ s− 1 andkr = i, kr+1 = l.

9

Page 10: The Noisy-Channel Coding Theoremmath.sfsu.edu/serkan/expository/michaelMaconExpository.pdffollow Khinchin down the lengthy and complicated road to the proof of the Noisy Channel Coding

3.2 A Partitioning Lemma

Lemma 1. A Partitioning LemmaLet ε > 0 and η > 0 be arbitrarily small numbers. For sufficiently large n, allsequences of the form C can be divided into two sets with the following properties:1) The probability of any sequence in the first group satisfies the inequality∣∣∣∣∣ log 1

p(C)

s−H

∣∣∣∣∣ < η

2) The sum of the probabilities of all sequences of the second group is less thanε.

The following is a proof of property 1 in the above lemma.

Proof. A sequence C is in the first group if it satisfies the following:

1) p(C) > 0 and2) |mil − sPipil| < sδIf C is in the first group, we have mil = sPipil + sδθil where−1 < θil < 1, 1 ≤ i ≤ n, 1 ≤ l ≤ n.Let ∗ denote the restriction of the product to non-zero factors.

Now p(C) = Pk1∏i

∏l

∗(pil)sPipil+sδθil .

Thus log1

p(C)

= − logPk1 − s∑i

∑l

∗Pipil log pil − sδ∑i

∑l

∗θil log pil

=− logPk1 + sH − sδ∑i

∑l

∗θil log pil

Rearranging, we have∣∣∣∣∣ log 1p(C)

s−H

∣∣∣∣∣ < 1

slog

1

Pk1+ δ

∑i

∑l

∗ log1

pil.

Consequently, for sufficiently large s, sufficiently small δ and η > 0, we havethat ∣∣∣∣∣ log 1

p(C)

s−H

∣∣∣∣∣ < η

3.3 Sources

A discrete source generates messages, which are defined to be ordered sequencesof symbols from a finite alphabet. Messages are considered to be a randomstochastic process. Let I denote the set of integers (... − 1, 0, 1, 2, ...) and AIdenote the class of infinite sequences x = (..., x−1, x0, x1, x2, ...) where each

10

Page 11: The Noisy-Channel Coding Theoremmath.sfsu.edu/serkan/expository/michaelMaconExpository.pdffollow Khinchin down the lengthy and complicated road to the proof of the Noisy Channel Coding

xt ∈ A and t ∈ I. A set of events of the type x is called a cylinder set. Aninformation source consists of a probability measure µ defined over the σ-algebraor Borel field of subsets of AI , denoted FA, along with the stochastic process[AI , FA, µ]. We simply denote the source [A, µ].

Definition 6. If S ⊂ AI implies µ(shift(S)) = µ(S), a source [A, µ] is calledstationary, where S is any set of elements x, shift(S) is the set of all shift(x),and shift : xk → xk+1 is a shift operator.

Definition 7. If shift(S) = S, then S is called an invariant set. A stationarysource [A, µ] is ergodic if µ(S) = 0 or µ(S) = 1 for every invariant set.

Mirroring definition 1, we have that the quantity of information encoded inthe set of all n-term sequences consisting of an events C with probabilities µ(C)is given by

Hn = −∑C

µ(C) logµ(C)

For a stationary source the expected value of the function fn(x) = − 1

nlogµ(C)

is given by

E[− 1

nlogµ(C)] = − 1

n

∑C

µ(C) logµ(C) =Hn

n

3.4 McMillan’s Theorem

A key result can be summarized as follows: as n→∞ we have

E(fn(x))→ H,

where fn(x) = − 1

nlogµ(C). Furthermore, for arbitrarily small ε > 0, δ > 0

and sufficiently large n, we also have

P{|fn(x)−H| > ε} < δ.

So fn(x) converges in probability to H as n → ∞[2]. The entropy of an infor-mation source is defined to be

H = limn→∞

Hn

n,

and is the average amount of information gained per symbol produced by thesource. On the other hand, the source entropy also represents the averageuncertainty per symbol produced by the source[3]. The limit

H = limn→∞

Hn

n

always exists—an important conclusion first published by Shannon[5].

11

Page 12: The Noisy-Channel Coding Theoremmath.sfsu.edu/serkan/expository/michaelMaconExpository.pdffollow Khinchin down the lengthy and complicated road to the proof of the Noisy Channel Coding

Definition 8. A source is said to have the E-property if all n-term sequencesC in the output of the source can be separated into two groups such that1) For every sequence C of the first group∣∣∣∣ logµ(C)

n+H

∣∣∣∣ < ε

2) The sum of the probabilities of all the sequences of the second group is lessthan δ, where ε > 0 and δ > 0 are arbitrarily small real numbers.

It can be demonstrated that any ergodic source has the E property. Thisis referred to as McMillan’s Theorem after Brockway McMillan, who first pub-lished this result[3]. It parallels the Partitioning Lemma listed above.

McMillan’s Theorem

For arbitrary small ε > 0, δ > 0 and sufficiently large n, all the an n-term sequences of the ergodic source output are divided into two groups , a high

probability group, such that | 1n

logµ(C) +H| < ε, for each of its sequences, and

a low probability group, such that the sum of the probabilities of its sequences isless than δ.

McMillan’s proof of the existence of such high probability sequences is crucialto the proof of Feinstein’s Fundamental Lemma, which relates the cardinalityof the set of so-called distinguishable sequences to the ergodic capacity of thechannel. This, in turn, is the key to Shannon’s proof that there exists a codingsystem that allows information to be transmitted across any ergodic channelwith an arbitrarily small probability of error. We now give a precise definitionfor communication channel.

3.5 Channels

A communication channel [A, νx,B] transmits a message of symbols from theinput alphabet A of a source, and outputs the possibly corrupted message con-sisting of symbols from the output alphabet B. The channel is characterized bya family of probability measures denoted by νx, which are conditional probabil-ities that the signal received when a given sequence is transmitted belongs tothe set S ⊂ BI , where BI is the set of all output sequences y.

Definition 9. A channel [A, νx,B] is stationary if for all x ∈ AI and S ∈ BI

νshift(x)(shift(S)) = νx(S),

where shift : xk → xk+1 is a shift operator.

If for a given channel, the first received sequence is unaffected by all thesequences transmitted after the first, then the channel is called non-anticipating.

12

Page 13: The Noisy-Channel Coding Theoremmath.sfsu.edu/serkan/expository/michaelMaconExpository.pdffollow Khinchin down the lengthy and complicated road to the proof of the Noisy Channel Coding

Definition 10. A channel is said to be without anticipation if the distributionof message yn is independent of messages transmitted after xn.

Some channels remember the past history of the channel, and the memorymay affect the distribution of error for the sent message.

Definition 11. If message yn is dependent only on a limited number of pre-ceding input signals xn−m, ..., xn−1, xn, then a channel is said to have a finitememory m.

Let CI be the set of pairs (x, y), where x ∈ AI and y ∈ BI and C = A × Bis the set of all pairs (a, b) where a ∈ A and b ∈ B. Then CI is the class ofall sequences of the type (x, y) = (· · · , (x−1, y−1), (x0, y0), (x1, y1), · · · ). LetS = M × N where M ∈ AI and N ∈ BI . Suppose ω(S) is the probabilitymeasure of S ∈ CI defined by

ω(S) = ω(M ×N) =

∫M

νx(N)dµ(x).

The connection of a channel [A, νx,B] to a source [A, µ] creates a stochasticprocess [CI , FC , ω] that can be considered a source itself. It is referred to asa compound source denoted by [C, ω]. It can be demonstrated that if [A, µ]and [A, νx,B] are stationary, then so is [C, ω]. The sources [A, µ] and [C, ω]correspond to the marginal distribution of the input x and the joint distributionof (x, y), respectively. If we define the probability measure η on FB as follows∫

BI

k(y)dη(y) =

∫AI

dµ(x)

∫BI

k(y)dνx(y),

then the output source [B, η] corresponds to the marginal distribution of y. Itcan be shown that if [A, µ] and [A, νx,B] are stationary, [B, η] is also stationary.Additionally, if [A, µ] is ergodic and if [A, νx,B] has a finite memory, then both[C, ω] and [B, η] are ergodic[3] [2].

Let H(X,Y ), H(X) and H(Y ) denote the entropy rates of [C, ω], [A, µ], and[B, η], respectively. The rate of transmission attained by the source [A, µ] overthe channel [A, νx,B] is given by R(X,Y ) = H(X) +H(Y )−H(X,Y ). R is de-pendent on both the channel and information source. However, the supremumof R(X,Y ) over all possible ergodic sources, called the ergodic capacity, C, ofthe channel, depends only on the channel. As McMillan succinctly states,

“R represents that portion of the ‘randomness’ or average uncer-tainty of each output letter which is not assignable to the randomnesscreated by the channel itself[3].”

Practically speaking, the ergodic capacity of the channel is the maximum num-ber of bits of information that can be transmitted per binary digit transmittedover a channel[4].

13

Page 14: The Noisy-Channel Coding Theoremmath.sfsu.edu/serkan/expository/michaelMaconExpository.pdffollow Khinchin down the lengthy and complicated road to the proof of the Noisy Channel Coding

4 Feinstein’s Fundamental Lemma

We now state and sketch the proof of Feinstein’s Fundamental Lemma in thefashion of Khinchin. This will lay the foundation for the proof of Shannon’scoding theorem. We begin by presenting the following three useful inequalities,also attributed to Amiel Feinstein. Their short, ingenious proofs can be foundin [2].

4.1 Three Inequalities

Consider two random variables X, Y and their product X × Y . Let Z be a setof events Xi × Yk ∈ X × Y . Let U0 be some set of events Xi ∈ X and δ1 > 0and δ2 > 0 such that

p(Z) > 1− δ1and

p(U0) > 1− δ2.Denote Γi for i = 1, 2, ...n the set of events Yk ∈ Y such that Xi × Yk does notbelong to Z. Let U1 be the set Xi ∈ U0 for which pXi

(Γi) ≤ α. Then we have

Feinstein’s Inequality #1

p(U1) > 1− δ2 −δ1α.

Let ik denote the value of the subscript i for which the probability p(Xi×Yk)is maximal. If there is more than one maximal value, then any one will suffice.Let

P =

m∑k=1

n∑i=1i 6=ik

p(Xi × Yk).

P is the probability of the occurrence of Xi × Yk such that Xi is not theevent in X that is most probable for a given event Yk ⊂ Y .

If for a given ε with 0 < ε < 1, a set ∆i of events Yk can be associated witheach Xi where i = 1, 2, ...n such that

p(∆i ×∆j) = 0, i 6= j

andpXi

(∆i) > 1− ε, i = 1, 2, ...n

then we have Feinstein’s Inequality #2

P ≤ ε.Feinstein’s Inequality #3For n > 1

H(X|Y ) ≤ P log(n− 1)− P logP − (1− P ) log(1− P ).

14

Page 15: The Noisy-Channel Coding Theoremmath.sfsu.edu/serkan/expository/michaelMaconExpository.pdffollow Khinchin down the lengthy and complicated road to the proof of the Noisy Channel Coding

4.2 Distinguishable Sequences

Suppose m is the memory of a non-anticipating ergodic channel [A, νx,B]. Let udenote the cylinder set of sequences {x = (x−m, ..., x−1, x0, ..., xn−1)} with eachxi ∈ A. Let v denote the set of sequences of the type {y = (y0, ..., yn−1)} withall yi ∈ B.The fragment x−m, ..., x−1 is the part of the input sequence x thatthe channel “remembers,” i.e. it affects the probability of error in the outputmessage. The probability νu(v) of receiving an output message y ∈ v is identicalfor every x ∈ u, i.e. every x with a memory block x−m, ..., x−1. Let V denotethe union of sequences v, and νu(V ) denote the probability of output sequencey ∈ V given the input message x ∈ u. Let Vi ⊂ BI be a set of v sequences, and{Vi} be a class of mutually disjoint sets Vi for 1 ≤ i ≤ N . The set of sequences

{ui} is distinguishable if νui(Vi) > 1− λ where 0 < λ <1

2. The diagram above

illustrates the relationships between these sets of sequences.

4.3 Feinstein’s Fundamental Lemma.

If a given channel is stationary, without anticipation, and with finite memory,m, then, for sufficiently small λ > 0 and sufficiently large n, there exists adistinguishable group {ui} 1 ≤ i ≤ N of u-sequences with

N > 2n(C−λ)

members, where C is the ergodic capacity of the channel.

Proof. Let [A, µ] be an ergodic source and [A, νx,B] an ergodic, stationary,non-anticipating channel with finite memory. Since C is the least upper boundof R(X,Y ) over all ergodic sources, we have

R(X,Y ) = H(X)−H(X,Y ) > C − λ

4.

15

Page 16: The Noisy-Channel Coding Theoremmath.sfsu.edu/serkan/expository/michaelMaconExpository.pdffollow Khinchin down the lengthy and complicated road to the proof of the Noisy Channel Coding

By McMillan’s theorem, [A, µ] has the E property. Let w be a cylinder set inAI and W0 be the set of all cylinders such that∣∣∣∣ logµ(w)

n+H(X)

∣∣∣∣ ≤ λ

4.

Likewise, the sources [C, ω] and [B, η] both have the E Property . Let v ⊂ BI ,(w, v) ∈ CI and Z denote the set of all cylinders such that∣∣∣∣ logω(w, v)

n+H(X,Y )

∣∣∣∣ ≤ λ

4

and ∣∣∣∣η(v)

n+H(Y )

∣∣∣∣ ≤ λ

4.

The sets W0, Z and v are “high probability” sets. That is, for sufficiently largen and arbitrarily small λ > 0, δ > 0, ε > 0,

µ(W0) > 1− λ

4

ω(Z) > 1− δ,

andη(v) > 1− ε.

Khinchin now finds a set of sequences W1 ⊂ AI that contains every sequencew ⊂W0 such that

ω(w,Aw)

µ(w)=ω(w,Aw)

ω(w,BI)> 1− λ

2.

Using Feinstein’s Inequality #1, Khinichin estimates the probability of W1 tobe

µ(W1) = 1− 2λ.

Let w ⊂W1 and v ⊂ Aw. If follows that (w, v) ⊂ X. Thus from the inequalitieslisted above, we have

logµ(w)

n+H(X) ≤ λ

4,

log η(v)

n+H(Y ) ≤ λ

4,

andlogω(w, v)

n+H(X,Y ) ≥ −λ

4.

16

Page 17: The Noisy-Channel Coding Theoremmath.sfsu.edu/serkan/expository/michaelMaconExpository.pdffollow Khinchin down the lengthy and complicated road to the proof of the Noisy Channel Coding

Then

log

(ω(w, v)

µ(w)η(v)

)+ n[H(X,Y )−H(X)−H(Y )] ≥ −3

4nλ.

Since R(X,Y ) = H(X) +H(Y )−H(X,Y ), we have

logω(w, v)

µ(w)η(v)≥ n[R(X,Y )− 3

4λ] ⇐⇒ ω(w, v)

µ(w)≥ 2n[R(X,Y )− 3

4λ]η(v).

Recalling that R(X,Y ) > C − λ

4, it follows that

ω(w, v)

µ(w)≥ 2n[C−λ]η(v).

Denote Aw ⊂ BI the set of sequences such that (w, v) ∈ X. Then summing overall v ⊂ Aw, we have

ω(w,Aw)

µ(w)≥ 2n[C−λ]η(Aw).

Since the ratioω(w,Aw)

µ(w)≤ 1, we can state that

η(Aw) ≤ 2−n(C−λ).

Khinchin now constructs what he calls special groups of w-sequences. A group{wi}Ni=1 is special if a set Bi of v-sequences can be associated with each sequencewi such that

1) Bi ∩Bj = ∅

2)ω(wi, Bi)

µ(wi)> 1− λ

3) η(Bi) < 2−n(C−λ).

Note how this description of special groups parallels the definition of distin-guishable sequences. It can be demonstrated that every sequence w ⊂ W1 is aspecial group. A special group is called maximal if the addition of any sequenceto it results in the group no longer being special. It can be further shown that,for sufficiently large n, the cardinality, N , of the set of maximal special groupsof w- sequences is given by

N > 2n(C−2λ).

Now let {wi}ni=1 be an arbitrary maximal special group. Take each sequencewi and concatenate m letters on the left, obtaining a new sequence ui of length

17

Page 18: The Noisy-Channel Coding Theoremmath.sfsu.edu/serkan/expository/michaelMaconExpository.pdffollow Khinchin down the lengthy and complicated road to the proof of the Noisy Channel Coding

n + m, which is an extension of wi. By choosing the appropriate extension itcan be shown that for i = 1, 2, ...N

ω(ui, Bi)

µ(ui)> 1− λ ⇐⇒ νui(Bi) > 1− λ.

Therefore, the sequences {ui} form a distinguishable group for which N >2n(C−2λ). Since λ can be made arbitrarily small, we have N > 2n(C−λ), asdesired.

5 Shannon’s Theorem Part 1

5.1 Coding

The source and the channel need not share the same alphabet. Let [A0, µ]be an information source with alphabet A0 transmitting information throughchannel [A, νx,B]. The source transmits a message θ = (...θ−1, θ0, θ1, ...), whereθ ∈ AI0. Each θ is encoded into an x ∈ AI by a mapping x(θ) = x. The mappingis called a code, and it can be regarded as a channel itself. The code channel isdenoted [A0, ρθ,A], where for M ∈ AI

ρθ(M) =

{1 : x(θ) ∈M0 : x(θ) /∈M.

The linking of the code to the channel can be viewed as a new channel [A0, λθ,B],where for Q ∈ FB,

λθ(Q) = νx(θ)(Q).

Now [A0, λθ,B] is equivalent to [A0, νx(θ),B], and this again creates a compoundsource [C, ω] where C = A× B, and for M ∈ FA0

and N ∈ FB

ω(M ×N) =

∫M

λθ(N)dµ(θ) =

∫M

νx(θ)(N)dµ(θ).

Lemma 2. A Useful Inequality

Suppose A and B are finite random variables and pAi(Bk) =

p(Ai, Bk)

p(Ai), where

18

Page 19: The Noisy-Channel Coding Theoremmath.sfsu.edu/serkan/expository/michaelMaconExpository.pdffollow Khinchin down the lengthy and complicated road to the proof of the Noisy Channel Coding

Ai and Bk are events in A and B, respectively.∑k

∗ denotes the summation

over certain values of k. Then,

n∑i=1

p(Ai)∑k

∗pAi(Bk) log pAi

(Bk) ≥∑k

∗p(Bk) log p(Bk)

The above inequality does not depend on whether or not the events Bk form acomplete probability distribution.

The Noisy Channel Coding Theorem Part 1.Given 1) a stationary, non-anticipating channel [A, νx,B] with an ergodic ca-pacity C and finite memory m and 2) an ergodic source [A0, µ] with entropyH0 < C. Let ε > 0. Then for sufficiently large n, the output of the source[A0, µ] can be encoded into the alphabet A in such a way that each sequence αiof n letters from the alphabet A0 is mapped into a sequence ui of n+m lettersfrom the alphabet A, and such that if the sequence ui is transmitted through thegiven channel, we can determine the transmitted sequence αi with a probabilitygreater than 1− ε from the sequence received at the channel output.

We will show that a code exists such that Pe, the probability of transmissionerror, can be made arbitrarily small.

Proof of the Noisy Channel Coding Theorem Part 1.Assume H0 < C. Let HPS = {α1, α2, ....} denote the set of sequences that havea high probability, and α0 denote the set of all sequences in the low probability

set, as defined by McMillan. Then there exists a λ such that λ <1

2(C −H0).

By McMillan’s Theorem, [A0, µ] has the E property. Then for an α ∈ A0 in the“high probability” group

logµ(α)

n+H0 > −λ ⇐⇒ µ(α) > 2−n(H0+λ).

Since H0 < C, the cardinality of the HPS is less than

2n(H0+λ) < 2n(C−λ).

By Feinstein’s Fundamental Lemma, there exists a distinguishable set of se-quences {ui} with N > 2n(C−λ) elements. Observe that N is larger than |HPS|.So we can construct a mapping that takes each αi to a unique ui. This impliesthat there is at least one ui that is not mapped to some αi. We can map theremaining ui to the set α0. Since this set {ui} forms a distinguishable set ofsequences, there is a group {Bi}ni=1 such that νui

(Bi) > 1− λ and Bi ∩Bj = ∅for i 6= j. Now divide the sequence θ ∈ AI0 into subsequences of length n anddivide x ∈ AI into subsequences of length n+m. Let the kth subsequence α in

19

Page 20: The Noisy-Channel Coding Theoremmath.sfsu.edu/serkan/expository/michaelMaconExpository.pdffollow Khinchin down the lengthy and complicated road to the proof of the Noisy Channel Coding

the message θ correspond to the kth subsequence ui in x. This correspondenceis the unique coding map, x = x(θ). Let βk denote the distinct sequences oflength n from BI , where k = 1, 2, ..., n. Then

ω(αi × βk) =

∫αi

λθ(βk)dµ(θ) =

∫αi

νx(θ)(βk)dµ(θ)

andω(αi × βk) = µ(αi)νui

(βk).

Given a sequence of length n transmitted from the source [A0, µ], ω(αi × βk) isthen the joint probability that this sequence corresponds to an αi, and yieldsa sequence of length n + m with letters from B after transmission through thechannel [A0, λθ,B] such that the last n letters comprise the sequence βk. Thuswe have ∑

βk⊂Bi

ω(αi × βk) = µ(αi)νui(Bi) > (1− λ)µ(αi).

Now we invoke Feinstein’s Inequality #2. Let ik denote the value of theindex so that ω(αi, βk) has its maximum value. Again, if there are more thanone maximum, then any one will suffice. Let

P =∑k

∑i6=ik

ω(αi, Bk).

P is the probability that αi is not the most probable input sequence given theoutput sequence βk. The probability of output βk given input αi is

Pαi(Bi) =

ω(αi, Bi)

µ(αi)> 1− λ.

It follows from Feinstein’s Inequality #2 that

P ≤ λ ⇐⇒ 1− P ≥ 1− λ.

This means that the probability that αi is the most probable input sequence,given the receipt of output sequence βk, is greater than 1 − λ, and so provesPart 1 of the Noisy Channel Coding Theorem.

6 Shannon’s Theorem Part 2

The Noisy Channel Coding Theorem Part 2.Under the conditions of the Noisy Channel Coding Theorem Part 1, there existsa code x = x(θ) such that the rate of transmission can be made arbitrarily closeto H0.

Proof of the Noisy Channel Coding Theorem Part 2.Our goal is to show that the rate of transmission can be made arbitrarily close

20

Page 21: The Noisy-Channel Coding Theoremmath.sfsu.edu/serkan/expository/michaelMaconExpository.pdffollow Khinchin down the lengthy and complicated road to the proof of the Noisy Channel Coding

to the entropy of the source [A0, µ], denoted H0, assuming H0 < C, where Cis the ergodic capacity of the channel. To this end, we will find an estimate forthe rate. We can regard the output of a source as a stream of symbols fromthe alphabet A0 that we can divide into sequences αi of length n, which, whenpassed through the channel [A0, λθ,B], yield sequences of length n + m com-posed of letters from alphabet A. The last n letters of the sequences from thechannel output are the delivered sequences, βk.

Now take a sequence of length s = nt+ r output from source [A0, µ]. Let X de-note a sequence of this type, and {X} denote the set of all such sequences. Sim-ilarly, let Y denote an output sequence of length s from the channel [A0, λθ,B],composed of letters from alphabet B, and let {Y } denote the set of all sequencesY . Let H(X|Y ) denote the entropy of space {X} conditioned over Y .

Each sequence X of length s = nt + r can be subdivided into t consecutivesequences α(1), α(2), ..., α(t) and a remainder sequence of length r < n, denotedα∗. Now

{X} = {α(j)} × {α∗}.

So we have

H(X|Y0) ≤t∑

j=1

H({α(j)}|Y0) +H(α∗|Y0).

Averaging over Y , it follows that

H(X|Y ) ≤t∑

j=1

H({α(j)}|Y ) +H(α∗|Y ).

As with X, we can also subdivide each sequence Y into t sequences {β(j)}tj=1

of length n with a remainder sequence {β∗} of length r < n so that {Y } =

{β(j)} × {β∗}. Note that the spaces {α(j), β(j)} and {β(j)k } have distribu-

tions ω(αi, βk) and η(βk), respectively. Let B(j) denote the set of sequences{β(l)}tl=1 and β∗ that comprise Y except β(j). In other words, for j = 1, 2, ..., t,{Y } = {β(j)} ×B(j).By Lemma #2 it follows that

H({α(j)}|Y ) = H({α(j)}|β(j)B(j)) ≤ H({α(j)}|β(j)) = H(α|β).

Since a is the number of letters in alphabet A, there are ar sequences in {α∗}.It follows from Theorem # 4 that

H(α∗|Y ) < log an ⇐⇒ H(α∗|Y ) < n log a

and for sufficiently large n and λ > 0,

H(X|Y ) = tH(α|β) + n log a ⇐⇒ H(X|Y ) < λtn+ n log a ≤ λs+ n log a.

21

Page 22: The Noisy-Channel Coding Theoremmath.sfsu.edu/serkan/expository/michaelMaconExpository.pdffollow Khinchin down the lengthy and complicated road to the proof of the Noisy Channel Coding

The quantity of information encoded in X before transmission is sH0, and sinceH(X|Y ) is the amount of information lost after transmission, we have that theamount of information retained is sH0−H(X|Y ). Also, when X is transmittedthrough the channel, the total number of symbols transmitted is (t+ 1)(n+m).Thus the rate in bits per symbol is given by

sH0 −H(X|Y )

(t+ 1)(n+m).

Recall that s = nt+ r. So s+n = nt+ r+n, and consequently, s+n > nt+n.Since H(X|Y ) ≤ λs+ n log a, we have

sH0 −H(X|Y )

(t+ 1)(n+m)≥ sH0 − λs− n log a

(nt+ n)(1 + mn )

≥ sH0 − λs− n log a

(s+ n)(1 + mn )

=H0 − λ− n log a

s

(1 + ns )(1 + m

n ).

So by choosing n and t sufficiently large such thatm

n< ε and

n

s≤ 1

t< ε, for

some ε > 0, we haveH0 − λ− ε log a

(1 + ε)2< H0 − 2λ

as ε→ 0.

The last inequality implies the existence a code for which rate of transmissioncan be made arbitrarily close to H0 < C.

7 Conclusion

We cannot generally reconstruct a message sent over a noisy channel. However,if we restrict ourselves to sending distinguishable sequences such that, with ahigh probability, each is mapped into a particular set at the channel outputwithin a class of mutually disjoint sets, then we can determine the originalmessage virtually without error, so long as the number of different sequencesfrom the output of the given source does not exceed the number of distinguish-able sequences at the channel input. The proof of the Noisy Channel CodingTheorem hinges on this powerful idea, which is a synthesis of the results ofShannon, Feinstein and McMillan. McMillan showed that the number of highprobability sequences approximates 2nH0 , and Feinstein proved the existenceof N > 2n(C−ε) distinguishable groups. If H0 < C, the claims of the NoisyChannel Coding Theorem follow.

Although Shannon’s theorem is an existence theorem, its implications have mo-tivated scientists to persist in their search for efficient coding systems. Thisis evident in the modern-day forward error correction coding systems, some ofwhich closely approximate Shannon’s theoretical maximum transmission rates.

22

Page 23: The Noisy-Channel Coding Theoremmath.sfsu.edu/serkan/expository/michaelMaconExpository.pdffollow Khinchin down the lengthy and complicated road to the proof of the Noisy Channel Coding

Additionally, Shannon’s ideas can be found in research fields as varied as lin-guistics, genetics, neuroscience, computer science and digital communicationsengineering. The advancement of information theory has provoked a radicalshift in perspective within the scientific community. Indeed, Shannon arguably“single-handedly accelerated the rate of scientific progress”[6].

References

[1] Robert B. Ash. Information Theory. Dover Publications, New York, 1990.Unabridged and corrected republication of the work originally published byInterscience Publishers ... New York, 1965–T.p. verso.

[2] A.I. Khinchin. Mathematical Foundations of Information Theory. DoverPublications, Inc., New York, New York, 1957.

[3] Brockway McMillan. The basic theorems of information theory. The Annalsof Mathematical Statistics, 24(2):196–219, 1953.

[4] John Robinson Pierce. An Introduction to Information Theory : Symbols,Signals & Noise. Dover Publications, New York, second, revised edition.

[5] C.E. Shannon. A mathematical theory of communication. Bell System Tech-nical Journal, The, 27(4):623–656, Oct 1948.

[6] James V. Stone. Information Theory: A Tutorial Introduction. James VStone, 2013.

[7] Norbert Wiener. Extrapolation, Interpolation, and Smoothing of StationaryTime Series. The MIT Press, 1964.

23