Lecture Notes on Information Theory Volume I · PDF fileLecture Notes on Information Theory...

Lecture Notes on Information Theory

Volume I

by

Po-Ning Chen† and Fady Alajaji‡

† Department of Communications EngineeringNational Chiao Tung University

1001, Ta Hsueh RoadHsin Chu, Taiwan 300

Republic of ChinaEmail: [email protected]

‡ Department of Mathematics & Statistics,Queen’s University, Kingston, ON K7L 3N6, Canada

Email: [email protected]

August 3, 2005

c© Copyright byPo-Ning Chen† and Fady Alajaji‡

August 3, 2005

Preface

The reliable transmission of information bearing signals over a noisy commu-nication channel is at the heart of what we call communication. Informationtheory—founded by Claude E. Shannon in 1948—provides a mathematical frame-work for the theory of communication; it describes the fundamental limits to howefficiently one can encode information and still be able to recover it with neg-ligible loss. This course will examine the basic concepts of this theory. Whatfollows is a tentative list of topics to be covered.

1. Volume I:

(a) Fundamentals of source coding (data compression): Discrete memo-ryless sources, entropy, redundancy, block encoding, variable-lengthencoding, Kraft inequality, Shannon code, Huffman code.

(b) Fundamentals of channel coding: Discrete memoryless channels, mu-tual information, channel capacity, coding theorem for discrete mem-oryless channels, weak converse, channel capacity with output feed-back, the Shannon joint source-channel coding theorem.

(c) Source coding with distortion (rate distortion theory): Discrete mem-oryless sources, rate-distortion function and its properties, rate-dis-tortion theorem.

(d) Other topics: Information measures for continuous random variables,capacity of discrete-time and band-limited continuous-time Gaussianchannels, rate-distortion function of the memoryless Gaussian source,encoding of discrete sources with memory, capacity of discrete chan-nels with memory.

(e) Fundamental backgrounds on real analysis and probability (Appen-dix): The concept of set, supremum and maximum, infimum andminimum, boundedness, sequences and their limits, equivalence, pro-bability space, random variable and random process, relation betweena source and a random process, convergence of sequences of randomvariables, ergodicity and laws of large numbers, central limit theorem,concavity and convexity, Jensen’s inequality.

ii

2. Volumn II:

(a) General information measure: Information spectrum and Quantileand their properties, Renyi’s informatino measures.

(b) Advanced topics of losslesss data compression: Fixed-length losslessdata compression theorem for arbitrary channels, Variable-length loss-less data compression theorem for arbitrary channels, entropy of En-glish, Lempel-Ziv code.

(c) Measure of randomness and resolvability: Resolvability and sourcecoding, approximation of output statistics for arbitrary channels.

(d) Advanced topics of channel coding: Channel capacity for arbitrarysingle-user channel, optimistic Shannon coding theorem, strong ca-pacity, ε-capacity.

(e) Advanced topics of lossy data compressing

(f) Hypothesis testing: Error exponent and divergence, large deviationstheory, Berry-Esseen theorem.

(g) Channel reliability: Random coding exponent, expurgated exponent,partitioning exponent, sphere-packing exponent, the asymptotic lar-gest minimum distance of block codes, Elias bound, Varshamov-Gil-bert bound, Bhattacharyya distance.

(h) Information theory of networks: Distributed detection, data com-pression over distributed source, capacity of multiple access channels,degraded broadcast channel, Gaussian multiple terminal channels.

As shown from the list, the lecture notes are divided into two volumes. Thefirst volume is suitable for a 12-week introductory course as that given at theDepartment of Mathematics and Statistics, Queen’s University at Kingston,Canada. It also meets the need of a fundamental course for senior undergradu-ates as that given at the Department of Computer Science and Information En-gineering, National Chi Nan University, Taiwan. For an 18-week graduate courseas given in Department of Communications Engineering, National Chiao-TungUniversity, Taiwan, the lecturer can selectively add advanced topics covered inthe second volume to enrich the lecture content, and provide a more completeand advanced view on information theory to students.

The authors are very much indebted to all people who provided insightfulcomments on these lecture notes. Special thanks are devoted to Prof. YunghsiangS. Han with the Department of Computer Science and Information Engineeringin National Chi Nan University, Taiwan, for his enthusiasm in testing theselecture notes at his school, and providing the authors valuable feedback.

iii

Notes to readers. In these notes, all the assumptions, claims, conjectures,corollaries, definitions, examples, exercises, lemmas, observations, properties,and theorems are numbered under the same counter for ease of their searching.For example, the lemma that immediately follows Theorem 2.1 will be numberedas Lemma 2.2, instead of Lemma 2.1.

In addition, you may obtain the latest version of the lecture notes fromhttp://shannon.cm.nctu.edu.tw. Interested readers are welcome to return com-ments to [email protected].

iv

Acknowledgements

Thanks are given to our families for their full support during the period ofwriting these lecture notes.

v

Table of Contents

Chapter Page

List of Tables ix

List of Figures x

1 Introduction 11.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 System model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Fundamental concepts of information theory . . . . . . . . . . . . 31.4 Joint design versus separate design of source and channel coders . 9

2 Information Measures for Discrete Systems 142.1 Entropy, joint entropy and conditional entropy . . . . . . . . . . . 14

2.1.1 Self-information . . . . . . . . . . . . . . . . . . . . . . . . 142.1.2 Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.1.3 Properties of entropy . . . . . . . . . . . . . . . . . . . . . 182.1.4 Joint entropy and conditional entropy . . . . . . . . . . . . 192.1.5 Properties of joint entropy and conditional entropy . . . . 21

2.2 Mutual information and conditional mutual information . . . . . . 232.2.1 Properties of mutual information . . . . . . . . . . . . . . 242.2.2 Conditional mutual information and its properties . . . . . 24

2.3 Properties of entropy and mutual information for higher dimen-sional extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.4 Relative entropy and hypothesis testing . . . . . . . . . . . . . . . 292.4.1 Fundamentals on hypothesis testing . . . . . . . . . . . . . 292.4.2 Relative entropy or Kullback-Leibler divergence . . . . . . 31

2.5 Properties of divergence and its relation with entropy and mutualinformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

2.6 Convexity and concavity of entropy, mutual information and di-vergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

vi

3 Lossless Data Compression 443.1 Principles of data compression . . . . . . . . . . . . . . . . . . . . 443.2 Block codes for asymptotic lossless data compression . . . . . . . 46

3.2.1 Block codes for discrete memoryless sources . . . . . . . . 463.2.2 Block codes for stationary-ergodic sources . . . . . . . . . 533.2.3 Redundancy for lossless data compression . . . . . . . . . 57

3.3 Variable-length codes for lossless data compression . . . . . . . . . 583.3.1 Non-singular codes and uniquely decodable codes . . . . . 583.3.2 Prefix or instantaneous codes for lossless data compression 613.3.3 Examples of variable-length lossless data compression codes 67

A) Huffman code : a variable-length optimal code . . . . . 67B) Shannon-Fano-Elias code . . . . . . . . . . . . . . . . . 69

3.3.4 Example study on universal lossless variable-length codes . 70A) Adaptive Huffman code . . . . . . . . . . . . . . . . . . 71B) Lempel-Ziv codes . . . . . . . . . . . . . . . . . . . . . 73

4 Data Transmission and Channel Capacity 784.1 Principles of data transmission . . . . . . . . . . . . . . . . . . . . 784.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 804.3 Block codes for data transmission over DMC . . . . . . . . . . . . 814.4 Examples of DMCs . . . . . . . . . . . . . . . . . . . . . . . . . . 92

4.4.1 Identity channels . . . . . . . . . . . . . . . . . . . . . . . 924.4.2 Binary symmetric channels . . . . . . . . . . . . . . . . . . 924.4.3 Symmetric, weakly symmetric and quasi-symmetric channels 944.4.4 Binary erasure channels . . . . . . . . . . . . . . . . . . . 95

5 Lossy Data Compression 1005.1 Fundamental concept on lossy data compression . . . . . . . . . . 100

5.1.1 Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . 1005.1.2 Distortion measures . . . . . . . . . . . . . . . . . . . . . . 1015.1.3 Examples of some frequently used distortion measures . . . 103

5.2 Fixed-length lossy data compression codes . . . . . . . . . . . . . 1055.3 Rate distortion function for discrete memoryless

sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

6 Continuous Sources and Channels 1176.1 Information measures for continuous sources and channels . . . . 117

6.1.1 Models of continuous sources and channels . . . . . . . . . 1176.1.2 Differential entropy . . . . . . . . . . . . . . . . . . . . . . 1186.1.3 Properties of differential entropies . . . . . . . . . . . . . . 1216.1.4 Operational meaning of differential entropy . . . . . . . . . 125

vii

6.1.5 Relative entropy and mutual information for continuoussources and channels . . . . . . . . . . . . . . . . . . . . . 127

6.2 Lossy data compression for continuous sources . . . . . . . . . . . 1286.2.1 Rate distortion function for specific sources . . . . . . . . . 129

A) Binary sources . . . . . . . . . . . . . . . . . . . . . . . 129B) Gaussian sources . . . . . . . . . . . . . . . . . . . . . 131

6.3 Channel coding theorem for continuous channels . . . . . . . . . . 1326.4 Capacity-cost functions for specific continuous channels . . . . . . 138

6.4.1 Memoryless additive Gaussian channels . . . . . . . . . . . 1386.4.2 Capacity for uncorrelated parallel Gaussian channels . . . 1406.4.3 Capacity for correlated parallel additive Gaussian channels 1436.4.4 Capacity for band-limited waveform channels with white

Gaussian noise . . . . . . . . . . . . . . . . . . . . . . . . 1466.4.5 Capacity for filtered waveform stationary Gaussian channels150

6.5 Information-transmission theorem . . . . . . . . . . . . . . . . . . 1536.6 Capacity bound for non-Gaussian channels . . . . . . . . . . . . . 158

A Mathematical Background on Real Analysis 161A.1 The concept of sets . . . . . . . . . . . . . . . . . . . . . . . . . . 161A.2 Supremum and maximum . . . . . . . . . . . . . . . . . . . . . . 163A.3 Infimum and minimum . . . . . . . . . . . . . . . . . . . . . . . . 165A.4 Boundedness and supremum/infimum operations . . . . . . . . . . 166A.5 Sequences and their limits . . . . . . . . . . . . . . . . . . . . . . 168A.6 Equivalence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172

B Mathematical Background on Probability and Stochastic Pro-cesses 174B.1 Concept of source and channel and some frequently used mathe-

matical models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174B.2 Probability space . . . . . . . . . . . . . . . . . . . . . . . . . . . 174B.3 Random variable and random process . . . . . . . . . . . . . . . . 174B.4 Observation probability space . . . . . . . . . . . . . . . . . . . . 175B.5 Relation between a source and a random process . . . . . . . . . . 176B.6 Statistical properties of random sources . . . . . . . . . . . . . . . 176B.7 Convergence of sequences of random variables . . . . . . . . . . . 182B.8 Ergodicity and laws of large numbers . . . . . . . . . . . . . . . . 185

B.8.1 Laws of large numbers . . . . . . . . . . . . . . . . . . . . 185B.8.2 Ergodicity and strong law of large numbers . . . . . . . . . 187

B.9 Central limit theorem . . . . . . . . . . . . . . . . . . . . . . . . . 190B.10 Concavity, convexity and Jensen’s inequality . . . . . . . . . . . . 190

C Problems 193

viii

List of Tables

Number Page

3.1 An example of the δ-typical set with n = 2 and δ = 0.3, whereF2(0.3) = AB, AC, BA, BB, BC, CA, CB . The codeword setis 001(AB), 010(AC), 011(BA), 100(BB), 101(BC), 110(CA),111(CB), 000(AA, AD, BD, CC, CD, DA, DB, DC, DD) , wherethe parenthesis following the codeword indicates those source-words that are encoded to this codeword. The source distributionis PX(A) = 0.4, PX(B) = 0.3, PX(C) = 0.2 and PX(D) = 0.1. . . . 49

6.1 List of entropies of m-level quantized source and m-level Riemannapproximation of differential entropy. . . . . . . . . . . . . . . . . 120

ix

List of Figures

Number Page

1.1 General model of a communication system. . . . . . . . . . . . . 21.2 General model of a communication system. . . . . . . . . . . . . . 91.3 A specific model of a source coder. . . . . . . . . . . . . . . . . . 11

2.1 Relation between entropy and mutual information. . . . . . . . . 252.2 Communication context of the data processing lemma. . . . . . . 28

3.1 Block diagram of a data compression system. . . . . . . . . . . . . 453.2 Possible codebook C∼n and its corresponding Sn. The solid box

indicates the decoding mapping from C∼n back to Sn. . . . . . . . 523.3 Behavior of the probability of block decoding error as block length

n goes to infinity for a discrete memoryless source. . . . . . . . . . 523.4 Classification of variable-length codes. . . . . . . . . . . . . . . . 613.5 Tree structure of a prefix code. The codewords are those residing

on the leaves, which in this case are 00, 01, 10, 110, 1110 and 1111. 623.6 Example of the Huffman encoding. . . . . . . . . . . . . . . . . . 693.7 Example of the sibling property based on the code tree from P

(16)

X.

The arguments inside the parenthesis following aj respectivelyindicate the codeword and the probability associated with aj. “b”is used to denote the internal nodes of the tree with the assigned(partial) code as its subscript. The number in the parenthesisfollowing b is the probability sum of all its children. . . . . . . . 73

3.8 (Continue from Figure 3.7) Example of violation of the siblingproperty after observing a new symbol a3 at n = 17. Note thatnode a1 is not adjacent to its sibling a2. . . . . . . . . . . . . . . 74

3.9 (Continue from Figure 3.8) Update of Huffman code. The siblingproperty holds now for the new code. . . . . . . . . . . . . . . . . 75

x

4.1 A data transmission system, where U represents the message fortransmission, X denotes the codeword corresponding to the chan-nel input symbol U , Y represents the received vector due to chan-nel input X, and U denotes the reconstructed messages from Y . . 79

4.2 Permissible (Pe, H(X|Y )) region of the Fano’s inequality. . . . . . 904.3 Binary symmetric channel. . . . . . . . . . . . . . . . . . . . . . . 934.4 Binary erasure channel. . . . . . . . . . . . . . . . . . . . . . . . . 96

5.1 Example for applications of lossy data compression codes. . . . . . 1015.2 “Grouping” as one kind of lossy data compression. . . . . . . . . . 102

6.1 The water-pouring scheme for parallel Gaussian channels. . . . . . 1416.2 The water-pouring for lossy data compression of parallel Gaussian

sources. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1436.3 Band-limited waveform channels with white Gaussian noise. . . . 1466.4 Filtered Gaussian channel. . . . . . . . . . . . . . . . . . . . . . . 1506.5 Equivalent model of filtered Gaussian channel. . . . . . . . . . . . 1516.6 The water-pouring scheme. . . . . . . . . . . . . . . . . . . . . . . 1546.7 The Shannon limits for (2, 1) and (3, 1) codes under binary-input

AWGN channel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1566.8 The Shannon limits for (2, 1) and (3, 1) codes under continuous-

input AWGN channels. . . . . . . . . . . . . . . . . . . . . . . . . 157

A.1 Illustrated example for Lemma A.22. . . . . . . . . . . . . . . . . 173

B.1 General relations of random processes. . . . . . . . . . . . . . . . 181B.2 Relation of ergodic random processes respectively defined through

time-shift invariance and ergodic theorem. . . . . . . . . . . . . . 188B.3 The support line y = ax + b of the convex function f(x). . . . . . 191

xi

Chapter 1

Introduction

1.1 Overview

At its inception, the main role of information theory was to provide the engineer-ing and scientific communities with a mathematical framework for the theory ofcommunication by establishing the fundamental limits on the performance ofvarious communication systems. Its birth was initiated with the publication ofthe works [2, 3] of Claude E. Shannon who stated that it is possible to send in-formation-bearing signals at a fixed rate through a noisy communication channelwith an arbitrarily small probability of error as long as the communication rateis below a certain fixed quantity that depends on the channel characteristics;he “baptized” this quantity with the name of channel capacity. He further pro-claimed that random sources – such as speech, music or image signals – possessan irreducible complexity beyond which they cannot be compressed distortion-free. He called this complexity the source entropy. He went on asserting that if asource has an entropy that is less than the capacity of a communication channel,then asymptotically error free transmission of the source over the channel canbe achieved.

Inspired and guided by the pioneering ideas of Shannon, information theoristsgradually expanded their interests beyond communication theory, and investi-gated fundamental questions in many other related fields. Among them we cite:

• statistical physics (thermodynamics, quantum information theory);

• computer science (algorithmic complexity, resolvability);

• probability theory (large deviations, limit theorems);

• statistics (hypothesis testing, multi-user detection, Fisher information, es-timation);

1

Source - - - Modulator

?PhysicalChannel

?

Demodulator¾¾¾Destination

6DiscreteChannel

Transmitter Part

Receiver Part

Focus ofthese notes

SourceEncoder

ChannelEncoder

ChannelDecoder

SourceDecoder

Figure 1.1: General model of a communication system.

• economics (gambling theory, investment theory);

• biology (biological information theory);

• cryptography (data security, watermarking);

• networks (self-similarity, traffic regulation theory).

In these lecture notes, we however focus our attention on the study of the basictheory of communication – information storage and information transmission –from which information theory is originated.

1.2 System model

A simple model of a general communication system is depicted in Figure 1.1.Let us briefly describe each block in the figure.

• Source: The source is usually modelled as a random process (the necessarybackground regarding random processes is introduced in Appendix B). Itcan be discrete (finite or countable alphabet) or continuous (uncountablealphabet) in value and in time.

• Source Encoder: Its role is to represent the source in a compact fashionby removing its unnecessary or redundant content (i.e., compression).

2

• Channel Encoder: Its role is to enable the reliable reproduction of thesource encoder output after its transmission through a noisy communica-tion channel. This is achieved by adding redundancy to the source encoderoutput.

• Modulator: It transforms the channel encoder output into a waveformsuitable for transmission over the physical channel. This is usually accom-plished by varying the parameters of a sinusoidal signal in proportion withthe data provided by the channel encoder output.

• Physical Channel: It consists of the noisy (or unreliable) medium thatthe transmitted waveform traverses. It is usually modelled via a conditional(or transition) probability distribution of receiving an output given that aspecific input was sent.

• Receiver Part: It consists of the demodulator, the channel decoder andthe source decoder where the reverse operations are performed. The desti-nation represents the sink where the source estimate provided by the sourcedecoder is reproduced.

In these notes, we will model the concatenation of the modulator, physicalchannel and demodulator via a discrete-time channel with a given conditionalprobability distribution. Given a source and a discrete channel, our objectiveswill then consist of determining the fundamental limits of how well we can con-struct a (source/channel) coding scheme so that:

• the smallest number of source encoder symbols is required to representeach source sample within a prescribed distortion level D, where D ≥ 0;

• the largest rate of information can be transmitted between the channelencoder input and the channel decoder output with an arbitrarily smallprobability of error;

• we can guarantee that the source is transmitted over the channel and re-produced at the destination within distortion D, where D ≥ 0.

1.3 Fundamental concepts of information theory

A) What is information?

Information is a message that is previously uncertain to receivers. Anyalready-known result is certainly non-informative. Uncertainty is therefore a

3

key character in measuring the information content. We will continue the dis-cussion on information measure in Section 1.3-D).

After obtaining the information, one may wish to store it or convey it toothers; this raises the question that:

B) how to represent information for ease of storage or conveying it to others?

Perhaps, the most effective and natural way to represent information is to usepre-defined symbols and their concatenations. For example, there are 26 symbolsdefined in English, and people use some concatenations of these 26 symbols tocommunicate with each other. Another example is the concatenation of “0”and “1” symbols used in computer and digital communication systems. Afterthe information is symbolized, storage or conveyance of these symbols becomestraightforward. In reality, pre-defined symbols and their concatenations areusually referred to as languages; while in the field of communications, they arenamed codes.

The receiver of the symbolized information often assumes to know all the“possibilities” of conveyed information; he is just uncertain about which “pos-sibility” is going to be received. For example, it is priori known for an Englishlistener that one of the vocabularies in an English dictionary is going to be spo-ken, even if he cannot tell which before its reception. In digital communications,the collection of all possible concatenations of pre-defined symbols is often calledthe codebook (or simply code).

The dictionary (i.e., the vocabulary base of the receiver) chosen for an infor-mation transmission system may not be universally the same. Moreover, the vo-cabularies selected to describe an informational event may not be person-wiselyidentical even based on the same dictionary (such a vocabular ambiguity rarelyhappens in engineering-defined codebooks). As it turns out, some codes may bemore “lengthy” than the others. This brings up another essential question ininformation theory.

C) What is the most compact representation for an informative message?

An exemplified answer to the above question is as follows. Suppose there aretwo dictionaries for a 0-1 language:

code 1

event one : 00event two : 01

event three : 10event four : 11

code 2

event one : 0event two : 10

event three : 110event four : 111

,

which are both good for conveying the information of “which of the four eventsoccurs?” Assume that the probabilities of occurrence for the four events are

4

respectively 1/2, 1/4, 1/8 and 1/8. Then we can answer the above question bycomputing that the averaged number of code bits required for the first code is

1

2× 2 code bits +

1

4× 2 code bits

+1

8× 2 code bits +

1

8× 2 code bits = 2 code bits,

and that for the second code is

1

2× 1 code bits +

1

4× 2 code bits

+1

8× 3 code bits +

1

8× 3 code bits =

7

4code bits.

Hence, the second code is more compact due to its smaller average codewordlength required for information storage.

In order to assert the optimality of the second code (i.e., it is the mostcompact code in the world), a straightforward approach is to exhaust the averagecodeword lengths of all possible four-event descriptive codes, and show thatall these average codeword lengths are no smaller than 7/4 code bits. Thismay be a tedious work especially when the number of possible events becomeslarge. An alternative approach is to quantitatively find the minimum averagecodeword length attainable, and illustrate that the second code indeed achievessuch minimum. Question is how to obtain the minimum average codeword lengthwithout examining all possible code designs. This can be answered by the nextquery.

D) How to measure information?

As mentioned in Section 1.3-A), a good measure on information contentshould base on the degree of its uncertainty.

From probabilistic viewpoint, if an event is less likely to happen, it shouldcarry more information when it occurs, because it is more uncertain that theevent would happen. In addition, it is reasonable to have additivity for informa-tion measure, i.e., the degree-of-uncertainty of a joint event should equal the sumof the degree-of-uncertainty of each individual event. Moreover, a small changein event probability should only yield a small variation in event uncertainty. Forexample, two events respectively with probabilities 0.20001 and 0.19999 shouldreasonably possess comparable information content. As it turns out, the onlymeasure satisfying these demands is the self-information defined as the logarithmof the reciprocal of the event probability (cf. Section 2.1.1). It is then legitimateto adopt the entropy—the expected value of the self-information—as a measureof information.

5

However, from the standpoint of engineers, a more useful definition for in-formation measure is the average codeword length of the most compact coderepresenting the information messages. Under such a definition, engineers candirectly determine the minimum space required to store the information basedon the quantity of the information measure.

In 1948, Shannon proved that the above two viewpoints are actually equiv-alent (under some constraints). The minimum average code length for a sourcecode is indeed equal to the entropy of the source. One can then compute theentropy of the source in the previous example, i.e.,

1

2log2

1

(1/2)+

1

4log2

1

(1/4)+

1

8log2

1

(1/8)+

1

8log2

1

(1/8)=

7

4bits,

and assures that the average codeword length of the second code design indeedachieves this minimum.

Shannon’s work laid the foundation for the field of information theory. Inaddition, his work indicates that the mathematical results of information theorycan serve as a guide for the development of information manipulation systems.

In the case of data transmission over noisy channel, the concern is differentfrom that for data storage (or error-free transmission). The sender wishes totransmit to the receiver a sequence of pre-defined information symbols under anacceptable symbol error rate. Code redundancies are therefore added to combatthe noise. For example, one may employ the three-times repetition code (i.e.,transmit 111 for information symbol 1 and send 000 for information symbol0), and apply the majority law at the receiver end so that one-bit error canbe recovered. Yet, the addition of extra two code bits, although improvingthe error recovery capability, decreases the information transmission efficiency.Only one information symbol is transmitted per three code symbols, which isquantitatively termed 1/3 information symbol per channel usage. A naturalquestion that herein arises is the following.

E) Given a noisy channel, what is the maximum transmission efficiency attain-able for channel code designs, subject to an arbitrarily small error probability forinformation symbols?

The answer to question E) is baptized by Shannon with the name of channelcapacity. Before we proceed to interpret the concept behind the channel capacity,we have to decipher the condition of arbitrarily small information-symbol errorrate.

Suppose the channel noise induces a probability of 0.1 for receiving 1 giventhat 0 is transmitted. Of the same probability is the reception of 0 given that 1

6

is transmitted. An uncoded transmission therefore results in an information-symbol error rate of 0.1 with transmission efficiency of 1 information sym-bol per channel usage. Applying the three-times repetition code improves theinformation-symbol error rate to

0.028 =

(3

2

)(0.1)2 × 0.9 +

(3

3

)(0.1)3

with only 1/3 information symbol transmitted per channel usage. Question is“Can one design a channel code with the same transmission efficiency and smallerinformation-symbol error rate than 0.028?” The answer is affirmative. In fact,Shannon found that for any ε > 0 given, there exists a channel code whichtransmits 1/3 information symbol per channel usage with information-symbolerror rate smaller than ε. As the error demand ε can be any positive numberlarger than zero, we term the situation as arbitrarily small (but possibly never-be-zero) information-symbol error can be achieved by a deliberate code designat transmission efficiency of 1/3 information symbol per channel usage over thisnoisy channel. For convenience, information theorists use reliable transmissionefficiency (or simply reliable transmission rate) to brief the above lengthy de-scription. Channel capacity is then the maximum reliable transmission rate thatcan be transmitted over a noisy channel.

Can one determine the maximum reliable transmission rate without exhaust-ing all the possible channel code designs? As anticipated, Shannon answeredthe question with a positive yes. Observe that a good channel code basicallyincreases the certainty of channel outputs to channel inputs, although both thechannel inputs and channel outputs are uncertain before the transmission begins(where channel inputs are decided by the information transmitted, and channeloutputs are the joint results of the channel inputs and noise). So the design ofa channel code should consider more the statistically “shared information” be-tween the channel inputs and outputs so that once a channel output is observed,the receiver is more certain about which channel input is transmitted. To furtherclarify the concept, consider the following example.

Example 1.1

Channel Model: Suppose that the channel input is a two-tuple (x1, x2) in (a, a),(a, b), (b, a), (b, b). Due to channel noise, only x1 survives at the channeloutput. In other words, if (y1, y2) represents the channel output, then y1 =x1 and y2 = b.

Common Uncertainty Between Channel Input and Output: Based on the chan-nel model, the channel input has two uncertainties, X1 and X2, since each

7

of them could be one of a and b (prior to the transmission begins). How-ever, the channel output only possess one uncertainty, Y1, because Y2 isdeterministically known to be b. So the shared information or commonuncertainty between channel input and output (prior to the transmissionbegins) is Y1 = X1.

Channel Code: Suppose that Jack and Mary wish to use this noisy channel toreliably convey a 4-event information. This may be achieved as follows.

• First, pre-define the transmission codebook by mapping 4-event infor-mation symbols to legitimate channel symbols, and make it known toboth Jack and Mary. This may not be possible for only one chan-nel usage. Hence, they define the transmission codebook based on twochannel usages as:

event 1 = (a, d) (a, d),event 2 = (a, d) (b, d),event 3 = (b, d) (a, d),event 4 = (b, d) (b, d),

where “d” means “don’t-care” and can be any one in a, b. Thena reliable (actually error-free) transmission of a 4-event informationbetween Jack and Mary is established. The resultant transmission rateis

log2(4 events)

2 channel usages= 1 information bit per channel usage.

It is noted that the above transmission code only uses uncertainty X1. Thisis simply because uncertainty X2 is useless to the information exchangebetween Jack and Mary.

From the above example, one may conclude that the design of a good trans-mission code should consider the common uncertainty (or more formally, themutual information) between channel inputs and channel outputs. It is thennatural to wonder whether or not this “consideration” can be expressed math-ematically. Indeed, this was established by Shannon when he showed that thebound on the reliable transmission rate (information bits per channel usage) isthe maximum channel mutual information (i.e., common uncertainty prior tothe transmission begins) attainable. With his ingenious work, once again, bothengineering and probabilistic viewpoints coincide.

8

1.4 Joint design versus separate design of source andchannel coders

In this section, we wish to justify the general assumption of channel input beinguniformly distributed, which is often made on channel coder design, from theviewpoint of information theory.

As depicted in Figure 1.2, a source encoder maps information symbols, rep-resenting informative events, to source codewords (e.g., u = f(z)). Channelencoder then selects channel codewords according the source codewords (e.g.,x = g(u)). In principle, these two coders can be jointly treated as a mapping di-rectly from information symbols to channel codewords (e.g., x = g(f(z)) = h(z)where h = f g). It is then natural to foresee that a joint-design of source-channel code (i.e., to find the best mapping h(·)) is advantageous, but hardbecause the main concerns of the two coders are somewhat contrary to eachother: the source coder removes the redundancy of the information symbols,while the channel coder brings in redundancy to compensate the noise. Yet,it is recently shown that when the channel noise is non-stationary in time, ajoint design of source and channel coders may outperform the concatenation ofseparately designed optimal source code and best channel code [4].

Source -z -u -xModulator

?

PhysicalChannel

?

Demodulator¾¾¾Destination

Transmitter Part

Receiver Part

SourceEncoder

ChannelEncoder

ChannelDecoder

SourceDecoder

Figure 1.2: General model of a communication system.

Conventionally, the source code and the channel code are independently de-signed, even if they are placed in one communication system. So to speak, thesource code only focuses on the compression of information message withoutconsidering the statistics of channel noise, while the channel code adds code re-dundancy to balance the noise effect on the transmitted codewords by simply

9

assuming that there is no redundancy in its inputs, which is in concept equiv-alent to the channel encoder inputs are uniformly distributed. The equivalencebetween optimal source compression and uniformly distributed channel encoderinputs can be justified by the next example.

As mentioned in the previous section, the optimal source code for eventprobabilities 1/2, 1/4, 1/8, 1/8 is:

event one = e1 : 0event two = e2 : 10event three = e3 : 110event four = e4 : 111

, (1.4.1)

where in this setting (cf. Figure 1.3), Z1, Z2, Z3, . . . draws values from e1, e2, e3,e4 and is independent and identically distributed, and U1, U2, U3, . . . belongs to0, 1. It can be verified that given a sequence of the code bits u1, u2, . . . , un, theevent sequences z1, z2, . . . , zm−1 can be uniquely determined except possibly thelast one zm (otherwise, the source code would not be a distortionless compressioncode). Thus, we can derive PrUn+1 = un+1|U1 = u1, U2 = u2, . . . , Un = un bydistinguishing the following cases:

• u1, u2, . . . , un uniquely determines z1, z1, . . . , zm: In this case,

PrUn+1 = un+1|U1 = u1, U2 = u2, . . . , Un = un= PrUn+1 = un+1=

PrZm+1 = e1, if un+1 = 0PrZm+1 6= e1, if un+1 = 1

=1

2.

• u1, u2, . . . , un−1 uniquely determines z1, z1, . . . , zm−1 but zm remains unde-cided: In this case, un = 1, and

PrUn+1 = un+1|U1 = u1, U2 = u2, . . . , Un = un= PrUn+1 = un+1|Zm ∈ e2, e3, e4=

PrZm = e2|Zm ∈ e2, e3, e4, if un+1 = 0PrZm 6= e2|Zm ∈ e2, e3, e4, if un+1 = 1

=1

2.

• u1, u2, . . . , un−2 uniquely determines z1, z1, . . . , zm−1 but zm remains unde-

10

cided: In this case, un = un−1 = 1, and

PrUn+1 = un+1|U1 = u1, U2 = u2, . . . , Un = un= PrUn+1 = un+1|Zm ∈ e3, e4=

PrZm = e3|Zm ∈ e3, e4, if un+1 = 0PrZm 6= e3|Zm ∈ e3, e4, if un+1 = 1

=1

2.

We then conclude that

PrUn+1 = un+1|U1 = u1, U2 = u2, . . . , Un = un =1

2

for all (u1, u2, . . . , un, un+1) ∈ 0, 1n+1, and U1, U2, U3, . . . are i.i.d. sequencewith uniform marginal distribution.

-. . . , Z3, Z2, Z1

∈ e1, e2, e3, e4Source Encoder -. . . , U3, U2, U1

∈ 0, 1

Figure 1.3: A specific model of a source coder.

The above example shows that an optimal source encoder defined in (1.4.1)does transform its non-uniformly distributed input to a uniformly distributedoutput as we have claimed. An alternative conceptual proof to this statement isthat if U1, U2, U3, . . . were not marginally uniformly distributed, then its entropy(rate) would be equal to:

R = p log2

1

p+ (1− p) log2

1

1− p< 1 bits per U ’s symbol,

where PrU = 0 = p 6= 1/2. Then from Shannon’s source coding theo-rem, we can construct another source encoder to transform U1, U2, . . . , Un toU1, U2, . . . , Un, where Uj ∈ 0, 1, such that n/n = R < 1, and hence, furthercompression of U1, U2, . . . , Un results, which contradicts to the optimality of thesource encoder in (1.4.1).

We summarize the discussion in this section as follows. The output of an op-timal source encoder in the sense of minimizing the average per-letter codewordlength (i.e., the number of U divided by the number of Z), which asymptotically

11

achieves the per-letter source entropy (i.e., the overall entropy of Z1, Z2, . . . di-vided by the number of Z), should be asymptotically i.i.d. with uniform marginaldistribution. In case the average per-letter codeword length of the optimal sourcecode equals the per-letter source entropy, its output becomes exactly i.i.d. withequally probable marginal. Accordingly, in a communication system with sepa-rate source and channel coders, where the entropy-achieving optimal source codeis employed, the channel code designer can certainly focus only on the establish-ment of a good channel code based on uniform i.i.d. input assumption. Such aseparation in code design consideration simplifies the system design effort. Aninteresting query that follows is that if the source code designer does not well-perform his duty to give a unform i.i.d. source encoder output, then the optimalchannel code that assumes uniform i.i.d. input may degrade in its performance.In such case, performance improvement by adopting a channel code specificallydesigned for a non-optimal source code may become fairly feasible!1

1Recently, there are certain publications dealing with a specific channel code design forknown data compressors, such as MELP and CELP. A quick example is the one by Alajaji,Phamdo and Fuja [1]. Interested Readers should be able to find more references under thekeyword of joint source-channel code.

12

Bibliography

[1] F. Alajaji, N. Phamdo and T. Fuja, “Channel codes that exploit the resid-ual redundancy in CELP-encoded speech,” IEEE Trans. Speech and AudioProcessing, vol. 4, no. 5, pp. 325–336, Sept. 1996.

[2] C. E. Shannon, “A mathematical theory of communication,” Bell Sys. Tech.Journal, 27:379–423, 623–656, 1948.

[3] C. E. Shannon and W. W. Weaver, The Mathematical Theory of Commu-nication, Univ. of Illinois Press, Urbana, IL, 1949.

[4] S. Vembu, S. Verdu and Y. Steinberg, “The source-channel separation theo-rem revisited,” IEEE Trans. on Information Theory, vol. 41, no. 1, pp. 44–54, Jan. 1995.

13

Chapter 2

Information Measures for DiscreteSystems

In this chapter, we define information measures for discrete systems througha probabilistic standpoint. Properties regarding to these information measuresare then addressed. The relation between probabilistically defined informationmeasures and the coding limits will be discussed in subsequent chapters.

2.1 Entropy, joint entropy and conditional entropy

2.1.1 Self-information

Let E be an event with probability Pr(E), and let I(E) represent the amountof information you gain when you learn that E has occurred (or equivalently,the amount of uncertainty you lose after learning that E has happened). Thena natural question to ask is “what properties should I(E) have?” The answerto the question may vary person by person. Here are some common propertiesthat I(E), which is called the self-information, is reasonably expected to have.

1. I(E) should be a function of Pr(E).

In other words, this property says that I(E) = I(Pr(E)), where I(·) is afunction defined over an event space, and I(·) is a function defined over[0, 1]. In general, people expect that the less likely an event is, the moreinformation you have gained when you learn it has happened. In otherwords, I(Pr(E)) is a decreasing function of Pr(E).

2. I(Pr(E)) should be continuous in Pr(E).

Intuitively, we should expect that a small change in Pr(E) corresponds toa small change in the uncertainty of E.

14

3. If E1 and E2 are independent events, then I(E1 ∩ E2) = I(E1) + I(E2),or equivalently, I(Pr(E1)× Pr(E2)) = I(Pr(E1)) + I(Pr(E2)).

This property declares that the amount of uncertainty we lose by learningthat both E1 and E2 have occurred should be equal to the sum of individualuncertainty losses for independent E1 and E2.

Next, we show that the only function that satisfies properties 1, 2 and 3 isthe logarithmic function.

Theorem 2.1 The only function defined over p ∈ [0, 1] and satisfying

1. I(p) is monotonically decreasing in p;

2. I(p) is a continuous function of p for 0 ≤ p ≤ 1;

3. I(p1 × p2) = I(p1) + I(p2);

is I(p) = −C · log(p), where C is a positive constant.

Proof:

Step 1: Claim. For n = 1, 2, 3, · · · ,

I

(1

n

)= −C · log

(1

n

),

where C > 0 is a constant.

Proof: Conditions 1 and 3 respectively imply

n < m ⇒ I

(1

n

)< I

(1

m

). (2.1.1)

and

I

(1

mn

)= I

(1

m

)+ I

(1

n

)(2.1.2)

where n,m = 1, 2, 3, · · · . Now using (2.1.2), we can show by induction that

I

(1

nk

)= k · I

(1

n

)(2.1.3)

for all positive integer n and non-negative integer k Note that (2.1.3) al-ready proves the claim for the case of n = 1.

15

Now let n be a fixed positive integer greater than 1. Then for any positiveinteger r, there exists non-negative integer k such that

nk ≤ 2r < nk+1.

By (2.1.1), we obtain

I

(1

nk

)≤ I

(1

2r

)< I

(1

nk+1

),

which together with (2.1.3), yields

k · I(

1

n

)≤ r · I

(1

2

)< (k + 1) · I

(1

n

).

Hence, by I(1/n) > I(1) = 0,

k

r≤ I(1/2)

I(1/n)≤ k + 1

r.

On the other hand, by the monotonity of logarithm, we obtain

log nk ≤ log 2r ≤ log nk+1 ⇔ k

r≤ log(2)

log(n)≤ k + 1

r.

Therefore, ∣∣∣∣log(2)

log(n)− I(1/2)

I(1/n)

∣∣∣∣ <1

r.

Since n is fixed, and r can be made arbitrarily large, we can let r →∞ toget:

I

(1

n

)= C · log(n).

where C = I(1/2)/ log(2) > 0. This completes the proof of the claim.

Step 2: Claim. I(p) = −C · log(p) for positive rational number p, where C > 0is a constant.

Proof: A rational number p can be represented by a ratio of two integers,i.e., p = r/s, where r and s are both positive integers. Then condition 3gives that

I

(1

s

)= I

(r

s

1

r

)= I

(r

s

)+ I

(1

r

),

which, from Step 1, implies that

I(p) = I(r

s

)= I

(1

s

)− I

(1

r

)= C · log s− C · log r = −C · log p.

16

Step 3: For any p ∈ [0, 1], it follows by continuity that

I(p) = lima↑p, a rational

I(a) = limb↓p, b rational

I(b) = −C · log(p).

2

2.1.2 Entropy

Entropy is a measure of the amount of information (or uncertainty) containedin the source. The source can be modelled as a random process, which is acollection of random variables indexed through an index set (cf. Appendix B).For simplicity, we first assume that the index set associated with the randomprocess corresponding to the source consists of only one index. It is also assumedthat the source alphabet X is finite. Then as indicated in the previous subsection,the self-information can be probabilistically defined as:

I(x) , − log PX(x),

where x ∈ X is a possible outcome of the source, and PX(·) is the probabilitydistribution of the source X. This definition fits the intuition that a less likelyoutcome will bring more information. By extending the concept, entropy isdefined as follows.

Definition 2.2 (entropy) For a source X, the entropy H(X) is defined by

H(X) , −∑x∈X

PX(x) · log PX(x) = E[− log PX(X)] = E[I(X)].

By the above definition, entropy can be interpreted as the expected or averageamount of (self-)information you gain when you learn that one of the |X | out-comes has occurred, where |X | is the cardinality of X . Another interpretation isthat H(X) is a measure of uncertainty of random variable X. Sometimes, H(X)is also written as H(PX) for notation convenience.

When the base of the logarithm operation is 2, entropy is expressed in bits;when the natural logarithm is employed, entropy is measured in nats. For ex-ample, the entropy of a fair coin source is 1 bit or log(2) nat.

In computing the entropy, we adopt the convention that

0 · log 0 = 0,

which can be justified by continuity since x log x → 0 as x → 0. Also note thatthe entropy only depends on the probability distribution of the source, and is not

17

affected by the symbols that represent the outcomes. For example, we can use 0and 2 to denote head and tail of a fair coin source, and the entropy still remains1 bit. For ease of computation, the natural logarithm is assumed throughoutunless otherwise stated.

Example 2.3 Let X be a random variable with PX(1) = p and PX(0) = 1− p.Then H(X) = −p · log p− (1− p) · log(1− p). This is called the binary entropyfunction.

2.1.3 Properties of entropy

Lemma 2.4 H(X) ≥ 0. Equality holds if, and only if, X is deterministic.(When X is deterministic, the uncertainty of X is obviously zero.)

Proof: 0 ≤ PX(x) ≤ 1 implies that log[1/PX(x)] ≥ 0 for every x ∈ X . Hence,

H(X) =∑x∈X

PX(x) log1

PX(x)≥ 0,

with equality holds if, and only if, PX(x) = 1 for some x ∈ X . 2

Lemma 2.5 If a random variable X takes value from a set X , then H(X) ≤log(|X |), where |X | denotes the size of the set X .

Proof:

log |X | −H(X) = log |X | ×[∑

x∈XPX(x)

]−

[−

∑x∈X

PX(x) log PX(x)

]

=∑x∈X

PX(x)× log |X |+∑x∈X

PX(x) log PX(x)

=∑x∈X

PX(x) log[|X | × PX(x)]

≥∑x∈X

PX(x)

(1− 1

|X | × PX(x)

)

because (∀y > 0) log(y) ≥ 1− (1/y)

with equality if, and only if, y = 1.

=∑x∈X

(PX(x)− 1

|X |)

= 1− 1 = 0.

18

Equality holds if, and only if, (∀ x ∈ X ), |X | × PX(x) = 1, which means PX(x)is a uniform distribution on X . 2

Intuitively, H(X) tells us how random X is. Indeed, X is deterministic (notrandom at all) if, and only if, H(X) = 0. If X is uniform (or equiprobable),H(X) is maximized, and is equal to log |X |.

In the above proof, we employ an inequality that for any y > 0,

y − 1 ≥ log(y) ≥ 1− 1

y

with equality holds at y = 1. This inequality is referred to as the FundamentalInequality. It can be easily justified by curve drawing.

The previous lemma can also be proven using the log-sum inequality as statedbelow.

Lemma 2.6 (log-sum inequality) For non-negative numbers, a1, a2, . . ., an

and b1, b2, . . ., bn,

n∑i=1

(ai log

ai

bi

)≥

(n∑

i=1

ai

)log

∑ni=1 ai∑ni=1 bi

(2.1.4)

with equality holds if, and only if, (∀ 1 ≤ i ≤ n) (ai/bi) = (a1/b1). (By conven-tion, 0 · log(0) = 0, 0 · log(0/0) = 0 and a · log(a/0) = ∞ if a > 0. Again, thiscan be justified by “continuity.”)

Proof: Without loss of generality, assume that ai > 0 and bi > 0. The Jensen’sinequality tells us that

n∑i=1

αif(ti) ≥ f

(n∑

i=1

αiti

)

for any strictly convex function f(·), αi ≥ 0, and∑n

i=1 αi = 1; equality holds if,and only if, ti is a constant for all i. Hence by setting αi = bi/

∑nj=1 bj, ti = ai/bi,

and f(t) = t · log(t), we obtain the desired result. 2

2.1.4 Joint entropy and conditional entropy

We now consider the case where the index set associated with the random sourceconsists of two indexes. Then the self-information of such a source is probabilis-tically defined as:

I(x, y) , − log PX,Y (x, y),

where (x, y) ∈ X × Y is a possible outcome of the source, and PX,Y (·, ·) is theprobability distribution of the source (X, Y ). This leads us to the definitions ofjoint entropy and conditional entropy:

19

Definition 2.7 (joint entropy)

H(X, Y ) , −∑

(x,y)∈X×YPX,Y (x, y) · log PX,Y (x, y)

= E[− log PX,Y (X, Y )].

Definition 2.8 (conditional entropy) The conditional entropy H(Y |X) isdefined as

H(Y |X) ,∑x∈X

PX(x)

(−

∑y∈Y

PY |X(y|x) · log PY |X(y|x)

). (2.1.5)

(2.1.5) can be written into three different but equivalent forms:

H(Y |X) =∑x∈X

PX(x) ·H(Y |X = x)

= −∑

(x,y)∈X×YPX,Y (x, y) · log PY |X(y|x)

= E[− log PY |X(Y |X)].

The relationship between joint entropy and conditional entropy is exhibitedby the fact that the entropy of a pair of random variables is the entropy of oneplus the conditional entropy of the other.

Theorem 2.9 (chain rule for entropy)

H(X, Y ) = H(X) + H(Y |X) (2.1.6)

Proof: This can be easily justified by:

PX,Y (x, y) = PX(x)PY |X(y|x),

since

H(X, Y ) = E[− log PX,Y (X,Y )]

= E[− log PX(X)] + E[− log PY |X(Y |X)]

= H(X) + H(Y |X).

2

By its definition, joint entropy is commutative; i.e., H(X,Y ) = H(Y, X).Hence,

H(X, Y ) = H(X) + H(Y |X) = H(Y ) + H(X|Y ) = H(Y,X),

20

which implies that

H(X)−H(X|Y ) = H(Y )−H(Y |X). (2.1.7)

The above quantity is exactly equal to the mutual information which will beintroduced in the next section.

The conditional entropy can be thought of in terms of a channel whose inputis the random variable X and whose output is the random variable Y . H(X|Y ) isthen called the equivocation1 and corresponds to the uncertainty in the channelinput from the receiver’s point-of-view. For example, suppose that the set ofpossible outcomes of random vector (X,Y ) is (0, 0), (0, 1), (1, 0), (1, 1), wherenone of the elements has zero probability mass. When the receiver Y receives 1,he still cannot determine exactly what the sender X observes (could be either 1or 0); therefore, the uncertainty, from the receiver’s view point, depends on theprobabilities PX|Y (0|1) and PX|Y (1|1).

Similarly, H(Y |X), which is called prevarication,2 is the uncertainty in thechannel output from the transmitter’s point-of-view. In other words, the senderknows exactly what he sends, but is uncertain on what the receiver will finallyobtain.

A case that is of specific interest is when H(X|Y ) = 0. By its definition,H(X|Y ) = 0 when X becomes deterministic after observing Y . In such case,the uncertainty of X after giving Y is completely zero.

The next corollary can be proved similarly to Theorem 2.9.

Corollary 2.10 (chain rule for conditional entropy)

H(X,Y |Z) = H(X|Z) + H(Y |X, Z).

2.1.5 Properties of joint entropy and conditional entropy

Lemma 2.11 Side information Y helps decreasing the uncertainty about Xexcept X and Y are independent. So to speak,

H(X|Y ) ≤ H(X)

with equality holds if, and only if, X and Y are independent.

1Equivocation means the deed that occurs when someone deliberately uses vague or am-biguous language in order to deceive or to avoid speaking the truth.

2Prevarication means the deed that occurs when a person deliberately avoids doing a taskthat he ought to do or avoids telling people something that they want told to them.

21

Proof:

H(X)−H(X|Y ) =∑

(x,y)∈X×YPX,Y (x, y) · log

PX|Y (x|y)

PX(x)

=∑


PX|Y (x|y)PY (y)

PX(x)PY (y)

=∑


PX,Y (x, y)

PX(x)PY (y)

≥ ∑

(x,y)∈X×YPX,Y (x, y)

log

∑(x,y)∈X×Y PX,Y (x, y)∑

(x,y)∈X×Y PX(x)PY (y)

(by log-sum inequality)

= 0

with equality holds if, and only if,

(∀ (x, y) ∈ X × Y)PX,Y (x, y)

PX(x)PY (y)= constant.

Since probability must sum to 1, the above constant equals 1, which is exactlythe case of X being independent of Y . 2

Lemma 2.12 Entropy is additive for independent random variables, i.e.,

H(X,Y ) = H(X) + H(Y ) for independent X and Y.

Proof: By the previous lemma, independence of X and Y implies H(Y |X) =H(Y ). Hence,

H(X, Y ) = H(X) + H(Y |X) = H(X) + H(Y ).

2

Since “conditioning” never increases entropy, it follows that

H(X,Y ) = H(X) + H(Y |X) ≤ H(X) + H(Y ). (2.1.8)

The above lemma tells us that equality holds for (2.1.8) only when X is inde-pendent of Y .

Similar result to (2.1.8) also applies to conditional entropy.

22

Lemma 2.13 Conditional entropy is lower additive; i.e.,

H(X1, X2|Y1, Y2) ≤ H(X1|Y1) + H(X2|Y2).

Equality holds if, and only if, (X1|Y1) and (X2|Y2) are independent, namely

PX1,X2|Y1,Y2(x1, x2|y1, y2) = PX1|Y1(x1|y1)PX2|Y2(x2|y2).

Proof: For any y1 and y2,

H(X1, X2|y1, y2) ≤ H(X1|y1, y2) + H(X2|y1, y2), (2.1.9)

≤ H(X1|y1) + H(X2|y2), (2.1.10)

whereH(X1|y1) ,

∑x∈X

−PX1|Y1(x1|y1) log PX1|Y1(x1|y1),

and H(X2|y2), H(X|y1, y2) and H(X2|y1, y2) are similarly defined. By takingthe expectation value with respect to Y in the above inequalities, we obtain

H(X1, X2|Y1, Y2) ≤ H(X1|Y1) + H(X2|Y2).

For (2.1.9), equality holds if, and only if, X1 and X2 are conditionally indepen-dent given (Y1, Y2) = (y1, y2). Since this should hold for any y1 and y2, X1 andX2 must be conditionally independent given (Y1, Y2). For (2.1.10), equality holdsif, and only if, X1 is independent of Y2, and X2 is independent of Y1. Hence, thedesired equality condition of the lemma is obtained. 2

2.2 Mutual information and conditional mutual informa-tion

For two random variables X and Y , the mutual information between X andY is the reduction in the uncertainty of Y due to the knowledge of X (or viceversa). For example, in Example 1.1, the mutual information of the channel isthe first argument X1. A dual definition of mutual information states that itis the average amount of information that Y has (or contains) about X or Xhas (or contains) about Y . Under this definition, we can say that the shared(or mutual) uncertainty (or information) in Example 1.1 between channel senderand channel receiver is Uncertainty X1.

We can think of the mutual information between X and Y in terms of achannel whose input is X and whose output is Y . Thereby the reduction of theuncertainty is by definition the total uncertainty of X (i.e. H(X)) (e.g. Uncer-tainty X1 and Uncertainty X2 in Example 1.1) minus the uncertainty of X after

23

observing Y (i.e. H(X|Y )) (e.g. Uncertainty X2 in Example 1.1). Mathemati-cally, it is

mutual information = I(X; Y ) , H(X)−H(X|Y ). (2.2.1)

It can be easily verified from (2.1.7) that mutual information is symmetric; i.e.,I(X; Y ) = I(Y ; X).

2.2.1 Properties of mutual information

Lemma 2.14 .

1. I(X; Y ) =∑x∈X

∑y∈Y

PX,Y (x, y) logPX,Y (x, y)

PX(x)PY (y).

2. I(X; Y ) = I(Y ; X).

3. I(X; Y ) = H(X) + H(Y )−H(X,Y ).

4. I(X; Y ) ≤ H(X) with equality holds if, and only, if X is a function of Y(i.e., X = f(Y ) for some function f(·)).

5. I(X; Y ) ≥ 0 with equality holds if, and only if, X and Y are independent.

Proof: Properties 1, 2, 3, and 4 follows immediately from the definition. Prop-erty 5 is a direct consequence of Lemma 2.11. 2

The relationship between H(X), H(Y ), H(X, Y ), H(X|Y ), H(Y |X) andI(X; Y ) can be illustrated by the Venn diagram in Figure 2.1.

2.2.2 Conditional mutual information and its properties

The conditional mutual information, denoted by I(X; Y |Z), is defined as thecommon uncertainty between X and Y under the knowledge of Z. It is mathe-matically defined by

I(X; Y |Z) = H(X|Z)−H(X|Y, Z).

Lemma 2.15 (chain rule for mutual information)

I(X; Y, Z) = I(X; Y ) + I(X; Z|Y ) = I(X; Z) + I(X; Y |Z).

24

I(X; Y )H(X|Y ) H(Y |X)H(X) - H(Y )¾

H(X, Y )@

@@R

¡¡¡ª

Figure 2.1: Relation between entropy and mutual information.

Proof: Without loss of generality, we only prove the first equality:

I(X; Y, Z) = H(X)−H(X|Y, Z)

= H(X)−H(X|Y ) + H(X|Y )−H(X|Y, Z)

= I(X; Y ) + I(X; Z|Y ).

2

The above lemma can be read as: the information that (Y, Z) has about Xis equal to the information that Y has about X plus the information that Z hasabout X when Y is already known.

2.3 Properties of entropy and mutual information forhigher dimensional extensions

Theorem 2.16 (chain rule for entropy) Let X1, X2, . . ., Xn be drawn ac-cording to PXn(x1, . . . , xn). Then

H(X1, X2, . . . , Xn) =n∑

i=1

H(Xi|Xi−1, . . . , X1).

(It can also be written as:

H(Xn) =n∑

i=1

H(Xi|X i−1),

where X i , (X1, . . . , Xi).)

25

Proof: From (2.1.6),

H(X1, X2, . . . , Xn) = H(X1, X2, . . . , Xn−1) + H(Xn|Xn−1, . . . , X1). (2.3.1)

Once again, applying (2.1.6) to the first term of the right-hand-side of (2.3.1),we have

H(X1, X2, . . . , Xn−1) = H(X1, X2, . . . , Xn−2) + H(Xn−1|Xn−2, . . . , X1).

The desired result can then be obtained by repeatedly applying (2.1.6). 2

Theorem 2.17 (chain rule for conditional entropy)

H(X1, X2, . . . , Xn|Y ) =n∑

i=1

H(Xi|Xi−1, . . . , X1, Y ).

Proof: The theorem can be proved similarly as Theorem 2.16. 2

Theorem 2.18 (chain rule for mutual information)

I(X1, X2, . . . , Xn; Y ) =n∑

i=1

I(Xi; Y |Xi−1, . . . , X1).

Proof: This can be proved by first expressing mutual information in terms ofentropy and conditional entropy, and then applying the chain rules for entropyand conditional entropy. 2

Theorem 2.19 (independent bound on entropy)

H(X1, X2, . . . , Xn) ≤n∑

i=1

H(Xi).

Equality holds if, and only if, Xi is independent of Xi−1, . . . , X1.

Proof: By applying the chain rule for entropy,

H(X1, X2, . . . , Xn) =n∑

i=1

H(Xi|Xi − 1, . . . , X1)

≤n∑

i=1

H(Xi).

Equality holds when each conditional entropy is equal to its associated entropy.2

26

Theorem 2.20 (bound on mutual information) If (Xi, Yi)ni=1 is a pro-

cess such that PY n|Xn =∏n

i=1 PYi|Xi, then

I(X1, . . . , Xn; Y1, . . . , Yn) ≤n∑

i=1

I(Xi; Yi)

with equality holds if, and only if, Xini=1 are independent.

Proof: From (2.1.8), we have

H(Y1, . . . , Yn) ≤n∑

i=1

H(Yi).

By the independence of (Yi|Xi)ni=1 and applying Theorem 2.16,

H(Y1, . . . , Yn|X1, . . . , Xn)

=n∑

i=1

H(Yi|Y1, . . . , Yi−1, X1, . . . , Xn)

=n∑

i=1

H(Yi|Xi).

Hence,

I(Xn; Y n) = H(Y n)−H(Y n|Xn)

≤n∑

i=1

H(Yi)−n∑

i=1

H(Yi|Xi)

=n∑

i=1

I(Xi; Yi)

with equality holds if, and only if, Yini=1 are independent, which is so if, and

only if, Xini=1 are independent. 2

Lemma 2.21 (data processing inequality) (This is also called the data pro-cessing lemma.) If X → Y → Z, then I(X; Y ) ≥ I(X; Z).

Proof: Since X and Z are conditional independent given Y , we have I(X; Z|Y )= 0. By the chain rule for mutual information,

I(X; Z) + I(X; Y |Z) = I(X; Y, Z) (2.3.2)

= I(X; Y ) + I(X; Z|Y )

= I(X; Y ). (2.3.3)

27

Source -UEncoder -X

Channel -YDecoder -V

I(U ; V ) ≤ I(X; Y )

“By processing, we can only lose the mutual information,but the remained mutual information may be in a more useful form!”

Figure 2.2: Communication context of the data processing lemma.

Since I(X; Y |Z) ≥ 0, I(X; Y ) ≥ I(X; Z) with equality holds if, and only if,I(X; Y |Z) = 0. 2

The data processing inequality means that the mutual information will notincrease after processing. This result is somewhat counter-intuitive since giventwo random variables X and Y , we might believe that applying a well-designedprocessing scheme to Y , which can be generally represented by a mapping g(Y ),could possibly increase the mutual information. However, for any g(·), X →Y → g(Y ) forms a Markov chain which implies that data processing can notincrease mutual information. A direct look on the communication context of thedata processing lemma is depicted in Figure 2.2, and summarized in the nextcorollary.

Corollary 2.22 For any g(·) which forms X → Y → g(Y ),

I(X; Y ) ≥ I(X; g(Y )).

The final remark on the mutual information is that if Z obtains all theinformation about X through Y , then giving Z will not help increasing themutual information between X and Y .

Corollary 2.23 If X → Y → Z, then

I(X; Y |Z) ≤ I(X; Y ).

Proof: The proof follows (2.3.2) and (2.3.3). 2

Note that it is possible that I(X; Y |Z) > I(X; Y ) when X, Y and Z do notform a Markov chain. For example, let X and Y be independent equiprobablebinary random variables, and let Z = X + Y . Then

I(X; Y |Z) = H(X|Z)−H(X|Y, Z)

= H(X|Z)

= PZ(0)H(X|z = 0) + PZ(1)H(X|z = 1) + PZ(2)H(X|z = 2)

= 0 + 0.5 + 0

= 0.5 bits,

28

which is apparently larger than I(X; Y ) = 0.

2.4 Relative entropy and hypothesis testing

In addition to the probabilistically defined entropy and mutual information, an-other measure that is frequently considered in information theory is the relativeentropy or divergence. We will give its definition and statistical properties in thissection.

2.4.1 Fundamentals on hypothesis testing

One of the fundamental problems in statistics is to decide between two alternativeexplanations for the observed data. For example, when gambling, one may wishto test whether it is a fair game or not. Similarly, a sequence of observations onthe market may reveal the information that whether a new product is successfulor not. This is the simplest form for hypothesis testing problem, which is usuallynamed simple hypothesis testing.

It has quite a few applications in information theory. One of the frequentlycited examples is on the alternative interpretation to the law of large numbers.Another example is the computation of the true coding error (for universal codes)by testing the empirical distribution against the true distribution. All of thesecases will be discussed subsequently.

The simple hypothesis testing problem can be formulated as follows:

Problem: Let X1, . . . , Xn be a sequence of observations which is possibly drawnaccording to either null hypothesis distribution PXn or alternative hypothesisdistribution PXn . They are usually denoted by:

• H0 : PXn

• H1 : PXn

Based on one sequence of observations xn, one has to decide which of the hy-potheses is true. This is denoted by a decision mapping φ(·), where

φ(xn) =

0, if distribution of Xn is classified to be PXn ;1, if distribution of Xn is classified to be PXn .

Accordingly, the possible observed sequences are divided into two groups:

Acceptance region for H0 : xn ∈ X n : φ(xn) = 0Acceptance region for H1 : xn ∈ X n : φ(xn) = 1.

29

Hence, depending on the true distribution, there are possibly two types of pro-bability of errors:

Type I error : αn = αn(φ) = PXn (xn ∈ X n : φ(xn) = 1)Type II error : βn = βn(φ) = PXn (xn ∈ X n : φ(xn) = 0) .

The choice of the decision mapping is dependent on the optimization criterion.Two of the most frequently used ones in information theory are:

1. Bayesian hypothesis testing.

φ(·) is chosen so that the Bayesian cost

π0αn + π1βn

is minimized, where π0 and π1 are the prior probabilities for null hypothesisand alternative hypothesis, respectively. The mathematical expression forBayesian testing is:

minφ

[π0αn(φ) + π1βn(φ)] .

2. Neyman Pearson hypothesis testing subject to fixed test level.

φ(·) is chosen so that the type II error βn is minimized subject to a constantbound on the type I error, i.e.,

αn ≤ ε.

The mathematical expression for Neyman-Pearson testing is:

minφ : αn(φ)≤ε

βn(φ).

The set φ considered in the minimization operation could have two differentranges: range over deterministic rules, and range over randomization rules. Themain difference between a randomization rule and a deterministic rule is thatthe former allows the mapping φ(xn) to be random on 0, 1 for some xn, whilethe latter only accept deterministic assignment to 0, 1 for all xn. For example,a randomization rule for specific observations xn can be

φ(xn) = 0, with probability 0.2;

φ(xn) = 1, with probability 0.8.

30

2.4.2 Relative entropy or Kullback-Leibler divergence

The Neyman-Pearson lemma shows the well-known fact that likelihood ratio testis always the optimal test.

Lemma 2.24 (Neyman-Pearson Lemma) For a simple hypothesis testingproblem, define an acceptance region for null hypothesis through likelihood ratioas

An(τ) ,

xn ∈ X n :PXn(xn)

PXn(xn)> τ

,

and letα∗n , PXn Ac

n(τ) and β∗n , PXn An(τ) .

Then for type I error αn and type II error βn associated with another choice ofacceptance region for null hypothesis, we have

αn ≤ α∗n ⇒ βn ≥ β∗n.

Proof: Let B be a choice of acceptance region for null hypothesis. Then

αn + τβn =∑

xn∈Bc

PXn(xn) + τ∑xn∈B

PXn(xn)

=∑

xn∈Bc

PXn(xn) + τ

[1−

∑xn∈Bc

PXn(xn)

]

= τ +∑

xn∈Bc

[PXn(xn)− τPXn(xn)] . (2.4.1)

Observe that (2.4.1) is minimized by choosing B = An(τ). Hence,

αn + τβn ≥ α∗n + τβ∗n,

which immediately implies the desired result. 2

The Neyman-Pearson lemma indicates that no other choices of acceptanceregions can simultaneously improve both type I error and type II error of thelikelihood ratio test. Indeed, from (2.4.1), it is clear that for any αn and βn,one can always find a likelihood ratio test that performs as good. Therefore,likelihood ratio test is an optimal test. The statistical property of the likelihoodratio therefore becomes essential in theories of hypothesis testing.

We next define a new measure based on the likelihood ratio.

Definition 2.25 (Relative entropy rate or Kullback-Leibler divergencerate3) The relative entropy rate or divergence rate with respective a sequence

3Here, “rate” means “normalized by number of samples,” i.e., n. We can similarly haveentropy rate and (mutual-)information rate.

31

of observations Xn for null hypothesis distribution PXn against alternative hy-pothesis distribution PXn is defined by

1

nD(Xn‖Xn) =

1

nD(PXn‖PXn) , EXn

[1

nlog

PXn(Xn)

PXn(Xn)

].

If the observations are i.i.d. for both hypothesis, then the divergence ratereduces to a single-letter formula:

1

nD(Xn‖Xn) = EXn

[1

nlog

PXn(Xn)

PXn(Xn)

]

=n∑

i=1

1

nEX

[log

PX(Xi)

PX(Xi)

]

= D(X‖X).

This quantity D(X‖X) is referred to as divergence.

As the name revealed, D(X‖X) can be viewed as a measure of divergence ofdistribution PX to distribution PX . It is also called relative entropy, of which thename is owing to that some researchers treat it as a measure of the inefficiency ofmistakenly assuming the distribution is PX when the true distribution is PX . Forexample, if we know the true distribution PX of a source, then we can constructa lossless data compression code with average codeword length achieving entropyH(X) (will be discussed in the next chapter). If, however, we mistakenly thoughtthe “true” distribution is PX and employ the “best” code corresponding to PX ,then the resultant average codeword length becomes

∑x∈X

[−PX(x) · log PX(x)].

As a result, the relative difference between the resultant average codeword lengthand H(X) is the relative entropy D(X‖X). Hence, divergence is a measure on thesystem cost (e.g., more storage consumed) paid due to the deed of mis-classifyingsystem statistics.

Note that when computing the quantity of divergences, we take the conven-tion that

0 · log0

p= 0 and p · log

p

0= ∞ for p > 0.

2.5 Properties of divergence and its relation with entropyand mutual information

Lemma 2.26 (nonnegativity of divergence)

D(X‖X) ≥ 0

32

with equality holds if, and only if, PX(x) = PX(x) for all x ∈ X .

Proof:

D(X‖X) =∑x∈X

PX(x) logPX(x)

PX(x)

≥(∑

x∈XPX(x)

)log

∑x∈X PX(x)∑x∈X PX(x)

= 0

where the second step follows from log-sum inequality with equality holds if, andonly if, PX(x) = PX(x) for all x ∈ X . 2

Observation 2.27 (mutual information and divergence)

I(X; Y ) = D(PX,Y ‖PX × PY ).

Proof: The observation follows directly from the definitions of divergence andmutual information. 2

Lemma 2.28 (chain rule for divergence)

D(X,Y ‖X, Y ) = D(X‖X) + D((Y |X)‖(Y |X)),

where

D((Y |X)‖(Y |X)) ,∑x∈X

∑y∈Y

PX,Y (x, y) logPY |X(y|x)

PY |X(y|x).

Proof: This can be proved by the definition of divergence. 2

Definition 2.29 (refinement of distribution) Given distribution PX on X ,divide X into k mutually disjoint sets, U1,U2, . . . ,Uk, satisfying

X =k⋃

i=1

Ui.

Define a new distribution PU as

PU(Ui) =∑x∈Ui

PX(x).

Then PX is called a refinement of PU .

33

Let us briefly discuss the relation between processing of information and therefinement of it. Processing of information can be modelled as a (many-to-one) function mapping, and refinement is actually the opposite of processing.Recall that the data processing lemma shows that the mutual information cannever increase due to processing. Hence, if one wishes to increase the mutualinformation, he should simultaneously anti-process (or refine) the input andoutput statistics.

From Observation 2.27, the mutual information can be viewed as the diver-gence of joint input-output distribution against the product distribution of theinput marginal and output marginal. It is therefore reasonable to expect thatsimilar effect by processing and refinement should also apply to divergence. Thisis shown in the next lemma.

Lemma 2.30 (divergence is either the same or increased after refine-ment) Let PX and PX be the refinement of PU and PU respectively. Then

D(PX‖PX) ≥ D(PU‖PU).

Proof: By log-sum inequality,

∑x∈Ui

PX(x) logPX(x)

PX(x)≥

(∑x∈Ui

PX(x)

)log

∑x∈Ui

PX(x)∑x∈Ui

PX(x)

= PU(Ui) logPU(Ui)

PU(Ui), (2.5.1)


(∀ x ∈ Ui)PX(x)

PX(x)=

PU(Ui)

PU(Ui).

Hence,

D(PX‖PX) =k∑

i=1

∑x∈Ui

PX(x) logPX(x)

PX(x)

≥k∑

i=1

PU(Ui) logPU(Ui)

PU(Ui)

= D(PU‖PU),


(∀ i)(∀ x ∈ Ui)PX(x)

PX(x)=

PU(Ui)

PU(Ui).

34

2

One drawback of adopting the divergence as a measure between two distri-butions is that it does not meet the symmetry requirement of a metric,4 sinceinterchange of the two arguments may yield different quantities. In other words,

D(PX‖PX) 6= D(PX‖PX)

in general. Due to this, some other measures, such as variational distance, aresometimes used instead.

Definition 2.31 (variational distance) The variational distance between twodistributions PX and PX is

‖PX − PX‖ ,∑x∈X

|PX(x)− PX(x)|.

An alternative but equivalent definition for variational distance is:

‖PX − PX‖ = 2 · supE⊂X

|PX(E)− PX(E)| .

Lemma 2.32 (variational distance and divergence)

D(X‖X) ≥ 1

2· ‖PX − PX‖2.

Proof:

Step 1: Define A , x ∈ X : PX(x) > PX(x). Then the variational distanceis equal to:

‖PX − PX‖ = 2[PX(A)− PX(A)].

Step 2: Define two random variables U and U as:

U =

1, if X ∈ A;0, if X ∈ Ac,

and U =

1, if X ∈ A;

0, if X ∈ Ac.

Then PX and PX are refinements of PU and PU . From Lemma 2.30,

D(PX‖PX) ≥ D(PU‖PU).

4A distance or metric [1][2, pp. 139] should satisfy the properties of i) non-negativity; ii)being zero if, and only if, two points coincide; iii) symmetry; and iv) triangle inequality.

35

Step 3: It remains to show that

D(PU‖PU) ≥ 2[PX(A)− PX(A)]2 = 2[PU(1)− PU(1)]2.

For ease of notations, let p = PU(1) and q = PU(1). Then to prove theabove inequality is equivalent to show that

p · logp

q+ (1− p) · log

1− p

1− q≥ 2(p− q)2.

Define

f(p, q) , p · logp

q+ (1− p) · log

1− p

1− q− 2(p− q)2,

and observe that

df(p, q)

dq= (p− q)

(4− 1

q(1− q)

)≤ 0 for q ≤ p.

Thus, f(p, q) is non-increasing in q for q ≤ p. Also note that f(p, q) = 0for q = p. Therefore,

f(p, q) ≥ 0 for q ≤ p.

The proof is completed by noting that

f(p, q) ≥ 0 for q > p,

since f(1− p, 1− q) = f(p, q). 2

The above lemma tells us that for a sequence of distributions

(PXn , PXn)n≥1,

when D(PXn‖PXn) goes to zero, ‖PXn − PXn

‖ goes to zero as well. But theconverse does not necessarily hold. A quick counterexample exists for PXn(0) =1−PXn(1) = 1/n > 0 and PXn

(0) = 1−PXn(1) = 0. In such case, D(PXn‖PXn

) =∞ since by convention, (1/n) · log((1/n)/0) = ∞, and yet,

‖PX − PX‖= 2

[PX

(x : PX(x) > PX(x))− PX

(x : PX(x) > PX(x))]

=2

n→ 0.

We, however, can bound D(PX‖PX) by variational distance between PX and PX

when D(PX‖PX) < ∞.

36

Lemma 2.33 If D(PX‖PX) < ∞, then

D(PX‖PX) ≤ 1

minx : PX(x)>0

minPX(x), PX(x) · ‖PX − PX‖.

Proof: Without loss of generality, we assume that PX(x) > 0 for all x. SinceD(PX‖PX) < ∞, PX(x) > 0 implies PX(x) > 0. Let

t , minx : PX(x)>0

minPX(x), PX(x).

Then for all x,

logPX(x)

PX(x)≤

∣∣∣∣logPX(x)

PX(x)

∣∣∣∣

≤∣∣∣∣ maxminPX(x),PX(x)≤s≤maxPX(x),PX(x)

d log(s)

ds

∣∣∣∣ · |PX(x)− PX(x)|

=1

minPX(x), PX(x) · |PX(x)− PX(x)|

≤ 1

t· |PX(x)− PX(x)|.

Hence,

D(PX‖PX) =∑x∈X

PX(x) · logPX(x)

PX(x)

≤ 1

t

∑x∈X

PX(x) · |PX(x)− PX(x)|

≤ 1

t

∑x∈X

|PX(x)− PX(x)|

=1

t· ‖PX − PX‖.

2

The next lemma discusses the effect of side information on divergence. Asstated in Lemma 2.11, side information usually reduces entropy; it, however, in-creases divergence. These results can be interpreted as side information is useful.As for entropy, side information provides us more information, so uncertaintydecreases. As for divergence, it is the measure or index of how easy one candifferentiate the source from two candidate distributions. The larger the diver-gence, the easier one can tell apart these two distributions and make the rightguess. At an extreme case, when divergence is zero, one can never tell which

37

distribution is the right one, since they produce the same source. So, when weobtain more information (side information), we should be able to make betterdecision on the source statistics, which implies that the divergence should belarger.

Definition 2.34 (conditional divergence)

D(PX‖PX |Z) = D(PX|Z‖PX|Z) ,∑z∈Z

∑x∈X

PX,Z(x, z) logPX|Z(x|z)

PX|Z(x|z).

Lemma 2.35 (conditioning never decreases divergence)

D(PX|Z‖PX|Z) = D(PX‖PX |Z) ≥ D(PX‖PX).

Proof:

D(PX‖PX |Z)−D(PX‖PX)

=∑z∈Z

∑x∈X

PX,Z(x, z) · logPX|Z(x|z)

PX|Z(x|z)−

∑x∈X

PX(x) · logPX(x)

PX(x)

=∑z∈Z

∑x∈X

PX,Z(x, z) · logPX|Z(x|z)

PX|Z(x|z)−

∑x∈X

(∑z∈Z

PX,Z(x, z)

)· log

PX(x)

PX(x)

=∑z∈Z

∑x∈X

PX,Z(x, z) · logPX|Z(x|z)PX(x)

PX|Z(x|z)PX(x)

≥∑z∈Z

∑x∈X

PX,Z(x, z) ·(

1− PX|Z(x|z)PX(x)

PX|Z(x|z)PX(x)

), since log(u) ≥ 1− 1

u.

= 1−∑x∈X

PX(x)

PX(x)

∑z∈Z

PZ(z)PX|Z(x|z)

= 1−∑x∈X

PX(x)

PX(x)PX(x)

= 1−∑x∈X

PX(x) = 0

with equality holds if, and only if, for all x and z,

PX(x)

PX(x)=

PX|Z(x|z)

PX|Z(x|z).

2

38

Note that it is not necessary that

D(PX|Z‖PX|Z) ≥ D(PX‖PX).

In other words, the side information is helpful for divergence only when it pro-vides information on the similarity or difference of the two distributions. For theabove case, Z only provides information about X, and Z only regards to X; sothe divergence certainly can not be expected to increase. The next lemma showsthat if (Z, Z) are independent of (X, X), then the side information of (Z, Z) doesnot help improving the divergence of X against X.

Lemma 2.36 (independent side information provides no help in in-creasing divergence)

D(PX|Z‖PX|Z) = D(PX‖PX),

provided (X, X) is independent of (Z, Z).

Proof: This can be easily justified by the definition of divergence. 2

Lemma 2.37 (additivity for independence)

D(PX,Z‖PX,Z) = D(PX‖PX) + D(PZ‖PZ),

provided (X, X) is independent of (Z, Z).

Proof: This can be easily proved from definition. 2

2.6 Convexity and concavity of entropy, mutual informa-tion and divergence

We close this chapter by addressing the convexity and concavity of the informa-tion measures with respect to distributions, which will be useful when optimizingthe information measures over the distribution spaces.

Lemma 2.38

1. H(PX) is a concave function of PX , namely

H(λPX + (1− λ)PX) ≥ λH(PX) + (1− λ)H(PX).

39

2. If I(X; Y ) is re-written as I(PX , PY |X), then it is a concave function of PX

(for fixed PY |X), and a convex function of PY |X (for fixed PX).

3. D(PX‖PX) is convex with respect to both the first argument PX and thesecond argument PX , also convex in the pair (PX , PX); i.e., if (PX , PX)and (QX , QX) are two pairs of probability mass functions, then

D(λPX + (1− λ)QX‖λPX + (1− λ)QX)

≤ λ ·D(PX‖QX) + (1− λ) ·D(PX‖QX), (2.6.1)

for all λ ∈ [0, 1].

Proof:

1.

H(λPX + (1− λ)PX)− [λH(PX) + (1− λ)H(PX)

]

= λ∑x∈X

PX(x) logPX(x)

λPX(x) + (1− λ)PX(x)

+(1− λ)∑x∈X

PX(x) logPX(x)

λPX(x) + (1− λ)PX(x)

≥ λ

(∑x∈X

PX(x)

)log

∑x∈X PX(x)∑

x∈X [λPX(x) + (1− λ)PX(x)]

+(1− λ)

(∑x∈X

PX(x)

)log

∑x∈X PX(x)∑

x∈X [λPX(x) + (1− λ)PX(x)]

= 0,

with equality holds if, and only if, PX(x) = PX(x) for all x.

2. Let λ = 1− λ.

I(λPX + λPX , PY |X)− λI(PX , PY |X)− λI(PX , PY |X)

= λ∑y∈Y

∑x∈X

PX(x)PY |X(y|x) log

∑x∈X

PX(x)PY |X(y|x)

∑x∈X

[λPX(x) + λPX(x)]PY |X(y|x)]

+ λ∑y∈Y

∑x∈X

PX(x)PY |X(y|x) log

∑x∈X

PX(x)PY |X(y|x)

∑x∈X

[λPX(x) + λPX(x)]PY |X(y|x)]

≥ 0, (by log-sum inequality)

40


∑x∈X

PX(x)PY |X(y|x) =∑x∈X

PX(x)PY |X(y|x)

for all y ∈ Y .

Now turn to the convexity of I(PX , PY |X) w.r.t. PY |X . For ease ofnotations, let

PYλ(y) , λPY (y) + λPY (y),

andPYλ|X(y|x) , λPY |X(y|x) + λPY |X(y|x).

Then

λI(PX , PY |X) + λI(PX , PY |X)− I(PX , λPY |X + λPY |X)

= λ∑x∈X

∑y∈Y

PX(x)PY |X(y|x) logPY |X(y|x)

PY (y)

+λ∑x∈X

∑y∈Y

PX(x)PY |X(y|x) logPY |X(y|x)

PY (y)

−∑x∈X

∑y∈Y

PX(x)PYλ|X(y|x) logPYλ|X(y|x)

PYλ(y)

= λ∑x∈X

∑y∈Y

PX(x)PY |X(y|x) logPY |X(y|x)PYλ

(y)

PY (y)PYλ|X(y|x)

+λ∑x∈X

∑y∈Y

PX(x)PY |X(y|x) logPY |X(y|x)PYλ

(y)

PY (y)PYλ|X(y|x)

≥ λ∑x∈X

∑y∈Y

PX(x)PY |X(y|x)

(1− PY (y)PYλ|X(y|x)

PY |X(y|x)PYλ(y)

)

+λ∑x∈X

∑y∈Y

PX(x)PY |X(y|x)

(1− PY (y)PYλ|X(y|x)

PY |X(y|x)PYλ(y)

)

= 0,


(∀ x ∈ X , y ∈ Y)PY (y)

PY |X(y|x)=

PY (y)

PY |X(y|x).

41

3. For ease of notations, let PXλ(x) , λPX(x) + (1− λ)PX(x).

λD(PX‖PX) + (1− λ)D(PX‖PX)−D(PXλ‖PX)

= λ∑x∈X

PX(x) logPX(x)

PXλ(x)

+ (1− λ)∑x∈X

PX(x) logPX(x)

PXλ(x)

≥ 0,


Similarly, by letting PXλ(x) , λPX(x) + (1− λ)PX(x), we obtain:

λD(PX‖PX) + (1− λ)D(PX‖PX)−D(PX‖PXλ)

= λ∑x∈X

PX(x) logPXλ

(x)

PX(x)+ (1− λ)

∑x∈X

PX(x) logPXλ

(x)

PX(x)

≥ λ∑x∈X

PX(x)

(1− PX(x)

PXλ(x)

)+ (1− λ)

∑x∈X

PX(x)

(1− PX(x)

PXλ(x)

)

= 1−∑x∈X

PX(x)λPX(x) + (1− λ)PX(x)

PXλ(x)

= 0,


Finally, by log-sum inequality,

(λPX(x) + (1− λ)PX(x)) logλPX(x) + (1− λ)PX(x)

λQX(x) + (1− λ)QX(x)

≤ λPX(x) logλPX(x)

λQX(x)+ (1− λ)PX(x) log

(1− λ)PX(x)

(1− λ)QX(x).

Summing over x, we yield (2.6.1).

2

42

Bibliography

[1] A. N. Kolmogorov and S. V. Fomin. Introductory Real Analysis. NewYork:Dover Publications, Inc., 1970.

[2] H. L. Royden. Real Analysis. New York:Macmillan Publishing Company,3rd edition, 1988.

43

Chapter 3

Lossless Data Compression

3.1 Principles of data compression

As mentioned in Chapter 1, data compression describes methods of representinga source by a code whose average codeword length is acceptably small. Sincea source is modelled as a random variable, the averaged codeword length of acodebook is calculated based on the probability of that random variable. Forexample, a ternary source X exhibits three possible outcomes with

PX(x = outcomeA) = 0.5;

PX(x = outcomeB) = 0.25;

PX(x = outcomeC) = 0.25.

Suppose that a binary code book is designed for this source, in which outcomeA,outcomeB and outcomeC are respectively symbolized as 0, 10, and 11. Then theaverage codeword length is

length(0) ∗ PX(outcomeA) + length(10) ∗ PX(outcomeB)

+length(11) ∗ PX(outcomeC)

= 1 ∗ 0.5 + 2 ∗ 0.25 + 2 ∗ 0.25

= 1.5 bits.

There are usually no constraints on the basic structure of a code. In the casewhere the codeword length for each source outcome can be different, the code iscalled a variable-length code. When the codeword lengths of all source outcomesare equal, the code is referred to as a fixed-length code. It is obvious that theminimum average codeword length among all variable-length codes is no greaterthan that among all fixed-length codes, since the latter is (usually treated as)a subclass of the former. We will see in this chapter that these two minimumaverage codeword lengths (for variable-length and fixed-length codes) coincide

44

for sources with good probabilistic characteristics, such as stationary ergodicity.But in general, they are different (cf. Volume II of the lecture notes).

For fixed-length codes, the sequence of adjacent codewords for subsequent in-formation are concatenated together, and some punctuation mechanism—suchas to mark the beginning of each codeword or to delineate internal sub-blocksby synchronization between encoder and decoder—is normally considered an im-plicit part of codewords. Due to constraint on space or processing capability,the sequence of source symbols may be too long for the encoder to deal withall at once; therefore, segmentation before encoding is often necessary for feasi-bility. For example, suppose that we need to encode the grades of a class with100 students. There are three grade levels: A, B and C. By observing thatthere are 3100 possible grade combinations for 100 students, a straightforwardcode design requires dlog2(3

100)e = 159 bits to encode these combinations. Nowsuppose that the encoder facility can only process 16 bits at a time. Then theabove code design becomes infeasible; and segmentation is unavoidable. Undersuch constraint, we may encode grades of 10 students at a time, which requiresdlog2(3

10)e = 16 bits. As a consequence, for a class of 100 students, the coderequires 160 bits in total.

In the above example, the letters in the grade set A,B,C and the lettersfrom the code alphabet 0, 1 are often called the sourcewords (or source symbols)and the codewords, respectively. The block diagram of a source coding systemis depicted in Figure 3.1.

Source(a pseudofacility)

-sourcewords

(source symbols)

Sourceencoder

--codewords

Sourcedecoder

-sourcewords

Figure 3.1: Block diagram of a data compression system.

When adding segmentation mechanisms to fixed-length codes, the codes canbe loosely divided into two groups. The first one is block codes in which theencoding of the next segment is independent of the previous segments. If theencoding of the next segment, somehow, retains and uses some knowledge ofearlier segments, the code is called a fixed-length1 tree code.

In this chapter, we will first introduce the data compression for block codingschemes in Section 3.2. The data compression for variable-length codes will be

1If the segments of a tree code (at different positions) have different lengths, a variable-length tree code is yielded. Here, we aim at the tree codes with a fixed length in each segment.

45

discussed separately in Section 3.3.

3.2 Block codes for asymptotic lossless data compression

3.2.1 Block codes for discrete memoryless sources

Definition 3.1 (discrete memoryless source) A discrete memoryless source(DMS) consists of a sequence of independent and identically distributed (i.i.d.)random variables, X1, X2, X3, . . ., etc. In particular, if PX(·) is the commondistribution of Xi’s, then

PXn(x1, x2, . . . , xn) =n∏

i=1

PX(xi).

Definition 3.2 An (n,M) block code for asymptotic lossless data compressionof blocklength n and size M is a set c1, c2, . . . , cM consisting of M codewords;2

each codeword represents a group of source symbols of length n.

The encoding of a block code can be symbolically represented as

(x1, x2, . . . , xn) → cm ∈ c1, c2, . . . , cM.

This procedure will be repeated for each consecutive block of length n, i.e.,

· · · (x3n, . . . , x31)(x2n, . . . , x21)(x1n, . . . , x11) → · · · |cm3|cm2|cm1 ,

where “|” reflects the necessity of “punctuation mechanism” or ”synchronizationmechanism” for consecutive source block coders.

The next theorem is the basis for Shannon’s source coding theorem.

2One can binary-index the codewords in c1, c2, . . . , cM by r , dlog2 Me bits. Since thebehavior of block codes is investigated as n and M large (or more precisely, tend to infinity),it is legitimate to replace dlog2 Me by log2 M . With this convention, the data compressionrate or code rate is

bits required per source symbol =r

n≈ 1

nlog2 M.

For computation convenience, nats (natural log) is often used instead of bits; and hence, thecode rate becomes:

nats required per source symbol =1n

log M.

This convention will be used throughout the lecture notes.

46

Theorem 3.3 (asymptotic equipartition property or AEP3) If X1, X2,. . ., Xn, . . . are i.i.d., then

− 1

nlog PXn(X1, . . . , Xn) → H(X) in probability.

Proof: This theorem follows by the observation that for i.i.d. sequence,

− 1

nlog PXn(X1, . . . , Xn) = − 1

n

n∑i=1

log PX(Xi),

and the weak law of large numbers. 2

AEP theorem is actually the “information theoretic” analog of the weak lawof large number (WLLN) which states that if

− log PX(X1),− log PX(X2), . . .

are i.i.d., then for any δ > 0,

Pr

∣∣∣∣∣1

n

n∑i=1

[− log PX(Xi)]−H(X)

∣∣∣∣∣ < δ

→ 1.

As a consequence of the AEP, all the probability mass will be ultimately placedon the weakly δ-typical set, which is defined as

Fn(δ) ,

xn ∈ X n :

∣∣∣∣∣−1

n

n∑i=1

log PX(xi)−H(X)

∣∣∣∣∣ < δ

.

It can be shown that almost all the source sequences in Fn(δ) are nearly equiprob-able or equally surprising (cf. Property 3 of Theorem 3.4); hence, Theorem 3.3is named AEP.

Theorem 3.4 (Shannon-McMillan theorem) Given a discrete memorylesssource and any δ greater than zero, the weakly δ-typical set Fn(δ) satisfies

1. PXn (F cn(δ)) < δ for sufficiently large n, where the superscript “c” denotes

the complementary set operation.

2. |Fn(δ)| > (1−δ)en(H(X)−δ) for sufficiently large n, and |Fn(δ)| < en(H(X)+δ)

for every n, where |Fn(δ)| denotes the number of elements in Fn(δ).

3This is also called entropy stability property.

47

3. If xn ∈ Fn(δ), then

e−n(H(X)+δ) < PXn(xn) < e−n(H(X)−δ).

Proof: Property 3 is an immediate consequence of the definition of Fn(δ). ForProperty 1, we observe that by Chebyshev’s inequality,4

PXn(F cn(δ)) = PXn

xn ∈ X n :

∣∣∣∣−1

nlog PXn(xn)−H(X)

∣∣∣∣ ≥ δ

≤ σ2

nδ2< δ,

for n > σ2/δ3, where

σ2 ,∑x∈X

PX(x) [log PX(x)]2 −H(X)2

is a constant independent of n.

To prove Property 2, we have

1 ≥∑

xn∈Fn(δ)

PXn(xn) >∑

xn∈Fn(δ)

e−n(H(X)+δ) = |Fn(δ)|e−n(H(X)+δ),

and, using Property 1,

1− δ ≤ 1− σ2

nδ2≤

∑

xn∈Fn(δ)

PXn(xn) <∑

xn∈Fn(δ)

e−n(H(X)−δ) = |Fn(δ)|e−n(H(X)−δ),

for n ≥ σ2/δ3. 2

Before proving the asymptotic lossless data compression theorem for blockcodes, it is important to point out that for block coders, the source sequencecannot be reproduced completely lossless. One can only design a block codewith arbitrarily small reconstructing error (if the block length is long enough).Therefore, the readers should always keep in mind that for block codes, thedata compression is performed only for asymptotic lossless with respect to blocklength, unlike variable-length codes where completely lossless data compression

4In the proof, we assume that σ2 = Var[− log PX(X)] < ∞. This is true for finite alphabet:

Var[− log PX(X)] ≤ E[(log PX(X))2] =∑

x∈XPX(x)(log PX(x))2

≤∑

x∈X0.5414 = 0.5414× |X | < ∞.

48

Source

∣∣∣∣∣−1

2

2∑i=1

log PX(xi)−H(X)

∣∣∣∣∣ codewordreconstructed

source sequence

AA 0.364 nats 6∈ F2(0.3) 000 ambiguousAB 0.220 nats ∈ F2(0.3) 001 ABAC 0.017 nats ∈ F2(0.3) 010 ACAD 0.330 nats 6∈ F2(0.3) 000 ambiguousBA 0.220 nats ∈ F2(0.3) 011 BABB 0.076 nats ∈ F2(0.3) 100 BBBC 0.127 nats ∈ F2(0.3) 101 BCBD 0.473 nats 6∈ F2(0.3) 000 ambiguousCA 0.017 nats ∈ F2(0.3) 110 CACB 0.127 nats ∈ F2(0.3) 111 CBCC 0.330 nats 6∈ F2(0.3) 000 ambiguousCD 0.676 nats 6∈ F2(0.3) 000 ambiguousDA 0.330 nats 6∈ F2(0.3) 000 ambiguousDB 0.473 nats 6∈ F2(0.3) 000 ambiguousDC 0.676 nats 6∈ F2(0.3) 000 ambiguousDD 1.023 nats 6∈ F2(0.3) 000 ambiguous

Table 3.1: An example of the δ-typical set with n = 2 and δ = 0.3,where F2(0.3) = AB, AC, BA, BB, BC, CA, CB . The code-word set is 001(AB), 010(AC), 011(BA), 100(BB), 101(BC), 110(CA),111(CB), 000(AA, AD, BD, CC, CD, DA, DB, DC, DD) , where theparenthesis following the codeword indicates those sourcewords thatare encoded to this codeword. The source distribution is PX(A) = 0.4,PX(B) = 0.3, PX(C) = 0.2 and PX(D) = 0.1.

is assumed. This explains the reason why the next theorem only shows that theprobability mass of non-reconstructible source sequences can be made arbitrarilysmall.

The idea behind the proof is basically to encode the source sequence in theweakly δ-typical set Fn(δ) to its binary index (starting from one); and to encodeall source sequences outside Fn(δ) to a default all-zero codeword, which certainlycannot be reproduced distortionless due to its many-to-one-mapping property.The resultant code rate is (1/n)dlog(|Fn(δ)| + 1)e nats per source symbol. Asrevealed in Shannon-McMillan theorem, almost all the probability mass will beon Fn(δ) as n sufficiently large, and hence, the probability of non-reconstructiblesource sequences can be made arbitrarily small. A simple example for the abovecoding scheme is illustrated in Table 3.1.

49

Theorem 3.5 (Shannon’s source coding theorem) Fix a discrete memo-ryless source

X = Xn = (X1, X2, . . . , Xn)∞n=1

with marginal entropy H(Xi) = H(X), and ε > 0 arbitrarily small. There existsδ with 0 < δ < ε, and a sequence of block codes C∼n = (n,Mn)∞n=1 with

1

nlog Mn < H(X) + δ (3.2.1)

such thatPe( C∼n) < ε for all sufficiently large n,

where Pe( C∼n) denotes the probability of decoding error for block code C∼n.

Proof: Fix δ satisfying 0 < δ < ε. Binary-index the source symbols in Fn(δ/2)starting from one.

For n ≤ 2 log(2)/δ, pick any C∼n = (n,Mn) block code satisfying (3.2.1). Forn > 2 log(2)/δ, choose C∼n = (n,Mn) block encoder as

xn → binary index of xn, if xn ∈ Fn(δ/2);xn → all-zero codeword, if xn 6∈ Fn(δ/2).

Then by Shannon-McMillan theorem, we obtain

Mn = |Fn(δ/2)|+ 1 < en(H(x)+δ/2) + 1 < 2en(H(X)+δ/2) < en(H(X)+δ),

for n > 2 log(2)/δ. Hence, a sequence of (n,Mn) block code satisfying (3.2.1) isestablished. It remains to show that the error probability for this sequence of(n,Mn) block code can be made smaller than ε for all sufficiently large n.

By Shannon-McMillan theorem,

PXn(F cn(δ/2)) <

δ

2for all sufficiently large n.

Consequently, for those n satisfying the above inequality, and being bigger than2 log(2)/δ,

Pe( C∼n) ≤ PXn(F cn(δ/2)) < δ ≤ ε.

(For the last step, the readers can refer Table 3.1 to confirm that the probabilityof error are contributed only by those ambiguous sequences outside the typicalset.) 2

Usually, lim supn→∞(1/n) log Mn is called the ultimate compression rate of ablock code sequence, and is conventionally denoted by R. We will use this namingconvention in the converse part of Shannon’s source coding theorem. The abovetheorem concludes that good code exists under R ≤ H(X)+δ for arbitrarily small

50

δ and sufficiently large blocklength n. Question is “Can we extend the range ofgood compressing rate below H(X)?” The answer is negative. The next theoremwill show that for R < H(X), all codes are bad; hence, we can conclude thatH(X) is the minimum compressing rate for the existence of asymptotic losslessdata compression block codes for discrete memoryless sources.

Theorem 3.6 (strong converse theorem) Fix a discrete memoryless sourceX = Xn = (X1, X2, . . . , Xn)∞n=1 with marginal entropy H(Xi) = H(X), andε > 0 arbitrarily small. For any block code sequence of rate R < H(X) andsufficiently large blocklength n, the probability of block decoding failure Pe sat-isfies

Pe > 1− ε.

Proof: Fix any sequence of block codes C∼n∞n=1 with

R = lim supn→∞

1

nlog | C∼n| < H(X).

Let Sn be the set of source symbols that can be correctly decoded through C∼n-coding system. (A quick example is depicted in Figure 3.2.) Then |Sn| = | C∼n|.By choosing δ small enough with ε/2 > δ > 0, and also by definition of limsupoperation, we have

(∃ N0)(∀ n > N0)1

nlog |Sn| = 1

nlog | C∼n| < H(X)− 2δ,

which implies|Sn| < en(H(X)−2δ).

Furthermore, from Property 1 of Shannon-McMillan Theorem, we obtain

(∃ N1)(∀ n > N1) PXn(F cn(δ)) < δ.

Consequently, for n > N , maxN0, N1, log(2/ε)/δ, the probability of correctlyblock decoding satisfies

1− Pe( C∼n) =∑

xn∈Sn

PXn(xn)

=∑

xn∈Sn∩Fcn(δ)

PXn(xn) +∑

xn∈Sn∩Fn(δ)

PXn(xn)

≤ PXn(F cn(δ)) + |Sn ∩ Fn(δ)| · max

xn∈Fn(δ)PXn(xn)

< δ + |Sn| · maxxn∈Fn(δ)

PXn(xn)

<ε

2+ en(H(X)−2δ) · e−n(H(X)−δ)

<ε

2+ e−nδ

< ε,

51

Source SymbolsSns s s s s s s s s s s s s s s s s s s s

c c c c c c c c c c c c c c c c? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?

c?

CCCCCCCW

c?

CCCCCCCW

Codewords

Figure 3.2: Possible codebook C∼n and its corresponding Sn. The solidbox indicates the decoding mapping from C∼n back to Sn.

which is equivalent to Pe( C∼n) > 1− ε for n > N . 2

The results of the above two theorems is symbolically illustrated in Figure 3.3.It is clear from the figure that the rate of the optimal block code with arbitrarilysmall decoding error probability must be greater than entropy. Conversely, theprobability of decoding error for any block code of rate smaller than entropyultimately approaches 1, which is a strong converse statement; and therefore,Theorem 3.6 is named strong converse theorem.

-

H(X)

Pen→∞−→ 1

for all block codesPe

n→∞−→ 0for the best data compression block code

R

Figure 3.3: Behavior of the probability of block decoding error as blocklength n goes to infinity for a discrete memoryless source.

For a more general source, such as sources with memory, Shannon-McMillantheorem may not be applied in its original form, and thereby the validity of theShannon’s source coding theorem seems restricted to only i.i.d. sources. However,by exploring the concept behind these theorems, we found that the key for thevalidity of Shannon’s source coding theorem is actually the existence of a setAn = xn

1 , xn2 , . . . , x

nM with M ≈ enH(X) and PXn(Ac

n) → 0, namely, existence ofa typical-like set An whose size is prohibitively small, and whose the probabilitymass is large. Thus, if we can find such typical-type-of-set for a general source,the Shannon source coding theorem for block codes can be generalized for thissource. A good practice for such extension is the data compression of stationary-ergodic sources, which is the main stream of the next subsection.

52

In addition, for a general source, the statement of the converse theoremmay not retain the form in Theorem 3.6. Contrary to the i.i.d. (and alsothe stationary-ergodic) sources where decompressed error approaches one forall codes, the probability of block decoding failure can only be made boundedaway from zero in general cases (e.g., non-stationary non-ergodic sources). Since“bounded away from zero” is apparently a weaker statement than that in str-ong converse theorem, its regarding theorem is straightforwardly named weakconverse theorem. Details will be provided in Volume II of the lecture notes.

3.2.2 Block codes for stationary-ergodic sources

In practice, the source usually has memory, and its joint distribution is nota product of marginal distributions. In this subsection, we will discuss theasymptotic lossless data compression theorem for a simple model of sources withmemory: stationary-ergodic sources.

Before proceeding to generalize source coding theorem, we need to first gen-eralize the “entropy” measure for a sequence of random variables X (whichcertainly shall be backward compatible to the discrete memoryless cases). Astraightforward generalization is to adopt the concept of average entropy over asource sequence, which is usually named entropy rate.

Definition 3.7 (entropy rate) The entropy rate for a source X is defined by

limn→∞

1

nH(Xn),

provided the limit exists.

Next we will show that the entropy rate exists for stationary sources (here,we do not need ergodicity for the existence of entropy rate).

Lemma 3.8 For a stationary source, the conditional entropy

H(Xn|Xn−1, . . . , X1)

is non-increasing in n and also bounded from below by zero. Hence by LemmaA.25, the limit

limn→∞

H(Xn|Xn−1, . . . , X1)

exists.

53

Proof:

H(Xn|Xn−1, . . . , X1) ≤ H(Xn|Xn−1, . . . , X2) (3.2.2)

= H(Xn−1|Xn−2, . . . , X1), (3.2.3)

where (3.2.2) follows since conditioning never increases entropy, and (3.2.3) holdsbecause of the stationarity assumption. 2

Lemma 3.9 (Cesaro-mean theorem) If an → a and bn = (1/n)∑n

i=1 ai,then bn → a as n →∞.

Proof: an → a implies that for any ε > 0, there exists N such that for all n > N ,|an − a| < ε. Then

|bn − a| =

∣∣∣∣∣1

n

n∑i=1

(ai − a)

∣∣∣∣∣

≤ 1

n

n∑i=1

|ai − a|

=1

n

N∑i=1

|ai − a|+ 1

n

n∑i=N+1

|ai − a|

≤ 1

n

N∑i=1

|ai − a|+ n−N

nε.

Hence, limn→∞ |bn − a| ≤ ε. Since ε can be made arbitrarily small, the lemmaholds. 2

Theorem 3.10 For a stationary source, its entropy rate always exists and isequal to

limn→∞

1

nH(Xn) = lim

n→∞H(Xn|Xn−1, . . . , X1).

Proof: This theorem can be proved by writing

1

nH(Xn) =

1

n

n∑i=1

H(Xi|Xi−1, . . . , X1) (chain-rule for entropy)

and applying Cesaro-Mean theorem. 2

Exercise 3.11 (1/n)H(Xn) is non-increasing in n for a stationary source.

54

It is obvious that when X = Xn = (X1, . . . , Xn)∞n=1 is discrete memoryless,H(Xn) = n×H(X) for every n. Hence,

limn→∞

1

nH(Xn) = H(X).

For a first-order stationary Markov source,

limn→∞

1

nH(Xn) = lim

n→∞H(Xn|Xn−1, . . . , X1) = H(X2|X1),

where

H(X2|X1) , −∑x1∈X

∑x2∈X

π(x1)PX2|X1(x2|x1) · log PX2|X1(x2|x1),

and π(·) is the stationary distribution for the Markov source. Furthermore, ifthe Markov source is binary with PX2|X1(0|1) = α and PX2|X1(1|0) = β, then

limn→∞

1

nH(Xn) =

β

α + βHb(α) +

α

α + βHb(β),

where Hb(α) , −α log α− (1− α) log(1− α) is the binary entropy function.

Theorem 3.12 (generalized AEP or Shannon-McMillan-Breiman The-orem) If X1, X2, . . ., Xn, . . . are stationary-ergodic, then

− 1

nlog PXn(X1, . . . , Xn)

a.s.−→ limn→∞

1

nH(Xn).

Since the AEP theorem (law of large numbers) is valid for stationary-ergodicsources, all consequences of AEP will follow, including Shannon’s source codingtheorem.

Theorem 3.13 (Shannon’s source coding theorem for stationary-ergo-dic sources) Fix a stationary-ergodic source

X = Xn = (X1, X2, . . . , Xn)∞n=1

with entropy rate

H , limn→∞

1

nH(Xn)

and ε > 0 arbitrarily small. There exists δ with 0 < δ < ε and a sequence ofblock codes C∼n = (n,Mn)∞n=1 with

1

nlog Mn < H + δ,

such thatPe( C∼n) < ε for all sufficiently large n,

where Pe( C∼n) denotes the probability of decoding error for block code C∼n.

55

Theorem 3.14 (strong converse theorem) Fix a stationary-ergodic source

X = Xn = (X1, X2, . . . , Xn)∞n=1

with entropy rate H and ε > 0 arbitrarily small. For any block code of rateR < H(X) and sufficiently large blocklength n, the probability of block decodingfailure Pe satisfies

Pe > 1− ε.

In general, it is hard to check whether a process is ergodic or not. A specificcase for which ergodicity can be verified is the case of Markov sources. Specifi-cally, if a Markov source is irreducible, it is ergodic. Note that irreducibility canbe verified in terms of the transition probability matrix. Some useful observa-tions regarding stationarity and ergodicity on Markov sources are summarizedbelow.

Observation 3.15

1. An irreducible finite-state Markov source is ergodic.

2. The generalized AEP theorem holds for irreducible stationary Markovsources. For example, if the Markov source is of the first-order, then

− 1

nlog PXn(Xn)

a.s.−→ limn→∞

1

nH(Xn) = H(X2|X1).

In more complicated situations such as when the sources are non-stationary(with time-varying statistics), the quantity limn→∞(1/n)H(Xn) is no longervalid. This renders the need to establish new entropy measures which appropri-ately characterize the operational limits of arbitrary stochastic systems. This isachieved in [2] and [3] where Han and Verdu introduce the notions of inf/sup-entropy rates and illustrate the key role these entropy measures play in proving ageneral asymptotic lossless (block) source coding theorem and a general channelcoding theorem. More specifically, they demonstrate that for an arbitrary finite-alphabet source X (not necessarily stationary and ergodic), the expression forthe minimum achievable (block) source coding rate is given by the sup-entropyrate H(X), defined by

H(X) , infβ∈<

β : lim sup

n→∞Pr

[− 1

nlog PXn(Xn) > β

]= 0

.

More details will be provided in Volume II of the lecture notes.

56

3.2.3 Redundancy for lossless data compression

Shannon’s source coding theorem tells us that the minimum data compressionrate to achieve arbitrarily small error probability for stationary-ergodic sourcesis the entropy rate. This arises that there is redundancy in the original source(otherwise, it cannot be compressed). As justified in Section 1.4, the outputof an optimal completely (resp. asymptotic) lossless data compressor should be(resp. asymptotic) i.i.d. with uniform marginal distribution, because if it werenot so, there would be redundancies remained in the output, and the compressorcannot be claimed optimal. This motivates the need to define the redundancyof a source. The redundancy can be classified into two parts:

• intra-sourceword redundancy: the redundancy due to the non-uniformmarginal, and

• inter-sourceword redundancy: the redundancy due to the source memory.

We then quantitatively define the two kinds of redundancy in the following.

Definition 3.16 (redundancy)

1. The redundancy of a stationary source due to non-uniform marginals is

ρD , log |X | −H(X1).

2. The redundancy of a stationary source due to source memory is

ρM , H(X1)− limn→∞

1

nH(Xn).

3. The total redundancy of a stationary source is

ρT , ρD + ρM = log |X | − limn→∞

1

nH(Xn).

We can summarize the redundancy of some typical stationary sources in thefollowing table.

Source ρD ρM ρT

i.i.d. uniform 0 0 0i.i.d. non-uniform log |X | −H(X1) 0 ρD

stationary first-ordersymmetric Markov5 0 H(X1)−H(X2|X1) ρM

stationary first-ordernon-symmetric Markov log |X | −H(X1) H(X1)−H(X2|X1) ρD + ρM

57

3.3 Variable-length codes for lossless data compression

3.3.1 Non-singular codes and uniquely decodable codes

In this subsection, we limit our discussion to variable-length (completely) loss-less data compression codes. A variable-length lossless data compression codeis a code in which the source symbols can be completely reconstructed withoutdistortion. In order to achieve this goal, the source symbols have to be encodedunambiguously in the sense that any two different source symbols (with posi-tive probabilities) are represented by different codewords. Codes satisfying thisproperty are called non-singular codes.

In practice, the encoder often needs to encode a sequence of source symbols,which results in a concatenated sequence of codewords. If any concatenation ofcodewords can also be unambiguously reconstructed without punctuation, thenthe code is said to be uniquely decodable. Note that a non-singular code is notnecessarily uniquely decodable. For example, consider a code for source symbolsA,B,C,D,E, F given by

code of A = 0,

code of B = 1,

code of C = 00,

code of D = 01,

code of E = 10,

code of F = 11.

The above code is clearly non-singular; it is however not uniquely decodablebecause the codeword sequence, 01, can be reconstructed as AB or D.

The following theorem provides a strong condition which a uniquely decod-able code must satisfy.

Theorem 3.17 (Kraft inequality) A uniquely decodable code C with binarycode alphabet 0, 1 and with M codewords having lengths `0, `1, `2, . . . , `M−1

must satisfy the following inequality

M−1∑m=0

2−`m ≤ 1.

5A first-order Markov process is symmetric if for any x1 and x1,

a : a = PX2|X1(y|x1) for some y = a : a = PX2|X1(y|x1) for some y.

58

Proof: Suppose that we use the codebook C to encode N source symbols (ar-riving in sequence); this yields a concatenated codeword sequence

c1c2c3 . . . cN .

Let the lengths of the codewords be respectively denoted by

`(c1), `(c2), . . . , `(cN).

Consider the identity:(∑

c1∈C

∑c2∈C

· · ·∑cN∈C

2−[`(c1)+`(c2)+···+`(cN )]

).

It is obvious that the above identity is equal to

(∑c∈C

2−`(c)

)N

=

(M−1∑m=0

2−`m

)N

.

(Note that |C| = M .) On the other hand, all the code sequences with length

i = `(c1) + `(c2) + · · ·+ `(cN)

contribute equally to the sum of the identity, which is 2−i. Let Ai denote thenumber of code sequences that have length i. Then the above identity can bere-written as (

M−1∑m=0

2−`m

)N

=LN∑i=1

Ai2−i,

whereL , max

c∈C`(c).

(Here, we implicitly and reasonably assume that the smallest length of the codesequence is 1.)

Since C is by assumption a uniquely decodable code, the codeword sequencemust be unambiguously decodable. Observe that a code sequence with length ihas at most 2i unambiguous combinations. Therefore, Ai ≤ 2i, and

(M−1∑m=0

2−`m

)N

=LN∑i=1

Ai2−i ≤

LN∑i=1

2i2−i = LN,

which implies thatM−1∑m=0

2−`m ≤ (LN)1/N .

59

The proof is completed by noting that the above inequality holds for every N ,and the upper bound (LN)1/N goes to 1 as N goes to infinity. 2

The Kraft inequality is a very useful tool, especially for showing that thefundamental lower bound of the average codeword length is the entropy.

Theorem 3.18 The average binary codeword length of every uniquely decod-able code of a source is lower-bounded by the source entropy (measured in bits).

Proof: Let the source be modelled as a random variable X, and denote theassociated source symbol by x. The codeword for source symbol x and its lengthare respectively denoted as cx and `(cx). Hence,

∑x∈X

PX(x)`(cx)−H(X) =∑x∈X

PX(x)`(cx)−∑x∈X

(−PX(x) log2 PX(x))

=1

log(2)

∑x∈X

PX(x) logPX(x)

2−`(cx)

≥ 1

log(2)

[∑x∈X

PX(x)

]log

[∑x∈X PX(x)

][∑

x∈X 2−`(cx)]

(log-sum inequality)

= − 1

log(2)log

[∑x∈X

2−`(cx)

]

≥ 0.

2

From the above theorem, we know that the average codeword length is lower-bounded by the source entropy. Indeed a lossless data compression code, whoseaverage code length achieves entropy, should be optimal (since if its averagecodeword length is below entropy, the Kraft inequality is violated and the code isno longer uniquely decodable). We summarize the main results in the subsectionas follows:

1. Uniquely decodability ⇒ the Kraft inequality.

2. Uniquely decodability ⇒ average codeword length of variable-length codes≥ H(X).

Exercise 3.19

1. Find a non-singular and also non-uniquely decodable code that violates theKraft inequality. (Hint: The answer is already provided in this subsection.)

60

'

&

$

%

'

&

$

%

'

&

$

%

Prefixcodes

Uniquelydecodable codes

Non-singularcodes

Figure 3.4: Classification of variable-length codes.

2. Find a non-singular and also non-uniquely decodable code that beats theentropy lower bound. (Hint: Same as the previous one.)

Note that all the above discussions are based on the assumptions that thesource alphabet is finite. For simplicity, this assumption will be used throughoutthe lecture notes except when stated otherwise.

3.3.2 Prefix or instantaneous codes for lossless data com-pression

A prefix code is a variable-length code which is self-punctuated in the sense thatthere is no need to append extra symbols for differentiating adjacent codewords.A more precise definition follows:

Definition 3.20 (prefix code) A code is called a prefix code or an instanta-neous code if no codeword is a prefix of any other codeword.

A prefix code is also named an instantaneous code because the codeword se-quence can be decoded instantaneously without the reference to future codewordsin the same sequence. Note that a uniquely decodable code is not necessarilyprefix-free and may not be decoded instantaneously. The relationship betweendifferent codes encountered thus far is depicted in Figure 3.4.

A prefix code can be represented graphically as an initial segment of a tree.A graphical portrayal of an example is shown in Figure 3.5.

Observation 3.21 (prefix code to Kraft inequality) There exists a bi-nary prefix code with M codewords of length `m for m = 0, . . . , M − 1 if, andonly if, the Kraft inequality holds.

61

t¢¢¢¢¢¢

AAAAAAU

t

t

¡¡

¡µ

@@

@R

¡¡

¡µ

@@

@R

t

t

t

t¡¡

¡µ

@@

@R

t

t©©©*

HHHj

t

t

(0)

(1)

00

01

10

(11)

110

(111) 1110

1111

Figure 3.5: Tree structure of a prefix code. The codewords are thoseresiding on the leaves, which in this case are 00, 01, 10, 110, 1110 and1111.

Proof:

1. [The forward part] Prefix codes satisfy the Kraft inequality.

The codewords of a prefix code can always be put on a tree. Pick up a length

`max , max0≤m≤M−1

`m.

A tree has originally 2`max nodes on level `max. Each codeword of length `m

obstructs 2`max−`m nodes on level `max. In other words, when any node is chosenas a codeword, all its children will be excluded from being codewords. There areexactly 2`max−`m excluded nodes on level `max of the tree. We therefore say thateach codeword of length `m obstructs 2`max−`m nodes on level `max. Note thatno two codewords obstruct the same nodes on level `max. Hence the number oftotally obstructed codewords on level `max should be less than 2`max , i.e.,

M−1∑m=0

2`max−`m ≤ 2`max ,

62

which immediately implies the Kraft inequality:

M−1∑m=0

2−`m ≤ 1.

(This part can also be proven by stating the fact that a prefix code is a uniquelydecodable code. The objective of adding this proof is to illustrate the character-istics of a tree-like prefix code.)

2. [The converse part] Kraft inequality implies the existence of a prefix code.

Suppose that `0, `1, . . . , `M−1 satisfy the Kraft inequality. We will show thatthere exists a binary tree with M selected nodes where the ith node resides onlevel `i.

Let ni be the number of nodes (among the M nodes) residing on level i(namely, ni is the number of codewords with length i or ni = |m : `m = i|),and let

`max , max0≤m≤M−1

`m.

Then from the Kraft inequality, we have

n12−1 + n22

−2 + · · ·+ n`max2−`max ≤ 1.

The above inequality can be re-written in a form that is more suitable for thisproof as:

n12−1 ≤ 1

n12−1 + n22

−2 ≤ 1

· · ·n12

−1 + n22−2 + · · ·+ n`max2

−`max ≤ 1.

Hence,

n1 ≤ 2

n2 ≤ 22 − n121

· · ·n`max ≤ 2`max − n12

`max−1 − · · · − n`max−121,

which can be interpreted in terms of a tree model as: the 1st inequality saysthat the number of codewords of length 1 is less than the available number ofnodes on the 1st level, which is 2. The 2nd inequality says that the number ofcodewords of length 2 is less than the total number of nodes on the 2nd level,which is 22, minus the number of nodes obstructed by the 1st level nodes already

63

occupied by codewords. The succeeding inequalities demonstrate the availabilityof a sufficient number of nodes at each level after the nodes blocked by shorterlength codewords have been removed. Because this is true at every codewordlength up to the maximum codeword length, the assertion of the theorem isproved. 2

The following theorem interprets the relationship between the average code-word length of a prefix code and the entropy.

Theorem 3.22

1. For any prefix code, the average codeword length is no less than entropy.

2. There must exist a prefix code whose average codeword length is no greaterthan (entropy +1) bits, namely,

¯,∑x∈X

PX(x)`(cx) ≤ H(X) + 1, (3.3.1)

where cx is the codeword for source symbol x, and `(cx) is the length ofcodeword cx.

Proof: A prefix code is uniquely decodable, and hence its average codewordlength is no less than entropy (measured in bits).

To prove the second part, we can design a prefix code satisfying both (3.3.1)and the Kraft inequality, which immediately implies the existence of the desiredcode by Observation 3.21. Choose the codeword length for source symbol x as

`(cx) = b− log2 PX(x)c+ 1. (3.3.2)

Then2−`(cx) ≤ PX(x).

Summing both sides over all source symbols, we obtain∑x∈X

2−`(cx) ≤ 1,

which is exactly the Kraft inequality. On the other hand, (3.3.2) implies

`(cx) ≤ − log2 PX(x) + 1,

which in turn implies∑x∈X

PX(x)`(cx) ≤∑x∈X

[− PX(x) log2 PX(x)]+

∑x∈X

PX(x)

= H(X) + 1. 2

64

In the above discussion, it is assumed that encoding is performed based onsingle source symbol. When encoding based on concatenation of source symbolsis allowed, and the source is memoryless, we can make the average per-source-symbol codeword length of the prefix code arbitrarily close to entropy. Forexample, a source with source alphabet

A,B, C

and probabilityPX(A) = 0.8, PX(B) = PX(C) = 0.1

has entropy being equal to

−0.8 · log2 0.8− 0.1 · log2 0.1− 0.1 · log2 0.1 = 0.92 bits.

One of the best prefix codes for single-letter encoding of the above source isc(A) = 0, c(B) = 10 and c(C) = 11 where function c(·) represents the resultantprefix codeword of the source symbol. Then the resultant average codewordlength is

0.8× 1 + 0.2× 2 = 1.2 bits ≥ 0.92 bits.

(From this, the readers shall learn that the optimal variable-length data com-pression code for fixed source block length indeed has average codeword lengthstrictly larger than per-letter source entropy. It is only when the source blocklength approaches infinity that the average codeword length can be made closeto the per-letter source entropy. See the follow-up example in the next paragraphfor more intuition.)

Now if we consider to prefix-encode two consecutive source symbols at a time,the new source alphabet becomes

AA,AB,AC,BA,BB, BC, CA, CB, CC,

and the resultant probability is calculated by

(∀ x1, x2 ∈ A,B,C) PX2(x1, x2) = PX(x1)PX(x2)

65

under the assumption that the source is memoryless. Then one of the best prefixcodes for the new source symbol pair is

c(AA) = 0

c(AB) = 100

c(AC) = 101

c(BA) = 110

c(BB) = 111100

c(BC) = 111101

c(CA) = 1110

c(CB) = 111110

c(CC) = 111111.

The average codeword length per source symbol now becomes

0.64(1× 1) + 0.08(3× 3 + 4× 1) + 0.01(6× 4)

2= 0.96 bits,

which is closer to the per-source-symbol entropy 0.92 bits.

From this example, we conclude in the next corollary that a prefix code canbe found with per-sourceword average codeword length arbitrarily close to theper-sourceword entropy, provided the source is memoryless.

Corollary 3.23 Fix ε > 0 and a memoryless source X with marginal distribu-tion PX . A prefix code can always be found with

¯≤ H(X) + ε,

where ¯ is the average per-source-symbol codeword length, and H(X) is theper-source-symbol entropy.

Proof: Choose n large enough such that 1/n < ε. Find a prefix code for nconcatenated source symbols Xn. Then the prefix code satisfies

∑xn∈Xn

PXn(xn)`xn ≤ H(Xn) + 1,

where `xn denotes the resultant codeword length of the concatenated sourcesymbol xn. By dividing both sides by n, and observing that H(Xn) = nH(X)for a memoryless source, we obtain

¯≤ H(X) +1

n≤ H(X) + ε.

66

2

Corollary 3.23 can actually be extended to stationary sources.

We end this section by remarking the relation of a variable-length uniquely-decodable code and a prefix code.

Corollary 3.24 A uniquely decodable code can always be replaced by a prefixcode with the same average codeword length.

3.3.3 Examples of variable-length lossless data compres-sion codes

A) Huffman code : a variable-length optimal code

In this subsection, we will introduce a simple optimal variable-length code,named Huffman code. Here “optimality” means that it yields the minimumaverage codeword length among all codes on the same source. We now begin ourexamination of Huffman coding with a simple observation.

Observation 3.25 Give a source with source alphabet 1, . . . , K and proba-bility p1, . . . , pK. Let `i be the binary codeword length of symbol i. Thenthere exists an optimal uniquely-decodable variable-length code satisfying:

1. pi > pj implies `i ≤ `j.

2. The two longest codewords have the same length.

3. The two longest codewords differ only in the last bit and correspond to thetwo least-frequent symbols.

Proof: First, we note that any optimal code that is uniquely decodable mustsatisfy the Kraft inequality. In addition, for any set of codeword lengths thatsatisfy the Kraft inequality, there exists a prefix code who takes the same set asits set of codeword lengths. Therefore, it suffices to show that there exists anoptimal prefix code satisfying the above three properties.

1. Suppose there is an optimal prefix code violating the observation. Thenwe can interchange the codeword for symbol i with that for symbol j, andyield a better code.

2. Without loss of generality, let the probabilities of the source symbols satisfy

p1 ≤ p2 ≤ p3 ≤ · · · ≤ pK .

67

Therefore by the first property, there exists an optimal prefix code withcodeword lengths

`1 ≥ `2 ≥ `3 ≥ · · · ≥ `K .

Suppose the codeword length for the two least-frequent source symbolssatisfy `1 > `2. Then we can discard `1 − `2 code bits from the firstcodewords, and yield a better code. (From the definition of prefix codes,it is obviously that the new code is still a prefix code.)

3. Since all the codewords of a prefix code reside in the leaves (if we figurethe code as a binary tree), we can interchange the siblings of two brancheswithout changing the average codeword length. Property 2 implies thatthe two least-frequent codewords has the same codeword length. Hence,by repeatedly interchanging the siblings of a tree, we can result in a prefixcode which meets the requirement. 2

The above observation proves the existence of an optimal prefix code thatsatisfies the stated properties. As it turns out, Huffman code is one of suchcodes. In what follows, we will introduce the construction algorithm of Huffmancode.

Huffman encoding algorithm:

1. Combine the two least probable source symbols into a new single symbol,whose probability is equal to the sum of the probabilities of the originaltwo. Thus we have to encode a new source alphabet of one less symbol.Repeat this step until we get down to the problem of encoding just twosymbols in a source alphabet, which can be encoded merely using 0 and 1.

2. Go backward by splitting one of the two (combined) symbols into twooriginal symbols, and the codewords of the two split symbols is formedby appending 0 for one of them and 1 for the other from the codewordof their combined symbol. Repeat this step until all the original symbolshave been recovered, and have obtained a codeword.

We now give an example of Huffman encoding.

Example 3.26 Consider a source with alphabet 1, 2, 3, 4, 5, 6 with probabi-lity 0.25, 0.25, 0.25, 0.1, 0.1 and 0.05, respectively. By following the Huffmanencoding procedure as shown in Figure 3.6, we obtain the Huffman code as

00, 01, 10, 110, 1110, 1111.

68

0.05

0.1

0.1

0.25

0.25

0.25

(1111)

(1110)

(110)

(10)

(01)

(00)

0.15

0.1

0.25

0.25

0.25

111

110

10

01

00

0.25

0.25

0.25

0.25

11

10

01

00

0.5

0.25

0.25

1

01

00

0.5

0.5

1

01.0

Figure 3.6: Example of the Huffman encoding.

B) Shannon-Fano-Elias code

Assume X = 0, 1, . . . , L− 1 and PX(x) > 0 for all x ∈ X . Define

F (x) ,∑a≤x

PX(a),

and

F (x) ,∑a<x

PX(a) +1

2PX(x).

Encoder: For any x ∈ X , express F (x) in binary decimal, say

F (x) = .c1c2 . . . ck . . . ,

and take the first k bits as the codeword of source symbol x, i.e.,

(c1, c2, . . . , ck),

where k , dlog2(1/PX(x))e+ 1.

Decoder: Given codeword (c1, . . . , ck), compute the cumulative sum of F (·) start-ing from the smallest element in 0, 1, . . . , L− 1 until the first x satisfying

F (x) ≥ .c1 . . . ck.

69

Then x should be the original source symbol.

Proof of decodability: For any number a ∈ [0, 1], let back denote the operationthat chops the binary representation of a after k bits (i.e., remove (k + 1)th bit,(k + 2)th bit, etc). Then

F (x)− bF (x)ck <1

2k.

Since k = dlog2(1/PX(x))e+ 1,

1

2k≤ 1

2PX(x)

=

[∑a<x

PX(a) +PX(x)

2

]−

∑a≤x−1

PX(a)

= F (x)− F (x− 1).

Hence,

F (x− 1) =

[F (x− 1) +

1

2k

]− 1

2k≤ F (x)− 1

2k< bF (x)ck.

In addition,F (x) > F (x) ≥ bF (x)ck.

Consequently, x is the first element satisfying

F (x) ≥ .c1c2 . . . ck.

Average codeword length:

¯ =∑x∈X

PX(x)

⌈log2

1

PX(x)

⌉+ 1

<∑x∈X

PX(x) log2

1

PX(x)+ 2

= (H(X) + 2) bits.

Observation 3.27 The Shannon-Fano-Elias code is a prefix code.

3.3.4 Example study on universal lossless variable-lengthcodes

In Section 3.3.3, we assume that the source distribution is known. Thus we canuse either Huffman codes or Shannon-Fano-Elias codes to compress the source.

70

What if the source distribution is not a known priori? Is it still possible toestablish a completely lossless data compression code which is universally good(or asymptotically optimal) for all interested sources? The answer is affirmative.Two of the examples are the adaptive Huffman codes and the Lempel-Ziv codes.

A) Adaptive Huffman code

A straightforward universal coding scheme is to use the empirical distribution(or relative frequencies) as the true distribution, and then apply the optimalHuffman code according to the empirical distribution. If the source is i.i.d.,the relative frequencies will converge to its true marginal probability. Therefore,such universal codes should be good for all i.i.d. sources. However, in order to getan accurate estimation of the true distribution, one must observe a sufficientlylong sourceword sequence under which the coder will suffer a long delay. Thiscan be improved by using the adaptive universal Huffman code [1].

The working procedure of the adaptive Huffman code is as follows. Startwith an initial guess of the source distribution (based on the assumption thatthe source is DMS). As a new source symbol arrives, encode the data in terms ofthe Huffman coding scheme according to the current estimated distribution, andthen update the estimated distribution and the Huffman codebook according tothe newly arrived source symbol.

To be specific, let the source alphabet be X , a1, . . . , aJ. Define

N(ai|xn) , number of ai occurrence in x1, x2, . . . , xn.

Then the (current) relative frequency of ai is N(ai|xn)/n. Let cn(ai) denote theHuffman codeword of source symbol ai with respect to distribution

N(a1|xn)

n,N(a2|xn)

n, · · · ,

N(aJ |xn)

n

.

Now suppose that xn+1 = aj. The codeword cn(aj) is outputted, and therelative frequency for each source outcome becomes:

N(aj|xn+1)

n + 1=

n× (N(aj|xn)/n) + 1

n + 1

andN(ai|xn+1)

n + 1=

n× (N(ai|xn)/n)

n + 1for i 6= j.

This observation results in the following distribution updated policy.

P(n+1)

X(aj) =

nP(n)

X(aj) + 1

n + 1

71

andP

(n+1)

X(ai) =

n

n + 1P

(n)

X(ai) for i 6= j,

where P(n+1)

Xrepresents the estimate of the true distribution PX at time (n+1).

Note that in Adaptive Huffman coding scheme, the encoder and decoderneed not be re-designed at every time, but only when a sufficient change in theestimated distribution occurs that the sibling property is violated.

Definition 3.28 (sibling property) A prefix code is said to have the siblingproperty if its codetree satisfies:

1. every node in the code tree (except for the root node) lies a sibling (i.e.,tree is saturated), and

2. the node can be listed in non-decreasing order of probabilities with eachnode being adjacent to its sibling.

The next observation indicates the fact that the Huffman code is the onlyprefix code satisfying the sibling property.

Observation 3.29 A prefix code is a Huffman code if, and only if, it satisfiesthe sibling property.

An example for a code tree satisfying the sibling property is shown in Fig-ure 3.7. The first requirement is satisfied since the tree is saturated. The secondrequirement can be checked by the node list in Figure 3.7.

If the next observation (n = 17) is a3, then its codeword 100 is outputted

(using the Huffman code corresponding to P(16)

X). The estimated distribution is

updated as

P(17)

X(a1) =

16× (3/8)

17=

6

17, P

(17)

X(a2) =

16× (1/4)

17=

4

17

P(17)

X(a3) =

16× (1/8) + 1

17=

3

17, P

(17)

X(a4) =

16× (1/8)

17=

2

17

P(17)

X(a5) =

16× [1/(16)]

17=

1

17, P

(17)

X(a6) =

16× [1/(16)]

17=

1

17.

The sibling property is then violated (cf. Figure 3.8). Hence, codebook needs tobe updated according to the new estimated distribution, and the observation atn = 18 shall be encoded using the new codebook in Figure 3.9.

72

a1(00, 3/8)

a2(01, 1/4)

a3(100, 1/8)

a4(101, 1/8)

a5(110, 1/16)

a6(111, 1/16)

b11(1/8)

b10(1/4)

b0(5/8)

b1(3/8)

8/8

b0

(5

8

)≥ b1

(3

8

)

︸︷︷︸sibling pair

≥ a1

(3

8

)≥ a2

(1

4

)


≥ b10

(1

4

)≥ b11

(1

8

)


≥ a3

(1

8

)≥ a4

(1

8

)


≥ a5

(1

16

)≥ a6

(1

16

)


Figure 3.7: Example of the sibling property based on the code tree fromP

(16)

X. The arguments inside the parenthesis following aj respectively

indicate the codeword and the probability associated with aj. “b” isused to denote the internal nodes of the tree with the assigned (partial)code as its subscript. The number in the parenthesis following b is theprobability sum of all its children.

B) Lempel-Ziv codes

We now introduce a well-known and feasible universal coding scheme, which isnamed after its inventors, Lempel and Ziv.

Suppose the source alphabet is binary. Then the Lempel-Ziv encoder can bedescribed as follows.

Encoder:

1. Parse the input sequence into strings that have never appeared before. Forexample, if the input sequence is 1011010100010 . . ., the algorithm first eatsthe first letter 1 and finds that it never appears before. So 1 is the firststring. Then the algorithm eats the second letter 0 and finds that it never

73

a1(00, 6/17)

a2(01, 4/17)

a3(100, 3/17)

a4(101, 2/17)

a5(110, 1/17)

a6(111, 1/17)

b11(2/17)

b10(5/17)

b0(10/17)

b1(7/17)

17/17

b0

(10

17

)≥ b1

(7

17

)


≥ a1

(6

17

)≥ b10

(5

17

)

≥ a2

(4

17

)≥ a3

(3

17

)≥ a4

(2

17

)


≥ b11

(2

17

)≥ a5

(1

17

)≥ a6

(1

17

)


Figure 3.8: (Continue from Figure 3.7) Example of violation of thesibling property after observing a new symbol a3 at n = 17. Note thatnode a1 is not adjacent to its sibling a2.

appears before, and hence, put it to be the next string. The algorithm eatsthe next letter 1, and finds that this string has appeared. Hence, it eatsanother letter 1 and yields a new string 11. Repeat these procedures, thesource sequence is parsed into strings as

1, 0, 11, 01, 010, 00, 10.

2. Let L be the number of distinct strings of the parsed source. Then we needlog2 L bits to index these strings (starting from one). By following fromthe above example, the indices are:

parsed source : 1 0 11 01 010 00 10index : 001 010 011 100 101 110 111

.

The codeword of each string is the index of its prefix concatenated withthe last bit in its source string. For example, the codeword of source string010 will be the index of 01, i.e., 100, concatenated with the last bit of the

74

a1(10, 6/17)

a2(00, 4/17)

a3(01, 3/17)

a4(110, 2/17)

a5(1110, 1/17)

a6(1111, 1/17)

b111(2/17)

b11(4/17)

b0(7/17)

b1(10/17)17/17

b1

(10

17

)≥ b0

(7

17

)


≥ a1

(6

17

)≥ b11

(4

17

)


≥ a2

(4

17

)≥ a3

(3

17

)


≥ a4

(2

17

)≥ b111

(2

17

)


≥ a5

(1

17

)≥ a6

(1

17

)


Figure 3.9: (Continue from Figure 3.8) Update of Huffman code. Thesibling property holds now for the new code.

source string, i.e., 0. Through this procedure, the above parsed stringswith L = 3 is encoded into

(000, 1)(000, 0)(001, 1)(010, 1)(100, 0)(010, 0)(001, 0)

or equivalently,0001000000110101100001000010.

Note that the conventional Lempel-Ziv encoder requires two passes: the firstpass to decide L, and the second pass to generate the codewords. The algorithm,however, can be modified so that it requires only one pass over the entire sourcestring. Also note that the above algorithm uses an equal number of bits—log2 L—to all the location index, which can also be relaxed by proper modification.

Decoder: The decoding is straightforward from the encoding procedure.

Theorem 3.30 The above algorithm asymptotically achieves the entropy rateof any (unknown statistics) stationary source.

75

Proof: Please refer to Section 3.4 of Volume II of the lecture notes. 2

76

Bibliography

[1] R. Gallager, “Variations on theme by Huffman,” IEEE Trans. on Infor. The-ory, Vol. 24, pp. 668–674, November 1978.

[2] T. S. Han and S. Verdu, “Approximation theory of output statistics,” IEEETrans. on Inform. Theory, Vol. IT–39, No. 3, pp. 752–772, May 1993.

[3] S. Verdu and T. S. Han, “A general formula for channel capacity,” IEEETrans. on Inform. Theory, Vol. IT–40, No. 4, pp. 1147–1157, July 1994.

77

Chapter 4

Data Transmission and ChannelCapacity

4.1 Principles of data transmission

A noisy communication channel is a channel for which the output is not com-pletely determined by the input. Similar to data compression, such a channel ismathematically modelled by a probabilistic description. Each possible channelinput x induces a probability distribution on the channel output y according toa transition probability PY |X(y|x). Since two different inputs may give rise tothe same output, the receiver, upon receipt of an output, needs to guess whatthe most probable input is.

The codewords for data transmission are selected from the set of channelinput symbols. The designer of a data transmission code needs to carefullyselect codewords so that a minimal ambiguity is obtained at the channel outputsite. For example, suppose that the channel transition probability is given by:

PY |X(y = 0|x = 00) = 1

PY |X(y = 0|x = 01) = 1

PY |X(y = 1|x = 10) = 1

PY |X(y = 1|x = 11) = 1,

which can be graphically depicted as

©©©©©©*-

©©©©©©*-

11

10

01

00

1

01

1

1

1

78

and a binary message (either event A or event B) is required to be transmittedfrom the sender to the receiver. Then the data transmission code of (00 for eventA, 10 for event B) obviously induces less ambiguity at the receiver site than thecode of (00 for event A, 01 for event B).

In short, the objective in designing a data transmission code is to transform anoisy channel into a reliable channel for the messages intended for transmission.To achieve this goal, the designer of a data transmission code needs to takeadvantage of the common parts between the sender site and the receiver site thatis least affected by the noise. Probabilistically, these common parts constitutethe channel mutual information.

U - ChannelEncoder

-X ChannelPY |X(y|x)

-Y ChannelDecoder

-U

Figure 4.1: A data transmission system, where U represents the mes-sage for transmission, X denotes the codeword corresponding to thechannel input symbol U , Y represents the received vector due to chan-nel input X, and U denotes the reconstructed messages from Y .

After a “least-noise-affected” subset of the channel input symbols is wiselyselected as codewords, the messages intended to be transmitted can be reliablysent to the receiver with arbitrarily small error. Theorists then raise the followingquestion:

What is the maximum amount of information (per channel usage) thatcan be reliably transmitted via a given noisy channel?

In the example on page 78, we can transmit a binary informational messageerrorfreely, and hence the amount of information that can be reliability trans-mitted is at least 1 bit per channel usage. It can be expected that the amount ofinformation that can be reliably transmitted for a highly noisy channel shouldbe less than that for a less noisy channel. But such a comparison requires a goodmeasure of the “noisiness” of channels.

From the viewpoint of information theory, channel capacity is perhaps a goodmeasure on the noisiness of a channel, which is defined as the maximum amountof informational messages (per channel usage) that can be transmitted via thischannel with arbitrarily small error. In addition to its dependence on the channelnoise, channel capacity also depends on the coding constraint imposed on the

79

channel input, such as “only convolutional codes are allowed.” When no codingconstraint is applied on the channel input (so that variable-length codes canbe employed), the derivation of the channel capacity is usually viewed as a hardproblem, and is only partially solved so far. In this chapter, we will introduce thechannel capacity for block codes (namely, only block transmission code can beused). Throughout the chapter, the noisy channel is assumed to be memoryless.

4.2 Preliminaries

Definition 4.1 (fixed-length data transmission code) An (n, M) fixed-length data transmission code for channel input alphabet X n and output alpha-bet Yn consists of

1. M informational messages intended for transmission;

2. an encoding function

f : 1, 2, . . . , M → X n;

3. a decoding functiong : Yn → 1, 2, . . . , M,

which is (usually) a deterministic rule that assigns a guess to each possiblen-dimensional received vector.

The channel inputs in xn ∈ X n : xn = f(m) for some 1 ≤ m ≤ M are thecodewords of the data transmission block code.

Definition 4.2 (average probability of error) The average probability oferror for a C∼n = (n,M) code with encoder f(·) and decoder g(·) transmittedover channel QY n|Xn is defined as

Pe( C∼n) =1

M

M∑i=1

λi,

whereλi ,

∑

yn∈Yn : g(yn)6=iQY n|Xn(yn|f(i)).

Under the criterion of average probability of error, all the codewords are treatedequally, namely the prior probability of the selected M codewords are uniformlydistributed.

80

Definition 4.3 (discrete memoryless channel) A discrete memoryless chan-nel (DMC) is a channel whose transition probability QY n|Xn satisfies

QY n|Xn(yn|xn) =n∏

i=1

QY |X(yi|xi).

4.3 Block codes for data transmission over DMC

Our target in this section is to find a good data transmission block code (or toshow the existence of a good data transmission block code). From the (weak)law-of-large-number viewpoint, a good choice is to draw the data transmissionblock codewords based on the joint typical set between the input and the outputof the channel, since all the probability mass is ultimately placed on the jointtypical set. The decoding failure then occurs only when the channel input-and-output-pair does not lie in the joint typical set, which implies that the probabilityof decoding error is ultimately small. We begin our discussion from the definitionof joint typical set.

Definition 4.4 (joint typical set) The set Fn(δ) of joint δ-typical sequences(xn, yn) with respect to the memoryless distribution PXn,Y n is defined by

Fn(δ) ,

(xn, yn) ∈ X n × Yn :

∣∣∣∣−1

nlog PXn(xn)−H(X)

∣∣∣∣ < δ,

∣∣∣∣−1

nlog PY n(yn)−H(Y )

∣∣∣∣ < δ,

and

∣∣∣∣−1

nlog PXn,Y n(xn, yn)−H(X,Y )

∣∣∣∣ < δ

.

In short, it says that the empirical entropy is δ-close to the true entropy.

With the above definition, we derive the joint AEP theorem.

Theorem 4.5 (joint AEP) If (X1, Y1), (X2, Y2), . . ., (Xn, Yn), . . . are i.i.d.,then

− 1

nlog PXn(X1, X2, . . . , Xn) → H(X) in probability;

− 1

nlog PY n(Y1, Y2, . . . , Yn) → H(Y ) in probability;

and

− 1

nlog PXn,Y n((X1, Y1), . . . , (Xn, Yn)) → H(X, Y ) in probability.

81

Proof: By the weak law of large numbers, we have the desired result. 2

Before proving the coding theorem for data transmission, we need to firstprove the Shannon-McMillan theorem for pairs.

Theorem 4.6 (Shannon-McMillan theorem for pairs) Given a dependentpair of DMSs with joint entropy H(X, Y ) and any δ greater than zero, we canchoose n big enough so that the joint δ-typical set satisfies:

1. PXn,Y n(F cn(δ)) < δ for sufficiently large n.

2. The number of elements in Fn(δ) is at least (1 − δ)en(H(X,Y )−δ) for suffi-ciently large n, and at most en(H(X,Y )+δ) for every n.

3. If (xn, yn) ∈ Fn(δ), its probability of occurrence satisfies

e−n(H(X,Y )+δ) < PXn,Y n(xn, yn) < e−n(H(X,Y )−δ).

Proof: The proof is similar to that of Shannon-McMillan theorem for a singlememoryless source, and hence we omit it. 2

We are now ready to show the channel coding theorem.

Theorem 4.7 (Shannon’s channel coding theorem) Consider a DMC withmarginal transition probability QY |X(y|x). Define the channel capacity1

C , maxPX,Y : PY |X=QY |X

I(X; Y ) = maxPX

I(PX , QY |X).

and fix ε > 0 arbitrarily small. There exist γ > 0 and a sequence of datatransmission block codes C∼n = (n,Mn)∞n=1 with

1

nlog Mn > C − γ

such thatPe( C∼n) < ε for sufficiently large n.

1Note that the mutual information is actually a function of the input statistics PX and thechannel statistics QY |X . Hence, we may write it as

I(PX , QY |X) ,∑

x∈X

∑

y∈YPX(x)QY |X(y|x) log

QY |X(y|x)∑x′∈X PX(x′)QY |X(y|x′) .

Such an expression is more suitable in calculating the channel capacity.

82

Proof: It suffices to prove the existence of a good block code sequence (satisfyingthe rate condition, i.e., (1/n) log Mn > C − γ for some γ > 0) whose averagedecoding error is ultimately less than ε. In the proof, the good block codesequence is not deterministically designed; instead, its existence is explicitlyproven by showing that for a class of block code sequences and a code-selectingdistribution over these block code sequences, the expectation value of the averageblock decoding error, evaluated under the code-selecting distribution on theseblock code sequences, can be made smaller than ε for n sufficiently large. Hence,there must exist such a desired good code sequence among them.

Fix some γ in (0, 4ε). Observe that there exists N0 such that for n > N0, wecan choose an integer Mn with

C − γ

2≥ 1

nlog Mn > C − γ.

(Since we only concern the case of “for all sufficient large n,” it suffices to consideronly those n’s satisfying n > N0, and ignore those n’s for n ≤ N0.)

Define δ , γ/8. Let PX be the probability distribution achieving the channelcapacity,2 i.e.,

C , maxPX

I(PX , QY |X) = I(PX , QY |X).

Denote by PY n the channel output distribution due to the channel input PXn

(where PXn(xn) =∏n

i=1 PX(xi)) through channel QY n|Xn , i.e.,

PXn,Y n(xn, yn) , PXn(xn)QY n|Xn(yn|xn)

andPY n(yn) ,

∑xn∈Xn

PXn,Y n(xn, yn).

We then present the proof in three steps.

Step 1: Code construction. For any blocklength n, independently select Mn

channel inputs with replacement3 from X n according to the distributionPXn(xn). For the selected Mn channel inputs Cn , c1, c2, . . . , cMn, re-spectively define the encoder fn(·) and decoder gn(·) as:

fn(m) = cm for 1 ≤ m ≤ Mn,

2The supremum of a concave function is achievable (and hence is named maximum) exceptthere are jumps at the boundary points. Since I(PX , QY |X) is concave and continuous withrespect to PX , the achievability of supremum is guaranteed.

3Here, the channel inputs are selected with replacement. That means it is possible andacceptable that all the selected Mn channel inputs are identical.

83

and

gn(yn) =

m, if cm is the only codeword in Cn

satisfying (cm, yn) ∈ Fn(δ);

any one in 1, 2, . . . ,Mn, otherwise,

where Fn(δ) is defined in Definition 4.4 with respect to distribution PXn,Y n .

Step 2: Error probability.

For the previously defined data transmission code, the conditional proba-bility of error given that message m was sent, denoted by λm, can be upperbounded by:

λm ≤∑

yn∈Yn : (cm,yn)6∈Fn(δ)

QY n|Xn(yn|cm)

+Mn∑

m′=1m′ 6=m

∑

yn∈Yn : (cm′ ,yn)∈Fn(δ)QY n|Xn(yn|cm), (4.3.1)

where the first term in (4.3.1) considers the case that the received channeloutput yn is not weakly joint δ-typical with cm, (and hence, the decodingrule gn(·) would possibly result in a wrong guess) and the second term in(4.3.1) reflects the situation when yn is weakly joint δ-typical with not onlythe transmitted codeword cm but also another codeword cm′ (which alsopossibly causes error decision in decoding).

By taking the expectation with respect to the mth codeword-selectingdistribution PXn(cm), (4.3.1) can be written as:

E[λm] =∑

cm∈Xn

PXn(cm)λm

≤∑

cm∈Xn

∑

yn 6∈Fn(δ|cm)

PXn(cm)QY n|Xn(yn|cm)

+∑

cm∈Xn

Mn∑

m′=1m′ 6=m

∑

yn∈Fn(δ|cm′ )

PXn(cm)QY n|Xn(yn|cm)

= PXn,Y n (F cn(δ)) +

Mn∑

m′=1m′ 6=m

∑cm∈Xn

∑

yn∈Fn(δ|cm′ )

PXn,Y n(cm, yn),

(4.3.2)

84

whereFn(δ|xn) , yn ∈ Yn : (xn, yn) ∈ Fn(δ) .

Step 3: The expectation of average decoding error Pe(Cn) (for the Mn selectedcodewords) with respect to the random selecting code Cn can be expressed

85

as:

E[Pe(Cn)] =∑

c1∈Xn

· · ·∑

cMn∈Xn

PXn(c1) · · ·PXn(cMn)

(1

Mn

Mn∑m=1

λm

)

=1

Mn

Mn∑m=1

∑c1∈Xn

· · ·∑

cm−1∈Xn

∑cm+1∈Xn

· · ·∑

cMn∈Xn

PXn(c1) · · ·PXn(cm−1)PXn(cm+1) · · ·PXn(cMn)

×( ∑

cm∈Xn

PXn(cm)λm

)

=1

Mn

Mn∑m=1

∑c1∈Xn

· · ·∑

cm−1∈Xn

∑cm+1∈Xn

· · ·∑

cMn∈Xn

PXn(c1) · · ·PXn(cm−1)PXn(cm+1) · · ·PXn(cMn)× E[λm]

≤ 1

Mn

Mn∑m=1

∑c1∈Xn

· · ·∑

cm−1∈Xn

∑cm+1∈Xn

· · ·∑

cMn∈Xn


×[PXn,Y n (F c

n(δ))]

+1

Mn

Mn∑m=1

∑c1∈Xn

· · ·∑

cm−1∈Xn

∑cm+1∈Xn

· · ·∑

cMn∈Xn


×

Mn∑

m′=1m′ 6=m

∑cm∈Xn

∑

yn∈Fn(δ|cm′ )

PXn,Y n(cm, yn)

(4.3.3)

= PXn,Y n (F cn(δ))

+1

Mn

Mn∑m=1

Mn∑

m′=1m′ 6=m

∑

c1∈Xn

· · ·∑

cm−1∈Xn

∑cm+1∈Xn

· · ·∑

cMn∈Xn


×∑

cm∈Xn

∑

yn∈Fn(δ|cm′ )

PXn,Y n(cm, yn)

,

where (4.3.3) follows from (4.3.2), and the last step holds since

PXn,Y n (F cn(δ))

86

is a constant independent of c1, . . ., cMn and m. Observe that for n > N0,

Mn∑

m′=1m′ 6=m

∑

c1∈Xn

· · ·∑

cm−1∈Xn

∑cm+1∈Xn

· · ·∑

cMn∈Xn


×∑

cm∈Xn

∑

yn∈Fn(δ|cm′ )

PXn,Y n(cm, yn)

=Mn∑

m′=1m′ 6=m

∑

cm∈Xn

∑cm′∈Xn

∑

yn∈Fn(δ|cm′ )

PXn(cm′)PXn,Y n(cm, yn)

=Mn∑

m′=1m′ 6=m

∑

cm′∈Xn

∑

yn∈Fn(δ|cm′ )

PXn(cm′)

( ∑cm∈Xn

PXn,Y n(cm, yn)

)

=Mn∑

m′=1m′ 6=m

∑

cm′∈Xn

∑

yn∈Fn(δ|cm′ )

PXn(cm′)PY n(yn)

=Mn∑

m′=1m′ 6=m

∑

(cm′ ,yn)∈Fn(δ)

PXn(cm′)PY n(yn)

≤Mn∑

m′=1m′ 6=m

|Fn(δ)|e−n(H(X)−δ)e−n(H(Y )−δ)

≤Mn∑

m′=1m′ 6=m

en(H(X,Y )+δ)e−n(H(X)−δ)e−n(H(Y )−δ)

= (Mn − 1)en(H(X,Y )+δ)e−n(H(X)−δ)e−n(H(Y )−δ)

≤ Mn · en(H(X,Y )+δ)e−n(H(X)−δ)e−n(H(Y )−δ)

≤ en(C−4δ) · e−n(I(X;Y )−3δ) = e−nδ,

where the last step follows, since C = I(X; Y ) by definition of X and Y ,and (1/n) log Mn ≤ C − (γ/2) = C − 4δ. Consequently,

E[Pe(Cn)] ≤ PXn,Y n (F cn(δ)) + e−nδ,

which for sufficiently large n (and n > N0), can be made smaller than2δ = γ/4 < ε by Shannon-McMillan theorem for pairs. 2

87

Usually, we will take lim infn→∞(1/n) log Mn as the ultimate data transmis-sion rate of a block code sequence, and is also denoted by R. This is the sameas the convention to data compression; however, a little confusion may arise(cf. page 50) for their using the same notation. In data compression, the coderate is the measure of number of bits required per sourceword, while for datatransmission, it is the measure of number of bits carried by one channel inputsymbol. Apparently, a smaller R is preferred in data compression; and a larger Rconveys information faster in data transmission. As a consequence, in order tohave the error probability vanishing as n tends to infinity (i.e., error can be madearbitrarily small for all sufficiently large n), liminf operation on the code rateshould be used in data transmission, instead of adopting the limsup operationas in data compression.

In the previous chapter, we prove that the lossless data compression rateis lower bounded by source entropy. From the above theorem, we obtain aparallel result in data transmission to source coding that reliable4 transmissionis achieved at any rate R with R > C − γ for some γ > 0. Next, as similarly tosource coding, we will prove a converse statement that when one desires to furtherincrease R beyond C (i.e., R > C), reliability in data transmission is no longerguaranteed; and hence, reliable data transmission rate is upper bounded by thechannel capacity C. We will start from a lemma, which relates the probability oferror (in guessing the random variable X based on a received dependent randomvariable Y ) to the conditional entropy H(X|Y ).

Lemma 4.8 (Fano’s inequality) Let X and Y be two random variables, cor-related in general, with values in X and Y , respectively, where X is finite butY can be an infinite set. Let x , g(y) be an estimate of x from observing y.Define the probability of estimating error as

Pe , Pr g(Y ) 6= X .

Then for any estimating function g(·),Hb(Pe) + Pe · log(|X | − 1) ≥ H(X|Y ),

where Hb(Pe) is the binary entropy function defined by

Hb(t) , −t · log t− (1− t) · log(1− t).

Proof: Define a new random variable,

E ,

1, if g(Y ) 6= X0, if g(Y ) = X

.

4“Reliable” is a short-hand to state “the block decoding error can be made arbitrarily smallwith all sufficiently large blocklength.”

88

Then using the chain rule for conditional entropy, we obtain

H(E, X|Y ) = H(X|Y ) + H(E|X, Y ) = H(E|Y ) + H(X|E, Y ). (4.3.4)

Observe that E is a function of X and Y ; hence, H(E|X, Y ) = 0. Since con-ditioning never increases entropy, H(E|Y ) ≤ H(E) = Hb(Pe). The remainingterm, H(X|E, Y ), can be bounded as follows:

H(X|E, Y ) = Pr(E = 0)H(X|Y, E = 0) + Pr(E = 1)H(X|Y, E = 1)

≤ (1− Pe) · 0 + Pe · log(|X | − 1),

since X = g(Y ) for E = 0, and given E = 1, we can upper bound the condi-tional entropy by the log of the number of remaining outcomes, i.e., (|X | − 1).Combining these results, we obtain the Fano’s inequality. 2

The Fano’s inequality can not be improved in the sense that the lowerbound, H(X|Y ), can be achieved for some specific cases. Any bound that canbe achieved in some cases is often referred as sharp.5 From the proof of theabove lemma, we can observe that equality holds for the Fano’s inequality, ifH(E|Y ) = H(E) and H(X|Y,E = 1) = log(|X | − 1). The former is equivalentto that E is independent of Y ; and the latter holds if, and only if, PX|Y (·|y)is uniformly distributed over the set X − g(y). We can therefore create anexample that equates the Fano’s inequality.

Example 4.9 Suppose that X and Y are two independent random variables,which are both uniformly distributed over 0, 1, 2. Let the estimator be x =g(y) = y. Then

Pe = Prg(Y ) 6= X = PrY 6= X = 1−2∑

x=0

PX(x)PY (y) =2

3.

In this case, equality holds for the Fano’s inequality, i.e.,

Hb

(2

3

)+

2

3· log(3− 1) = H(X|Y ) = H(X) = log(3).

We also remark that the Fano’s inequality, in some situations, can also pro-vide bounds on error probability. This can be seen by plotted its permissibleregion (cf. Figure 4.2). When

log(|X | − 1) < H(X|Y ) ≤ log(|X |),5Definition. A bound is said to be sharp if the bound is achievable for some specific cases.

A bound is said to be tight if the bound is achievable for all cases.

89

log(|X | − 1)

log(|X |)

H(X|Y )

(|X | − 1)/|X |0 1Pe

Figure 4.2: Permissible (Pe, H(X|Y )) region of the Fano’s inequality.

we obtain

0 < infa : Hb(a) + a · log(|X | − 1) ≥ H(X|Y )≤ Pe ≤ supa : Hb(a) + a · log(|X | − 1) ≥ H(X|Y ) < 1;

so both lower and upper bounds on Pe are provided. When

0 < H(X|Y ) ≤ log(|X | − 1)],

we yield

Pe ≥ infa : Hb(a) + a · log(|X | − 1) ≥ H(X|Y ) > 0;

thus, if H(X|Y ) is bounded away from zero, Pe is also bounded away from zero.The Fano’s inequality is the key to the converse of Shannon’s channel codingtheorem.

Theorem 4.10 (weak converse to Shannon’s channel coding theorem)Fix a DMC with marginal transition probability QY |X . For any data transmissioncode sequence C∼n = (n,Mn)∞n=1, if

lim infn→∞

1

nlog Mn > C,

the average probability of block decoding error is bounded away from zero forall n sufficiently large.

90

Proof: For an (n,Mn) block data transmission code, an encoding function ischosen as:

fn : 1, 2, . . . ,Mn → X n,

and each index i is equally likely for the average probability of block decoding er-ror criterion. Hence, we can assume that the information message 1, 2, . . . ,Mnis generated from a uniformly distributed random variable, and denoted it byW . As a result,

H(W ) = log Mn.

Since W → Xn → Y n forms a Markov chain because Y n only depends on Xn,we obtain by the data processing lemma that I(W ; Y n) ≤ I(Xn; Y n). We canalso bound I(Xn; Y n) by the channel capacity C as

I(Xn; Y n) ≤ maxPXn,Y n : PY n|Xn=QY n|Xn

I(Xn; Y n)

≤ maxPXn,Y n : PY n|Xn=QY n|Xn

n∑i=1

I(Xi; Yi), (by Theorem 2.20)

≤n∑

i=1

maxPXn,Y n : PY n|Xn=QY n|Xn

I(Xi; Yi)

=n∑

i=1

maxPXi,Yi

: PYi|Xi=QYi|Xi

I(Xi; Yi)

= nC.

Consequently, by defining Pe( C∼n) as the error of guessing W by observing Y n

via a decoding function

gn : Yn → 1, 2, . . . , Mn,which is exactly the average block decoding error, we get

log Mn = H(W )

= H(W |Y n) + I(W ; Y n)

≤ H(W |Y n) + I(Xn; Y n)

≤ Hb(Pe( C∼n)) + Pe( C∼n) · log(|W| − 1) + nC,

(by the Fano′s inequality)

≤ log(2) + Pe( C∼n) · log(Mn − 1) + nC,

(by the fact that (∀ t ∈ [0, 1]) Hb(t) ≤ log(2))

≤ log(2) + Pe( C∼n) · log Mn + nC,

which implies that

Pe( C∼n) ≥ 1− C

(1/n) log Mn

− log(2)

log Mn

.

91

So if lim infn→∞(1/n) log Mn > C, then there exist δ with 0 < δ < 4ε and aninteger N such that for n ≥ N ,

1

nlog Mn > C + δ.

Hence, for n ≥ N0 , maxN, 2 log(2)/δ,

Pe( C∼n) ≥ 1− C

C + δ− log(2)

n(C + δ)≥ δ

2(C + δ).

2

4.4 Examples of DMCs

4.4.1 Identity channels

An identity channel has equal-size in input and output alphabets (|X | = |Y|),and channel transition probability satisfying

QY |X(y|x) = either 1 or 0.

In such channel, H(Y |X) = 0 since no extra information provides by Y when Xis given. As a consequence,

I(X; Y ) = H(Y )−H(Y |X)

= H(Y ),

and the channel capacity is

maxX

I(X; Y ) = maxX

H(Y ) = log |Y| nats/channel usage.

4.4.2 Binary symmetric channels

A binary symmetric channel (BSC) is a channel with binary input and outputalphabet, and the probability for one input symbol to be complemented at theoutput is equal to that for another input symbol as shown in Figure 4.3.

This is the simplest model of a channel with errors; yet it captures most ofthe complexity of the general problems. To compute the channel capacity of it,

92

-©©©©©©©©©©©©©©©©©©©©*-HHHHHHHHHHHHHHHHHHHHj1

0

0

1

1− ε

1− ε

ε

ε

Figure 4.3: Binary symmetric channel.

we first bound the mutual information by

I(X; Y ) = H(Y )−H(Y |X)

= H(Y )−1∑

x=0

PX(x)H(Y |X = x)

= H(Y )−1∑

x=0

PX(x)Hb(ε)

= H(Y )−Hb(ε)

≤ log(2)−Hb(ε),

where Hb(u) , −u · log u− (1−u) · log(1−u) is the binary entropy function, andthe last inequality follows because Y is a binary random variable. Equality isachieved when H(Y ) = log(2), which is induced by uniform input distribution.Hence,

C , maxX

I(X; Y ) = [log(2)−Hb(ε)] nats/channel usage.

An alternative way to derive the channel capacity for BSC is to first assumePX(0) = p = 1− PX(1), and to express I(X; Y ) as:

I(X; Y ) = (1− ε) log(1− ε) + ε log(ε)

−[p(1− ε) + (1− p)ε] log[p(1− ε) + (1− p)ε]

−[pε + (1− p)(1− ε)] log[pε + (1− p)(1− ε)];

then maximizing the above quantity over p ∈ [0, 1] yields that the maximizer isp∗ = 1/2, which immediately gives C = log(2)−Hb(ε).

93

4.4.3 Symmetric, weakly symmetric and quasi-symmetricchannels

A channel is said to be weakly symmetric if the set

A(x) , PY |X(y1|x), . . . , PY |X(y|Y||x)is identical for every x, and

∑x∈X PY |X(yi|x) equals constant for every 1 ≤ i ≤

|Y|. If, in addition, the set

B(y) , PY |X(y|x1), . . . , PY |X(y|x|X|)

is also identical for every y, the channel is named symmetric channel. One canalso equivalently define the symmetry and weak symmetry for channels in termsof the channel probability transition matrix

[PY |X

]. A channel is said to be

weakly symmetric, if every row of the matrix[PY |X

]is a permutation of the first

row (here, the quantities in each row must sum to 1 by basic probability theory),and all the column sums are equal; if, in addition, every column of

[PY |X

]is also

a permutation of the first column, it becomes a symmetric channel.

A quick example for a symmetric channel is the BSC introduced in the pre-vious subsection. Another example with ternary input and output alphabetsis:

PY |X(0|0) = 0.4, PY |X(1|0) = 0.1, PY |X(2|0) = 0.5;PY |X(0|1) = 0.5, PY |X(1|1) = 0.4, PY |X(2|1) = 0.1;PY |X(0|2) = 0.1, PY |X(1|2) = 0.5, PY |X(2|2) = 0.4.

Furthermore, the frequently encountered mod-q channel, modelled as:

Y = (X + Z) mod q,

where the channel input X and noise Z are independent, and take value fromthe same alphabet 0, 1, . . . , q − 1, is also a symmetric channel.

The channel capacity of symmetric channels can be computed in a way similarto BSC as:

I(X; Y ) = H(Y )−H(Y |X)

= H(Y )−∑x∈X

PX(x)H(Y |X = x)

= H(Y )−∑x∈X

PX(x)H(Z)

= H(Y )−H(Z)

≤ log |Y| −H(Z), (4.4.1)

94

with equality holds if the output distribution is uniform, which is achieved byuniform input. Therefore,

C , maxX

I(X; Y ) = [log |Y| −H(Z)] nats/channel usage.

Actually, the capacity being achieved by uniform input is not restrictedto symmetric and weakly symmetric channels, but can be extended to quasi-symmetric channel, which is defined as one that can be partitioned along thecolumn of its channel transition matrix into weakly symmetric sub-arrays. Anexample of quasi-symmetric channels is the binary erasure channel with transi-tion matrix[

PY |X(0|0) PY |X(1|0) PY |X(e|0)PY |X(0|1) PY |X(1|1) PY |X(e|1)

]=

[1− ε 0 ε

0 1− ε ε

].

We can partition this transition matrix (along its column) into weakly symmetricsub-arrays as: [

1− ε 00 1− ε

∣∣∣∣εε

]

An intuitive interpretation for uniform input achieving the capacity of quasi-symmetric (hence, weakly symmetric and symmetric) channels is that if thechannel treats all input symbol equally, then all input symbols should be usedequally often.

Exercise: Show that the capacity of a quasi-symmetric channel with weaklysymmetric sub-arrays Q1,Q2, . . . ,Qn respectively with sizes |X | × |Y1|, |X | ×|Y2|, . . . , |X | × |Yn| is given by:

C =n∑

i=1

aiCi,

where ai is equal to sum of any row in Qi, and Ci = log |Yi|−H(normalized rowdistribution of Qi).

4.4.4 Binary erasure channels

The binary erasure channel (BEC) has a form similar to BSC, except that bitsare erased with some probability. It is shown in Figure 4.4.

We calculate the capacity of the binary erasure channel as follows:

C = maxX

I(X; Y )

= maxX

[H(Y )−H(Y |X)]

= maxX

[H(Y )]−Hb(ε).

95

-

-

»»»»»»»»»»»»»»»»»»»»:

XXXXXXXXXXXXXXXXXXXXz

1

0

1

0

e

1− ε

1− ε

ε

ε

Figure 4.4: Binary erasure channel.

Now we note that H(Y ) ≤ log(3) because the size of the output alphabet 0, 1, eis 3, which is achieved by uniform channel output. But since there is no input dis-tribution yielding uniform channel output, we cannot take log(3) as an achievablemaximum value. Some specific approach needs to be created for the calculationof its channel capacity.

Definition 4.11 (mutual information for specific input symbol) Definethe mutual information for a specific input symbol as:

I(x; Y ) ,∑y∈Y

PY |X(y|x) logPY |X(y|x)

PY (y).

From the above definition, the mutual information becomes:

I(X; Y ) ,∑x∈X

∑y∈Y

PX,Y (x, y) logPX,Y (x, y)

PX(x)PY (y)

=∑x∈X

PX(x)∑y∈Y

PY |X(y|x) logPY |X(y|x)

PY (y)

=∑x∈X

PX(x)I(x; Y ).

I(x; Y ) can be re-written as:

I(x; Y ) = log(1/PX(x)

)−H(x|Y )

= I(x)−H(x|Y ),

where I(x) is the self-information of x (cf. Subsection 2.1.1), and

H(x|Y ) =∑y∈Y

PY |X(y|x) log PX|Y (x|y).

96

So I(x; Y ) can be conceptually interpreted as the self-information of X = xminus the information that Y has about X = x, namely, I(x; Y ) is the “mutualinformation” between self-information of X = x and general receiver Y given xis transmitted.

Observation 4.12 An input distribution PX achieves the channel capacity Cif, and only if,

I(x; Y )

= C, for PX(x) > 0;≤ C, for PX(x) = 0.

Proof: The if part holds straightforwardly; hence, we only provide proof foronly-if part.

Without loss of generality, we assume that PX(x) < 1 for all x ∈ X , sincePX(x) = 1 for some x implies I(X; Y ) = 0.

The problem of calculating the channel capacity is to maximize

I(X; Y ) =∑x∈X

∑y∈Y

PX(x)QY |X(y|x) logQY |X(y|x)∑

x′∈X PX(x′)QY |X(y|x′) , (4.4.2)

subject to the condition ∑x∈X

PX(x) = 1 (4.4.3)

for a given QY |X . By using the Lagrange multiplier argument, to maximize(4.4.2) subject to (4.4.3) is equivalent to maximize:

f(PX) ,∑x∈Xy∈Y

PX(x)QY |X(y|x) logQY |X(y|x)∑

x′∈XPX(x′)QY |X(y|x′)

+ λ

(∑x∈X

PX(x)− 1

).

We then take the derivative of the above quantity with respect to PX(x′′), and

97

obtain,6

∂f(PX)

∂PX(x′′)= I(x′′; Y )− 1 + λ.

From Property 2 of Lemma 2.38, I(X; Y ) = I(PX , QY |X) is a concave functionin PX . Therefore, the maximum occurs at zero derivative when PX(x) does notlocate at the boundary, namely 1 > PX(x) > 0. For those PX(x) locating onthe boundary, i.e., PX(x) = 0, the maximum occurs if, and only if, displacementfrom the boundary to interior decreases the quantity, which implies non-positivederivative, namely

I(x; Y ) ≤ −λ + 1, for those x with PX(x) = 0.

To summarize, if an input distribution PX achieves the channel capacity, then

I(x′′; Y )

= −λ + 1, for PX(x′′) > 0;≤ −λ + 1, for PX(x′′) = 0.

for some λ (With the above result, C = −λ + 1 trivially hold. Why?), whichcompletes the proof of only-if part. 2

From this observation, the capacity of BEC should satisfy one of the followingthree cases:

C = I(0; Y ) = I(1; Y ) for PX(0) > 0 and PX(1) > 0 (4.4.4)

6The detail for taking the derivative is as follows:

∂

∂PX(x′′)

∑

x∈X

∑

y∈YPX(x)QY |X(y|x) log QY |X(y|x)

−∑

x∈X

∑

y∈YPX(x)QY |X(y|x) log

[ ∑


]+ λ

(∑

x∈XPX(x)− 1

)

=∑

y∈YQY |X(y|x′′) log QY |X(y|x′′)−

∑

y∈YQY |X(y|x′′) log

[ ∑


]

+∑

x∈X

∑

y∈YPX(x)QY |X(y|x)

QY |X(y|x′′)∑x′∈X PX(x′)QY |X(y|x′)

+ λ

= I(x′′;Y )−∑

y∈Y

[∑

x∈XPX(x)QY |X(y|x)

]QY |X(y|x′′)∑

x′∈X PX(x′)QY |X(y|x′) + λ

= I(x′′;Y )−∑

y∈YQY |X(y|x′′) + λ

= I(x′′;Y )− 1 + λ.

98

orC = I(0; Y ) ≥ I(1; Y ) for PX(0) = 1 and PX(1) = 0 (4.4.5)

orC = I(1; Y ) ≥ I(0; Y ) for PX(0) = 0 and PX(1) = 1. (4.4.6)

Since (4.4.5) and (4.4.6) only yield uninteresting zero capacity, it remains toverify whether or not (4.4.4) can give a positive capacity.

By extending (4.4.4), we obtain

C = I(0; Y ) = −Hb(ε)− (1− ε) · log PY (0)− ε · log PY (e)

= I(1; Y ) = −Hb(ε)− (1− ε) · log PY (1)− ε · log PY (e),

which implies PY (0) = PY (1). Since PY (e) is always equal to ε, the equalitybetween PY (0) and PY (1) immediately gives that PY (0) = PY (1) = (1−ε)/2, andthe uniform input process maximizes the channel mutual information. Finally,we obtain that the channel capacity of BEC is equal to

C = −Hb(ε)− (1− ε) · log1− ε

2− ε · log(ε)

= log(2) (1− ε) nats/channel usage

= 1− ε bits/channel usage.

99

Chapter 5

Lossy Data Compression

5.1 Fundamental concept on lossy data compression

5.1.1 Motivations

In operation, one sometimes need to compress a source in a rate less than entropy,which is known to be the minimum code rate for lossless data compression. Insuch case, some sort of data loss is inevitable. People usually refer the resultantcodes as lossy data compression code.

Some of the examples for requiring lossy data compression are made below.

Example 5.1 (digitization or quantization of continuous signals) Theinformation content of continuous signals, such as voice or multi-dimensionalimages, is usually infinity. It may require an infinite number of bits to digitizesuch a source without data loss, which is not feasible. Therefore, a lossy datacompression code must be used to reduce the output of a continuous source toa finite number of bits.

Example 5.2 (constraint on channel capacity) To transmit a source thr-ough a channel with capacity less than the source entropy is challenging. Asstated in channel coding theorem, it always introduces certain error for rateabove channel capacity. Recall that Fano’s inequality only provides a lowerbound on the amount of the decoding error, and does not tell us how large theerror is. Hence, the error could go beyond control when one desires to conveysource that generates information at rates above channel capacity.

In order to have a good control of the transmission error, another approachis to first reduce the data rate with manageable distortion, and then transmitthe source data at rates less than the channel capacity. With such approach,the transmission error is only introduced at the (lossy) data compression step

100

since the error due to transmission over channel can be made arbitrarily small(cf. Figure 5.1).

- channel withcapacity C

-

source withH(X) > C

output withunmanageable

error

-

lossy datacompressorintroducing

error E

-

compresseddata with

reduced rateRr < C channel with

capacity C-

source withH(X) > C

output withmanageable

error E

Figure 5.1: Example for applications of lossy data compression codes.

Example 5.3 (extracting useful information) In some situation, some ofthe information is not useful for operational objective. A quick example willbe the hypothesis testing problem in which case the system designer concernsonly the likelihood ratio of the null hypothesis distribution against the alternativehypothesis distribution. Therefore, any two distinct source letters which producethe same likelihood ratio should not be encoded into different codewords. Theresultant code is usually a lossy data compression code since reconstruction ofthe source from certain code is usually impossible.

5.1.2 Distortion measures

A source is modelled as a random process Z1, Z2, . . . , Zn. For simplicity, weassume that the source discussed in this section is memoryless and with finitegeneric alphabet. Our objective is to compress the source with rate less thanentropy under a pre-specified criterion. In general, the criterion is given by adistortion measure as defined below.

Definition 5.4 (distortion measure) A distortion measure is a mapping

ρ : Z × Z → <+,

where Z is the source alphabet, Z is the reproduction alphabet for the com-pressed code, and <+ is the set of non-negative real number.

101

From the above definition, the distortion measure ρ(z, z) can be viewed as acost of representing the source symbol z by a code symbol z. It is then expectedto choose a certain number of (typical) reproduction letters in Z to representthe source letters, which cost least.

When Z = Z, the selection of typical reproduction letters is similar to dividethe source letters into several groups, and then choose one element in each groupto represent the rest of the members in the same group. For example, supposethat Z = Z = 1, 2, 3, 4. Due to some constraints, we need to reduce thenumber of outcomes to 2, and the resultant expected cost can not be largerthan 0.5.1 The source is uniformly distributed. Given a distortion measure by amatrix as:

[ρ(i, j)] ,

0 1 2 21 0 2 22 2 0 12 2 1 0

,

the resultant two groups which cost least should be 1, 2 and 3, 4. We maychoose respectively 1 and 3 as the typical elements for these two groups (cf. Fig-ure 5.2). The expected cost of such selection is

1

4ρ(1, 1) +

1

4ρ(2, 1) +

1

4ρ(3, 3) +

1

4ρ(4, 3) =

1

2.

Note that the entropy of the source is reduced from 2 bits to 1 bit.

'

&

$

%

1

3

2

4

⇒

'

&

$

%

1µ´¶³?

Representative for group 1, 2

3µ´¶³

Representative for group 3, 46

2

4

Figure 5.2: “Grouping” as one kind of lossy data compression.

Sometimes, it is convenient to have |Z| = |Z|+ 1. For example,

|Z = 1, 2, 3| = 3, |Z = 1, 2, 3, e| = 4

1Note that the constraints for lossy data compression code are usually specified on theresultant entropy and expected distortion. Here, instead of putting constraints on entropy, weadopt the number-of-outcome constraint simply because it is easier to understand, especiallyfor those who are not familiar with this subject.

102

and the distortion measure is defined by

[ρ(i, j)] ,

0 2 2 0.52 0 2 0.52 2 0 0.5

.

The source is again uniformly distributed. In this example, to denote sourceletters by distinct letters in 1, 2, 3 will cost four times than to represent themby e. Therefore, if only 2 outcomes are allowed, and the expected distortioncannot be greater than 1/3, then employing typical elements 1 and e to representsource 1 and 2, 3 is an optimal choice. The resultant entropy is reduced fromlog(3) nat to [log(3)− (2/3) log(2)] nat.

It needs to be pointed out that to have |Z| > |Z| + 1 is usually not advan-tageous. Indeed, it has been proved that under some reasonable assumptions onthe distortion measure, to have larger reproduction alphabet that |Z| + 1 willnot perform better.

5.1.3 Examples of some frequently used distortion mea-sures

Example 5.5 (Hamming distortion measure) Let source alphabet and re-production alphabet be the same, i.e., Z = Z. Then the Hamming distortionmeasure is given by

ρ(z, z) ,

0, if z = z;1, if z 6= z.

It is also named the probability-of-error distortion measure because

E[ρ(Z, Z)] = Pr(Z 6= Z).

Example 5.6 (squared error distortion) Let source alphabet and reproduc-tion alphabet be the same, i.e., Z = Z. Then the squared error distortion isgiven by

ρ(z, z) , (z − z)2.

The squared error distortion measure is perhaps the most popular distortionmeasure used for continuous alphabets.

The squared error distortion has the advantages of simplicity and having closeform solution for most cases of interest, such as using least squares prediction.Yet, such distortion measure has been criticized as an unhumanized criterion.For example, two speech waveforms in which one is a slightly time-shifted version

103

of the other may have large square error distortion; however, they sound verysimilar to human.

The above definition for distortion measures can be viewed as a single-letterdistortion measure since they consider only one random variable Z which draws asingle letter. For sources modelled as a sequence of random variables Z1, . . . , Zn,some extension need to be made. A straightforward extension is the additivedistortion measure.

Definition 5.7 (additive distortion measure) The additive distortion mea-sure ρn between sequences zn and zn is defined by

ρn(zn, zn) =n∑

i=1

ρ(zi, zi).

Another example that is also based on a per-symbol distortion is the maxi-mum distortion measure:

Definition 5.8 (maximum distortion measure)

ρn(zn, zn) = max1≤i≤n

ρ(zi, zi).

After defining the distortion measures for source sequences, a natural ques-tion to ask is whether to reproduce source sequence zn by sequence zn of thesame length is a must or not. To be more precise, can we use zk to representzn for k 6= n? The answer is certainly yes if a distortion measure for zn andzk is defined. A quick example will be that the source is a ternary sequenceof length n, while the (fixed-length) data compression result is a set of binaryindexes of length k, which is taken as small as possible subject to some givenconstrains. Hence, k is not necessarily equal to n. One of the problems for tak-ing k 6= n is that the distortion measure for sequences can no longer be definedbased on per-letter distortions, and hence a per-letter formula for the best lossydata compression rate may not be rendered.

In order to alleviate the aforementioned (k 6= n) problem, we claim that formost cases of interest, it is reasonable to assume k = n. This is because onecan actually implement the lossy data compression from Zn to Zk in two steps:the first step corresponds to lossy compression mapping hn : Zn → Zn, and thesecond step performs indexing hn(Zn) into Zk. For ease of understanding, thesetwo steps are illustrated below.

Step 1 : Find the data compression code

hn : Zn → Zn

104

for which the pre-specified distortion constraint and rate constraint areboth satisfied.

Step 2 : Derive the (asymptotically) lossless data compression block code forsource hn(Zn). The existence of such code with block length

k > H(h(Zn))

is guaranteed by Shannon’s source coding theorem.

Through the above two steps, a lossy data compression code from

Zn → Zn︸︷︷︸Step 1

→ 0, 1k

is established. Since the second step is already discussed in lossless data com-pression, we can say that the theorem regarding the lossy data compression isbasically a theorem on the first step.

5.2 Fixed-length lossy data compression codes

Similar to lossless source coding theorem, the objective of information theoristsis to find the theoretical limit of the compression rate for lossy data compres-sion codes. Before introducing the main theorem, we need to define lossy datacompression codes first.

Definition 5.9 (fixed-length lossy data compression code subject toaverage distortion constraint) An (n, M,D) fixed-length lossy data com-pression code for source alphabet Zn and reproduction alphabet Zn consists ofa compression function

h : Zn → Zn

with the size of the codebook (i.e., the image h(Zn)) being |h(Zn)| = M , andthe average distortion satisfying

E

[1

nρn(Zn, h(Zn))

]≤ D.

Since the size of the codebook is M , it can be binary-indexed by log2 Mbits. Therefore, the average rate of such code is (1/n) log2 M bits/sourceword(or (1/n) log M nats/sourceword).

Note that a parallel definition for variable-length source compression codecan also be defined. However, there is no conclusive results for the bound of

105

such code rate up to now, and hence we omit it for the moment. This is also aninteresting open problem to research on.

After defining fixed-length lossy data compression codes, we are ready todefine the achievable rate-distortion pair.

Definition 5.10 (achievable rate-distortion pair) For a given sequence ofdistortion measures ρnn≥1, a rate distortion pair (R,D) is achievable if thereexists a sequence of fixed-length lossy data compression codes (n,Mn, D) withultimate code rate lim supn→∞(1/n) log Mn ≤ R.

With the achievable rate-distortion region, we define the rate-distortion func-tion as follows.

Definition 5.11 (rate-distortion function) The rate-distortion function, de-noted by R(D), is equal to

R(D) , infR ∈ < : (R,D) is an achievable rate-distortion pair.

5.3 Rate distortion function for discrete memorylesssources

The main result here is based on the discrete memoryless source (DMS) andbounded additive distortion measure. “Boundedness” of a distortion measuremeans that

max(z,z)∈Z×Z

ρ(z, z) < ∞.

The basic idea of choosing the data compression codewords from the set ofsource input symbols under DMS is to draw the codewords from the distortiontypical set. This set is defined similarly as the joint typical set for channels.

Definition 5.12 (distortion typical set) For a memoryless distribution withgeneric marginal PZ,Z and a bounded additive distortion measure ρn(·, ·), the

106

distortion δ-typical set is defined by

Dn(δ) ,

(zn, zn) ∈ Zn × Zn :∣∣∣∣−

1

nlog PZn(zn)−H(Z)

∣∣∣∣ < δ,

∣∣∣∣−1

nlog PZn(zn)−H(Z)

∣∣∣∣ < δ,

∣∣∣∣−1

nlog PZn,Zn(zn, zn)−H(Z, Z)

∣∣∣∣ < δ,

and

∣∣∣∣1

nρn(zn, zn)− E[ρ(Z, Z)]

∣∣∣∣ < δ

.

Note that this is the definition of the jointly typical set with additional con-straint on the distortion being close to the expected value. Since the additivedistortion measure between two joint i.i.d. random sequences, i.e.,

ρn(Zn, Zn) =n∑

i=1

ρ(Zi, Zi),

is actually the sum of i.i.d. random variable, the (weak) law-of-large-numberholds for the distortion typical set. Therefore, an AEP-like theorem can bederived for distortion typical set.

Theorem 5.13 If (Z1, Z1), (Z2, Z2), . . ., (Zn, Zn), . . . are i.i.d., and ρn are bo-unded additive distortion measure, then

− 1

nlog PZn(Z1, Z2, . . . , Zn) → H(Z) in probability;

− 1

nlog PZn(Z1, Z2, . . . , Zn) → H(Z) in probability;

− 1

nlog PZn,Zn((Z1, Z1), . . . , (Zn, Zn)) → H(Z, Z) in probability;

and1

nρn(Zn, Zn) → E[ρ(Z, Z)] in probability.

Proof: Functions of independent random variables are also independent randomvariables. Thus by the weak law of large numbers, we have the desired result. 2

It needs to be pointed out that without boundedness assumption, the nor-malized sum of an i.i.d. sequence is not necessary convergence in probability to afinite mean. That is why an additional condition, “boundedness” on distortionmeasure, is imposed, which guarantees the required convergence.

107

Theorem 5.14 (AEP for distortion measure) Given a discrete memory-less sources Z, a single-letter conditional distribution PZ|Z , and any δ > 0, theweakly distortion δ-typical set satisfies

1. PZn,Zn(Dcn(δ)) < δ for n sufficiently large;

2. for all (zn, zn) in Dn(δ),

PZn(zn) ≥ PZn|Zn(zn|zn)e−n[I(Z;Z)+3δ]. (5.3.1)

Proof: The first one follows from the definition. The second one can be provedby

PZn|Zn(zn|zn) =PZn,Zn(zn, zn)

PZn(zn)

= PZn(zn)PZn,Zn(zn, zn)

PZn(zn)PZn(zn)

≤ PZn(zn)e−n[H(Z,Z)−δ]

e−n[H(Z)+δ]e−n[H(Z)+δ]

= PZn(zn)en[I(Z;Z)+3δ].

2

Before we go further to the lossy data compression theorem, we also need thefollowing inequality.

Lemma 5.15 For 0 ≤ x ≤ 1, 0 ≤ y ≤ 1, and n > 0,

(1− xy)n ≤ 1− x + e−yn, (5.3.2)

with equality holds if, and only if, (x, y) = (1, 0).

Proof: Let gy(t) , (1−yt)n. It can be shown by taking the second derivative ofgy(t) with respect to t that this function is strictly convex for t ∈ [0, 1]. Hence,for any x ∈ [0, 1],

(1− xy)n = gy

((1− x) · 0 + x · 1)

≤ (1− x) · gy(0) + x · gy(1)

with equality holds if, and only if, (x = 0) ∨ (x = 1) ∨ (y = 0)

= (1− x) + x · (1− y)n

≤ (1− x) + x · (e−y)n

with equality holds if, and only if, (x = 0) ∨ (y = 0)

≤ (1− x) + e−ny

with equality holds if, and only if, (x = 1).

108

From the above derivation, we know that equality holds for (5.3.2) if, and onlyif,

[(x = 0) ∨ (x = 1) ∨ (y = 0)] ∧ [(x = 0) ∨ (y = 0)] ∧ [x = 1] = (x = 1, y = 0).

(Note that (x = 0) represents (x, y) ∈ <2 : x = 0 and y ∈ [0, 1]. Similardefinition applies to the other sets.) 2

Theorem 5.16 (rate distortion theorem) For DMS and bounded additivedistortion measure (namely,

ρmax , max(z,z)∈Z×Z

ρ(z, z) < ∞ and ρn(zn, zn) =n∑

i=1

ρ(zi, zi)),

the rate-distortion function is

R(D) = minPZ|Z : E[ρ(Z,Z)]≤D

I(Z; Z).

Proof: Denote f(D) , minPZ|Z : E[ρ(Z,Z)]≤D

I(Z; Z). Then we shall show

that R(D) defined in Definition 5.11 equals f(D).

1. Achievability (i.e., R(D + ε) ≤ f(D) + 4ε for arbitrarily small ε > 0): Weneed to show that for any ε > 0, there exist 0 < γ < 4ε and a sequence of lossydata compression codes, (n,Mn, D + ε)∞n=1, with

lim supn→∞

1

nlog Mn ≤ f(D) + γ.

The proof is as follows.

Step 1: Optimizer. Let PZ|Z be the optimizer of f(D), i.e.,

f(D) = minPZ|Z : E[ρ(Z,Z)]≤D

I(Z; Z) = I(Z; Z).

Then

E[ρ(Z, Z)] ≤ D (and also1

nE[ρn(Zn, Zn)] ≤ D).

Choose Mn to satisfy

f(D) +1

2γ ≤ 1

nlog Mn ≤ f(D) + γ

for some γ in (0, 4ε), for which the choice should exist for all sufficientlylarge n > N0 for some N0. Define

δ , min

γ

8,

ε

1 + 2ρmax

.

109

Step 2: Random coding. Independently select Mn codewords from Zn ac-cording to

PZn(zn) =n∏

i=1

PZ(zi),

and denote this random codebook by C∼n, where

PZ(z) =∑z∈Z

PZ(z)PZ|Z(z|z).

Step 3: Encoding rule. Define a subset of Zn as

J ( C∼n) , zn ∈ Zn : ∃ zn ∈ C∼n such that (zn, zn) ∈ Dn(δ) ,

where Dn(δ) is defined under PZ|Z . Based on the codebook

C∼n = c1, c2, . . . , cMn,

define the encoding rule as:

hn(zn) =

cm, if (zn, cm) ∈ Dn(δ);(when more than one satisfying the requirement,just pick anyone.)

0, otherwise.

Note that when zn ∈ J ( C∼n), we have (zn, hn(zn)) ∈ Dn(δ) and

1

nρn(zn, hn(zn)) ≤ E[ρ(Z, Z)] + δ ≤ D + δ.

Step 4: Calculation of probability outside J ( C∼n). Let N1 satisfying thatfor n > N1,

PZn,Zn(Dcn(δ)) < δ.

LetΩ , PZn(J c( C∼n)).

Then by random coding argument,

E[Ω] =∑C∼n

PZn( C∼n)

∑

zn 6∈J ( C∼n)

PZn(zn)

=∑

zn∈Zn

PZn(zn)

∑C∼n : zn 6∈J ( C∼n)

PZn( C∼n)

.

110

For any zn given, to select a codebook C∼n satisfying zn 6∈ J ( C∼n) is equiv-alent to independently draw Mn n-tuple from Zn which is not distortionjoint typical with zn. Hence,

∑C∼n : zn 6∈J ( C∼n)

PZn( C∼n) =(Pr

[(zn, Zn) 6∈ Dn(δ)

])Mn

.

For convenient, we let K(zn, zn) be the indicator function of Dn(δ), i.e.,

K(zn, zn) =

1, if (zn, zn) ∈ Dn(δ);0, otherwise.

Then

∑C∼n : zn 6∈J ( C∼n)

PZn( C∼n) =

1−

∑

zn∈Zn

PZn(zn)K(zn, zn)

Mn

.

111

Continuing the computation of E[Ω], we get

E[Ω]

=∑

zn∈Zn

PZn(zn)

1−

∑

zn∈Zn

PZn(zn)K(zn, zn)

Mn

≤∑

zn∈Zn

PZn(zn)

1−

∑

zn∈Zn

PZn|Zn(zn|zn)e−n(I(Z;Z)+3δ)K(zn, zn)

Mn

(by (5.3.1))

=∑

zn∈Zn

PZn(zn)

1− e−n(I(Z;Z)+3δ)

∑

zn∈Zn

PZn|Zn(zn|zn)K(zn, zn)

Mn

≤∑

zn∈Zn

PZn(zn)

(1−

∑

zn∈Zn


+ exp−Mn · e−n(I(Z;Z)+3δ)

)(from (5.3.2))

≤∑

zn∈Zn

PZn(zn)

(1−

∑

zn∈Zn


+ exp−en(f(D)+γ/2) · e−n(I(Z;Z)+3δ)

),

(for f(D) + γ/2 < (1/n) log Mn)

≤ 1− PZn,Zn(Dn(δ)) + exp−enδ

,

(for f(D) = I(Z; Z) and δ ≤ γ/8)

= PZn,Zn(Dcn(δ)) + exp

−enδ

≤ δ + δ = 2δ, for n > N , max

N0, N1,

1

δlog log

(1

minδ, 1)

.

Since E[Ω] = E [PZn (J c( C∼n))] ≤ 2δ, there must exists a codebook C∼∗nsuch that PZn (J c( C∼∗n)) is no greater than 2δ.

Step 5: Calculation of distortion. For the optimal codebook C∼∗n (from the

112

previous step) at n > N , its distortion is:

1

nE[ρn(Zn, hn(Zn))] =

∑

zn∈J ( C∼∗n)

PZn(zn)1

nρn(zn, hn(zn))

+∑

zn 6∈J ( C∼∗n)

PZn(zn)1

nρn(zn, hn(zn))

≤∑

zn∈J ( C∼∗n)

PZn(zn)(D + δ) +∑

zn 6∈J ( C∼∗n)

PZn(zn)ρmax

≤ (D + δ) + 2δ · ρmax

≤ D + δ(1 + 2ρmax)

≤ D + ε.

2. Converse Part (i.e., R(D + ε) ≥ f(D) for arbitrarily small ε > 0 andany D ∈ D ≥ 0 : f(D) > 0): We need to show that for any sequence of(n,Mn, Dn)∞n=1 code with

lim supn→∞

1

nlog Mn < f(D),

there exists ε > 0 such that

Dn =1

nE[ρn(Zn, hn(Zn))] > D + ε

for n sufficiently large.

The proof is as follows.

Step 1: Convexity of mutual information. By the convexity of mutual in-formation I(Z; Z) with respective to PZ|Z ,

I(Z; Zλ) ≤ λ · I(Z; Z1) + (1− λ) · I(Z; Z2),

where λ ∈ [0, 1], and

PZλ|Z(z|z) , λPZ1|Z(z|z) + (1− λ)PZ2|Z(z|z).

Step 2: Convexity of f(D). Let PZ1|Z and PZ2|Z be two distributions achiev-ing f(D1) and f(D2), respectively. Since

E[ρ(Z, Zλ)] =∑z∈Z

PZ(z)∑

z∈ZPZλ|Z(z|z)ρ(z, z)

=∑z∈Z

PZ(z)∑

z∈Z

[λPZ1|Z(z|z) + (1− λ)PZ2|Z(z|z)

]ρ(z, z)

= λD1 + (1− λ)D2,

113

we have

f(λD1 + (1− λ)D2) ≤ I(Z; Zλ)

≤ λI(Z; Z1) + (1− λ)I(Z; Z2)

= λf(D1) + (1− λ)f(D2).

Therefore, f(D) is a convex function.

Step 3: Strictly decreasingness and continuity of f(D).

By definition, f(D) is non-increasing in D. Also, f(D) = 0 for

D ≥ minPZ

∑z∈Z

∑

z∈ZPZ(z)PZ(z)ρ(z, z)

(which is finite from boundedness of distortion measure). Together withits convexity, the strictly decreasingness and continuity of f(D) over D ≥0 : f(D) > 0 is proved.

114

Step 4: Main proof.

log Mn ≥ H(hn(Zn))

= H(hn(Zn))−H(hn(Zn)|Zn), since H(hn(Zn)|Zn) = 0;

= I(Zn; hn(Zn))

= H(Zn)−H(Zn|hn(Zn))

=n∑

i=1

H(Zi)−n∑

i=1

H(Zi|hn(Zn), Z1, . . . , Zi−1)

by the independence of Zn,

and chain rule for conditional entropy;

≥n∑

i=1

H(Zi)−n∑

i=1

H(Zi|Zi)

where Zi is the ith component of hn(Zn);

=n∑

i=1

I(Zi; Zi)

≥n∑

i=1

f(Di), where Di , E[ρ(Zi, Zi)];

= n

n∑i=1

1

nf(Di)

≥ nf

(n∑

i=1

1

nDi

), by convexity of f(D);

= nf

(1

nE[ρn(Zn, hn(Zn))]

),

where the last step follows since the distortion measure is additive. Finally,lim supn→∞(1/n) log Mn < f(D) implies the existence of N and γ > 0 suchthat (1/n) log Mn < f(D)− γ for all n > N . Therefore, for n > N ,

f

(1

nE[ρn(Zn, hn(Zn))]

)< f(D)− γ,

which, together with the strictly decreasing of f(D), implies

1

nE[ρn(Zn, hn(Zn))] > D + ε

for some ε = ε(γ) > 0 and for all n > N .

3. Summary: For D ∈ D ≥ 0 : f(D) > 0, the achievability and converseparts jointly imply that f(D) + 4ε ≥ R(D + ε) ≥ f(D) for arbitrarily small

115

ε > 0. Together with the continuity of f(D), we obtain that R(D) = f(D) forD ∈ D ≥ 0 : f(D) > 0.

For D ∈ D ≥ 0 : f(D) = 0, the achievability part gives us f(D) + 4ε =4ε ≥ R(D + ε) ≥ 0 for arbitrarily small ε > 0. This immediately implies thatR(D) = 0(= f(D)) as desired. 2

The formula of the rate distortion function obtained in the previous theoremis also valid for the squared error distortion over real numbers, even if it isunbounded. Here, we put the boundedness assumption just to facilitate theexposition of the current proof. Readers may refer to volume II of the lecturenotes for a more general proof.

The discussion on lossy data compression, especially on continuous sources,will be continued in Section 6.2. Examples of the calculation of rate-distortionfunctions will also be given in the same section.

After introducing Shannon’s source coding theorem for block codes, Shan-non’s channel coding theorem for block codes and rate-distortion theorem (fori.i.d. or stationary ergodic system setting), we would like to once again makeclear the “key concepts” behind these lengthy proofs, that is, typical-set andrandom-coding. The typical-set argument (specifically, δ-typical set for sourcecoding, joint δ-typical set for channel coding, and distortion typical set for rate-distortion) uses the law-of-large-number or AEP reasoning to claim the existenceof a set with very high probability; hence, the respective information manipu-lation can just focus on the set with negligible performance loss. The random-coding argument shows that the expectation of the desired performance over allpossible information manipulation schemes (randomly drawn according to someproperly-chosen statistics) is already acceptably good, and hence, the existenceof at least one good scheme that fulfills the desired performance index is vali-dated. As a result, in situations where the two above arguments apply, a similartheorem can often be established. Question is “Can we extend the theorems tocases where the two arguments fail?” It is obvious that only when new prov-ing technique (other than the two arguments) is developed can the answer beaffirmative. We will further explore this issue in Volume II of the lecture notes.

116

Chapter 6

Continuous Sources and Channels

We have introduced the fundamental theories for discrete sources and channels inthe previous chapters. In this chapter, we will turn our focus to their extensionsto continuous sources and channels.

6.1 Information measures for continuous sources and ch-annels

6.1.1 Models of continuous sources and channels

Due to their characteristics, continuous sources can be classified into two cate-gories: sources with discrete time instances and continuous alphabet, and sourceswith continuous time instances and continuous alphabet. The former is oftennamed discrete-time continuous sources, while the latter is called waveformsources.

A waveform source can be made into a continuous source by sampling. Acontinuous source can be made into a pure discrete source by quantization. So wecan consider the theory on continuous sources, as well as on waveform sources,an extension of that on discrete sources.

As mentioned in Chapter 2, a source can be modelled as a random processXt, t ∈ I. This model can also be applied to continuous sources. Specifically, adiscrete-time continuous source is that the support of Xt is continuous and indexset I is discrete. When both the support of Xt and index set I are continuous,Xt, t ∈ I becomes a waveform source.

117

6.1.2 Differential entropy

Recall that the definition of entropy for a discrete source X is

H(X) ,∑x∈X

−PX(x) log PX(x) nats.

By Shannon’s source coding theorem, this quantity is the minimum averagecodeword length achievable for lossless data compression.

Example 6.1 (extension of entropy to continuous sources) Give a sourceX with source alphabet [0, 1) and uniform generic distribution. We can make itdiscrete by quantizing it into m levels as

qm(X) =i

m, if

i− 1

m≤ X <

i

m,

for 1 ≤ i ≤ m. Then the resultant entropy of the quantized source is

H(qm(X)) = −m∑

i=1

1

mlog

(1

m

)= log(m) nats.

Since the entropy H(qm(X)) of the quantized source is a lower bound to theentropy H(X) of the uniformly distributed continuous source,

H(X) ≥ limm→∞

H(qm(X)) = ∞.

2

The above example shows that to compress such a continuous source withoutdistortion indeed requires infinite number of bits. In fact, all the continuoussources have infinite entropy.1 Thus, when studying the continuous sources,the entropy measure is limited in its effectiveness. The introduction of a newmeasure is therefore necessary.

1Proof: For any continuous source X, there must exist a non-empty open interval in whichthe cumulative distribution function FX(·) is strictly increasing. Now quantize the source intom + 1 level as follows:

• Assign one level to the complement of this open interval, and

• assign m levels to this open interval such that the probability mass on this interval,denoted by a, is equally distributed to these m levels. (This is similar to do equalpartitions on the concerned domain of F−1

X (·).)Then

H(X) ≥ H(X∆) = −(1− a) · log(1− a)− a · loga

m,

where X∆ represents the quantized version of X. The lower bound goes to infinity as m tendsto infinity. 2

118

Definition 6.2 (differential entropy) The differential entropy (in nats) of acontinuous source with generic probability density function (pdf) pX is definedas

h(X) , −∫

XpX(x) · log pX(x)dx.

The next example demonstrates the difference (in its quantity) between theentropy and differential entropy.

Example 6.3 A continuous source X with source alphabet [0, 1) and pdf f(x) =2x has differential entropy equal to

∫ 1

0

−2x · log(2x)dx =x2(1− 2 log(2x))

2

∣∣∣∣1

0

=1

2− log(2) ≈ −0.193 nats.

To derive its entropy (asymptotically), we quantize the source uniformly into mlevels, i.e.,

q(X) =i

m, if

i− 1

m≤ X <

i

m,

for 1 ≤ i ≤ m. Hence,

Pr

q(X) =

i

m

=

(2i− 1)

m2,

and the entropy of the quantized source is

−m∑

i=1

2i− 1

m2log

(2i− 1

m2

)=

[− 1

m2

m∑i=1

(2i− 1) log(2i− 1) + 2 log(m)

]nats,

which goes to infinity as m tends to infinity (cf. Table 6.1).

In order to compare these two entropies, the integral for differential entropycan be evaluated in terms of its Riemann sum (cf. Footnote 2), i.e., evaluatingthe function −2x log(2x) at m points,

1

2m,

3

2m, . . . ,

2i− 1

2m, . . . ,

2m− 1

2m,

and summing the products between 1/m (the width of the interval) and thesefunction values.

m∑i=1

[−2i− 1

mlog

(2i− 1

m

)]1

m

=

[− 1

m2

m∑i=1

(2i− 1) log(2i− 1) + log(m)

]nats,

which converges to −0.193 as m tends to infinity (cf. Table 6.1).

119

m entropy of differentialquantized source entropy

2 0.562335 nats -0.130812 nats4 1.212314 nats -0.173980 nats8 1.891987 nats -0.187455 nats16 2.581090 nats -0.191499 nats32 3.273057 nats -0.192679 nats64 3.965867 nats -0.193016 nats128 4.658919 nats -0.193111 nats256 5.352040 nats -0.193137 nats512 6.045180 nats -0.193144 nats

Table 6.1: List of entropies of m-level quantized source and m-levelRiemann approximation of differential entropy.

From the above example, a simple approximation for the difference betweenentropy and differential entropy can be made as follows:

∆E , entropy − differential entropy

≈∞∑

i=−∞[−∆x · f(i∆x)] log[∆x · f(i∆x)]−

∞∑i=−∞

[−f(i∆x) log f(i∆x)] ∆x

=∞∑

i=−∞f(i∆x)[−∆x · log(∆x)],

where f(x) is the pdf of the source, and ∆x is the sampling width. Then if thereexists an interval (a, b) in which the minimum of f(x) is strictly larger than 0,then

∆E ≈∑

i∈Zf(i∆x)[−∆x · log(∆x)]

≥∑

i∈Z : i∆X∈(a,b)f(i∆x)[−∆x · log(∆x)]

≥[

minx∈(a,b)

f(x)

][− log(∆x)]

∑

i∈Z : i∆X∈(a,b)∆x

≥[

minx∈(a,b)

f(x)

][− log(∆x)] (|a− b| − 2∆x)

which goes to infinity as ∆x tends to zero, where Z represents the set of allintegers. This analysis confirms again that the entropy of continuous sources isinfinity.

120

Note that the differential entropy, unlike the entropy, can be negative in itsvalue.

Example 6.4 (differential entropy of continuous sources with uniformgeneric distribution) A continuous source X with uniform generic distributionover (a, b) has differential entropy

h(X) = log |b− a| nats.

Example 6.5 (differential entropy of Gaussian sources) A continuoussource X with Gaussian generic distribution of mean µ and variance σ2 hasdifferential entropy

h(X) =

∫

<φ(x)

[1

2log(2πσ2) +

(x− µ)2

2σ2

]dx

=1

2log(2πσ2) +

1

2σ2E[(X − µ)2]

=1

2log(2πσ2) +

1

2

=1

2log(2πσ2e) nats,

where φ(x) is the pdf of the Gaussian distribution with mean µ and variance σ2.

6.1.3 Properties of differential entropies

The AEP theorem for discrete sources tells us that the number of elementsin a typical set is approximately enH(X). Extension of the AEP theorem fromdiscrete sources to continuous sources, by counting the number of elements in aset defined upon law-of-large-number argument, seems somewhat useless, sincethe number of total elements in a continuous domain is infinite. Therefore, weconsider its volume instead, and conclude that most of the probability will beeventually mass placed on a typical set with small volume.

Theorem 6.6 (AEP for continuous sources) Let X1, . . . , Xn be a sequenceof sources drawn i.i.d. according to the density pX(·). Then

− 1

nlog pX(X1, . . . , Xn) → E[− log pX(X)] = h(X) in probability.

Proof: The proof is an immediate result of law of large numbers. 2

121

Definition 6.7 (typical set) For δ > 0 and any n given, define the typical setas

Fn(δ) ,

xn ∈ X n :

∣∣∣∣−1

nlog pX(X1, . . . , Xn)− h(X)

∣∣∣∣ < δ

.

Definition 6.8 (volume) The volume of a set A is defined as

Vol(A) ,∫

Adx1 · · · dxn.

Theorem 6.9 (Shannon-McMillan theorem for continuous sources)

1. For n sufficiently large, PXn F cn(δ) < δ.

2. Vol(Fn(δ))≤ en(h(X)+δ) for all n.

3. Vol(Fn(δ))≥ (1− δ)en(h(X)−δ) for n sufficiently large.

Proof: The proof is an extension of Shannon-McMillan theorem for discretesources, and hence we omit it. 2

We can also derive a source coding theorem for continuous sources and obtainthat when a continuous sources is compressed by quantization, it is beneficialto put most of the quantization effort on its typical set, instead of on the entiresource space, namely, assigning (m − 1)-level to elements in Fn(δ) and 1 levelto those elements outside Fn(δ). As a result, if the differential entropy of acontinuous source is larger, it is expected that a larger number of quantizationlevels is required in order to minimize the distortion introduced via quantization.(Note that the general result also depends on the definition of the distortionmeasure.) So we may conclude that continuous sources with higher differentialentropy contain more information in volume. The next theorem says that amongall continuous sources with identical mean and variance, the Gaussian source hasthe largest differential entropy.

Theorem 6.10 (maximal differential entropy of Gaussian source) TheGaussian source has the largest differential entropy among all continuous sourceswith identical mean µ and variance σ2.

Proof: Let p(·) be the pdf of a continuous source X, and let φ(·) be the pdf ofa Gaussian source Y . Assume that these two sources have the same mean µ and

122

variance σ2. Observe that

−∫

<φ(y) log φ(y)dy =

∫

<φ(y)

[1

2log(2πσ2) +

(y − µ)2

2σ2

]dy

=1

2log(2πσ2) +

1

2σ2E[(Y − µ)2]

=1

2log(2πσ2) +

1

2σ2E[(X − µ)2]

= −∫

<p(x)

[−1

2log(2πσ2)− (x− µ)2

2σ2

]dx

= −∫

<p(x) log φ(x)dx.

Hence,

h(Y )− h(X) = −∫

<φ(y) log φ(y)dy +

∫

<p(x) log p(x)dx

= −∫

<p(x) log φ(x)dx +

∫

<p(x) log p(x)dx

=

∫

<p(x) log

p(x)

φ(x)dx

≥∫

<p(x)

(1− φ(x)

p(x)

)dx (fundamental inequality)

=

∫

<(p(x)− φ(x)) dx

= 0

with equality holds if, and only if, p(x) = φ(x) for all x ∈ <. 2

In the next lemma, we list some properties regarding differential entropy forcontinuous sources, which are the same in concept as in discrete cases.

Lemma 6.11

1. h(X|Y ) ≤ h(X) with equality holds if, and only if, X and Y are indepen-dent.

2. (chain rule for differential entropy)

h(X1, X2, . . . , Xn) =n∑

i=1

h(Xi|X1, X2, . . . , Xi−1).

3. h(Xn) ≤ ∑ni=1 h(Xi) with equality holds if, and only if, Xin

i=1 are inde-pendent.

123

There are some properties that are conceptually different in continuous casesfrom the original ones in discrete cases.

Lemma 6.12

• (In discrete cases) For any one-to-one correspondence mapping f ,

H(f(X)) = H(X).

• (In continuous cases) For a mapping f(x) = ax with some non-zero con-stant a,

h(f(X)) = h(X) + log |a|.Proof: (For continuous cases only) Let pX(·) and pf(X)(·) be respectively thepdfs of the original source and the mapped source. Then

pf(X)(u) =1

|a|pX

(u

a

).

By taking the new pdf into the formula of differential entropy, we have thedesired result. 2

The above lemma says that the differential entropy can be increased by aone-to-one correspondence mapping. Therefore, when viewing the quantity asa measure of information content of continuous sources, we will yield that in-formation content can be increased by a one-to-one function mapping. This issomewhat contrary to the intuition. Based on this reason, it may not be appro-priate to interpret the differential entropy as an index of information content ofcontinuous sources. In Section 6.1.5, we will see that this quantity, however, canbe viewed as a measure of quantization efficiency. More details will follow.

Some researchers interpret the differential entropy as a convenient interme-diate formula of calculating the mutual information and divergence for systemswith continuous alphabets, which is in general true. Before introducing theformula of relative entropy and mutual information for continuous settings, wepoint out beforehand that the interpretation, as well as the operational charac-teristics, of the mutual information and divergence for systems with continuousalphabets is exactly the same as those for systems with discrete alphabets.

We end this subsection with some useful results on differential entropy.

Corollary 6.13 For a sequence of continuous sources Xn = (X1, . . ., Xn), anda non-singular n× n matrix A,

h(AXn) = h(Xn) + log |A|,where |A| represents the determinant of matrix A.

Corollary 6.14 h(X + c) = h(X) and h(X|Y ) = h(X + Y |Y ).

124

6.1.4 Operational meaning of differential entropy

Lemma 6.15 Give the pdf f(x) of a continuous source X, and suppose that−f(x) log2 f(x) is Riemann-integrable.2 Then to uniformly quantize the randomsource within n-bit accuracy, i.e., the quantization width is no greater than 2−n,need approximately h(X) + n bits (as n large enough).

Proof:

Step 1: Mean-value theorem.

Let ∆ = 2−n be the width of two adjacent quantization levels. Let ti = i∆for integer i ∈ (−∞,∞). From mean-value theorem [1], we can choosexi ∈ [ti−1, ti] such that

∫ ti

ti−1

f(x)dx = f(xi)(ti − ti−1) = ∆ · f(xi).

2Riemann integral: Let s(x) represent a step function on [a, b), which is defined as thatthere exists a partition a = x0 < x1 < · · · < xn = b such that s(x) is constant during (xi, xi+1)for 0 ≤ i < n. If a function f(x) is Riemann integrable,

∫ b

a

f(x) , sups(x) : s(x)≤f(x)

∫ b

a

s(x)dx = infs(x) : s(x)≥f(x)

∫ b

a

s(x)dx.

Example of a non-Riemann-integrable function: f(x) = 0 if x is irrational; f(x) = 1 if x isrational. Then

sups(x) : s(x)≤f(x)

∫ b

a

s(x)dx = 0,

but

infs(x) : s(x)≥f(x)

∫ b

a

s(x)dx = (b− a).

Lebesgue integral: Let t(x) represent a simple function, which is defined as the linear combi-nation of indicator functions for (finitely many) mutually-disjoint partitions. Specifically, letU1, . . . ,Um be a mutually-disjoint partition of domain X (namely, ∪m

i=1Ui = X and Ui∩Uj = ∅for i 6= j), and the indicator function of Ui is 1(x;Ui) = 1 if x ∈ Ui, and 0, otherwise. Thent(x) =

∑mi=1 ai1(x;Ui) is a simple function.

If a function f(x) is Lebesgue integrable, then

∫ b

a

f(x) = supt(x) : t(x)≤f(x)

∫ b

a

t(x)dx = inft(x) : t(x)≥f(x)

∫ b

a

t(x)dx.

The previous example is actually Lebesgue integrable, and its Lebesgue integral is equal tozero.

125

Step 2: Definition of h∆(X).

Let

h∆(X) ,∞∑

i=−∞[f(xi) log2 f(xi)]∆.

Since h(X) is Riemann-integrable,

h∆(X) → h(X) as ∆ = 2−n → 0.

Therefore, given any ε > 0, there exists N such that for all n > N ,

|h(X)− h∆(X)| < ε.

Step 3: Computation of H(X∆).

The entropy of the quantized source X∆ is

H(X∆) = −∞∑

i=−∞pi log2 pi = −

∞∑i=−∞

(f(xi)∆) log2(f(xi)∆) bits.

Step 4: H(X∆)− h∆(X) .

From Steps 2 and 3,

H(X∆)− h∆(X) = −∞∑

i=−∞[f(xi)∆] log2 ∆

= (− log2 ∆)∞∑

i=−∞

∫ ti

ti−1

f(x)dx

= (− log2 ∆)

∫ ∞

−∞f(x)dx = − log2 ∆ = n.

Hence,

[h(X) + n]− ε < H(X∆) = h∆(X) + n < [h(X) + n] + ε,

for n > N . 2

Since H(X∆) is the minimum average number of codeword length for losslessdata compression, to uniformly quantize a continuous source upto n-bit accuracyrequires approximately h(X) + n bits. Therefore, we may conclude that thelarger the differential entropy, the average number of bits required to uniformlyquantize the source subject to a fixed accuracy is larger.

This operational meaning of differential entropy can be used to interpret itsproperties introduced in the previous subsection. For example, “h(X + c) =h(X)” can be interpreted as “A shift in value does not change the quantizationefficiency of the original source.”

126

Example 6.16 Find the minimum average number of bits required to uniformlyquantize the decay time (in years) of a radium atom upto 3-digit accuracy, if thehalf-life of the radium is 80 years. Note that the half-life of a radium atom isthe median of its decay time distribution f(x) = λe−λx, where x > 0.

Since the median is 80, we obtain:

∫ 80

0

λe−λxdx = 0.5,

which implies λ = 0.00866. Also, 3-digit accuracy is approximately equivalentto log2 999 = 9.96 ≈ 10 bit accuracy. Therefore, the number of bits required touniformly quantize the source is approximately

h(X) + 10 = log2

e

λ+ 10 = 18.29 bits.

6.1.5 Relative entropy and mutual information for con-tinuous sources and channels

Definition 6.17 (relative entropy) Define the relative entropy between twodensities pX and pX by

D(X‖X) ,∫

XpX(x) log

pX(x)

pX(x)dx.

Definition 6.18 (mutual information) The mutual information with input-output joint density pX,Y (x, y) is defined as

I(X; Y ) ,∫

X×YpX(x, y) log

pX,Y (x, y)

pX(x)pY (y)dxdy.

Different from the cases for entropy, the properties of relative entropy andmutual information in continuous cases are the same as those in the discretecases. In particular, the mutual information of the quantized version of a con-tinuous channel will converge to the mutual information of the same continuouschannel (as the quantization step size goes to zero). Hence, some researchersprefer to define the mutual information of a continuous channel directly as thelimit of the quantized channel.

Here, we quote some of the properties on relative entropy and mutual in-formation from discrete settings. Readers can refer to Chapters 2 for moreproperties.

127

Lemma 6.19

1. D(X‖X) ≥ 0 with equality holds if, and only if, pX = pX .

2. I(X; Y ) ≥ 0 with equality holds if, and only if, X and Y are indepen-dent.

3. I(X; Y ) = h(Y )− h(Y |X).

6.2 Lossy data compression for continuous sources

Since the entropy of continuous sources is infinity, to compress a continuoussource without distortion is impossible according to Shannon’s source codingtheorem. Thus, one way to characterize the data compression for continuoussources is to encode the original source subject to a constraint on the distortion,which yields the rate-distortion function for data compression (cf. Chapter 5).

In concept, the rate-distortion function is the minimum data compressionrate (nats required per source letter or nats required per source sample) forwhich the distortion constraint is satisfied. The next theorem provides a generalupper bound on R(D) for squared error distortion measure.

Theorem 6.20 Under the squared error distortion measure, namely

ρ(z, z) = (z − z)2,

the rate-distortion function for continuous source Z with zero mean and varianceσ2 satisfies

R(D) ≤

1

2log

σ2

D, for 0 ≤ D ≤ σ2;

0, for D > σ2.

Equality holds when Z is Gaussian.

Proof: By Theorem 5.16 (extended to squared-error distortion measure),

R(D) = minpZ|Z : E[(Z−Z)2]≤D

I(Z; Z).

So for any pZ|Z satisfying the distortion constraint,

R(D) ≤ I(pZ , pZ|Z).

128

For 0 ≤ D ≤ σ2, choose a dummy Gaussian random variable W with zeromean and variance aD, where a = 1 − D/σ2, and is independent of Z. LetZ = aZ + W . Then

E[(Z − Z)2] = E[(1− a)2Z2] + E[W 2]

= (1− a)2σ2 + aD = D

which satisfies the distortion constraint. Note that the variance of Z is equal toE[a2Z2] + E[W 2] = σ2 −D. Consequently,

R(D) ≤ I(Z; Z)

= h(Z)− h(Z|Z)

= h(Z)− h(W + aZ|Z)

= h(Z)− h(W |Z) (By Corollary 6.14)

= h(Z)− h(W ) (By Lemma 6.11)

= h(Z)− 1

2log(2πe(aD))

≤ 1

2log(2πe(σ2 −D))− 1

2log(2πe(aD))

=1

2log

σ2

D.

For D > σ2, let Z satisfy PrZ = 0 = 1, and be independent of Z. ThenE[(Z − Z)2] = E[Z2] + E[Z2]− 2E[Z]E[Z] = σ2 < D, and I(Z; Z) = 0. Hence,R(D) = 0 for D > σ2.

The achievability of the upper bound by Gaussian source will be proved inTheorem 6.22. 2

6.2.1 Rate distortion function for specific sources

A) Binary sources

A specific application of rate-distortion function that is useful in practice is whenbinary alphabets and Hamming additive distortion measure are assumed. TheHamming additive distortion measure is defined as:

ρn(zn, zn) =n∑

i=1

zi ⊕ zi,

where “⊕” denotes modulo two addition. In such case, ρ(zn, zn) is exactly thenumber of bit changes or bit errors after compression. Therefore, the distortion

129

bound D becomes a bound on the average probability of bit error. Specifically,among n compressed bits, it is expected to have E[ρ(Zn, Zn)] bit errors; hence,the expected value of bit-error-rate is (1/n)E[ρ(Zn, Zn)].

Theorem 6.21 Fix a memoryless binary source

Z = Zn = (Z1, Z2, . . . , Zn)∞n=1

with marginal distribution PZ(0) = 1 − PZ(1) = p. Assume the Hammingadditive distortion measure is employed. Then the rate-distortion function

R(D) =

Hb(p)−Hb(D), if 0 ≤ D < minp, 1− p;0, if D ≥ minp, 1− p,

where Hb(p) , −p · log(p)− (1− p) · log(1− p) is the binary entropy function.

Proof: Assume without loss of generality that p ≤ 1/2.

We first prove the theorem under 0 ≤ D < minp, 1− p = p. Observe that

H(Z|Z) = H(Z ⊕ Z|Z).

Also observe that

E[ρ(Z, Z)] ≤ D implies PrZ ⊕ Z = 1 ≤ D.

Then

I(Z; Z) = H(Z)−H(Z|Z)

= Hb(p)−H(Z ⊕ Z|Z)

≥ Hb(p)−H(Z ⊕ Z) (conditioning never increase entropy)

≥ Hb(p)−Hb(D),

where the last inequality follows since the binary entropy function Hb(x) is in-creasing for x ≤ 1/2, and PrZ ⊕ Z = 1 ≤ D. Since the above derivation istrue for any PZ|Z , we have

R(D) ≥ Hb(p)−Hb(D).

It remains to show that the lower bound is achievable by some PZ|Z , or equiv-

alently, H(Z|Z) = Hb(D) for some PZ|Z . By defining PZ|Z(0|0) = PZ|Z(1|1) =

1 − D, we immediately obtain H(Z|Z) = Hb(D). The desired PZ|Z can beobtained by solving

1 = PZ(0) + PZ(1)

=PZ(0)

PZ|Z(0|0)PZ|Z(0|0) +

PZ(0)

PZ|Z(0|1)PZ|Z(1|0)

=p

1−DPZ|Z(0|0) +

p

D(1− PZ|Z(0|0))

130

and

1 = PZ(0) + PZ(1)

=PZ(1)

PZ|Z(1|0)PZ|Z(0|1) +

PZ(1)

PZ|Z(1|1)PZ|Z(1|1)

=1− p

D(1− PZ|Z(1|1)) +

1− p

1−DPZ|Z(1|1),

and yield

PZ|Z(0|0) =1−D

1− 2D

(1− D

p

)and PZ|Z(1|1) =

1−D

1− 2D

(1− D

1− p

).

This completes the proof for 0 ≤ D < minp, 1− p = p.

Now in the case of p ≤ D < 1 − p, we can let PZ|Z(1|0) = PZ|Z(1|1) = 1 to

obtain I(Z; Z) = 0 and

E[ρ(Z, Z)] =1∑

z=0

1∑

z=0

PZ(z)PZ|Z(z|z)ρ(z, z) = p ≤ D.

Similarly, in the case of D ≥ 1−p, we let PZ|Z(0|0) = PZ|Z(0|1) = 1 to obtain

I(Z; Z) = 0 and

E[ρ(Z, Z)] =1∑

z=0

1∑

z=0

PZ(z)PZ|Z(z|z)ρ(z, z) = 1− p ≤ D.

2

B) Gaussian sources

Theorem 6.22 Fix a memoryless source

Z = Zn = (Z1, Z2, . . . , Zn)∞n=1

with zero-mean Gaussian marginal distribution of variance σ2. Assume that thesquared error distortion measure is employed. Then the rate-distortion functionis:

R(D) =

1

2log

σ2

D, if 0 ≤ D ≤ σ2;

0, if D > σ2.

131

Proof: From Theorem 6.20, it suffices to show that under the Gaussian source,(1/2) log(σ2/D) is a lower bound to R(D) for 0 ≤ D ≤ σ2. This can be provedas follows.

For Gaussian source Z with E[(Z − Z)2] ≤ D,

I(Z; Z) = h(Z)− h(Z|Z)

=1

2log(2πeσ2)− h(Z − Z|Z) (Corollary 6.14)

≥ 1

2log(2πeσ2)− h(Z − Z) (Lemma 6.11)

≥ 1

2log(2πeσ2)− 1

2log

(2πe Var[(Z − Z)]

)(Theorem 6.10)

≥ 1

2log(2πeσ2)− 1

2log

(2πe E[(Z − Z)2]

)

≥ 1

2log(2πeσ2)− 1

2log (2πeD)

=1

2log

σ2

D.

2

6.3 Channel coding theorem for continuous channels

To derive the channel capacity for a memoryless continuous channel without anyconstraint on the inputs is somewhat impractical, especially when the input canbe any number on the infinite real line. Such constraint is usually of the form

E[t(X)] ≤ S (or,1

n

n∑i=1

E[t(Xi)] ≤ S for a sequence of random inputs)

where t(·) is a non-negative cost function.

Example 6.23 (average power constraint) t(x) , x2, i.e., the constraint isthat the average input power is bounded above by S.

As extended from discrete cases, the channel capacity of a discrete-time con-tinuous channel with the input cost constraints is of the form

C(S) , maxpX : E[t(X)]≤S

I(X; Y ). (6.3.1)

We first show that under such a definition, C(S) is a concave function of S.

132

Lemma 6.24 (concavity of capacity cost function) C(S) defined in (6.3.1)is concave continuous, and strictly increasing in S.

Proof: Let PX1 and PX2 be two distributions that respectively achieve C(P1)and C(P2). Denote PXλ

, λPX1 + (1− λ)PX2 . Then

C(λP1 + (1− λ)P2) = maxPX : E[t(X)]≤λP1+(1−λ)P2

I(PX , PY |X)

≥ I(PXλ, PY |X)

≥ λI(PX1 , PY |X) + (1− λ)I(PX2 , PY |X)

= λC(P1) + (1− λ)C(P2),

where the first inequality holds since

EXλ[t(X)] =

∫

<t(x)dPλ(x)

= λ

∫

<t(x)dPX1(x) + (1− λ)

∫

<t(x)dPX2(x)

= λEX1 [t(X)] + (1− λ)EX2 [t(X)]

≤ λP1 + (1− λ)P2,

and the second inequality follows from the concavity of mutual information withrespect to the first argument (cf. Lemma 2.38). Accordingly, C(S) is concave inS.

Furthermore, it can be easily seen by definition that C(S) is non-decreasing,which, together with its concavity, implies its continuity and strictly increasing.2

Although the capacity-cost function formula in (6.3.1) is valid for generalcost function t(·), we only substantiate it under the average power constraint inthe next forward channel coding theorem. Its validity for a more general casecan be similarly proved based on the same concept.

Theorem 6.25 (forward channel coding theorem for continuous chan-nels under average power constraint) For any ε ∈ (0, 1), there exist0 < γ < 2ε and a data transmission code sequence C∼n = (n,Mn)∞n=1 sat-isfying

1

nlog Mn > C(S)− γ

and for each codeword c = (c1, c2, . . . , cn),

1

n

n∑i=1

c2i ≤ S (6.3.2)

133

such that the probability of decoding error Pe( C∼n) is less than ε for sufficientlylarge n.

Proof: The theorem holds trivially when C(S) = 0 because we can chooseMn = 1 for every n, and yields Pe( C∼n) = 0. Hence, assume without loss ofgenerality C(S) > 0.

Step 0:

Take a positive γ satisfying γ < min2ε, C(S). Pick ξ > 0 small enoughsuch that 2[C(S)−C(S−ξ)] < γ, where the existence of such ξ is assured bythe strict increasing of C(S). Hence, we have C(S−ξ)−γ/2 > C(S)−γ >0. Choose Mn to satisfy

C(S − ξ)− γ

2>

1

nlog Mn > C(S)− γ,

for which the choice should exist for all sufficiently large n. Take δ = γ/8.Let PX be the distribution that achieves C(S − ξ); hence, E[X2] ≤ S − ξand I(X; Y ) = C(S − ξ).

Step 1: Random coding with average power constraint.

Randomly draw Mn codewords according to distribution PXn with

PXn(xn) =n∏

i=1

PX(xi).

By law of large numbers, each randomly selected codeword

cm = (cm1, . . . , cmn)

satisfies

limn→∞

1

n

n∑i=1

c2mi = E[X2] ≤ S − ξ almost surely

for m = 1, 2, . . . , Mn − 1.

Step 2: Coder.

For Mn selected codewords c1, . . . , cMn, replace the codewords that vio-late the power constraint (i.e., (6.3.2)) by all-zero codeword 0. Define theencoder as

fn(m) = cm for 1 ≤ m ≤ Mn.

134

When receiving an output sequence yn, the decoder gn(·) is given by

gn(yn) =

m, if (cm, yn) ∈ Fn(δ)and (∀ m′ 6= m) (cm′ , yn) 6∈ Fn(δ),

arbitrary, otherwise,

where

Fn(δ) ,

(xn, yn) ∈ X n × Yn :

∣∣∣∣−1

nlog pXnY n(xn, yn)− h(X, Y )

∣∣∣∣ < δ,

∣∣∣∣−1

nlog pXn(xn)− h(X)

∣∣∣∣ < δ,

and

∣∣∣∣−1

nlog pY n(yn)− h(Y )

∣∣∣∣ < δ

.

Step 3: Probability of error.

Let λm denote the error probability given codeword m is transmitted. De-fine

E0 ,

xn ∈ X n :1

n

n∑i=1

x2i > S

.

Then by following similar argument as (4.3.2), we get:

E[λm] ≤ PXn(E0) + PXn,Y n (F cn(δ))

+Mn∑

m′=1m′ 6=m

∑cm∈Xn

∑

yn∈Fn(δ|cm′ )

PXn,Y n(cm, yn),

whereFn(δ|xn) , yn ∈ Yn : (xn, yn) ∈ Fn(δ) .

Note that the additional term PXn(E0) to (4.3.2) is to cope with the er-rors due to all-zero codeword replacement, which will be less than δ forall sufficiently large n by the law of large numbers. Finally, by carryingout similar procedure as in the proof of the capacity for discrete channels(cf. page 87), we obtain:

E[Pe(Cn)] ≤ PXn(E0) + PXn,Y n (F cn(δ))

+Mn · en(h(X,Y )+δ)e−n(h(X)−δ)e−n(h(Y )−δ)

≤ PXn(E0) + PXn,Y n (F cn(δ)) + en(C(S−ξ)−4δ) · e−n(I(X;Y )−3δ)

= PXn(E0) + PXn,Y n (F cn(δ)) + e−nδ.

Accordingly, we can make the average probability of error, E[Pe(Cn)], lessthan 3δ = 3γ/8 < 3ε/4 < ε for all sufficiently large n. 2

135

Next we present the converse theorem to forward channel coding theorem.The basic idea of the proof is very similar to that for discrete cases.

Theorem 6.26 (converse channel coding theorem for continuous chan-nels) For any data transmission code sequence C∼n = (n,Mn)∞n=1 with eachcodeword satisfying (6.3.2), if the ultimate data transmission rate satisfies

lim infn→∞

1

nlog Mn > C(S),

then its probability of decoding error is bounded away from zero for all n suffi-ciently large.

Proof: For an (n,Mn) block data transmission code, an encoding function ischosen as:

fn : 1, 2, . . . ,Mn → X n,

and each index i is equally likely for the average probability of block decoding er-ror criterion. Hence, we can assume that the information message 1, 2, . . . ,Mnis generated from a uniformly distributed random variable, and denoted it byW . As a result,

H(W ) = log Mn.

Since W → Xn → Y n forms a Markov chain because Y n only depends on Xn,we obtain by the data processing lemma that I(W ; Y n) ≤ I(Xn; Y n). We can

136

also bound I(Xn; Y n) by C(S) as:

I(Xn; Y n) ≤ maxPXn : (1/n)

∑ni=1 E[X2

i ]≤SI(Xn; Y n)

≤ maxPXn : (1/n)

∑ni=1 E[X2

i ]≤S

n∑j=1

I(Xj; Yj), (Theorem 2.20)

= max(P1,P2,...,Pn) : (1/n)

∑ni=1 Pi=S

maxPXn : (∀ i) E[X2

i ]≤Pi

n∑j=1

I(Xj; Yj)

≤ max(P1,P2,...,Pn) : (1/n)

∑ni=1 Pi=S

n∑j=1

maxPXn : (∀ i) E[X2

i ]≤PiI(Xj; Yj)

≤ max(P1,P2,...,Pn) : (1/n)

∑ni=1 Pi=S

n∑j=1

maxPXj

: E[X2j ]≤Pj

I(Xj; Yj)

= max(P1,P2,...,Pn):(1/n)

∑ni=1 Pi=S

n∑j=1

C(Pj)

= max(P1,P2,...,Pn):(1/n)

∑ni=1 Pi=S

n

n∑j=1

1

nC(Pj)

≤ max(P1,P2,...,Pn):(1/n)

∑ni=1 Pi=S

nC

(1

n

n∑j=1

Pj

)

(by concavity of C(S))

= nC(S).

Consequently, by defining Pe( C∼n) as the error of guessing W by observing Y n

via a decoding function

gn : Yn → 1, 2, . . . , Mn,

which is exactly the average block decoding failure, we get

log Mn = H(W )

= H(W |Y n) + I(W ; Y n)

≤ H(W |Y n) + I(Xn; Y n)

≤ Hb(Pe( C∼n)) + Pe( C∼n) · log(|W| − 1) + nC(S),

(by Fano′s inequality)

≤ log(2) + Pe( C∼n) · log(Mn − 1) + nC(S),

(by the fact that (∀ t ∈ [0, 1]) Hb(t) ≤ log(2))

≤ log(2) + Pe( C∼n) · log Mn + nC(S),

137

which implies that

Pe( C∼n) ≥ 1− C(S)

(1/n) log Mn

− log(2)

log Mn

.

So if lim infn→∞(1/n) log Mn > C(S), then there exists δ with 0 < δ < 4ε andan integer N such that for n ≥ N ,

1

nlog Mn > C(S) + δ.

Hence, for n ≥ N0 , maxN, 2 log(2)/δ,

Pe( C∼n) ≥ 1− C(S)

C(S) + δ− log(2)

n(C(S) + δ)≥ δ

2(C(S) + δ).

2

Since the capacity is now a function of the cost constraint, it is named thecapacity-cost function. Although we can derive the capacity of any memorylessdiscrete-time continuous channel (at least, numerically) using the above formula,perhaps the one for the additive Gaussian channel is cited most. It is not onlybecause it possesses a close-form expression for its capacity, but also it is a goodmodel for some practical communication channels.

6.4 Capacity-cost functions for specific continuous chan-nels

6.4.1 Memoryless additive Gaussian channels

Definition 6.27 (memoryless additive channel) Let

X1, . . . , Xn and Y1, . . . , Yn

be the input and output sequences of the channel. Also let N1, . . . , Nn be thenoise. Then a memoryless additive channel is defined by

Yi = Xi + Ni

for each i, where Xi, Yi, Nini=1 are i.i.d. and Xi is independent of Ni.

Definition 6.28 (memoryless additive Gaussian channel) A memorylessadditive channel is called memoryless additive Gaussian channel, if the noise isa Gaussian random variable.

Theorem 6.29 (capacity of memoryless additive Gaussian channel un-der average power constraint) The capacity of a memoryless additive Gaus-sian channel with noise model3 N (0, σ2) and average power constraint E[X2] ≤ S

3N (0, σ2) denotes the Gaussian distribution with zero mean and variance σ2.

138

for channel input X is

C(S) =1

2log

(1 +

S

σ2

)nats/channel symbol.

Proof: By definition,

C(S) = maxpX : E[X2]≤S

I(X; Y )

= maxpX : E[X2]≤S

(h(Y )− h(Y |X))

= maxpX : E[X2]≤S

(h(Y )− h(N + X|X))

= maxpX : E[X2]≤S

(h(Y )− h(N |X))

= maxpX : E[X2]≤S

(h(Y )− h(N))

=

(max

pX : E[X2]≤Sh(Y )

)− h(N),

where N represents the additive Gaussian noise. We thus need to find an inputdistribution satisfying E[X2] ≤ S that maximizes the differential entropy of Y .Recall that the differential entropy subject to mean and variance constraint ismaximized by Gaussian random variable (cf. Theorem 6.10); also, the differentialentropy of a Gaussian random variable with variance σ2 is (1/2) log(2πσ2e) nats(cf. Example 6.5), and is nothing to do with its mean. Therefore, by taking Xto be a Gaussian random variable having distribution N (0, S), Y achieves itslargest variance S + σ2 under the constraint E[X2] ≤ S. Consequently,

C(S) =1

2log(2πe(S + σ2))− 1

2log(2πeσ2)

=1

2log

(1 +

S

σ2

)nats/channel symbol.

2

Theorem 6.30 (worseness in capacity of Gaussian noise) For all memo-ryless additive discrete-time continuous channels whose noise has zero-mean andvariance σ2, the capacity subject to average power constraint S is lower boundedby

1

2log

(1 +

S

σ2

),

which is the capacity for memoryless additive discrete-time Gaussian channel.(This means that the Gaussian noise is the “worst” kind of noises in the senseof having the least channel capacity.)

139

Proof: Let pYg |Xg and pY |X denote the transition probabilities of the Gaussianchannel and some other channel satisfying the cost constraint. Let Ng and Nrespectively denote their noises. Then for any Gaussian input pXg ,

I(pXg , pY |X)− I(pXg , pYg |Xg)

=

∫

<

∫

<pXg(x)PN(y − x) log

pN(y − x)

pY (y)dydx

−∫

<

∫

<pXg(x)PNg(y − x) log

pNg(y − x)

pYg(y)dydx

=

∫

<

∫


pN(y − x)

pY (y)dydx

−∫

<

∫


pNg(y − x)

pYg(y)dydx

=

∫

<

∫


pN(y − x)pYg(y)

pNg(y − x)pY (y)dydx

≥∫

<

∫

<pXg(x)PN(y − x)

(1− pNg(y − x)pY (y)

pN(y − x)pYg(y)

)dydx

= 1−∫

<

pY (y)

pYg(y)

(∫

<pXg(x)pNg(y − x)dx

)dy

= 0,

with equality holds if, and only if, pY (y)/pYg(y) = pN(y − x)/pNg(y − x) for allx. Therefore,

maxpX : E[X2]≤S

I(pX , pYg |Xg) = I(p∗Xg, pYg |Xg)

≤ I(p∗Xg, PY |X)

≤ maxpX : E[X2]≤S

I(pX , pY |X).

2

6.4.2 Capacity for uncorrelated parallel Gaussian chan-nels

Suppose there are k mutually-independent Gaussian channels with noise powersσ2

1, σ22, . . . and σ2

k. If one wants to transmit information using these channelssimultaneously (parallelly), what will be the effective channel capacity, and howshould the signal powers for each channel be apportioned? The answer of theabove question is the so-called water-pouring scheme. As illustrated in Fig. 6.1,the rectangles of which the height is the noise power of each channel, and of

140

which the width is equal to one (hence, the area of each rectangle is the noisepower of the corresponding Gaussian channel) are placed in order inside thecontainer. For the input power affordable, namely S, we should pour the waterinto the container so that the resultant overall area of the filled water is equal toS. The water area above each rectangle is exactly the power that should allotto each channel. In particular, channel 3 should never be used because there isno water above the respective rectangle.

σ21

σ22

σ23

σ24

S1

S2

S4

S = S1 + S2 + S4

Figure 6.1: The water-pouring scheme for parallel Gaussian channels.

Theorem 6.31 (capacity for parallel additive Gaussian channels) Thecapacity of k parallel additive Gaussian channels under an overall input powerconstraint S, is

C(S) =k∑

i=1

1

2log

(1 +

Si

σ2i

),

where σ2i is the noise variance of channel i,

Si = max0, θ − σ2i ,

and θ is chosen to satisfy∑k

i=1 Si = S.

This capacity is achieved by a set of independent Gaussian input with zeromean and variance Si.

Proof: By definition,

C(S) = maxp

Xk :∑k

i=1 E[X2k ]≤S

I(Xk; Y k).

141

Since noise N1, . . . , Nk are independent,

I(Xk; Y k) = h(Y k)− h(Y k|Xk)

= h(Y k)− h(Nk + Xk|Xk)

= h(Y k)− h(Nk|Xk)

= h(Y k)− h(Nk)

= h(Y k)−k∑

i=1

h(Ni)

≤k∑

i=1

h(Yi)−k∑

i=1

h(Ni)

=k∑

i=1

I(Xi; Yi)

≤k∑

i=1

1

2log

(1 +

Si

σ2i

)

with equality holds if each input is a Gaussian random variable with zero meanand variance Si, and the inputs are independent, where Si is the individual powerconstraint applied on channel i with

∑ki=1 Si = S.

So the problem is reduced to finding the power allotment that maximizesthe capacity subject to the constraint

∑ki=1 Si = S. By using the Lagrange

multiplier technique, the maximizer of

max

k∑

i=1

1

2log

(1 +

Si

σ2i

)+ λ

(k∑

i=1

Si − S

)

can be found by taking the derivative (w.r.t. Si) of the above equation and letit be zero, which yields

1

2

1

Si + σ2i

+ λ = 0, if Si > 0;

1

2

1

Si + σ2i

+ λ ≤ 0, if Si = 0.

Hence, Si = θ − σ2

i , if Si > 0;Si ≥ θ − σ2

i , if Si = 0,

where θ = −1/(2λ). 2

A theorem on rate-distortion function parallel to that on capacity-cost func-tion can also be established (cf. Fig. 6.2).

142

Theorem 6.32 (rate distortion for parallel Gaussian sources) Given kmutually independent Gaussian sources with variance σ2

1, . . . , σ2k. The overall

rate-distortion function for additive squared error distortion, namely

k∑i=1

E[(Zi − Zi)2] ≤ D,

is given by

R(D) =k∑

i=1

1

2log

σ2i

Di

,

where Di = minθ, σ2i and θ is chosen to satisfy

∑ki=1 Di = D.

θ

The overall water isD = D1 + D2 + D3 + D4

= σ21 + σ2

2 + θ + σ24.

height σ21

height σ22

height σ23

height σ24

¡¡¡¡¡¡ª

Pour in water of amount D.

Figure 6.2: The water-pouring for lossy data compression of parallelGaussian sources.

6.4.3 Capacity for correlated parallel additive Gaussianchannels

In the previous subsection, we consider the case of k parallel Gaussian channelsin which the noise samples from different channels are independent. As a result,the channel capacity can be modelled using water-pouring scheme. In this sub-section, we will consider the channel with correlated parallel Gaussian channels.Surprisingly, the result, again, turns out to be water-pouring scheme.

Let KN be the covariance matrix of the noise, N1, N2, . . . , Nk, and let KX

be the covariance matrix of the input, X1, . . . , Xk. Assume that KN is positive

143

definite.4 The input power constraint becomes

k∑i=1

E[X2i ] = tr(KX) ≤ S,

where tr(·) represent the traverse of the k × k matrix KX . Assume that theinput is independent of the noise. Then

I(Xk; Y k) = h(Y k)− h(Y k|Xk)

= h(Y k)− h(Nk + Xk|Xk)

= h(Y k)− h(Nk|Xk)

= h(Y k)− h(Nk).

Since h(Nk) is not determined by the input, the capacity finding problem is re-duced to maximize h(Y k) over all possible inputs satisfying the power constraint.

Now observe that the covariance matrix of Y k is equal to KY = KX + KN ,which implies that the differential entropy of Y k is upper bounded by

h(Y k) ≤ 1

2log((2πe)k|KX + KN |),

which is achieved by making Y k Gaussian. It remains to find the KX (if it ispossible) under which the above upper bound is achieved, and also this achievableupper bound is maximized.

Decompose KN into its diagonal form as

KN = QΛQt,

where superscript “t” represents the transpose operation on matrices, QQt =Ik×k, and Ik×k represents the identity matrix of order k. Note that since KN

is positive definite, Λ is a diagonal matrix with positive diagonal componentsequal to the eigenvalues of KN . Then

|KX + KN | = |KX + QΛQt|= |Q| · |QtKXQ + Λ| · |Qt|= |QtKXQ + Λ|= |A + Λ|,

4A matrix Kk×k is positive definite if for every x1, . . . , xk,

[x1, . . . , xk]K

x1

...xk

≥ 0,

with equality holds only when xi = 0 for 1 ≤ i ≤ k.

144

where A , QtKXQ. Since tr(A) = tr(KX), the problem is further transformedto maximize |A + Λ| subject to tr(A) ≤ S.

By observing that A + Λ is positive definite (because Λ is positive definite),together with the Hadamard’s inequality,5 we have

|A + Λ| ≤k∏

i=1

(Aii + λi),

where λi is the component of matrix Λ locating at ith row and ith column, whichis exactly the i-th eigenvalue of KN . Thus, the maximum value of |A+Λ| undertr(A) ≤ S is achieved by a diagonal A with

k∑i=1

Aii = S.

Finally, we can adopt the Lagrange multiplier technique as used similarly inTheorem 6.31 to obtain:

Aii = max0, θ − λi,where θ is chosen to satisfy

∑ki=1 Aii = S. We summarize the result in the next

Theorem.

Theorem 6.34 (capacity for correlated parallel additive Gaussian ch-annels) The capacity of k parallel additive Gaussian channels under overallinput power constraint S, is

C(S) =k∑

i=1

1

2log

(1 +

Si

λi

),

where λi is the i-th eigenvalue of the positive-definite noise covariance matrixKN ,

Si = max0, θ − λi,and θ is chosen to satisfy

∑ki=1 Si = S.

This capacity is achieved by a set of independent Gaussian input with zeromean and variance Si.

5

Lemma 6.33 (Hadamard’s inequality) Any positive definite k × k matrix K satisfies

|K| ≤k∏

i=1

Kii,

where Kii is the component of matrix K locating at ith row and ith column. Equality holdsif, and only if, the matrix is diagonal.

145

6.4.4 Capacity for band-limited waveform channels withwhite Gaussian noise

A common model for communication over a radio network or a telephone lineis a band-limited channel with white noise, which is a continuous-time channel,modelled as

Yt = (Xt + Nt) ∗ h(t),

where “∗” represents the convolution operation, Xt is the waveform source, Yt

is the waveform output, Nt is the white noise, and h(t) is the band-limited filter(cf. Fig. 6.3).

-Xt

H(f)-+?

Channel

Nt

-Yt

Figure 6.3: Band-limited waveform channels with white Gaussian noise.

Perhaps, the most well-known result on band-limited waveform channels isthe sampling theorem which said that sampling a band-limited signal at a sam-pling rate 1/(2W ) is sufficient to reconstruct the signal from the samples, ifW is the bandwidth of the signal. Based on this theory, one can sample thefiltered waveform signal and the filtered waveform noise, and reconstruct themdistortionlessly by the sampling frequency 2W .

Let us now briefly describe how to transmit information over waveform chan-nels. For a fixed interval [0, T ), select M different functions (waveform codebook)

c1(t), c2(t), . . . , cM(t),

for each informational message. Based on the received function y(t) for t ∈ [0, T )at the channel output, a guess on the transmitted input message is made.

Now assume for the moment that the channel is noiseless, namely Nt = 0. Wenote that the input waveform codeword cj(t) is time-limited in [0, T ); so it cannotbe band-limited. However, the receiver can only observe a band-limited versioncj(t) of cj(t) due to the ideal band-limited filter h(t). By sampling theorem, theband-limited but time-unlimited cj(t) can be distortionlessly reconstructed by its

146

(possibly infinitely many) samples at sampling rate 1/(2W ). Due to a practicalsystem constraint, the receiver can only use those samples within time [0, T )to guess what the transmitter originally sent out. (This is actually an implicitsystem constraint that is often not mentioned in the channel description.) No-tably, these 2WT samples may not reconstruct the time-unlimited cj(t) withoutdistortion. As a result, the waveform codewords cj(t)M

j=1 are chosen such that

their residual signals ˜cj(t)Mj=1, after experiencing the ideal lowpass filter h(t)

and the implicit 2WT -sample constraint, are more “resistent” to noise.

In summary, the time-limited waveform cj(t) would pass through an ideallowpass filter and be sampled, and only 2WT samples survives at the receiverend without noise. Hence, the power constraint in the capacity-cost function is

actually applied on ˜Xt (the true signal that can be reconstructed by the 2WTsamples seen at the receiver end), rather than the transmitted signal Xt. (Indeed,the signal-to-noise ratio concerned in most communication problems is the ratioof the signal power survived at the receiver end against the noise powerexperienced by this received signal. Do not misunderstand that the signal powerin this ratio is the transmitted power at the transmitter end.)

Let us turn to the discussion on band-limited noise. Since the AWGN noiseNt is no longer white after it passes through the band-limited filter, a naturalsuspect is on whether the 2WT noise samples can be assumed i.i.d. Gaussiandistributed. The answer is positive, if a right sampling rate is used. To be

147

specific, the samples of the filtered non-white noise Nt = Nt ∗ h(t) satisfy:

E[Ni/(2W )Nk/(2W )]

= E

[(∫

<h(τ)Ni/(2W )−τdτ

)(∫

<h(τ ′)Nk/(2W )−τ ′dτ ′

)]

=

∫

<

∫

<h(τ)h(τ ′)E

[Ni/(2W )−τNk/(2W )−τ ′

]dτ ′dτ

=

∫

<

∫

<h(τ)h(τ ′)

N0

2δ

(i

2W− k

2W− τ + τ ′

)dτ ′dτ

=N0

2

∫

<h(τ)h(τ − (i− k)/(2W ))dτ

=N0

2

∫

<

(∫ W

−W

1√2W

ej2πfτdf

)(∫ W

−W

1√2W

ej2πf ′(τ−(i−k)/(2W ))df ′)

dτ

=N0

4W

∫ W

−W

∫ W

−W

(∫

<ej2π(f+f ′)τdτ

)e−j2πf ′(i−k)/(2W )df ′df

=N0

4W

∫ W

−W

∫ W

−W

δ(f + f ′)e−j2πf ′(i−k)/(2W )df ′df

=N0

4W

∫ W

−W

ej2πf(i−k)/(2W )df

(=

N0

2

sin(2πW (i− k)/(2W ))

π(i− k)/(2W )

)

=N0

2

sin(π(i− k))

π(i− k)(6.4.1)

=

N0/2, if i = k;0, if i 6= k,

(6.4.2)

where the ideal filter is chosen to satisfy∫ ∞

−∞|H(f)|2df = 1.

Note that, W is cancelled out in (6.4.1) because the sampling period is exactlytwice of the bandwidth of the filter h(t). This indicates that if the samplingrate is not taken appropriately, the noise samples will become correlated in theirstatistics.

Hence, the capacity-cost function 6 of this channel subject to input waveform

6In notations, we use CT (S) to denote the capacity-cost function subject to input waveformwidth T . This notation is specifically used in waveform channels. It should not be confusedwith the notation of C(S), which is used to represent the capacity-cost function of a discrete-time channels, where the channel input is transmitted only at each sampled time instance,and hence no duration T is involved. One, however, can measure C(S) by the unit of bits persample period to relate the quantity to the usual unit of data transmission speed, such as bitsper second.

148

width T (and implicit 2WT -sample system constraint) is equal to

CT (S) = maxp

X2WT :∑2WT

i=1 E[X2i/(2W )

]≤S I(X2WT ; Y 2WT )

=2WT∑i=1

1

2log

(1 +

Si

σ2i

)

=2WT∑i=1

1

2log

(1 +

S/(2WT )

(N0/2)

)

=2WT∑i=1

1

2log

(1 +

S

WTN0

)

= WT · log

(1 +

S

WTN0

), (6.4.3)

where the input samples X2WT that achieves the capacity is also i.i.d. Gaussiandistributed. It can be proved similarly as (6.4.2) that a white Gaussian processXt can render an i.i.d. Gaussian distributed filtered samples X2WT , if a rightsampling rate is employed.

When W →∞,

CT (S) → S

N0

(nats per T unit time).

This is the channel capacity for Gaussian waveform channel of infinite band-width. From this formula, we can expect that the capacity grows linearly withthe input power. Note that in the above capacity-cost function, the factor Tseems irrelevant; in fact, the input distribution that achieves the capacity satis-fies that the overall power of its samples within [0, T ) is no greater than S, andhence T is regarded implicitly through the parameter S.

Example 6.35 (telephone line channel) Suppose telephone signals are band-limited to 4 kHz. Given signal-to-noise (SNR) ratio of 20dB (namely, S/(WN0) =20dB) and T = 1 millisecond, the capacity of bandlimited Gaussian waveformchannel is equal to:

CT (S) = WT log

(1 +

S

WTN0

)

= 4000× (1× 10−3

)× log

(1 +

100

1× 10−3

).

Therefore, the maximum reliable transmission speed is:

CT (S)

T= 46051.7 nats per second

= 66438.6 bits per second.

149

It needs to pay special attention that the capacity-cost formula used aboveis calculated based on a prohibitively simplified channel model, i.e., the noise isadditive white Gaussian. From Theorem 6.30, we learn that the capacity formulafrom such a simplified model only provides a lower bound to the true channelcapacity. Therefore, it is possible that the true channel capacity is higher thanthe quantity obtained in the above example!

6.4.5 Capacity for filtered waveform stationary Gaussianchannels

A channel with filtered input and additive Gaussian noise is shown in Fig. 6.4,where the filter is no longer ideally band-limited and the Gaussian noise is per-haps colored instead of white. This kind of channel arises repeatedly in thephysical world, so the calculation of its capacity is of fundamental importance.

-Xt

H(f) - + -

Channel

Yt?

Nt

Figure 6.4: Filtered Gaussian channel.

In analyzing the channel, we are free to change it into any equivalent formthat suits our purpose. Let PSDN(f) be the power spectral density of the addi-tive Gaussian noise Nt. Let

PSDN ′(f) , PSDN(f)

|H(f)|2

be the power spectral density of N ′t . Then the channel can be viewed as a channel

with noise N ′t appearing before the filter (cf. Fig. 6.5). It seems at the first glance

from Fig. 6.5 that we can use the same sampling-theorem-like technique to solvethe problem. Yet, as the filter is not required to be band-limited now, we maynot be able to re-construct the signal waveforms from finite number of samples.Also, N ′

t may not be white, even if Nt is; hence, the samples is not necessarilymemoryless. Accordingly, an alternative approach may need to be employed forthe capacity finding.

150

-Xt

H(f)-+?

Channel

N ′t

-Yt

Figure 6.5: Equivalent model of filtered Gaussian channel.

Lemma 6.36 Any real-valued function v(t) defined over [0, T ) can be decom-posed into

v(t) =∞∑i=1

viΨi(t),

where the real-valued functions Ψi(t) are any orthonormal set (which can spanthe space with respect to v(t)) of functions on [0, T ), namely

∫ T

0

Ψi(t)Ψj(t)dt =

1, if i = j;0, if i 6= j,

and

vi =

∫ T

0

Ψi(t)v(t)dt.

When the real-valued function v(t) is a random process, the resultant coeffi-cients vi∞i=1 are also random in their values. The basic idea of the next lemmais to choose a proper Ψi∞i=1 such that the coefficients vi∞i=1 are uncorrelated.

Lemma 6.37 (Karhunen-Loeve expansion) Given a stationary randomprocess Vt, and its autocorrelation function

φV (t) = E[VτVτ+t].

Let Ψi(t)∞i=1 and λi∞i=1 be the eigenfunctions and eigenvalues of φV (t), namely

∫ T

0

φV (t− s)Ψi(s)ds = λiΨi(t), 0 ≤ t < T.

Then the expansion coefficients Λi∞i=1 of Vt with respect to orthonormal func-tions Ψi(t)∞i=1 are uncorrelated. In addition, if Vt is Gaussian, then Λi∞i=1

are independent Gaussian random variables.

151

Proof:

E[ΛiΛj] = E

[∫ T

0

Ψi(t)Vtdt×∫ T

0

Ψj(s)Vsds

]

=

∫ T

0

∫ T

0

Ψi(t)Ψj(s)E[VtVs]dtds

=

∫ T

0

∫ T

0

Ψi(t)Ψj(s)φV (t− s)dtds

=

∫ T

0

Ψi(t)

(∫ T

0

Ψj(s)φV (t− s)ds

)dt

=

∫ T

0

Ψi(t)(λjΨj(t))dt

=

λi, if i = j;0, if i 6= j.

2

We are now ready to express the input Xt and the noise N ′t in terms of

the Karhunen-Loeve expansion basis Ψi(t)∞i=1 of the autocorrelation functionof N ′

t . To abuse the notations without ambiguity, we let N ′i and Xi be the

Karhunen-Loeve coefficients of N ′t and Xt with respect to Ψi(t). Since N ′

i∞i=1

are independent Gaussian distributed, we obtain from Theorem 6.34 that thechannel capacity subject to input waveform width T is

CT (S) =∞∑i=1

1

2log

(1 +

max(0, θ − λi)

λi

)

=∞∑i=1

1

2log

[max

(1,

θ

λi

)]

=∞∑i=1

1

2max

[0, log

θ

λi

],

where θ is the solution of

S =∞∑i=1

max [0, θ − λi]

and E[(N ′i)

2] = λi is the i-th eigenvalue of the autocorrelation function of N ′t

(corresponding to eigenfunction Ψi(t)).

We then summarize the result in the next theorem.

152

Theorem 6.38 Give a filtered Gaussian waveform channel with noise spectraldensity PSDN(f) and filter H(f).

C(S) = limT→∞

CT (S)

T=

1

2

∫ ∞

−∞max

[0, log

θ

PSDN(f)/|H(f)|2]

df,

where θ is the solution of

S =

∫ ∞

−∞max

[0, θ − PSDN(f)

|H(f)|2]

df.

Proof: This is a consequence of Toeplitz distribution theorem.7 2

As you can see from the channel capacity formula in the above theorem. Italso follows the water-pouring scheme. In other words, we can view the curveof PSDN(f)/|H(f)|2 as a bowl, and water is imagined as being poured into thebowl to level θ under which the area of the water is equal to S (cf. Fig. 6.6).Then the water assumes the shape of the optimum transmission power spectrum.

6.5 Information-transmission theorem

Theorem 6.39 (joint source-channel coding theorem) Fix a distortionmeasure. A DMS can be reproduced at the output of a channel with distortionless than D (by taking sufficiently large blocklength), if

R(D)

Ts

<C(S)

Tc

,

7[Toeplitz distribution theorem] Consider a zero-mean stationary random process Vt

with power spectral density PSDV (f) and∫ ∞

−∞PSDV (f)df < ∞.

Denote by λ1(T ), λ2(T ), λ3(T ), . . . the eigenvalues of the Karhunen-Loeve expansion corre-sponding to the autocorrelation function of Vt over the time interval of width T . Then for anyreal-valued continuous function a(·) satisfying

a(t) ≤ K · t for 0 ≤ t ≤ maxf∈<

PSDV (f)

for any finite constant K,

limT→∞

1T

∞∑

i=1

a(λi(T )) =∫ ∞

−∞a(PSDV (f)

)df.

153

(a) The spectrum of PSDN(f)/|H(f)|2 and the corresponding waterpouring scheme.

(b) The input spectral density that achieves capacity.

Figure 6.6: The water-pouring scheme.

where Ts and Tc represent the durations per source letter and per channel input,respectively. Note that the units of R(D) and C(S) should be the same, i.e.,they should be measured both in nats (by taking natural logarithm), or both inbits (by taking base-2 logarithm).

Theorem 6.40 (joint source-channel coding converse) All data transmis-sion codes will have average distortion larger than D for sufficiently large block-length, if

R(D)

Ts

>C(S)

Tc

.

Example 6.41 (additive white Gaussian noise (AWGN) channel withbinary channel input) Assume that the discrete-time binary source is mem-oryless with uniform marginal distribution, and the discrete-time channel has

154

binary input alphabet and real-line output alphabet with Gaussian transitionprobability. Denote by Pb the probability of bit error.

From Theorem 6.21, the rate-distortion function for binary input and Ham-ming additive distortion measure is

R(D) =

log(2)−Hb(D), if 0 ≤ D ≤ 1

2;

0, if D >1

2.

Notably, the distortion bound D is exactly a bound on bit error rate Pb sinceHamming additive distortion measure is used.

According to [2], the channel capacity-cost function for binary-input AWGNchannel is

C(S) =S

σ2− 1√

2π

∫ ∞

−∞e−y2/2 log

[cosh

(S

σ2+ y

√S

σ2

)]dy

=EbTc/Ts

N0/2− 1√

2π

∫ ∞

−∞e−y2/2 log

[cosh

(EbEc/Ts

N0/2+ y

√EbTc/Ts

N0/2

)]dy

= 2Rγb − 1√2π

∫ ∞

−∞e−y2/2 log[cosh(2Rγb + y

√2Rγb)]dy,

where R = Tc/Ts is the code rate for data transmission and is measured in theunit of source letter/channel usage (or information bit/channel bit), and γb (oftendenoted by Eb/N0) is the signal-to-noise ratio per information bit.

Then from the joint source-channel coding theorem, good codes exist when

R(D) <Ts

Tc

C(S),

or equivalently

log(2)−Hb(Pb) <1

R

[2Rγb − 1√

2π

∫ ∞


√2Rγb)]dy

].

By re-formulating the above inequality as

Hb(Pb) > log(2)− 2γb +1

R√

2π

∫ ∞


√2Rγb)]dy,

a lower bound on the bit error probability as a function of γb is established.This is the Shannon limit for any code to achieve binary-input Gaussian channel(cf. Fig. 6.7).

155

1e-6

1e-5

1e-4

1e-3

1e-2

1e-1

1

-6 -5 -4 -3 -2 -1-.495 0.19 1 2

Pb

γb (dB)

Shannon limit

R = 1/2 cc c c c c c c c c c cc

cccc

R = 1/3 ss s s s s s s s s sssss

ss

sss

Figure 6.7: The Shannon limits for (2, 1) and (3, 1) codes under binary-input AWGN channel.

The result in the above example becomes important due to the invention ofthe Turbo coding, for which a near-Shannon-limit performance is first obtained.That implies that a near-optimal channel code has been constructed, since inprinciple, no codes can perform better than the Shannon-limit.

Example 6.42 (AWGN channel with real number input) Assume thatthe binary source is memoryless with uniform marginal distribution, and thechannel has real-line input and real-line output alphabets with Gaussian transi-tion probability. Denote by Pb the probability of bit error.

Again, the rate-distortion function for binary input and Hamming additivedistortion measure is

R(D) =

log(2)−Hb(D), if 0 ≤ D ≤ 1

2;

0, if D >1

2,

In addition, the channel capacity-cost function for real-input AWGN channel is

C(S) =1

2log

(1 +

S

σ2

)

=1

2log

(1 +

EbTc/Ts

N0/2

)

=1

2log (1 + 2Rγb) nats/channel symbol,

156

where R = Tc/Ts is the code rate for data transmission and is measured in theunit of information bit/channel usage, and γb = Eb/N0 is the signal-to-noiseratio per information bit.

Then from the joint source-channel coding theorem, good codes exist when

R(D) <Ts

Tc

C(S),

or equivalently

log(2)−Hb(Pb) <1

R

[1

2log (1 + 2Rγb)

].

By re-formulating the above inequality as

Hb(Pb) > log(2)− 1

2Rlog (1 + 2Rγb) ,

a lower bound on the bit error probability as a function of γb is established. Thisis the Shannon limit for any code to achieve for real-number input Gaussianchannel (cf. Fig. 6.8).

1e-6

1e-5

1e-4

1e-3

1e-2

1e-1

1

-6 -5 -4 -3 -2 -1 -.5 0 1 2

Pb

γb (dB)

Shannon limit

R = 1/2 cc c c c c c c c c c ccccc

c

R = 1/3 ss s s s s s s s s ssssssssss

s

Figure 6.8: The Shannon limits for (2, 1) and (3, 1) codes undercontinuous-input AWGN channels.

157

6.6 Capacity bound for non-Gaussian channels

If a channel has additive but non-Gaussian noise and an input power constraint,then it is often hard to calculate the channel capacity, not to mention to derive aclose-form capacity formula. Hence, in this section, we only introduce an upperbound and a lower bound on the capacity for non-Gaussian channels.

Definition 6.43 (entropy power) The entropy power of a random variableN is defined as

Ne , 1

2πee2·h(N).

Lemma 6.44 For a discrete-time continuous-alphabet additive-noise channel,the channel capacity-cost function satisfies

1

2log

S + σ2

Ne

≥ C(S) ≥ 1

2log

S + σ2

σ2,

where S is the bound in input power constraint and σ2 is the noise power.

Proof: The lower bound is already proved in Theorem 6.30. The upper boundfollows from

I(X; Y ) = h(Y )− h(N)

≤ 1

2log[2πe(S + σ2)]− 1

2log[2πeNe].

2

The entropy power of a noise N can be viewed as the average noise power ofa corresponding Gaussian random variable who has the same differential entropyas N . For a Gaussian noise N , its entropy power is equal to

Ne =1

2πee2h(X) = Var(N),

from which the name comes.

Whenever two independent Gaussian noises, N1 and N2, are added, the power(variance) in the sum is equal to the sum of the power (variance) of the two noises.This relationship can be written as

e2h(N1+N2) = e2h(N1) + e2h(N2),

or equivalentlyVar(N1 + N2) = Var(N1) + Var(N2).

158

However, when two independent noises are non-Gaussian, the relationship be-comes

e2h(N1+N2) ≥ e2h(N1) + e2h(N2),

or equivalentlyNe(N1 + N2) ≥ Ne(N1) + Ne(N2).

This is called the entropy-power inequality, and it indicates that the sum oftwo independent noises may introduce more noise power than the sum of eachindividual power, except for Gaussian noises.

159

Bibliography

[1] Tom M. Apostol, Calculus, 2nd edition, 1967.

[2] S. A. Butman and R. J. McEliece, “The ultimate limits of binary coding fora wideband Gaussian channel,” DSN Progress Report 42-22, Jet PropulsionLab., Pasadena, Ca, pp. 78–80, August 1974.

160

Appendix A

Mathematical Background on RealAnalysis

A.1 The concept of sets

A set is a collection of objects. A set may consist of people, of numbers, ofpoints, of any objects we can see and handle, or of abstract objects. The objectswhich belong to a set are called elements.

A set becomes useful if, given any object, we can decide whether it belongsto the collection we are interested in, or whether it does not. Such requirementresults in the notion of field or algebra.1

Although the field or algebra provides most of the necessary mechanisms for

1A set F is said to be a field or algebra of Ω if it is a nonempty collection of subsets of Ωwith the following properties:

1. ∅ ∈ F and Ω ∈ F ;

2. A ∈ F ⇒ Ac = Ω \ A ∈ F , where the superscript “c” and operator “\” represent theset complementary and subtraction operations, respectively;

3. A ∈ F and B ∈ F imply that A ∪B ∈ F .

The set Ω is named the sample space that consists of all possible outcomes in an experiment,while the set F is referred to as the event space.

One can view F as a mechanism to decide whether an object lies in a subset of Ω. Forexample, the first condition corresponds to the mechanism to determine whether the outcomelies in an empty set (impossible) or the sample space (certain). The second condition canbe interpreted as that having a mechanism to determine whether the outcome lies in A isequivalent to having a mechanism to determine whether the outcome lines in Ac. The thirdcondition says that if we have a mechanism to determine whether the outcome lies in A and amechanism to determine whether the outcome lies in B, then we can surely determine whetherthe outcome lies in the union of A and B. Note that the three conditions together indicatesthat A ∈ F and B ∈ F imply A ∩ B ∈ F ; hence, it is unnecessary to put it as a separatecondition of algebra.

161

locating an element in a (sub-)set, problems may still be encountered when oneis trying to identify the “limit” of a sequence of elements. For example, as Ω = <(the real line) and F is a collection of all open, semi-open and closed intervalswhose two endpoints are rational numbers, one cannot rely on field or algebra toidentify if an element lies in

∞⋃i=1

[0, 1. 111 . . . 1︸︷︷︸i of them

),

for the above set does not belong to F . This induces the necessity of introducingthe notion of σ-field or σ-algebra.2

We now examine the relations of sets.

Definition A.1 (subset) If each element in set A is also an element of set B,we say that A is a subset of B. Symbolically, we write

A ⊂ B (or B ⊃ A).

Some people use the notation A ⊆ B to indicate that “A is contained in Band may be equivalent to B;” while, A ⊂ B excludes the case of A = B. In ourlecture notes, we adopt the convention that “⊂” and “⊆” are equivalent in thesense of the above definition.

Definition A.2 (equality) If A ⊂ B and also B ⊂ A, then we say that A andB are equal, which is denoted by A = B.

In addition to set comparisons, we can also combine them in various ways toform new sets.

Definition A.3 (union) The union of A and B is the new set consisting of allelements which belong to A, or B, or both. This set has the symbol A ∪ B.

2A set F is said to be a σ-field or σ-algebra of Ω if it is a nonempty collection of subsets ofΩ with the following properties:

1. ∅ ∈ F and Ω ∈ F ;

2. A ∈ F ⇒ Ac ∈ F , where the superscript ‘c’ represents the set complementary operation;

3. Ai ∈ F for i = 1, 2, 3, . . . imply that ∪∞i=1Ai ∈ F .

From the definition, it can be derived that the smallest σ-field is ∅,Ω, and the largest σ-fieldis the powerset 2Ω of Ω.

162

Definition A.4 (intersection) The intersection of A and B is the new setconsisting of all elements which belong to both A and B. This set has thesymbol A ∩ B.

Definition A.5 (complement) The complement of A in a universal set Ω isthe subset of Ω consisting of all those objects in Ω which do not belong to A.We denote this subset by Ω \ A. If there is no ambiguity in the universal settaken, it will be abbreviated by Ac.

The most common universal set taken in the literature is the real line, com-monly denoted by <. In the sequel, we will adopt this conventional universal setunless otherwise stated.

A helpful device in the study of relations between sets is the Venn Diagram.Since it can be found in any fundamental analysis book, we therefore omit ithere.

A.2 Supremum and maximum

After introducing the concept of sets, we can then discuss the operation of takingthe supremum/maximum and infimum/minimum over a subset of the universalset <, the set of all real numbers.

Definition A.6 (upper bound of a set) A real number u is called an upperbound of a non-empty subset A of < if every element of A is less than, or equalto, u; if the subset A has a finite upper bound, we say that A is bounded above.Symbolically, the definition becomes:

A ⊂ < is bounded above ⇐⇒ (∃ u ∈ <) such that (∀ a ∈ A), a ≤ u.

The above definition on upper bound can be extended for empty sets or non-bounded-above sets as: the upper bound for an empty set is −∞, and that fora non-bounded-above set, ∞. This extended definition will be adopted in thelecture notes.

Definition A.7 (least upper bound) For every non-empty bounded-abovesubset of <, the collection of upper bounds has a least member, which is calledthe least upper bound. In addition, the least upper bound of an empty set is −∞and that of a non-bounded-above set, ∞.

163

We now distinguish two situations: (i) the least upper bound of a set Abelongs to A, and (ii) the least upper bound of a set A does not belong to A. Itis quite easy to create examples for both situations. A quick example for (i) is(0, 1], while (0, 1) is the situation for (ii). These two situations therefore resultin two new notions from the least upper bound.

Definition A.8 (supremum or least upper bound) The least upper boundof A is called the supremum over A, and is denoted by supA.

As stated above, supA may or may not be a member of the set A. In thecase of supA ∈ A, supA is actually the greatest element in A, which is calledthe maximum over A.

Definition A.9 (maximum) If supA ∈ A, then supA is also called the max-imum over A, and is denoted by maxA. However, if supA 6∈ A, then we saythat the maximum over A does not exist.

Some properties regarding the supremum are listed below.

Property A.10 (properties of supremum)

1. The supremum always exists in < ∪ −∞,∞.(This is called the completeness axiom.)

2. (∀ a ∈ A) a ≤ supA.

3. If −∞ < supA < ∞, then (∀ ε > 0)(∃ a0 ∈ A) a0 > supA− ε.(The existence of a0 ∈ (supA−ε, supA] for any ε > 0 under the conditionof | supA| < ∞ is called the approximation property for suprema.)

4. If supA = ∞, then (∀ L ∈ <)(∃ B0 ∈ A) B0 > L.

5. If supA = −∞, then A is empty.

It is quite often for information theories to establish that a finite number αis the supremum of a set A. To do this, one must show that α satisfies bothproperties 2 and 3, i.e.,

(∀ a ∈ A) a ≤ α (A.2.1)

and(∀ ε > 0)(∃ a0 ∈ A) a0 > α− ε. (A.2.2)

To be precise, (A.2.1) says that α is an upper bound of A, and (A.2.2) saysthat no number less than α is an upper bound, so α is the least upper bound

164

or supremum. However, to show that supA = ∞, one only needs to showproperty 4, i.e.,

(∀ L ∈ <)(∃ B0 ∈ A) B0 > L.

The last property for supA = −∞ is by its definition.

Some properties regarding the maximum are listed below.

Property A.11 (properties of maximum)

1. (∀ a ∈ A) a ≤ maxA, if maxA exists in < ∪ −∞,∞.2. maxA ∈ A.

From the above property, in order to obtain α = maxA, one needs to showthat α satisfies both

(∀ a ∈ A) a ≤ α and α ∈ A.

A.3 Infimum and minimum

One can define infimum and minimum based on the concept of greatest lowerbound, which is exactly dual to the definition of the least upper bound. We thenonly give the definitions in this section.

Definition A.12 (lower bound of a set) A real number ` is called a lowerbound of a non-empty subset A of < if every element of A is greater than, orequal to, `; if the subset A has a finite lower bound, we say that A is boundedbelow. Symbolically, the definition becomes:

A ⊂ < is bounded below ⇐⇒ (∃ ` ∈ <) such that (∀ a ∈ A) a ≥ `.

Definition A.13 (greatest lower bound) For every non-empty bounded-be-low subset of <, the collection of lower bounds has a greatest member, whichis called the greatest lower bound. In addition, the greatest lower bound of anempty set is ∞, and that of a non-bounded-below set, −∞.

Definition A.14 (infimum or greatest lower bound) The greatest lowerbound of A is called the infimum over A, and is denoted by infA.

165

Definition A.15 (minimum) If infA ∈ A, then infA is also called the mini-mum over A, and is denoted by minA. However, if infA 6∈ A, we say that theminimum over A does not exist.

Property A.16 (properties of infimum)

1. The infimum always exists in < ∪ −∞,∞.(This is called the completeness axiom.)

2. (∀ a ∈ A) a ≥ infA.

3. If ∞ > infA > −∞, then (∀ ε > 0)(∃ a0 ∈ A) a0 < infA+ ε.(The existence of a0 ∈ [infA, infA+ε) for any ε > 0 under the assumptionof | infA| < ∞ is called the approximation property for infima.)

4. If infA = −∞, then (∀A ∈ <)(∃ B0 ∈ A)B0 < L.

5. If infA = ∞, then A is empty.

Property A.17 (properties of minimum)

1. (∀ a ∈ A) a ≥ minA, if minA exists in < ∪ −∞,∞.2. minA ∈ A.

A.4 Boundedness and supremum/infimum operations

Definition A.18 (boundedness) A subset A of < is said to be bounded if itis bounded above and also bounded below; otherwise it is called unbounded.

Lemma A.19 (condition for boundedness) A subset A of < is bounded if,and only if, (∃ k ∈ <) such that (∀ a ∈ A) |a| ≤ k.

Lemma A.20 (monotone property) Suppose that A and B are non-emptysubsets of <, and A ⊂ B. Then

1. supA ≤ supB.

2. infA ≥ inf B.

166

The next lemma is useful in proving some theorems on information theory.

Lemma A.21 (supremum for set operations) Define the “addition” of twosets A and B to be

A+ B , c ∈ < : c = a + b for some a ∈ A and b ∈ B.Define a “scaler multiplication” of a set A to be

k · A , c ∈ < : c = k · a for some a ∈ A.Define a “negation” of a set A to be

−A , c ∈ < : c = −a for some a ∈ A.Then

1. If A and B are bounded above, then A + B is also bounded above andsup(A+ B) = supA+ supB.

2. If 0 < k < ∞ and A is bounded above, then k · A is also bounded aboveand sup(k · A) = k · supA.

3. supA = − inf(−A) and infA = − sup(−A).

A similar result for “product” of sets is in fact not necessary true! In otherwords, define the “product” of sets A and B as

A · B , c ∈ < : c = ab for some a ∈ A and b ∈ B.Then both of the following two situations could happen, i.e.,

sup(A · B) > (supA) · (supB)

sup(A · B) = (supA) · (supB).

Lemma A.22 (supremum/infimum for a monotone function)

1. If f(x) is a non-decreasing real-valued function of x, then

supx ∈ < : f(x) < ε = infx ∈ < : f(x) ≥ εand

supx ∈ < : f(x) ≤ ε = infx ∈ < : f(x) > ε.2. If f(x) is a non-increasing real-valued function of x, then

supx ∈ < : f(x) > ε = infx ∈ < : f(x) ≤ εand

supx ∈ < : f(x) ≥ ε = infx ∈ < : f(x) < ε.Readers may refer to Figure A.1 for an illustrated example of Lemma A.22.

167

A.5 Sequences and their limits

Let N denote the set of “natural numbers” (positive integers) 1, 2, 3, · · · . Asequence drawn from a real-valued function is denoted by

f : N→ <.

In other words, f(n) is a real number for each n = 1, 2, 3, · · · . It is usual to writef(n) = an, and we often indicate the sequence by the notations

a1, a2, a3, · · · , an, · · · = an∞n=1.

One important question that arises with a sequence is what happens when ngets large. To be precise, we want to know if n is large enough, whether or notevery an is close to some fixed number L (which is the limit of an).

Definition A.23 (limit) The limit of an∞n=1 is the real number L satisfyingthat (∀ ε > 0)(∃ N) such that (∀ n > N)

|an − L| < ε.

If no such L satisfies the above statement, we say that the limit of an∞n=1 doesnot exist.

Note that in the above definition, ±∞ is not a legitimate limit for any se-quence. In fact, if (∀ L)(∃ N) such that (∀ n > N) an > L, then we say that an

diverges to ∞, denoted by an → ∞. Similar argument applies to an divergingto −∞. For convenience, we will adopt the viewpoint in the lecture notes thatas an diverges to either +∞ or −∞, we would say that the limit of an exists in<∪ −∞,∞. Based on such viewpoint, some properties regarding the limit ofa sequence are quoted below.

Property A.24

1. limn→∞(an + bn) = limn→∞ an + limn→∞ bn.

2. limn→∞(α · an) = α · limn→∞ an.

3. limn→∞(anbn) = (limn→∞ an)(limn→∞ bn).

Lemma A.25 (convergence of monotone sequence) If an is non-decreasingin n, then limn→∞ an exists in < ∪ −∞,∞. Likewise, if an is non-increasingin n, then limn→∞ an exists in < ∪ −∞,∞.

168

As stated above, the limit of a sequence may not exist! For example, an =(−1)n. Then an will be close to either −1 or 1 for n large. Hence, a more gener-alized definitions that can describe the general limiting behavior of a sequenceis required.

Definition A.26 (limsup and liminf) The limit supremum of an∞n=1 is theextended real number3

lim supn→∞

an , limn→∞

(supk≥n

ak),

and the limit infimum of an∞n=1 is the extended real number

lim infn→∞

an , limn→∞

(infk≥n

ak).

(Some also use the notations lim and lim to denote limsup and liminf, respec-tively.)

Note that the limit supremum and limit infimum of a sequence is alwaysdefined under extended real number system, since the sequences supk≥n ak =supak : k ≥ n and infk≥n ak = infak : k ≥ n are monotone in n (cf. thelemma on convergence of monotone sequence or Lemma A.25). An immediateresult of the definitions of limsup and liminf then follows.

Lemma A.27 (limit) limn→∞ an = L if, and only if,

lim supn→∞

an = lim infn→∞

an = L.

Some properties regarding limsup and liminf of sequences (which are parallelto Properties A.10 and A.16) are listed below.

Property A.28 (properties of limit supremum)

1. The limit supremum always exists in the extended real number system.

2. If | lim supn→∞ an| < ∞, then (∀ ε > 0)(∃ N) such that (∀ n > N)an < lim supn→∞ an + ε. (Note that this holds for every n > N .)

3. If | lim supn→∞ an| < ∞, then (∀ ε > 0 and integer K)(∃ N > K) suchthat aN > lim supn→∞ an − ε. (Note that this holds only for one N , whichis larger than K.)

3By extended real number, we mean that it is either a real number or ±∞.

169

Property A.29 (properties of limit infimum)

1. The limit infimum always exists in the extended real number system.

2. If | lim infn→∞ an| < ∞, then (∀ ε > 0 and K)(∃ N > K) such thataN < lim infn→∞ an + ε. (Note that this holds only for one N , which islarger than K.)

3. If | lim infn→∞ an| < ∞, then (∀ ε > 0)(∃ N) such that (∀ n > N) an >lim infn→∞ an − ε. (Note that this holds for every n > N .)

The statement of the above two properties may be too “mathematical” andwould arise “confusion” to those students who are not familiar with such math-ematical statement. We therefore introduce two new terminologies which areoften used in information theory: sufficiently large and infinitely often.

Definition A.30 (sufficiently large) We say that a property holds for a se-quence an almost always or for all sufficiently large n, if the property holds forevery n > N for some N .

Definition A.31 (infinitely often) We say that a property holds for a se-quence an infinitely often or for infinitely many n, if for every K, the propertyholds for one (specific) N with N > K.

Then Property A.28 can be re-phrased as: if | lim supn→∞ an| < ∞, then(∀ ε > 0)

an < lim supn→∞

an + ε for all sufficiently large n

andan > lim sup

n→∞an − ε for infinitely many n.

Similarly, Property A.29 becomes: if | lim infn→∞ an| < ∞, then (∀ ε > 0)

an < lim infn→∞

an + ε for infinitely many n

andan > lim inf

n→∞an − ε for all sufficiently large n.

In terms of these two terminologies, we state the following lemma.

170

Lemma A.32

1. lim infn→∞ an ≤ lim supn→∞ an.

2. If an ≤ bn for all sufficiently large n, then

lim infn→∞

an ≤ lim infn→∞

bn and lim supn→∞

an ≤ lim supn→∞

bn.

3. lim supn→∞ an < r ⇒ an < r for all sufficiently large n;

4. lim supn→∞ an > r ⇒ an > r for infinitely many n.

5.

lim infn→∞

an + lim infn→∞

bn ≤ lim infn→∞

(an + bn)

≤ lim supn→∞

an + lim infn→∞

bn

≤ lim supn→∞

(an + bn)

≤ lim supn→∞

an + lim supn→∞

bn.

6. If limn→∞ an exists, then

lim infn→∞

(an + bn) = limn→∞

an + lim infn→∞

bn

andlim sup

n→∞(an + bn) = lim

n→∞an + lim sup

n→∞bn.

Finally, readers may also interpret the limit supremum and limit infimum interms of the concept of clustering points. A clustering point is a point that asequence an hits close for infinitely many times. For example, if an = sin(nπ/2),then ann≥1 = 1, 0,−1, 0, 1, 0,−1, 0, . . .. Hence, there are three clusteringpoints in this sequence, which are −1, 0 and 1. Then the limit supremum isnothing but the largest clustering point, and the limit infimum is exactly thesmallest clustering point. Specifically, lim supn→∞ an = 1 and lim infn→∞ an =−1. This viewpoint sometimes are more direct than viewing the limsup andliminf quantities purely from their definitions.

171

A.6 Equivalence

We close this chapter by providing some equivalent statements that are oftenused to simplify the proofs of information theories. For example, instead ofdirectly showing quantity x is less than, or equal to, quantity y, one can fix aconstant ε > 0, and prove x < y + ε. Since y + ε is a larger quantity than y,in some cases it might be easier to show x < y + ε than x ≤ y. By the nexttheorem, any proof that concludes to “x < y +ε for all ε > 0” immediately givesthe desired result, x ≤ y.

Theorem A.33 For any x, y and a in <,

1. x < y + ε for all ε > 0 if, and only if, x ≤ y;

2. x < y − ε for some ε > 0 if, and only if, x < y;

3. x > y − ε for all ε > 0 if, and only if, x ≥ y;

4. x > y + ε for some ε > 0 if, and only if, x > y;

5. |a| < ε for all ε > 0 if, and only if, a = 0.

172

-

6

f(x)

ε

supx : f(x) < ε= infx : f(x) ≥ ε

supx : f(x) ≤ ε= infx : f(x) > ε

-

6

f(x)

ε

supx : f(x) ≥ ε= infx : f(x) < ε

supx : f(x) > qε= infx : f(x) ≤ ε

Figure A.1: Illustrated example for Lemma A.22.

173

Appendix B

Mathematical Background onProbability and Stochastic Processes

B.1 Concept of source and channel and some frequentlyused mathematical models

In communication theory, the informational messages are usually represented bya random process that is referred to as the source. Its statistical structure iscompletely characterized by an associated probability space.

B.2 Probability space

A probability space is a triple (Ω,F , P ), where Ω is the set of all possible out-comes (often named sample space), and F is the σ-field of Ω (often named eventspace), and P is a probability measure on the σ-field, which satisfies

1. 0 ≤ P (A) ≤ 1 for A ∈ F ;

2. P (∅) = 0 and P (Ω) = 1, where ∅ represents the empty set.

3. (countable additivity) if A1, A2, . . . is a disjoint sequence of sets in F ,then

P

( ∞⋃

k=1

Ak

)=

∞∑

k=1

P (Ak).

B.3 Random variable and random process

A random variable X defined over probability space (Ω,F , P ) is a real-valuedfunction (i.e., X : Ω → <), satisfying that the set ω : X(ω) = x ∈ F for each

174

real x.1

A random process is a collection of random variables that arise from the sameprobability space. It can be mathematically represented by the collection

Xt, t ∈ I,where Xt denotes the tth random variable in the process, and the index t runsover an index set I which is arbitrary. The index set I can be uncountably infinite— e.g., I = < — in which case we are effectively dealing with a continuous-timeprocess. We will however exclude such a case in this chapter for the sake ofsimplicity. To be precise, we will only consider the following cases of index setI:

case a) I consists of one index only.case b) I is finite.case c) I is countably infinity.

B.4 Observation probability space

The observation probability space (ΩX ,FX , PX) of a random variable X definedover probability space (Ω,F , P ) satisfies:

ΩX = X(Ω)FX = X(A) ⊂ < : A ∈ FPX(G) = P

(ω ∈ Ω : X(ω) ∈ G) for any G ∈ FX ,

where X(A) = x ∈ < : X(ω) = x for some ω ∈ A.In applications, we are perhaps more interested in the observation probability

space than the inherited probability space on which the random variables and

1A student from engineering department may question that why bother to define the randomvariables based on some abstract probability space. He may continue “A random variable Xcan simply be defined based on its distribution PX ,” which is indeed true (cf. Section B.4).

A perhaps easier way to understand the abstract definition of random variables is that theinherited probability space (Ω,F , P ) on which the random variable is defined is what trulyoccurs internally, but possible non-observable. In order to infer which of the non-observableω occurs, an experiment that results in observable x that is a function of ω is performed.Such an experiment results in the random variable X whose probability is so defined over theprobability space (Ω,F , P ).

Another merit of defining random variables based on abstract probability space can beobserved from the extension definition of random variables to random processes. With theinherited probability space, any finite dimensional distribution of Xt, t ∈ I is well-defined.For example,

Pr[X1 ≤ x1, X5 ≤ x5, X9 ≤ x9] = P (ω ∈ Ω : X1(ω) ≤ x1, X5(ω) ≤ x5, X9(ω) ≤ x9) .

175

random processes are defined. It can be proved [1, Thm. 14.1] that given areal-valued non-negative function F (·), satisfying that limx↓−∞ F (x) = 0 andlimx↑∞ F (x) = 1, there exist a random variable and an inherited probabilityspace such that the cumulate distribution function (cdf) of the random variabledefined over the probability space is equal to F (·). This result releases us withthe burden of referring to a probability space before our defining a randomvariable. In other words, we can define a random variable X directly by itscdf, i.e., Pr[X ≤ x], without bothering to refer to its inherited probability space.Nevertheless, it is better to keep in mind (and learn) that a formal mathematicalnotion of random variables and random processes is defined over some inheritedprobability space.

In what follows, you will notice that most of the properties of random vari-ables (random processes) are defined simply based on their observation probabi-lity spaces.

B.5 Relation between a source and a random process

In statistical communication, a discrete source (X1, X2, X3, . . . , Xn) , Xn con-sists of a sequence of random quantities, where each quantity usually takes valuesfrom a source generic alphabet X , namely

(X1, X2, . . . , Xn) ∈ X × X × · · · × X , X n.

The observation probability space of Xi is (X , FXi, PXi

). The elements in Xare usually called letters.

B.6 Statistical properties of random sources

An event A in observation event space FX for a random process

X = . . . , X−3, X−2, X−1, X0, X1, X2, X3, . . .

is said to be invariant with respect to a time shift (or shift transformation) if itis unchanged by the time shift; i.e., if we apply the (reverse) time shift operatorto the elements of A, we simply get the set A again. For example, if the sourcealphabet is 0, 1∞, then the event

A = (. . . , x−1 = 0, x0 = 1, x1 = 0, x2 = 1, . . .),

(. . . , x−1 = 1, x0 = 0, x1 = 1, x2 = 0, . . .)is shift-invariant.

We now classify several useful statistical properties of random processes.

176

• Memoryless: A random process

X = . . . , X−2, X−1, X0, X1, X2, . . .

is said to be memoryless if the sequence of random variables Xi, wherei = · · · ,−1, 0, 1, · · · , are independent and identically distributed (i.i.d.).

• First-order stationary: A process is first-order stationary if the marginal dis-tribution is the same for every time instant.

• Second-order stationary: A process is second-order stationary if the joint dis-tribution of any two (not necessarily consecutive) time instances is invari-ant to time shift.

• Weakly stationary process: A process is weakly stationary (or wide-sense sta-tionary or stationary in the weak or wide sense) if the mean and auto-correlation function are unchanged by a time shift.

• Stationary process: A process is stationary (or strictly stationary) if the pro-bability of every sequence or event is unchanged by a time shift.

• Ergodic process: A process is ergodic if any invariant event in observation eventspace FX has probability either 1 or 0. This definition is not very intu-itive, but some interpretations and examples may shed a little light. First,observe that the definition has nothing to do with stationarity. It simplystates that events that are unaffected by time-shifting must have probabi-lity either zero or one.

The importance of ergodicity derives from the fact that if one wishesto verify that all convergent sample averages2 converge to a constant, butnot to a random variable, then a necessary condition is that the processbe ergodic. For example, ergodicity hints that at most one of the following

2Two alternative names for sample average are time average and Ces aro mean. In thislecture notes, these names will be used interchangeably.

177

time-shift invariant events has probability one.

x∞−∞ ∈ 0, 1∞−∞ : limn→∞

x−n + · · ·+ x0 + · · ·+ xn

2n + 1= 0.0

x∞−∞ ∈ 0, 1∞−∞ : lim

n→∞x−n + · · ·+ x0 + · · ·+ xn

2n + 1= 0.1

...

x∞−∞ ∈ 0, 1∞−∞ : lim

n→∞x−n + · · ·+ x0 + · · ·+ xn

2n + 1= 0.9

x∞−∞ ∈ 0, 1∞−∞ : lim

n→∞x−n + · · ·+ x0 + · · ·+ xn

2n + 1= 1.0

(B.6.1)

Note that as |xj| ≤ 1 (boundedness), the limits (if they exist) for all shiftversions of an outcome are the same; i.e., under the assumptions of bothlimits exist,

∣∣∣∣ limn→∞

x−n + · · ·+ xn

2n + 1− lim

n→∞x−n+1 + · · ·+ xn+1

2n + 1

∣∣∣∣

= limn→∞

∣∣∣∣x−n + · · ·+ xn

2n + 1− x−n+1 + · · ·+ xn+1

2n + 1

∣∣∣∣

= limn→∞

∣∣∣∣x−n − xn+1

2n + 1

∣∣∣∣

≤ limn→∞

2

2n + 1= 0;

so the sets in (B.6.1) are all time-shift invariant. Ergodicity then im-plies that the time average should converge to some constant. It needsto be pointed out that in the above example, ergodicity does not guar-antee that this constant equals the ensemble average. A quick exam-ple is that Pr(. . . , x−1 = 0, x0 = 1, x1 = 0, x2 = 1, . . .) = 0.2 andPr(. . . , x−1 = 1, x0 = 0, x1 = 1, x2 = 0, . . .) = 0.8 assure the validity ofergodicity, but

X−n + · · ·+ X0 + · · ·+ Xn

2n + 1

converges to 1/2, which is not equal to E[Xi] for any i. In summary, ergod-icity (and boundedness in values) implies the convergence of time averageto some constant, not necessarily the ensemble average, and stationarityassures that the time average converges to a random variable; hence, itis reasonably to expect that they jointly imply the ultimate time averageequals the ensemble average. This is validated by the well-known ergodictheorem by Birkhoff and Khinchin.

178

Theorem B.1 (pointwise ergodic theorem) Give a discrete time sta-tionary random process Xn−∞<n<∞. For arbitrary real-valued functionf(·) on < with finite mean (i.e., |E[f(Xn)]| < ∞), there exists a randomvariable Y such that for any u,3

limn→∞

1

n

n−u∑

k=1−u

f(Xk) = Y with probability 1.

If, in addition to stationarity, the process is also ergodic, then for anyu,

limn→∞

1

n

n−u∑

k=1−u

f(Xk) = E[f(X0)] with probability 1.

Example B.2 Consider the process Xi∞i=−∞ consisting of a family ofi.i.d. binary random variables (obviously, it is stationary and ergodic).Define the function f(·) by f(0) = 0 and f(1) = 1, Hence,4

E[f(X)] = PX(0)f(0) + PX(1)f(1) = PX(1)

is finite. By the pointwise ergodic theorem, we have

limn→∞

f(X1) + f(X2) + · · ·+ f(Xn)

n= lim

n→∞X1 + X2 + · · ·+ Xn

n= PX(1).

As learned from the above example, one of the important consequencesthat the pointwise ergodic theorem indicates is that the time average canultimately replace the statistical average, which is a useful result in engi-neering. Hence, with stationary ergodicity, one, who observes

X301 = 154326543334225632425644234443

from a dice rolling experiment, can draw the conclusion that the true dis-tribution of the dice rolling can be well approximated by:

PrXi = 1 ≈ 1

30PrXi = 2 ≈ 6

30PrXi = 3 ≈ 7

30

PrXi = 4 ≈ 9

30PrXi = 5 ≈ 4

30PrXi = 6 ≈ 3

30

3The mode of convergence, including convergence with probability 1, will be discussed indetail in the next section.

4In notations, we use PX(0) to denote PrX = 0. The above two representations will beused alternatively throughout the lecture notes.

179

Such result is also known by the law of large numbers. The relation betweenthe ergodicity and the law of large number will be further explored inSection B.8.

We close the discussions on ergodicity by remarking that in theoriesof communications, people may assume that the source is stationary orthe source is stationary ergodic. But it is rare to see the assumption ofthe source being ergodic but non-stationary. This is perhaps because anergodic but non-stationary source not only does not facilitate the analyticalstudy of communications problems, but seems no application in practice.From this, we learn that assumptions are made either to facilitate theanalytical study of communications problem or to fit a specific need ofapplications. Without the two footings, an assumption becomes of minorinterest. This, to some extent, justifies that ergodicity assumption usuallycomes after stationarity assumption. A specific example is the pointwiseergodic theorem, where the random processes considered is presumed tobe stationary!

• First-order Markov chain: Three random variables X, Y and Z are said toform a Markov chain or a first-order Markov chain if

PX,Y,Z(x, y, z) = PX(x) · PY |X(y|x) · PZ|Y (z|y); (B.6.2)

i.e., PZ|X,Y (z|x, y) = PZ|Y (z|y). This is usually denoted by X → Y → Z.

X → Y → Z is sometimes read as “ X and Z are conditionally independentgiven Y ” because it can be shown that (B.6.2) is equivalent to

PX,Z|Y (x, z|y) = PX|Y (x|y) · PZ|Y (z|y).

Therefore, X → Y → Z is equivalent to Z → Y → X. Accordingly, theMarkovian notation is sometimes expressed as X ↔ Y ↔ Z.

• Markov chain for random sequences: The random variables X1, X2, X3, . . .are said to form a k-th order Markov chain if

Pr[Xn = xn|Xn−1 = xn−1, . . . , X1 = x1]

= Pr[Xn = xn|Xn−1 = xn−1, . . . , Xn−k = xn−k].

Each xn−1n−k ∈ X k is called the state at time n.

A Markov chain is irreducible if with some probability, we can go fromany state in X k to another state in a finite number of steps, i.e.,

(∀ xk, yk ∈ X k and integer j) Pr

Xk+j−1j = xk

∣∣∣Xk1 = yk

> 0.

180

Markov

i.i.d. Stationary

Ergodic

Figure B.1: General relations of random processes.

A Markov chain is said to be time-invariant or homogeneous, if for everyn > k,

Pr[Xn = xn|Xn−1 = xn−1, . . . , Xn−k = xn−k]

= Pr[Xk+1 = xk+1|Xk = xk, . . . , X1 = x1].

Therefore, a homogeneous first-order Markov chain can be defined throughits transition probability:

[PrX2 = x2|X1 = x1]|X |×|X |,

and its initial state distribution PX1(x). A distribution π(x) on X is saidto be a stationary distribution for a homogeneous first-order Markov chain,if for every y ∈ X ,

π(y) =∑x∈X

π(x) PrX2 = y|X1 = x.

If the initial state distribution is equal to the stationary distribution, thenthe homogeneous first-order Markov chain is a stationary process.

The general relations among i.i.d. sources, Markov sources, stationary sourcesand ergodic sources are depicted in Figure B.1

181

B.7 Convergence of sequences of random variables

In this section, we will discuss modes in which a random process X1, X2, . . .converges to a limiting random variable X. Recall that a random variable is areal-valued function from Ω to <, where Ω the sample space of the probabilityspace over which the random variable is defined. So the following two expressionswill be used interchangeably.

X1(ω), X2(ω), X3(ω), . . . ≡ X1, X2, X3, . . .

for ω ∈ Ω. Note that the random variables in a random process are defined overthe same probability space (Ω,F , P ),

Definition B.3 (convergence mode for random sequence)

1. Point-wise convergence on Ω.

Xn∞n=1 is said to converge to X pointwisely on Ω, if

(∀ ω ∈ Ω) limn→∞

Xn(ω) = X(ω).

This is usually denoted by Xnp.w.−→ X.

2. Almost sure convergence or convergence with probability 1.

Xn∞n=1 is said to converge to X with probability 1, if

Pω ∈ Ω : limn→∞

Xn(ω) = X(ω) = 1.

This is usually denoted by Xna.s.−→ X.

3. Convergence in probability.

Xn∞n=1 is said to converge to X in probability, if for any ε > 0,

limn→∞

Pω ∈ Ω : |Xn(ω)−X(ω)| > ε = limn→∞

Pr|Xn −X| > ε = 0.

This is usually denoted by Xnp−→ X.

4. Convergence in rth mean.

Xn∞n=1 is said to converge to X in rth mean, if

limn→∞

E[|X −Xn|r] = 0.

This is usually denoted by XnLr−→ X.

182

5. Convergence in distribution.

Xn∞n=1 is said to converge to X in distribution, if

limn→∞

FXn(x) = FX(x),

for every continuity point of F (x), where

FXn(x) , PrXn ≤ x and FX(x) , PrX ≤ x.

This is usually denoted by Xnd−→ X.

An example that facilitates the understanding of pointwise convergence andalmost surely convergence is as follows.

Example B.4 Give a probability space (Ω, 2Ω, P ), where Ω = 0, 1, 2, 3, andP (0) = P (1) = P (2) = 1/3 and P (3) = 0. Define a random variable as Xn(ω) =ω/n. Then

PrXn = 0 = Pr

Xn =

1

n

= Pr

Xn =

2

n

=

1

3.

It can be derived that for every ω in Ω, Xn(ω) converges to X(ω), where X(ω) =0 for every ω ∈ Ω; so

Xnp.w.−→ X.

Now let X(ω) = 0 for ω = 0, 1, 2 and X(ω) = 1 for ω = 3. Then both of thefollowing statements are true:

Xna.s.−→ X and Xn

a.s.−→ X,

since

Pr

limn→∞

Xn = X

=3∑

ω=0

P (ω) · 1

limn→∞

Xn(ω) = X(ω)

= 1,

where 1· represents the set indicator function. However, Xn does not convergeto X pointwisely because limn→∞ Xn(3) 6= X(3). So to speak, pointwise conver-gence requires “equality” even for those samples without probability mass, forwhich almost surely convergence does not take into considerations.

183

Observation B.5 (uniqueness of convergence)

1. If Xnp.w.−→ X and Xn

p.w.−→ Y , then X = Y pointwisely. I.e., (∀ ω ∈ Ω)X(ω) = Y (ω).

2. If Xna.s.−→ X and Xn

a.s.−→ Y (or Xnp−→ X and Xn

p−→ Y ) (or XnLr−→ X

and XnLr−→ Y ), then X = Y with probability 1. I.e., PrX = Y = 1.

3. Xnd−→ X and Xn

d−→ Y , then FX(x) = FY (x) for every continuity pointof FX(x).

Sometimes, it is easier for one to establish convergence of Xn∞n=1 withoutregard to the properties of the limiting random variable X. This can be donethrough the next observation.

Observation B.6 (mutual convergence criteria)

1. Xn∞n=1 converges pointwisely if, and only if,

(∀ ω ∈ Ω) limn→∞

|Xn+1(ω)−Xn(ω)| = 0.

2. Xn∞n=1 converges with probability 1 if, and only if,

P

ω ∈ Ω : limn→∞

|Xn+1(ω)−Xn(ω)| = 0

= Pr

limn→∞

|Xn+1 −Xn| = 0

= 1.

3. Xn∞n=1 converges in probability if, and only if, for every ε > 0,

limn→∞

P ω ∈ Ω : |Xn+1(ω)−Xn(ω)| > ε= lim

n→∞Pr |Xn+1 −Xn| > ε = 0.

4. Xn∞n=1 converges in rth mean if, and only if,

limn→∞

E [|Xn+1(ω)−Xn(ω)|r] = limn→∞

E [|Xn+1 −Xn|r] = 0.

Note that it is tempting to think that a similar mutual convergence criteriafor convergence in distribution could exist. Yet, this is unfortunately not ingeneral true. A simple example is PrXn = n = 1 for every n. Then

(∀ x ∈ <) limn→∞

|FXn+1(x)− FXn(x)| = 0,

but Xn∞n=1 does not converge in distribution to any random variable.

For ease of memorizing them, the relations of the five modes of convergencecan be depicted as follows. As usual, a double arrow denotes implication.

184

-Thm. B.8

Thm. B.7

Xnp.w.−→ X⇓

Xna.s.−→ X

@@@@Xn

Lr−→ X (r ≥ 1)¡¡¡¡

Xnp−→ X⇓

Xnd−→ X

There are some other relations among these five convergence modes that arealso depicted in the above graph (by dotted line). They are stated below.

Theorem B.7 (monotone convergence theorem)

Xna.s.−→ X, (∀ n)Y ≤ Xn ≤ Xn+1, and E[|Y |] < ∞ ⇒ Xn

L1−→ X

⇒ E[Xn] → E[X].

Theorem B.8 (dominated convergence theorem)

Xna.s.−→ X, (∀ n)|Xn| ≤ Y, and E[|Y |] < ∞ ⇒ Xn

L1−→ X

⇒ E[Xn] → E[X].

The implication of XnL1−→ X to E[Xn] → E[X] can be easily seen from

|E[Xn]− E[X]| = |E[Xn −X]| ≤ E[|Xn −X|].

B.8 Ergodicity and laws of large numbers

B.8.1 Laws of large numbers

Consider a random process . . . , X−2, X−1, X0, X1, X2, . . . with common marginalensemble mean. Suppose that we wish to estimate the ensemble mean µ on thebasis of the observed sequence x1, x2, x3, . . . . The weak and strong laws of largenumbers ensure that such inference is possible (with reasonable accuracy), pro-vided that the dependencies between Xn’s are suitably restricted: e.g., the weaklaw is valid for uncorrelated Xn’s, while the strong law is valid for independentXn’s. Since independence is a more restrictive condition than absence of corre-lation, one expects the strong law to be more powerful than the weak law. Thisis indeed the case, as the weak law states that the sample average

X1 + · · ·+ Xn

n

185

converges to µ in probability, while the strong law asserts that this convergencetakes place with probability 1.

Theorem B.9 (weak law of large number) Let Xn∞n=1 be a sequence ofuncorrelated random variables with common mean E[Xi] = µ. If the variablesalso have common variance, or more generally,

limn→∞

1

n2

n∑i=1

Var[Xi] = 0, (equivalently,X1 + · · ·+ Xn

n

L2−→ µ)

then the sample averageX1 + · · ·+ Xn

n

converges to the mean µ in probability.

Proof: By Chebyshev’s inequality,

Pr

∣∣∣∣∣1

n

n∑i=1

Xi − µ

∣∣∣∣∣ ≥ ε

≤ 1

n2ε2

n∑i=1

Var[Xi].

2

Note that the right-hand side of the above Chebyshev’s inequality is just thesecond moment of the difference between the n-sample average and the mean

µ. Thus the variance constraint is equivalent to the statement that XnL2−→ µ

implies Xnp−→ µ.

Theorem B.10 (Kolmogorov’s strong law of large number) Let Xn∞n=1

be an independent sequence of random variables with common mean E[Xn] = µ.If either

1. Xn’s are identically distributed; or

2. Xn’s are square-integrable with

∞∑i=1

Var[Xi]

i2< ∞,

thenX1 + · · ·+ Xn

n

a.s.−→ µ.

186

Note that the above i.i.d. assumption does not exclude the possibility ofµ = ∞ (or µ = −∞), in which case the sample average converges to ∞ (or −∞)with probability 1. Also note that there are cases of independent sequences towhich the weak law applies, but the strong law does not. This is due to the factthat

n∑i=1

Var[Xi]

i2≥ 1

n2

n∑i=1

Var[Xi].

The final remark is that the Kolmogorov’s strong law of large number can beextended to a function of an independent sequence of random variables:

g(X1) + · · ·+ g(Xn)

n

a.s.−→ E[g(X1)].

But such extension cannot be applied to the weak law of large number, sinceg(Yi) and g(Yj) can be correlated even if Yi and Yi are not.

B.8.2 Ergodicity and strong law of large numbers

After the introduction of the Kolmogorov’s strong law of large number, onemay find that the pointwise ergodic theorem (Theorem B.1) actually indicatessimilar result. In fact, the pointwise ergodic theorem can be viewed as anotherversion of strong law of large number, which states that for stationary and ergodicprocesses, time averages converge with probability 1 to the ensemble expectation.

The notation of ergodicity is often confused by engineering students. Thereis, however, some justification for being so, since the definition is extremelyabstract and not very intuitive. Some engineering texts may provide a definitionas a stationary process satisfies ergodic theorem is also ergodic.5 However, theergodic theorem is indeed a consequence from its original mathematical definition

5Here is one example.

Definition B.11 (ergodicity for stationary source) A stationary random process Xnis called ergodic if for arbitrary k and function f(·) on X k with finite mean,

1n

n∑

i=1

f(Xi+1, . . . , Xi+k) a.s.−→ E[f(X1, . . . , Xk)].

As a result of this definition, a stationary ergodic source is the most general dependent randomprocess for which the strong law of large number holds! This definition somehow implies thatif a process is not stationary-ergodic, then the strong law of large numbers is violated (orthe time average does not converge with probability 1 to its ensemble expectation). But thisis not true. One can weaken the conditions of stationarity and ergodicity from its originalmathematical definitions to asymptotic stationarity and ergodicity, and still make the stronglaw of large number hold! (Cf. the last remark in this section and also Figure B.2)

187

in terms of the shift-invariant property (cf. Section B.6). To define ergodicity interms of its consequence does confuse students more.

Ergodicitydefined throughshift-invariance

property

Ergodicitydefined throughergodic theorem

i.e., stationarity andtime averageconverging to

sample average(law of large numbers)

Figure B.2: Relation of ergodic random processes respectively definedthrough time-shift invariance and ergodic theorem.

Let us try to clarify the notion of ergodicity by the following remarks.

• The concept of ergodicity does not require stationarity. In other words, anon-stationary process can be ergodic.

• Many perfectly good models of physical processes are not ergodic, yet theyhave a form of law of large numbers. In other words, non-ergodic processescan be perfectly good and useful models.

• There is no finite-dimensional equivalent definition of ergodicity as there isfor stationarity. This fact makes it more difficult to describe and interpretergodicity.

• I.i.d. processes are ergodic; hence, ergodicity can be thought of as a (kindof) generalization of i.i.d.

• As mentioned earlier, stationarity and ergodicity imply the time averageconverges with probability 1 to the ensemble mean. Now if a processis stationary but not ergodic, then the time average still converges, butpossibly not to the ensemble mean.

For example, let An∞n=−∞ and Bn∞n=−∞ be two i.i.d. binary 0-1random variables with PrAn = 0 = PrBn = 1 = 1/4. Suppose thatXn = An if U = 1, and Xn = Bn if U = 0, where U is equiprobable binaryrandom variable, and An∞n=1, Bn∞n=1 and U are independent. Then

188

Xn∞n=1 is stationary. Is the process ergodic? The answer is negative.If the stationary process were ergodic, then from the pointwise ergodictheorem (Theorem B.1), its relative frequency would converge to

Pr(Xn = 1) = Pr(U = 1) Pr(Xn = 1|U = 1)

+ Pr(U = 0) Pr(Xn = 1|U = 0)

= Pr(U = 1) Pr(An = 1) + Pr(U = 0) Pr(Bn = 1) =1

2.

However, one should observe that the outputs of (X1, . . . , Xn) form aBernoulli process with relative frequency of 1’s being either 3/4 or 1/4,depending on the value of U . Therefore,

limn→∞

1

n

n∑i=1

Xna.s.−→ Y,

where Pr(Y = 1/4) = Pr(Y = 3/4) = 1/2, which contradicts the ergodictheorem.

From the above example, the pointwise ergodic theorem can actuallybe made useful in such a stationary but non-ergodic case, since the esti-mate with stationary ergodic process (either An∞n=−∞ or Bn∞n=−∞) isactually being observed by measuring the relative frequency (3/4 or 1/4).This renders a surprising fundamental result of random processes— er-godic decomposition theorem: under fairly general assumptions, any (notnecessarily ergodic) stationary process is in fact a mixture of stationaryergodic processes, and hence one always observes a stationary ergodic out-come. As in the above example, one always observe either A1, A2, A3, . . .or B1, B2, B3, . . ., depending on the value of U , for which both sequencesare stationary ergodic (i.e., the time-stationary observation Xn satisfiesXn = U · An + (1− U) ·Bn).

• The previous remark implies that ergodicity is not required for the stronglaw of large number to be useful. The next question is whether or notstationarity is required. Again the answer is negative! In fact, the mainconcern of law of large numbers is the convergence of sample averages toits ensemble expectation. It should be reasonable to expect that randomprocesses could exhibit transient behavior that violates the stationaritydefinition, yet the sample average still converges. One can then introducethe notion of asymptotically stationary to achieve the law of large numbers.

Accordingly, one should not take the notions of stationarity and er-godicity too seriously (if the main concern is the law of large numbers)since they can be significantly weakened and still have laws of large num-bers holding (i.e., time averages and relative frequencies have desired andwell-defined limits).

189

B.9 Central limit theorem

In this section, we simply quote the classical statement of central limit theorem.

Theorem B.12 (central limit theorem) If Xn∞n=1 is a sequence of i.i.d. ran-dom variables with common marginal mean µ and variance σ2, then

1√n

n∑i=1

(Xi − µ)d−→ N (0, σ2),

where N (0, σ2) represents the Gaussian distribution with mean 0 and varianceσ2.

B.10 Concavity, convexity and Jensen’s inequality

Jensen’s inequality is a useful mathematical bound for the expectation of convex(or concave) functions.

Definition B.13 (convexity) A function f(x) is said to be convex over aninterval (a, b) if for every x1, x2 ∈ (a, b) and 0 ≤ λ ≤ 1,

f(λx1 + (1− λ)x2) ≤ λf(x1) + (1− λ)f(x2).

Furthermore, a function f is said to be strictly convex if equality holds only whenλ = 0 or λ = 1.

Definition B.14 (concavity) A function f is concave if −f is convex.

Note that when a function has a non-negative (resp. positive) second deriva-tive over (a, b), the function is convex (resp. strictly convex). This can be easilyshown by the Taylor series expansion of the function.

Theorem B.15 (Jensen’s inequality) If f is convex and X is a random vari-able, then

E[f(X)] ≥ f(E[X]).

Moreover, if f is strictly convex, then equality in the above inequality immedi-ately implies X = E[X] with probability 1.

190

-

6

x

y

support liney = ax + b

f(x)

Figure B.3: The support line y = ax + b of the convex function f(x).

Proof: Let y = ax + b be a support line through the point (E[X], f(E[X])),where a support line6 (for a convex function) at x0 is by definition a line passingthrough the point (x0, f(x0)) and is lying entirely below the graph of f(·). Thus,

(∀x ∈ X ) ax + b ≤ f(x).

By taking the expectation value of both sides, we obtain

a · E[X] + b ≤ E[f(X)],

but we know that a · E[X] + b = f(E[X]). Consequently,

f(E[X]) ≤ E[f(X)].

2

6A line y = ax+b is said to be a support line of function f(x) if among all lines of the sameslope a, it is the largest one satisfying ax + b ≤ f(x) for every x. Hence, a support line maynot necessarily pass the point (x0, f(x0)) for every x0. Here, since we only consider convexfunctions, the validity of the support line at x0 passing (x0, f(x0)) is therefore guaranteed.

191

Bibliography

[1] P. Billingsley. Probability and Measure, 2nd edition, New York, NY: JohnWiley and Sons, 1995.

192

Appendix C

Problems

Chapter 1

1-1 What is information?

1-2 How does one represent information?

1-3 When do we need translation for symbolized information?

Chapter 2

2-1 Show H(X) ≤ log |X | by log-sum inequality.

2-2 Is H(X|Y ) = H(Y |X)? When are they equivalent?

2-3 Try to find a physical meaning for the quantity:

∆ , I(X; Y ) + I(X; Z)− I(X; Y, Z).

2-4 Let X be a random variable defined over a finite alphabet. What is the(general) relationship between H(X) and H(Y ) if

(a) Y = log(X)?

(b) Y = sin(X)?

2-5 Let X be a discrete random variable. Show that the entropy of a functionof X is less than, or equal to, the entropy of X?(Hint: By the chain rule for entropy,

H(X, f(X)) = H(X) + H(f(X)|X) = H(f(X)) + H(X|f(X)).)

193

2-6 The World Series is a seven-game series that terminates as soon as eitherteam wins 4 games. Let X be the random variable that represents theoutcome of a World Series Winner between teams A and B; for example,AAAA or BABBAAB. Let Y be the number of games played, which obvi-ously ranges from 4 to 7. Assuming that A and B are equally matched andthat all games are independent. Please calculate H(X), H(Y ), H(Y |X)and H(X|Y ).

2-7 Give examples of:

(a) I(X; Y |Z) < I(X; Y ).

(b) I(X; Y |Z) > I(X; Y ).

(Hint: For C-7(a), create example for I(X; Y |Z) = 0 and I(X; Y ) > 0.For C-7(b), create example for I(X; Y ) = 0 and I(X; Y |Z) > 0.)

2-8 Let the joint distribution of X and Y be:

1

0

X

Y

1/2

1/4

0

1/4

0

1

@@

@@

@@

Please draw the Venn diagram for

H(X), H(Y ), H(X|Y ), H(Y |X), H(X, Y ) and I(X; Y ),

and indicate the quantities (in bits) for each area of the Venn diagram.

2-9 Let X, Y and Z be random variables with finite alphabet, and Z = X +Y .

(a) Prove that H(Z|X) = H(Y |X).

(b) With the above result, prove that addition of two independent randomvariables, X and Y , never decrease the entropy, namely

H(Z) ≥ H(X) and H(Z) ≥ H(Y ).

(c) Give an example that H(X) > H(Z) and H(Y ) > H(Z). Note thatnow X is not necessarily independent of Y .(Hint: You may create an example with H(Z) = 0, but H(X) > 0and H(Y ) > 0.)

194

(d) Under what condition does H(Z) = H(X) + H(Y )?

2-10 Let X1 → X2 → X3 → · · · → Xn form a first-order Markov chain. Showthat:

(a) I(X1; X2, . . . , Xn) = I(X1; X2).

(b) For any n, H(X1|Xn−1) ≤ H(X1|Xn).

2-11 Prove that refinement never decreases H(X). (Namely, given PX and par-titions U1, . . . ,Um on X , PU(i) = PX(Ui). Then H(X) ≥ H(U).)

2-12 Show by examples that

(a) D(PX|Z‖PX|Z) > D(PX ||PX).

(b) D(PX|Z‖PX|Z) < D(PX ||PX).

2-13 Prove that for any two distributions P and Q defined under common mea-surable space (Ω,F),

‖P −Q‖1 =∑ω∈Ω

|P (ω)−Q(ω)| = 2 · supE∈F

|P (E)−Q(E)|.

2-14 Prove that the binary divergence D(p‖q) = p log(p/q) + (1 − p) log[(1 −p)/(1− q)] is upper bounded by

(p− q)2

q(1− q)for 0 < p < 1 and 0 < q < 1.

(Hint: log(1 + u) ≤ u for u > −1 andp

q= 1 +

p− q

q.)

Chapter 3

3-1 What is the difference between block codes and fixed-length tree codes?

3-2 We know the fact that the average codeword length of all uniquely decod-able codes must be no less than entropy. But this is not necessarily truefor non-singular codes. Give an example of a non-singular code in whichthe average codeword length is less than entropy. (Since the proof of theabove fact employs Kraft inequality, such a non-singular code must violateKraft inequality. Also, we need its average codeword length to be less thanentropy, so it may be better to create an example with large entropy, suchas source with uniform distribution.)

3-3 When will the equality hold for Kraft inequality? Give an example.

3-4 For a stationary source Xn = (X1, . . . , Xn), show that

195

(a) (1/n)H(Xn) ≤ [1/(n − 1)]H(Xn−1). (I.e., (1/n)H(Xn) is non-in-creasing in n.)(Hint: Subtract the above two terms, and expand

H(Xn) and H(Xn−1)

by chain rule. Note that by stationarity

H(Xi|Xi−1, . . . , X1) = H(Xn|Xn−1, . . . , Xn−i+1).)

(b) (1/n)H(Xn) ≥ H(Xn|Xn−1).(Hint: Subtract the above two terms, and expand H(Xn) by chainrule. Note that by stationarity

H(Xi|Xi−1, . . . , X1) = H(Xn|Xn−1, . . . , Xn−i+1).)

3-5 Let . . . , X−1, X0, X1, . . . be a stationary process.

(a) Is H(Xn|X0) = H(X−n|X0) true? If it is, prove it; otherwise, show acounterexample.(Hint: H(X0, Xn) = H(X−n, X0) by stationarity. Then use chain ruleto compare them.)

(b) Is H(Xn|X0) ≥ H(Xn−1|X0) true? If it is, prove it; otherwise, showa counterexample.(Hint: You may construct a stationary source with H(Xn|X0) = 0,i.e., Xn = X0, but ...)

(c) Is H(Xn|X1, X2, . . . , Xn−1, Xn+1) non-increasing in n? If it is, proveit; otherwise, show a counterexample.(Hint: Subtract the consecutive two terms, and use the property ofstationarity and that conditioning does not increase entropy.)

3-6 (Random walk) A person walks on a line with integer numbers marked inorder on it. Each time he may walk forward with probability 0.9, whichincreases the number by 1; or he may walk backward with probability 0.1,which decreases the number by 1. Let Xi be the number he stands onat time instance i, and X0 = 0. Hence, a random process X0, X1, . . . isformed.

(a) Find H(X1, X2, . . . , Xn).(Hint: Use the chain rule for H(X1, . . . , Xn) and note that Xi onlydepends on Xi−1.)

(b) Find the average entropy rate of the random process, i.e.,

limn→∞

1

nH(Xn).

196

(c) Find the expected number of steps the person takes before walkingreversing direction (not including the reversed step)?

3-7 A discrete memoryless source emits a sequence of statistically independentbinary digits with probabilities p(1) = 0.005 and p(0) = 0.995. A binarycodeword is provided for every sequence of 100 digits containing three orfewer ones. In other words, the set of source symbols that are encoded todistinct block codewords is

A , xn ∈ 0, 1100 : number of 1′s in xn ≤ 3.(a) Show that A is indeed a typical set F100(0.2).

(b) Find the minimum length of codeword block length for the blockcoding scheme.

(c) Find the probability for source symbols not in A.

(d) Use Chebyshev’s inequality to bound the probability of observing asource sequence outside A. Compare this bound with its actual pro-bability computed previously.(Hint: Let Xi represent the binary random digit at instance i, and letSn = X1 + · · ·+ Xn. Note that PrS100 ≥ 4 is equal to

Pr|(1/100)S100 − 0.005| ≥ 0.35.)

3-8 Under what conditions does the average binary codeword length of auniquely decodable variable-length code of a source equals the source en-tropy (measured in bits)?(Hint: See the proof of Theorem 4.18.)

3-9 Consider the binary source:Xn∞n=−∞, Xn ∈ 0, 1, with

PrXn+1 = j|Xn = i =

p, if i = j;1− p, otherwise,

where p ∈ [0, 1].

(a) Find the initial state distribution of X1 required to make sourceXn∞n=1 stationary.(Hint: By stationarity, the distribution of X1 should be identical tothat of X2.)

(b) Assume that the source Xn∞n=1 is stationary (i.e., the initial statedistribution of X1 is exactly the solution of the previous question).If p = 1/2, show that Xn∞n=1 is a discrete memoryless source, andcompute its entropy.(Hint: Show that PXn+1|Xn = PXn+1 .)

197

(c) Suppose that p = 1. Is Xn∞n=−∞ ergodic?(Hint: Ergodicity means every shift-invariant event has either proba-bility 1 or 0.)

(d) Suppose that p = 0. Is Xn∞n=−∞ ergodic?(Hint: Same as above.)

(e) For p ∈ (0, 1), compute the entropy rate of Xn∞i=1.

3-10 Suppose sources Z1 and Z2 are independent to each other, and have thesame distribution as Z with

Pr[Z = e1] = 0.4;Pr[Z = e2] = 0.3;Pr[Z = e3] = 0.2;Pr[Z = e4] = 0.1

(a) Design the Huffman code for Z. (Requirement: The codeword ofevent e1 must be the single bit, 0.)

(b) Applying the Huffman code in (a) to the two sources in sequence yieldsthe codeword U1, U2, . . . , Uk, where k ranges from 2 to 6, dependingon the outcomes of Z1 and Z2. Is U1 and U2 independent? Justifyyour answer. (Hint: Examine the value of Pr[U2 = 0|U1 = u1] fordifferent u1.)

(c) Is the average per-letter codeword length equal to the per-letter sourceentropy

0.4 log2

1

0.4+0.3 log2

1

0.3+0.2 log2

1

0.2+0.1 log2

1

0.1= 1.84644 bits/letter?

Justify your answer.

(d) Now if we apply the Huffman code in (a) sequentially to the i.i.d. se-quence Z1, Z2, Z3, . . . with marginal distribution the same as Z, andyield the output U1, U2, U3, . . ., can U1, U2, U3, . . . be further com-pressed?

If your answer to this question is NO, prove the i.i.d. uniformality ofU1, U2, U3, . . .. If your answer to this question is YES, then explainwhy the optimal Huffman code does not give an i.i.d. uniform output.(Hint: Achievability of per-letter average codeword length to per-letter source entropy.)

3-11 In the second part of Theorem 3.22, it is shown that there exists a prefixcode with

¯=∑x∈X

PX(x)`(cx) ≤ H(X) + 1,

198

where cx is the codeword for the source symbol x and `(cx) is the lengthof codeword cx. Show that the upper bound can be improved to:

¯< H(X) + 1.

(Hint: Replace `(cx) = b− log2 PX(x)c+ 1 by a new assignment.)

3-12 Let X1, X2, X3, · · · be an i.i.d. discrete source with marginal alphabetx1, x2, x3, · · · , and assume that PX(xi) > 0 for every i.

(a) Prove that the average codeword length of the single-letter binaryHuffman code is equal to H(X) if, and only if, PX(xi) is equal to 2−ni

for every i, where ni is a sequence of positive integers. (Hint: Theif-part can be proved by the new bound in Problem 3-11, and theonly-if-part can be proved by modifying the proof of Theorem 3.18.)

(b) What is the sufficient and necessary condition under which the aver-age codeword length of the single-letter ternary Huffman code equalsH(X)? (Hint: You only need to write down the condition. No proofis necessary.)

(c) Prove that the average codeword length of the two-letter Huffmancode cannot be equal to H(X)+1/2 bits? (Hint: Use the new boundin Problem 3-11.)

Chapter 4

4-1 Consider a discrete memoryless channel with input X and output Y .Assume that the input alphabet is X = 1, 2, the output alphabet isY = 0, 1, 2, 3, and the transition probabilities PY |X , Pr(Y = y|X = x)are given by

PY |X(y|x) =

1− 2ε, if x = y;ε, if |x− y| = 1;0, otherwise,

where 0 < ε < 1/2.

(a) Determine the channel probability transition matrix Q , [PY |X ].

(b) Compute the capacity of this channel. What is the maximizing inputdistribution that achieves capacity?

4-2 The proof of the Shannon’s channel coding theorem is based on the ran-dom coding technique. What is the codeword selecting distribution of therandom codebook? What is the decoding rule in the proof?

199

4-3 Once is given a communication channel with transition probability

PY |X(y|x)

and channel capacity C = maxpXI(X; Y ). A statistician preprocesses the

output through Y = g(Y ). He claims that this will strictly improve thecapacity. Show that he is wrong.(Hint: Data processing lemma.)

4-4 Find the channel capacity of the DMC modeled as Y = X + Z, wherePZ(0) = PZ(a) = 1/2. The alphabet for channel input X is X = 0, 1.Assume that Z is independent of X. Discuss on the dependence of thechannel capacity on the value of a.

4-5 Consider two discrete memoryless channels

(X1, PY1|X1 ,Y1) and (X2, PY2|X2 ,Y2)

with capacity C1 and C2 respectively. A new channel (X1 × X2, PY1|X1 ×PY2|X2 ,Y1×Y2) is formed in which x1 ∈ X1 and x2 ∈ X2 are simultaneouslysent, resulting in Y1, Y2. Find the capacity of this channel.

4-6 Find the channel capacity of a q-ary erasure channel with input alphabet0, 1, 2, . . . , q − 1 and output alphabet 0, 1, 2, . . . , q − 1, e and channeltransition probability:

PY |X(y|x) =

1− ε, for y = x;ε, for y = e;0, otherwise.

(Hint: It is a weakly symmetric channel; hence, uniform channel inputachieves the capacity.)

4-7 Let the relation between the channel input Xnn≥1 and channel outputYnn≥1 be:

Yn = (αn ×Xn)⊕Nn for each n,

where αn, Xn, Yn and Nn all take values from 0, 1, and “⊕” representsXOR operation. Assume that the attenuation αn∞n=1, channel inputXn∞n=1 and additive noise Nn∞n=1 are independent. Also, αn∞n=1 andNn∞n=1 are i.i.d. random sequences with

Pr[αn = 1] = Pr[αn = 0] =1

2and Pr[Nn = 1] = 1−Pr[Nn = 0] = ε ∈ (0, 1/2).

200

(a) Derive the channel transition probability matrix

[PYj |Xj

(0|0) PYj |Xj(1|0)

PYj |Xj(0|1) PYj |Xj

(1|1)

].

(b) The channel is apparently a discrete memoryless channel. Determineits channel capacity C. (Hint: I(x : Y ) = C if PX(x) > 0, where PX

is the channel input distribution that achieves the channel capacity.)

(c) Suppose that αn is known, and consists of k 1’s. Find the maximumI(Xn; Y n) for the same channel with known αn. (Hint: For known αn,(Xj, Yj)n

j=1 are independent. So I(Xn; Y n) ≤ ∑nj=1 I(Xj; Yj). You

can treat that the “capacity, namely maximum mutual information,for a binary symmetric channel is log(2) − Hb(ε) as a known fact,”where Hb(·) is the binary entropy function.)

(d) Some researchers attempt to derive the capacity of the channel in (b)in terms of the following steps:

• Derive the maximum mutual information between channel inputXn and output Y n for a given αn (namely the solution in (c));

• Calculate the expectation value of the maximum mutual informa-tion obtained from the previous step according to the statisticsof αn.

• Then the capacity of the channel is equal to this “expectationvalue” divided by n.

Does this “capacity” C coincide with that in (b)?

4-8 Statement 1: From the proof of the converse to Shannon’s channel cod-ing theorem, the error probability shall be ultimately larger than1 − (C/R), where R is the transmission rate above channel capac-ity C. This converse theorem is applicable to all discrete memorylesschannels, including the binary symmetric channels. So if taking, say,R = 4C, Shannon said that the error rate will be ultimately largerthan 0.75.

Statement 2: For the binary symmetric channel with equal probable in-put from 0, 1, a pure random guess at the receiver side straightfor-wardly gives Bit-Error-Rate = 0.5, no matter how large the trans-mission rate R is.

Is there a conflict between the above two statements? Justify your answer.(Hint: The definition of average probability of error in Shannon’s channel

201

coding theorem.)

Definition (average probability of error) The average probabilityof error for a C∼n = (n,M) code with encoder f(·) and decoder g(·)transmitted over channel QY n|Xn is defined as

Pe( C∼n) =1

M

M∑i=1

λi,

whereλi ,

∑

yn∈Yn : g(yn)6=i QY n|Xn(yn|f(i)).

4-9 Suppose that blocklength n = 2 and code size M = 2. Assume each codebit is either 0 or 1.

(a) What is the number of all possible codebook designs? (Note: Thisnumber includes those lousy code designs, such as 00, 00.)

(b) Suppose that one randomly draws one of these possible code designsaccording to uniform distribution, and applies the selected code to Bi-nary Symmetric Channel with crossover probability ε. Then what isthe expected error probability, if the decoder simply selects the code-word whose Hamming distance to the received vector is the small-est? (When both codewords have the same Hamming distance to thereceived vector, the decoder makes an equal-probable guess on thetransmitted codeword.)

(c) Explain why the error in (b) does not vanish as ε ↓ 0.

(d) Based on the explanation in (c), show that the random error of auniformly-drawn binary (n,M) random code with fixed ultimate coderate (namely, log2(M)/n → R fixed) cannot approach zero faster than“exponentially” with respect to blocklength n in memoryless BSC. Inother words, there exists p > 0 and a constant A > 0 such thatBERrandom(n) ≥ Apn. (Hint: The error of random (n,M) code islower bounded by the error of random (n, 2) code for M ≥ 2.)

4-10 Show that the capacity of a quasi-symmetric channel with weakly sym-metric sub-arrays Q1,Q2, . . . ,Qn respectively with sizes |X | × |Y1|, |X | ×|Y2|, . . . , |X | × |Yn| is given by:

C =n∑

i=1

aiCi,

202

where ai is equal to sum of any row in Qi, and Ci = log |Yi|−H(normalizedrow distribution of Qi). (Hint: Use the result that the channel capacity ofweakly symmetric channel is achieved by uniform input.)

4-11 Assume that the alphabets for random variables X and Y are both 1, 2, 3, 4, 5.Let x = g(y) be an estimate of x from observing y. Define the probabi-lity of estimating error as Pe = Prg(Y ) 6= X. Then, the Fano’s in-equality gives bounds for Pe as Hb(Pe) + 2Pe ≥ H(X|Y ), where Hb(p) =p log2

1p

+ (1 − p) log21

1−pis the binary entropy function. The curve for

Hb(Pe) + 2Pe = H(X|Y ) is plotted below.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.5

1

1.5

2

2.5

Pe

H(X

|Y)

(in b

its)

|X|=5

A

B

C

D

(a) Point A on the above figure shows that if H(X|Y ) = 0, zero estimationerror, namely, Pe = 0, can be achieved. In this case, characterize thedistribution PX|Y . Also, give an estimator g(·) that achieves Pe = 0.(Hint: Think of what kind of statistical relation between X and Ycan render H(X|Y ) = 0.)

(b) Point B on the above figure indicates that when H(X|Y ) = log2(5),the estimation error can only be equal to 0.8. In this case, characterizethe distributions PX|Y and PX . Prove that at H(X|Y ) = log2(5), allestimators yield Pe = 0.8. (Hint: Think of what kind of statisticalrelation between X and Y can render H(X|Y ) = log2(5).)

(c) Point C on the above figure hints that when H(X|Y ) = 2, the esti-mation error can be as worse as 1. Give an estimator g(·) that leadsto Pe = 1, if PX|Y (x|y) = 1/4 for x 6= y, and PX|Y (x|y) = 0 for x = y.(Hint: The answer is apparent, isn’t it?)

203

(d) Similarly, point D on the above figure hints that when H(X|Y ) = 0,the estimation error can be as worse as 1. Give an estimator g(·) thatleads to Pe = 1 at H(X|Y ) = 0. (Hint: The answer is apparent, isn’tit?)

4-12 Can the channel capacity between channel input X and channel output Zbe strictly larger than the channel capacity between channel input X andchannel output Y ? Which lemma or theorem is your answer based on?

- ChannelPY |X

-

Post processingdeterministic

mappingg(·)

-X Y Z = g(Y )

4-13 Let the single-letter channel transition probability PY |X of the discretememoryless channel be defined as the following figure, where 0 < ε < 0.5.

εε−1

ε−1

ε−1

ε−1

ε

ε

ε

X Y

1

2

3

4

1

2

3

4

(a) Is the channel a weakly symmetric channel? Is the channel a symmet-ric channel?

(b) Determine the channel capacity of this channel (in unit of bits). Also,indicate the input distribution that achieves the channel capacity.(Hint: You can directly apply the conclusions that we draw about theweakly symmetric and symmetric channels to obtain the answers.)

Chapters 5 & 6

6-1 Evaluate the differential entropy of the following cases.

(a) The pdf of the source is f(x) = λe−λx, x ≥ 0;

(b) The pdf of the source is f(x) = (1/2)λe−λ|x|;

(c) The source X = X1+X2, where X1 and X2 are independent Gaussianrandom variables with mean-variance pair being (µ1, σ

21) and (µ2, σ

22),

respectively.

204

6-2 Find the mutual information of dependent Gaussian random variables(X, Y ) with zero means and covariance matrix

(σ2 ρσ2

ρσ2 σ2

).

Evaluate its values for ρ = 1, ρ = 0 and ρ = −1. Comment your results.

6-3 Prove that, of all probability density functions with support [0, 1], theuniform density function has the largest differential entropy.

6-4 Of all pdfs with continuous support [0, K], where K is finite and K > 1,which pdf has the largest differential entropy? (Hint: If pX is the pdf thatmaximizes differential entropy among all pdfs with continuous support[0, K], then E[log pX(X)] = E[log pX(Y )] for any random variable Y withcontinuous support [0, K].)

6-5 Show that the exponential distribution has the largest differential entropyamong all probability density functions (pdfs) with mean µ and continuoussupport [0,∞]. (Hint: The pdf of the exponential distribution with meanµ is given by pX(x) = 1

µexp(−x

µ) for x ≥ 0.)

6-6 Prove that, of all probability mass functions of non-negative integer-valuedsource with mean µ, the geometric distribution, PZ , has the largest entropy.(Note that the probability mass function of the geometric distribution withmean µ is

PZ(z) =1

1 + µ

(µ

1 + µ

)z

, for z = 0, 1, 2, . . .)

(Hint: Let X be a non-negative integer-valued source with mean µ. Showthat H(X)−H(Z) = −D(PX‖PZ) ≤ 0.)

6-7 Let X, Y , and Z be jointly Gaussian random variables, each with mean 0and variance 1; let the correlation coefficient of X and Y as well as thatof Y and Z be ρ, while X and Z are uncorrelated. Determine h(X, Y, Z).What does the result tell about the possible values of ρ?

6-8 Let Y1 and Y2 be conditionally independent and conditionally identicaldistributed given X.

(a) Show I(X; Y1, Y2) = 2 · I(X; Y1)− I(Y1; Y2).

(b) Show that the capacity of the channel X → (Y1, Y2) is less than twicethe capacity of the channel X → Y1.

205

6-9 Consider the channel X → (Y1, Y2) with

Y1 = X + Z1 and Y2 = X + Z2.

Assume Z1 and Z2 are zero-mean dependent Gaussian with covariancematrix (

N NρNρ N

).

By applying a power constraint on the input, i.e., E[X2] ≤ S. Find thecapacity under:

(a) ρ = 1;

(b) ρ = 0;

(c) ρ = −1.

6-10 Give a channel with a vectored output for a scalar input as follows.

X → Channel → Y1, Y2

Suppose that PY1,Y2|X(y1, y2|x) = PY1|X(y1|x)PY2|X(y2|x) for every y1, y2

and x.

(a) Show that I(X; Y1, Y2) =∑2

i=1 I(X; Yi)−I(Y1; Y2). (Hint: I(X; Y1, Y2) =H(Y1, Y2)−H(Y1, Y2|X) and H(Y1, Y2|X) = H(Y1|X) + H(Y2|X))

(b) Prove that the channel capacity Ctwo of using two outputs (Y1, Y2)is less than C1 + C2, where Cj is the channel capacity of using oneoutput Yj and ignoring the other output.

(c) Further assume that PYi|X is Gaussian with mean x and varianceσ2

j . In fact, this channel can be expressed as Y1 = X + N1 andY2 = X + N2, where (N1, N2) are independent Gaussian distributed

with mean zero and covariance matrix

[σ2

1 00 σ2

2

]. Using the fact that

h(Y1, Y2) ≤ 12log(2πe)2 |KY1,Y2| with equality holds when (Y1, Y2) are

joint Gaussian, where KY1,Y2 is the covariance matrix of (Y1, Y2), de-rive Ctwo(S) for the two-output channel under the power constraintE[X2] ≤ S. (Hint: I(X; Y1, Y2) = h(Y1, Y2)−h(N1, N2) = h(Y1, Y2)−h(N1)− h(N2).)

6-11 Consider the 3-input 3-output memoryless additive Gaussian channel

Y = X + Z,

where X = [X1, X2, X3], Y = [Y1, Y2, Y3] and Z = [Z1, Z2, Z3] are all 3-Dimensional real vectors. Assume that X is independent of Z, and the

206

input power constraint is S (i.e., E(X21 + X2

2 + X23 ) ≤ S). Also, assume

that Z is Gaussian distributed with zero mean and covariance matrix K,where

K =

1 0 00 1 ρ0 ρ 1

.

(a) Determine the capacity-cost function of the channel, if ρ = 0. (Hint:Directly apply Theorem 6.31.)

(b) Determine the capacity-cost function of the channel, if 0 < ρ < 1.(Hint: Directly apply Theorem 6.34.)

Appendix A

A-1 Give examples for

sup(A ·B) > (sup A) · (sup B)

sup(A ·B) = (sup A) · (sup B).

A-2 Prove Lemma A.27 based on the limit definition in Definition A.23, andlimsup and liminf definitions in Definition A.26.

Appendix B

B-1 From the basic definition of first-order Markov chain X → Y → Z, namely

PX,Y,Z(x, y, z) = PX(x) · PY |X(y|x) · PZ|Y (z|y),

show that it is equivalent to “X and Z are conditionally independent givenY”, i.e.,

PX,Z|Y (x, z|y) = PX|Y (x|y) · PZ|Y (z|y).

B-2 What is the necessary and sufficient equality condition for Jensen’s in-equality?

B-3 Prove the following inequalities:

(a) (Markov’s inequality) For any non-negative random variable X andany δ > 0, show that

Pr(X ≥ δ) ≤ E[X]

δ.

207

(Hint: E[X] is equal to the integration of the product of x and theprobability of x over [0,∞], which is no less than the same integrationover [δ,∞]. Together with the fact that x ≥ δ in [δ,∞], you can obtainthis inequality.)

(b) (Chebyshev’s inequality) Let Y be a random variable with mean µand variance σ2, show that

Pr|Y − µ| > ε ≤ σ2

ε2.

(Hint: Let X = (Y − µ)2, and use Markov’s inequality.)

(c) (The weak law of large numbers) Let Z1, . . . , Zn be a sequence ofi.i.d. random variables with mean µ and variance σ2. Show that

Pr

∣∣∣∣∣1

n

n∑i=1

Zi − µ

∣∣∣∣∣ > ε

≤ σ2

nε2.

(This is called the sample mean (1/n)∑n

i=1 Zi converges in probabilityto µ.)(Hint: Let Y = (1/n)

∑ni=1 Zi, and use the Chebyshev’s inequality.)

208

Index

δ-typical set, 47

AEP, 47, 108, 121algebra, 161

σ-algebra, 162asymptotic equipartition property, 47average power constraint, 132average probability of error, 80

BEC, 95binary erasure channel, 95binary symmetric channel, 92block code, 45, 46BSC, 92

capacityAWGN, 138bandlimited waveform channels,

149filtered Gaussian waveform chan-

nels, 153non-Gaussian channels, 158parallel additive Gaussian chan-

nels, 141, 145capacity-cost function, 132

concavity, 133central limit theorem, 190Cessaro-mean theorem, 54code rate for data compression, 46communication system, 2

general description, 2, 3

data processing inequality, 27for divergence, 34system interpretation, 28

data processing lemma, see data pro-cessing inequality

differential entropydefinition, 119Gaussian source, 121operation characteristic, 125uniform source, 121

discrete memoryless channel, 80discrete memoryless source, 46distortion measure, 101

additive, 104Hamming distortion measure, 103maximam, 104squared error distortion, 103

distortion typical set, 106divergence, 31, 32

additivity for independence, 39bound from variational distance,

35, 36chain rule, 33conditional, 38convexity, 39non-negativity, 32

DMC, 80dominated convergence theorem, 185

entropy, 17additivity for independence, 22chain rule, 20, 25concavity, 39conditional, 20, 21

chain rule, 21, 26lower additivity, 22

differential, 119independent bound, 26

209

joint, 19non-negativity, 18relative entropy, 31, 32

for continuous cases, 127uniformity, 18Venn diagram, 25

entropy power, 158entropy rate, 53entropy stability property, 47

Fano’s inequality, 88field, 161

σ-field, 162, 174fixed-length code, 44fixed-length data transmission code,

80fixed-length lossy data compression

code, 105fundamental inequality, 19

generalized AEP, 55

Huffman code, 67adaptive, 71

hypothesis testing, 29Bayes criterion, 30Neyman Pearson criterion, 30simple hypothesis testing, 29type I error, 30type II error, 30

identity channel, 92infimum, 165, 166

approximation property for infi-mum, 166

completeness axiom, 166equivalence to greatest lower bound,

165monotone property, 166property for monotone function,

167set algebraic operations, 167

infinitely often, definition, 170

information-transmission theorem, 153instantaneous code, 61

Jensen’s inequality, 190joint AEP, 81joint source-channel code, 9–12joint source-channel coding theorem,

153joint typical set, 81

Karhunen-Loeve expansion, 151Kraft inequality, 58Kullback-Leibler divergence, see di-

vergence

law of large numberstrong law, 186, 187weak law, 186

Lempel-Ziv code, 73limit infimum, see liminf under se-

quencelimit supremum, see limsup under se-

quencelog-sum inequality, 19

maximum, 164, 165memoryless additive channel, 138

Gaussian, 138minimum, 165, 166modes of convergences

almost surely or with probabilityone, 182

in distribution, 183in mean, 182in probability, 182mutual convergence criteria, 184pointwisely, 182uniqueness of convergence limit,

184monotone convergence lemma, 168monotone convergence theorem, 185mutual information, 24

210

bound for memoryless channel,26

chain rule, 24, 26conditional, 24, 28convexity and concavity, 39for continuous cases, 127for specific input symbol, 96Venn diagram, 25

Neyman-Pearson lemma, 31non-singular code, 58

pointwise ergodic theorem, 178prefix code, 61probability space, 174processing of distribution, 34

random coding argument, 83random process, 175rate distortion theorem, 109rate-distortion function, 106

binary source, 130Gaussian source, 131parallel Gaussian sources, 143

redundancy, 57refinement of distribution, 33

self-information, 14, 17definition for single observation,

17joint, 19uniqueness, 15

sequence, 168liminf, 169, 170, 171limit, 168, 169limsup, 169, 171

set, 161boundedness, 166

set operationcomplement, 163equality, 162intersection, 162subset, 162

union, 162Shannon’s channel coding theorem

direct part, 82, 133weak converse, 90, 136

Shannon’s source coding theoremfor DMS, 49for stationary-ergodic sources, 55strong converse, 51, 56

Shannon, Claude E., 1Shannon-Fano-Elias code, 69Shannon-McMilan theorem for con-

tinuous sources, 122Shannon-McMillan theorem, 47Shannon-McMillan theorem for pairs,

82Shanon-McMillan-Breiman theorem,

55sharp, 89shift-invariant transformation, 176sibling property, 72statistics of processes

ergodic, 177, 187Markov

k-th order, 180first-order, 180

memoryless, 177stationary

first-order, 177second-order, 177strictly sense, 177

weakly stationary, 177sufficiently large, definition, 170supremum, 163, 164

approximation property for supre-mum, 164

completeness axiom, 164equivalence to least upper bound,

163, 164monotone property, 166property for monotone function,

167set algebraic operations, 167

211

symmetric channel, 94

tight, 89Toeplitz distortion theorem, 153tree code, 45typical set, 122

uniquely decodability, 58

variable-length code, 44variational distance, 35

bound from divergence, 35, 36

water-pouring scheme, 141, 153weakly δ-typical set, 47weakly symmetric channel, 94

212

Lecture Notes on Information Theory Volume I · PDF fileLecture Notes on Information Theory...

Documents

Transcript of Lecture Notes on Information Theory Volume I · PDF fileLecture Notes on Information Theory...