CCE506: STOCHASTIC PROCESSES, DETECTION & ESTIMATIONmustafa-halabi.appspot.com/NOTESCCE506.pdf ·...

CCE506: STOCHASTIC PROCESSES,DETECTION & ESTIMATION

Instructor: Dr. Mustafa El-Halabi

Fall 2020

2

TABLE OF CONTENTS

Page

TABLE OF CONTENTS 2

LIST OF FIGURES 3

1 INTRODUCTION TO AXIOMATIC PROBABILITY 1

1.1 Why Probability? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 The Algebra of Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Probabilistic Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3.1 Discrete Probability Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.3.2 Continuous Probability Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.4 Combinatorics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.5 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.5.1 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.6 Application: Random Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

1.7 Application: Machine Learning (Occam’s Razor) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2 ONE RANDOM VARIABLE 15

2.1 Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.2 Probability Mass Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.3 Cumulative Distribution Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.4 Probability Density Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.4.1 Tail Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.5 Statistical Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.6 Conditioning a Random Variable on an Event . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.7 Transforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

2.7.1 The Moment Generating Function (MGF) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

2.7.2 Characteristic Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

2.8 Evaluating Tail Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

2.8.1 Markov’s Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

2.8.2 Chebyshev’s Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

2.8.3 Chernoff’s Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

2.9 Transformation of a Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3 PAIRS OF RANDOM VARIABLES 38

3.1 Joint PMF, CDF and PDF of Pair of Random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.1.1 Joint PMF of Two Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.1.2 Joint CDF of Two Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.1.3 Joint PDF of Two Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.2 Conditioning One Random Variable on Another . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.2.1 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.3 One Discrete and One Continuous Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.3.1 MAP and ML Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.4 Application: Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44


3.5.1 Conditional Expectation and Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

Copyright c© 2019, Dr. Mustafa El-Halabi 2

3.6 Bivariate Gaussian Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

3.7 Functions of Two Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

3.7.1 One Function of Two Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

3.7.2 Transformations of Two Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4 VECTOR RANDOM VARIABLE 58

4.1 Random Vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.2 Matrix Analysis Primer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60


4.4 Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.4.1 Linear Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.5 Gaussian Random Variables in Multiple Dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

4.5.1 Quadratic Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

4.5.2 Coloring and Whitening Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

4.6 Overview on Estimation Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

4.6.1 Minimum Mean Square Error (LMMSE) Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 70

4.6.2 Estimation Using a Vector of Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

5 RANDOM SEQUENCES AND SERIES 75

5.1 Random Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

5.2 Convergence of a Random Sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

5.2.1 Almost sure convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

5.2.2 Convergence in probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

5.2.3 Convergence in mean square . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

5.2.4 Convergence in distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

5.2.5 The Weak Law of Large Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

5.3 Asymptotic Equipartition Property . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

6 STOCHASTIC PROCESSES 90

6.1 Definition of a Random Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

6.2 Joint Distribution of Time Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

6.2.1 Statistical Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

6.2.2 Discrete Random Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

6.2.3 Gaussian Random Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

6.3 Stationary and Ergodic Random Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

6.3.1 Properties of Autocorrelation Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

6.4 Poisson Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

6.5 Power Spectral Density . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

6.6 Random Processes in Linear Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

3

4

LIST OF FIGURES

Page

1.1 Binary Symmetric Channel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.2 G = (V ,E ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

1.3 Four vertices graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.1 Visualization of random variable. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.2 CDF of a discrete random variable. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.3 CDF of a continuous random variable. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.4 Tail probability for a standard Gaussian distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.1 Example of a training dataset where each circle corresponds to one data instance with input values in the

corresponding axes and its sign indicates the class. For simplicity, only two customer attributes, income

and savings, are taken as input and the two classes are low-risk (“+ ”) and high-risk (“− ”). An example

discriminant that separates the two types of examples is also shown. . . . . . . . . . . . . . . . . . . . . . . 45

3.2 A scattergram for 200 observations of four different pairs of random variables. . . . . . . . . . . . . . . . . 48

3.3 Several sets of (x ,y) points, with the correlation coefficient of x and y for each set. Note that the correlation

reflects the noisiness and direction of a linear relationship (top row), but not the slope of that relationship

(middle), nor many aspects of nonlinear relationships (bottom). N.B.: the figure in the center has a slope of

0 but in that case the correlation coefficient is undefined because the variance of Y is zero. . . . . . . . . . 50

4.1 Comparison between linear and non-linear programing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

5.1 Plot of the distribution of Xn(ω) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

5.2 Plot of the sequence for ω = 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

5.3 Plot of the sequence Xn(ω) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

5.4 Plot of the CDF of 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

5.5 Plot of the CDF of U,X1,X2 and X3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

6.1 Several realizations of a random process. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91


6.2 (a) Sinusoid with random amplitude, (b) Sinusoid with random phase. . . . . . . . . . . . . . . . . . . . . . 92

6.3 A possible realization of the random process W (n) and its corresponding averaging function X (n). . . . . . . 97

6.4 A possible realization of the random walk process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

5

INTRODUCTION TO AXIOMATIC PROBABILITY 1

CHAPTER 1

INTRODUCTION TO AXIOMATIC PROBABILITY

1.1 Why Probability?Many electrical engineering students have studied, analyzed, and designed systems from the point of view of steady-state

and transient signals using time domain or frequency domain techniques. However, these techniques do not provide a method

for accounting for variability in the signal nor for unwanted disturbances such as interference and noise. We will see that

the theory of probability and random processes is useful for modeling the uncertainty of various events (e.g., the arrival of

telephone calls and the failure of electronic components). We also know that the performance of many systems is adversely

affected by noise, which may often be present in the form of an undesired signal that degrades the performance of the system.

Thus, it becomes necessary to design systems that can discriminate against noise and enhance a desired signal.

How do we distinguish between a deterministic signal or function and a stochastic or random phenomenon such as

noise? Usually, noise is defined to be any undesired signal, which often occurs in the presence of a desired signal. This

definition includes deterministic as well as non-deterministic signals. A deterministic signal is one which may be represented

by some parameter values, such as a sinusoid, which may be perfectly reconstructed given an amplitude, frequency, and

phase. Stochastic signals, such as noise, do not have this property. While they may be approximately represented by several

parameters, stochastic signals have an element of randomness which prevent them from being perfectly reconstructed from a

past history. Even the same word spoken by different speakers is not deterministic; there is variability, which can be modeled

as a random fluctuation. Likewise, the amplitude and/or phase of a stochastic signal cannot be calculated for any specified

future time instant even though the entire past history of the signal may be known. However, the amplitude and/or phase

of a stochastic signal can be predicted to occur with a specified probability, provided certain factors are known. The theory

of probability provides a tool to model and analyze phenomena that occur in many diverse fields, such as communications,

signal processing, control, and computers. Perhaps the major reason for studying probability and random processes is to be

able to model complex systems and phenomena.

Uncertainty and stochasticity can arise from many sources. Researchers have made compelling arguments for quantifying

uncertainty using probability since at least the 1980s. Nearly all activities require some ability to reason in the presence of

uncertainty. In fact, beyond mathematical statements that are true by definition, it is difficult to think of any proposition

that is absolutely true or any event that is absolutely guaranteed to occur. There are three possible sources of uncertainty:

1. Inherent stochasticity in the system being modeled. For example, most interpretations of quantum mechanics describe

the dynamics of subatomic particles as being probabilistic. We can also create theoretical scenarios that we postulate

to have random dynamics, such as a hypothetical card game where we assume that the cards are truly shuffled into a

random order.

2. Incomplete observability. Even deterministic systems can appear stochastic when we cannot observe all of the variables

that drive the behavior of the system. For example, in the Monty Hall problem, a game show contestant is asked to



choose between three doors and wins a prize held behind the chosen door. Two doors lead to a goat while a third leads

to a car. The outcome given the contestant?s choice is deterministic, but from the contestant’s point of view, the

outcome is uncertain.

3. Incomplete modeling. When we use a model that must discard some of the information we have observed, the discarded

information results in uncertainty in the model’s predictions. For example, suppose we build a robot that can exactly

observe the location of every object around it. If the robot discretizes space when predicting the future location of these

objects, then the discretization makes the robot immediately become uncertain about the precise position of objects:

each object could be anywhere within the discrete cell that it was observed to occupy.

1.2 The Algebra of Sets

Definition 1. (Set & Subsets) A set is a collection of objects and a subset is a sub-collection of objects.

Definition 2. (Cardinality) The cardinality of a set CA = |A| is the number of elements in A. If a set has a cardinality of n,

then the total number of subsets of A is equal to 2n.

Definition 3. (Power Set) The set containing all the subsets of A is defined as the power set, having a cardinality of 2n.

Example 1. An example of a set is A = a,b,c. The cardinality of A is 3. Set A has 23 = 8 subsets: φ , a,b, c,a,b, a,c, b,c, a,b,c.

In the study of probabilities, we are particularly interested in the set of all outcomes of a random experiment and the

subset of this set. We will denote an experiment by E .

Definition 4. (Sample Space & Events) The set of all possible outcomes of an experiment is called the sample space and is

denoted by Ω. An event is a subset of the sample space Ω.

Example 2.

1. Let E denote the experiment involving throwing a coin. The corresponding sample space is Ω = H,T. A possible

event is the subset H; the event that the coin will land a head.

2. Let E denote the experiment involving throwing two coins. The corresponding sample space is Ω = HH,TT ,HT ,TH.A possible event is the subset HH,HT ,TH which corresponds to the event of getting at least one head in two flips.

The following table illustrates the interpretation of sets by probability theory.

Typical notation Set jargon Probability jargon

Ω Collection of objects Sample space

ω member of Ω elementary event or outcome

A subset of Ω Event that an outcome in A oc-

curs

Ac complement of A No outcome in A occurs

A∩B Intersection set Both A and B occur

A∪B Union set Either A or B or both occur

A−B Difference set A occurs but not B

A∆B = (A−B)∪ (B−A) Symmetric difference set Either A or B occurs but not both

φ empty set Impossible event

Ω Whole set or Universe Certain event



Definition 5. ( Disjoint Events) Two events are said to be disjoint or mutually exclusive if A∩B = φ .

Definition 6. ( n-Partition) Given any set E , an n-partition of E consists of a sequence of sets Ei , i = 1,2, ... ,n, such that

Ei ⊂ E ,⋃ni=1 Ei = E , Ei ∩Ej = φ , ∀ i 6= j .

Set operations have several properties, which are elementary consequences of the definitions. Some examples are:

(Ac)c = A, A∩Ac = φ , A∪Ω = Ω, A∩Ω = A

Two particularly useful properties are given by De Morgan’s laws which state

(⋃

i

Ai

)c=⋂

i

Aci (1.1)

(⋂

i

Ai

)c=⋃

i

Aci (1.2)

1.3 Probabilistic ModelsA probabilistic model is a mathematical description of an uncertain situation. It must be in accordance with a fundamental

framework.

Definition 7. (Probability Law) The probability law, which assigns to a set A of possible outcomes or events a non-negative

number P(A) called the probability of A, which encodes our knowledge or belief about the collective likelihood of the elements

of A.

The probability law must satisfy the following axioms.

Definition 8. ( Axioms of Probability)

1. Axiom 1: ( Non-negativity) P(A)≥ 0 for every event A.

2. Axiom 2: ( Additivity) If A and B are two disjoint events, then the probability of their union satisfies

P(A∪B) = P(A)+P(B)

3. Axiom 3: ( Normalization) The probability of the entire sample space Ω is equal to 1:

P(Ω) = 1

These axioms are sufficient to establish the following properties

• P(φ) = 0

• P(A∩Bc) = P(A)−P(A∩B)

• P(Ac) = 1−P(A)

• If A⊆ B, then P(A)≤ P(B)

• P(A∪B) = P(A)+P(B)−P(A∩B)

• For mutually exclusive sets AnNn=1 (i.e., Ai ∩Aj = φ ∀ i 6= j).

P

(N⋃

i=1

Ai

)=N

∑i=1

P(Ai )



Example 3. Using the axioms of probability and the induction property,

1. prove the following inequality known as the Union Bound:

P

[N⋃

i=1

Ai

]≤N

∑i=1

P(Ai )

2. prove Boole’s inequality:

P

(N⋂

i=1

Ai

)≥ 1−

N

∑i=1

P(Aci )

Solution.

1. We need to prove first that P(A1 ∪A2) = P(A1)+P(A2)−P(A1 ∩A2). To start, we need to express A1 as (A1−A2)∪ (A1 ∩A2). Since A1−A2 and A1 ∩A2 are disjoint sets, then from axiom 2, we can establish P(A1) = P(A1−A2)+P(A1 ∩A2). Now, observing that A1 ∪A2 = (A1−A2)∪A2, where A1−A2 and A2 are disjoint sets, applying

axiom 2, we get P(A1 ∪A2) = P(A1 −A2) + P(A2), but P(A1 −A2) = P(A1)−P(A1 ∩A2). Thus, P(A1 ∪A2) =

P(A1)+P(A2)−P(A1∩A2).

In order to prove the Union bound using induction, we need to verify first the it is true for N = 2. For N = 2,

P(A1∪A2) = P(A1)+P(A2)−P(A1∩A2). Since P(A1∩A2)≥ 0, then

P(A1∪A2)≤ P(A1)+P(A2)

Assume now that P(A1∪A2∪ ...AN)≤ P(A1)+P(A2)+ · · ·+P(AN). We need to prove that

P(A1∪A2∪ ...AN+1)≤ P(A1)+P(A2)+ · · ·+P(AN+1)

Let B = A1∪A2 ...AN . Then,

P

(N+1⋃

i=1

Ai

)= P(B ∪AN+1) ≤ P(B)+P(AN+1)

≤ P(A1)+ · · ·+P(AN)+P(AN+1)

=N+1

∑i=1

P(Ai )

2. Using the previous result, we have

P

(N⋃

i=1

Ai

)≤N

∑i=1

P(Ai

Also, using De Morgan’s law:

P

(N⋂

i=1

Ai

)= P

((n⋃

i=1

Aci

)c)= 1−P

(n⋃

i=1

Aci

)≥ 1−

N

∑i=1

P(Aci )



Example 4. Let A and B be two events with probabilities P(A) = 3/4 and P(B) = 1/3. Show that

112≤ P(A∩B)≤ 1

3

Solution.

P(A∩B) = P(A)+P(B)−P(A∪B)≥ P(A)+P(B)−1 =112

Since A∩B ⊆ A, then P(A∩B)≤ P(A). Similarly, P(A∩B)≤ P(B). Hence, P(A∩B)≤minP(A),P(B)= 13 .

1.3.1 Discrete Probability Models

If the sample space consists of a finite number of possible outcomes, then the probability law is specified by the probabilities

of the events that consist of a single element. In particular, the probability of any event s1,s2, ... ,sn is the sum of the

probabilities of its elements:

P (s1,s2, ... ,sn) = P(s1)+P(s2)+ · · ·+P(sn)

If the probability space consists of n possible outcomes which are equally likely, then the probability of any event A is given by

P(A) =number of elements of A

n=|A||Ω|

Hence, when the probability assignment is uniform, i.e., all the outcomes have the same probability (which must be 1/|Ω|),

computing probabilities reduces to counting outcomes.

Example 5. Consider the experiment of rolling a pair of 4-sided dice. We assume the dice are fair, and we interpret the

assumption to mean that each of the sixteen possible outcomes [pairs (i , j), with i , j = 1,2,3,4] has the same probability of

1/16. To calculate the probability of an event, we must count the number of elements of the event and divide by 16 (the

total number of outcomes). Using this approach, we can for instance calculate the probability of the event “at least one role

is equal to 4” as 7/16.

1.3.2 Continuous Probability Models

Probabilistic models with continuous sample spaces differ from their discrete counterparts in that the probabilities of the

single-element events may not be sufficient to characterize the probability law. This is illustrated in the following example,

which also indicates how to generalize the uniform probability law to the case of a continuous sample space.

Example 6. A wheel of fortune is continuously calibrated from 0 to 1, so that the possible outcomes of an experiment

consisting of a single spin are the numbers in the interval Ω = [0,1]. Assuming a fair wheel, it is appropriate to consider all

outcomes equally likely, but what is the probability of the event consisting of a single element? It cannot be positive, because

then, using the additivity axiom, it would follow that events with a sufficiently large number of elements would have probability

larger than 1. Therefore, the probability of any event that consists of a single element must be 0. In this example, it makes

sense to assign probability b−a to any subinterval [a,b] of [0,1], and to calculate the probability of a more complicated set

by evaluating its “length”. This assignment satisfies the axioms of probability and qualifies as a legitimate probability law.



1.4 CombinatoricsThe calculation of probabilities often involves counting the number of outcomes in various events. Counting can be

challenging; the art of counting constitutes a large portion of the field of combinatorics. In this section, we present the basic

principle of counting and apply it to a number of situations that are often encountered in probabilistic models.

Definition 9. (The Counting Principle) Consider a process that consists of r stages. Suppose that

• There are n1 possible results at the first stage

• For every possible result at the first stage, there are n2 possible results at the second stage

• More generally, for any sequence of possible results at the first i−1 stages, there are ni possible results at the ith stage

Then, the total number of possible results of the r−stage process is

n1×n2×·· ·×nr (1.3)

Definition 10. (k−permutations) The number of different ways to pick k distinct objects out of n distinct objects and

arrange them in a sequence is called k−permutations and is given by

n(n−1) ...(n−k +1) =n!

(n−k)!(1.4)

In the special case when k = n, the number of possible sequences is simply called permutations, and is given by

n! (1.5)

Definition 11. ( Combinations) The number of different ways to pick k distinct objects out of n distinct objects without an

interest in the ordering is called combination and is given by(

n

k

)=

n!k!(n−k)!

(1.6)

A combination is a choice of k elements out of n−element set without regard to order. Thus, a combination can be

viewed as a partition of the set in two: one part contains k elements and the other contains the remaining n− k . We now

generalize by considering partitions into more than two subsets.

Definition 12. ( Partitions) Given an n−number set and nonnegative integers n1,n2, ... ,nr , whose sum is equal to n. We

consider partitions of the set into r disjoint subsets, with the ith subset containing exactly ni elements. This can be done in(

n

n1,n2, ... ,nr

)=

n!n1!n2! ...nr !

ways (1.7)

where(

nn1,n2,...,nr

)is known as the multinomial coefficient.

Example 7. By rearranging the letters of the word TATTOO we can obtain 6!1!2!3! = 60 different words.

Theorem 1. ( The Binomial Theorem)

(x + y)n =n

∑k=0

(n

k

)xkyn−k (1.8)

Proof. The proof can be easily carried using induction. try it!

Theorem 2. ( The Multinomial Theorem)

(x1 + x2 + · · ·+ xr )n = ∑

(n1,n2,...,nr )

(n

n1,n2, ... ,nr

)xn11 xn22 ...xnrr (1.9)



Example 8. (The Birthday paradox)

The birthday paradox is a remarkable phenomenon that examines the chances that two people in a group have the same

birthday. It is a “paradox” not because of a logical contradiction, but because it goes against intuition. For ease of calculations,

we take the number of days in a year to be 365. If we consider the case where there are n people in a room, then let Xi be

the birthday of the ith person. The sample space consists of all n−tuples of birthdays; |Ω|= 365n.

Let

A = “At least two people have the same birthday”

Ac = “No two people have the same birthday”

It is clear that P(A) = 1−P(Ac). We will calculate P(Ac), since it is easier, and then find out P(A). How many ways

are there for no two people to have the same birthday? Well, there are 365 choices for the first person, 364 for the

second,... , 365− n + 1 choices for the nth person, for a total of 365× 364× ·· ·× (365− n + 1) = 365!(365−n)! . Thus, we have

P(Ac) = |Ac ||Ω| =

365×364×···×(365−n+1)365n . Then, P(A) = 1− 365×364×···×(365−n+1)

365n . In fact, for n = 23 people you should be willing

to bet that at least two people do have the same birthday, since then P(A) is larger than 50%. For n = 60 people, P(A) is

over 99%.

1.5 Conditional ProbabilityConditional probability provides us with a way to reason about the outcome of an experiment, based on partial information.

We thus seek to construct a new probability law that takes into account the available knowledge: a probability law that for

any event A, specifies the conditional probability of A given B, denoted by P(A|B). An appropriate definition of conditional

probability is

P(A|B), P(A∩B)

P(B)=

P(A,B)

P(B)(1.10)

Note that for a fixed event B, it can be verified that the conditional probability P(A|B) form a legitimate probability law

that satisfies the three axioms (verify!). Since conditional probabilities constitute a legitimate probability law, all general

properties of probability laws remain valid. For example, a fact such as P(A∪C ) ≤ P(A)+P(C ) translates to new fact

P(A∪C |B)≤ P(A|B)+P(C |B).

Example 9. (Balls and Bins)

Mathematically, the birthday paradox is an example of a more general mathematical question, often formulated in terms of

balls and bins. Some number of labeled balls n are thrown into some number of labeled bins m. What does the distribution of

balls and bins look like? The birthday paradox is focused on the first time a ball lands in a bin with another ball. One might

also ask how many of the bins are empty, how many balls are in the most full bin, and other sorts of questions. Suppose we

toss m = 3 (labelled) balls into n = 3 (labeled) bins.

1. What is the probability of having the first bin empty?

2. What is the probability of having the first bin empty given that the second bin is empty?

3. Assume that the balls are not labeled. What is the probability of having the first bin empty given that the second bin

is empty?

Solution.

1. For the first bin to be empty, it has to be missed by all n balls. Since each ball hits the first bin with probability 1/m,

then probability the first bin remains empty is

(1− 1

m

)n' e−n/m



In our case, the answer is 8/27.

2. Denote P(A) and P(B) the probabilities that the first bin and the second bin are empty, respectively. Hence, we are

trying to figure out P(A|B). But A∩B is the event that both the first two bins are empty, i.e., all three balls fall in

the third bin. So, P(A∩B) = 127 . Therefore,

P(A|B) =P(A∩B)

P(B)=

1/278/27

=18

3. The total number of ways to distribute 3 balls in three labeled bins is 10. The total number of distributing 3 balls while

leaving bin 1 empty is 4. Hence, the probability of having bin 1 empty is 4/10 = 2/5. The probability P(A∩B) is then

P(A∩B) =1/104/10

=14

Example 10. (Radar Detection)

If an aircraft is present in a certain area, a radar detects it and generates an alarm signal with probability 0.99. If an aircraft

is not present, the radar generates a (false) alarm, with probability 0.1. We assume that an aircraft is present with probability

0.05. What is the probability of aircraft presence and no detection?

Solution. Let A and B be the events

A = an aircraft is presentB = the radar generates an alarm

Then,

Ac = an aircraft is not presentBc = the radar does not generate an alarm

Hence, the desired probability is P(A∩Bc) = P(A)P(Bc |A) = 0.05×0.01 = 0.0005.

Proposition 1. ( Multiplication Rule) Assuming that all of the conditional events have positive probability, we have

P

(n⋂

i=1

Ai

)= P(A1)P(A2|A1)P(A3|A2,A1) ...P

(An

∣∣∣n−1⋂

i=1

Ai

)(1.11)

Example 11. A batch of one hundred items is inspected by testing four randomly selected items. If one of the four is

defective, the batch is rejected. What is the probability that the batch is accepted if it contains five defective?

Solution. Let A be the event that the batch will be accepted. Then, A = A1∩A2∩A3∩A4, where Ai , i = 1, ... ,4 is the event

that the ith item is not defective. Using the multiplication rule, we have

P(A) = P(A1)P(A2|A1)P(A3|A1∩A2)P(A4|A1∩A2∩A3) =95100× 94

99× 93

98× 92

97= 0.812

We explore now some applications of conditional probability.



Theorem 3. ( Total Probability Theorem) Let A1, ... ,An be disjoint events that form an n−partition of the sample space.

Then, for any event B, we have

P(B) =n

∑i=1

P(Ai )P(B|Ai ) (1.12)

There is a category of problems known as inference. There are a number of “causes” that may result in a certain “effect”.

We observe the effect, and we wish to infer the cause. The events A1, ... ,An are associated with the causes and the event

B represents the effect. The probability P(B|Ai ) that the effect will be observed when the cause Ai is present amounts to a

probabilistic model of the cause-effect relation.Given that the effect B has been observed, we wish to evaluate the probability

P(Ai |B) that the cause Ai is present. We refer to P(Ai |B) as the posteriori probability of event Ai given the information, to

be distinguished from P(Ai ), which we call the priori probability. The following rule is used for inference.

Theorem 4. ( Baye’s Rule) Let A1, ... ,An be disjoint events that form an n−partition of the sample space. Then, for any

event B, we have

P(Ai |B) =P(B|Ai )P(Ai )

P(B)=

P(B|Ai )P(Ai )n

∑i=1

P(Ai )P(B|Ai )(1.13)

Example 12. (Mammogram)

A woman in her 40s decide to have a medical test for breast cancer called a mammogram. The test has a sensitivity of

80%, i.e., the test will correctly detect the disease with probability 0.8. On the other hand, the test gives a false alarm with

probability 0.1. What is the probability that the woman has cancer if the test is positive? Note that the probability of having

breast cancer among women in their 40s is fortunately pretty low; 4/1000.

Solution. Let A = 1 denote the event the mammogram is positive, B = 1 is the event that the woman has a breast cancer,

and B = 0 is the event that the woman does not have a breast cancer. Since, the sensitivity if the test is 80%, then

P(A = 1|B = 1) = 0.8. Also, P(B = 1) = 0.004. Since the probability of false alarm is 0.1, then P(A = 1|B = 0) = 0.1. Using

Bayes rule, we can compute

P(B = 1|A = 1) =P(A = 1|B = 1)P(B = 1)

P(A = 1|B = 1)P(B = 1)+P(A = 1|B = 0)P(B = 0)=

0.8×0.0040.8×0.004+0.1×0.996

= 0.031

X Y

0 0

1 1

p

p

1 p

1 p

Figure 1.1: Binary Symmetric Channel

Example 13. (ML Decision Rule)

Consider a binary symmetric channel (BSC) between a transmitter X and a receiver Y . The transmitter X can equally send

either a 0 or a 1. The receiver Y can receive either a 0 or a 1 according to the following diagram in Figure. 1.1, where p

is the probability of channel error. A detector placed at the channel output observes the incoming bits and decides whether



the received bit corresponds to 0 or to 1. The detector’s decision rule is the maximum likelihood (ML) decision rule given as

follows:

P(X = 0|Y = i)Decide 0

≷Decide 1

P(X = 1|Y = i), for i = 0,1

1. Find the probability of receiving a 1 and the probability of sending a 0 and receiving a 0.

2. Suppose p = 1/3. If the ML decoder observe a 1, what should it decode?

Solution.

1. Using the Total probability theorem:

P(Y = 1) = P(X = 0)P(Y = 1|X = 0)+P(X = 1)P(Y = 1|X = 1) =12

p+12(1−p) =

12

Using Baye’s rule:

P(X = 0,Y = 0) = P(X = 0)P(Y = 0|X = 0) =12(1−p)

2. Since P(Y = 1|X = 0) = 13 and P(Y = 1|X = 1) = 2

3 then

P(X = 0|Y = 1) =P(Y = 1|X = 0)P(X = 0)

P(Y = 1)=

1/61/2

=13

P(X = 1|Y = 1) =P(Y = 1|X = 1)P(X = 1)

P(Y = 1)=

1/31/2

=23

Using the ML rule, we need to inspect the largest conditional probability, which is 23 , corresponding to bit 1. Hence, if

the decoder observes a 1 it must decode it as 1.

1.5.1 Independence

An interesting and important case arises when the occurrence of event B provides no information and does not alter the

probability that A has occurred, i.e.,

P(A|B) = P(A) (1.14)

When the above equality holds, we say that A is independent of B (also called unconditionally independent). Note that this

also implies the following

P(A∩B) = P(A)P(B) (1.15)

Independence is not easily visualized in terms of the sample space. A common first thought is that two events are independent

if they are disjoint, but in fact the opposite is true: two disjoint events A and B with P(A) > 0 and P(B) > 0 are never

independent, since their intersection is empty and has a probability 0. This should make sense logically since if A and B are

disjoint, then roughly, the presence of A implies the absence of B (and vice versa). It is clear that the events should not be

independent. The events somehow affect each other.

We noted earlier that the conditional probabilities of events, conditioned on a particular event, form a legitimate probability

law. We can thus talk about independence of various events with respect to this conditional law.

Definition 13. (Conditional Independence)

Unfortunately, unconditional independence is rare, because most events can influence most other events. However, usually

this influence is mediated via other events rather than being direct. We therefore say A and B are conditionally independent

(CI) given C iff the conditional joint can be written as a product of conditional marginals:

P(A∩B|C ), P(A,B|C ) = P(A|C )P(B|C )⇔ A⊥ B|C (1.16)



Actually, we can write this assumption as a graph A−C −B, which captures the intuition that all the dependencies between

A and B are mediated via C .

Example 14. Show that if A and B are conditionally independent given C , then

P(A|B ∩C ) = P(A|C )

Solution.

P(A∩B|C ) =P(A∩B ∩C )

P(C )=

P(C )P(B|C )P(A|B ∩C )

P(C )= P(B|C )P(A|B ∩C )

Since P(A∩B|C ) = P(A|C )P(B|C ), then we conclude that P(A|B ∩C ) = P(A|C ).

Remark 1. For any probability model, let A and B be independent events, and let C be an event such that P(C ) > 0,

P(A|C ) > 0, and P(B|C ) > 0, while A∩B ∩C is empty. Then, A and B cannot be conditionally independent (given C )

since P(A∩B∩C ) = 0 while P(A|C )P(B|C )> 0. Interestingly, independence of two events with respect to the unconditional

probability law, does not imply conditional independence, and vice versa.

Example 15. Consider two independent fair coin tosses, in which all four possible outcomes are equally likely. Let

H1 = First toss is a head

H2 = Second toss is a headD = The two tosses have different results

The events H1 and H2 are (unconditionally) independent. But,

P(H1|D) =12

, P(H2|D) =12

, P(H1∩H2|D) = 0

so that P(H1∩H2|D) 6= P(H1|D)P(H2|D), and H1, H2 are not conditionally independent.

Example 16. Show that if X ⊥ Y |Z iff there exists function g and h such that

p(x ,y |z) = g(x ,z)h(y ,z)

for all x ,y ,z such that p(z)> 0.

Solution. The “only if” part follows immediately from the definition of conditional independence, so we will only prove the

“if” part. Assume

p(x ,y ,z) = g(x ,z)h(y ,z)

for all x ,y ,z such that p(z)> 0. Then, for such x , y and z , we have

p(x ,z) = ∑y

p(x ,y ,z) = ∑y

g(x ,z)h(y ,z) = g(x ,z)∑y

h(y ,z)

and

p(y ,z) = ∑x

p(x ,y ,z) = ∑x

g(x ,z)h(y ,z) = h(y ,z)∑x

g(x ,z)



Furthermore,

p(z) = ∑y

p(y ,z) = ∑y

∑x

g(x ,z)h(y ,z) =

(∑x

g(x ,z)

)(∑y

h(y ,z)

)

Therefore,

p(x ,z)p(y ,z)

p(z)=

(g(x ,z)∑

yh(y ,z)

)(h(y ,z)∑

xg(x ,z)

)

(∑x

g(x ,z)

)(∑y

h(y ,z)

) = g(x ,z)h(y ,z) = p(x ,y ,z)

Dividing both sides of the previous equation by p(z), we get

p(x |z)p(y |z) = p(x ,y |z)

Definition 14. (Independence of Several Events) We say that the events A1, ... ,An are independent if

P

(⋂

i∈SAi

)= ∏i∈S

P(Ai ), for every subset of 1,2, ... ,n. (1.17)

For the case of three events A1,A2,A3, independence amounts to satisfying the four conditions

P(A1∩A2) = P(A1)P(A2)

P(A1∩A3) = P(A1)P(A3)

P(A2∩A3) = P(A2)P(A3)

P(A1∩A2∩A3) = P(A1)P(A2)P(A3)

The first three conditions simply assert that any two events are independent, a property known as pairwise independence.

1.6 Application: Random GraphsThe theory of random graphs lies at the intersection between graph theory and probability theory. From a mathematical

perspective, random graphs are used to answer questions about the properties of typical graphs. Its practical applications

are found in all areas in which complex networks (e.g., the Internet) need to be modeled − a large number of random graph

models are thus known, mirroring the diverse types of complex networks encountered in different areas. A random graph is

obtained by starting with a set of n isolated vertices and adding successive edges between them at random. The aim is to

determine at what stage a particular property of the graph is likely to arise. A graph G is defined using a set of vertices V

and a set of edges E : G = (V ,E ).

1

2

3

4

Figure 1.2: G = (V ,E )



Example 17. In Figure. 1.2, the graph G = (V ,E ) is defined by V = 1,2,3,4 and E = 1,2.2,3,3,4.

1

2 3 4

1

2 3 4

1

2 3 4

N = 2, 3 N = 3, 4 N = 2, 4

Figure 1.3: Four vertices graph

Example 18. Consider a graph of four vertices as seen in Figure. 1.3. An edge exists between vertices with probability p.

What is the probability that a certain vertex is connected to two other vertices?

Solution. Let N denotes the set of vertices connected to to vertex 1, with |N| = 2. The event we are interested in is

A = A2,3∪A3,4∩A2,4. Hence,

P(A) = P(A2,3)+P(A3,4)+P(A2,3) = 3p2(1−p)

In general, the probability that a vertex is connected to k others in an n−node graph is

P(A) =

(n−1

k

)pk(1−p)n−1−k

1.7 Application: Machine Learning (Occam’s Razor)Consider a box that produces a specific type of numbers between 1 and 100. A machine is observing the outputs of that

box has to figure what type of numbers (we call this the concept) the box produces. The machine observes one output of

the box, which is the number 16 (we call this observation a positive example of the concept). The machine will have hard

time figuring out what other numbers are positive! It is hard to tell with only one example, so the machine’s predictions will

be quite vague. Presumably numbers that are similar in some sense to 16 are more likely. But similar in what way? 17 is

similar, because it is “close by”, 6 is similar because it has a digit in common, 32 is similar because it is also even and a power

of 2, but 99 does not seem similar. Thus some numbers are more likely than others.

Assume now that the machine will observe the number 8, 2 and 64 as also positive examples. Let the set of positive

examples be D = 16,8,2,64. At this point the machine can draw a number of hypotheses concerning the concept, which

are consistent with the new evidence. For simplicity, let us assume that the hypotheses are only reduced to two: htwo ,“powers of two” and heven , ”even numbers”.

One technique for resolving which hypothesis is the more consistent with the observed data is to use what is commonly

known as Occam’s razor. This technique involves the computation of the probability of independently sampling N items (with

replacement) from h for each hypothesis, i.e., computing the likelihood for each hypothesis

p(D|h) =[

1|h|

]N



let us see how it works. For htwo , there are only 6 power of two less than 100, then p(D|htwo) = (1/6)4 = 7.7× 10−4.

For heven, there are 50 even numbers between 1 and 100, then p(D|heven) = (1/50)4 = 1.6× 10−7. This indicates that the

likelihood ratio is almost 5000:1 in favor of htwo . Hence, the machine will elect htwo as the most plausible hypothesis.


ONE RANDOM VARIABLE 15

CHAPTER 2

ONE RANDOM VARIABLE

2.1 Random VariablesIn probabilistic models, the outcomes could be numerical or not. However, the non-numerical outcomes may be associated

with some numerical values of interest. When dealing with such numerical values, it is often useful to assign probabilities

to them. This is done through the notion of a random variable. Given an experiment and the corresponding set of possible

outcomes, a random variable associates a particular number with each outcome. We refer to this number as the value of the

random variable.

Definition 15. A random variable is a real-valued function X (ω) of the elements of a sample space, Ω. Given an experiment,

E , with sample space, Ω, the random variable X maps each possible outcome, ω, to a real number X (ω) as specified by

some rule. See Fig. 2.1. We use upper case letters for random variables and use lower case letters for values (realizations)

!X(!)

Figure 2.1: Visualization of random variable.

of the random variables: X = x means that random variable X takes on the value x, i.e., X (Ω) = x where ω is the outcome.

If the mapping X (ω) is such that the random variable X takes on a finite or countably infinite number of values, then we

refer to X as a discrete random variable; whereas, if the range of X (ω) is an uncountably infinite number of points, we refer

to as a continuous random variable.

Example 19. Consider the random experiment of tossing a coin twice. The sample space is Ω = HH,HT ,TH,TT. Define

for any ω ∈Ω, X (ω) as the number of heads in Ω. Hence, the random variable X (Ω) has the values 0, 1 and 2.

There are several basic concepts associated with random variables, which are summarized below. These concepts will be

discussed later in this chapter. Starting with a probabilistic model of an experiment:



• A random variable is a real-valued function of the outcome of the experiment

• A function of a random variable defines another random variable

• We can associate with each random variable certain “averages” of interest, such as the mean and the variance.

• A random variable can be conditioned on an event or another random variable

• There is a notion of independence of a random variable from an event or from another random variable

Since X (ω) is a random variable whose numerical value depends on the outcome of an experiment, we cannot describe

the random variable by stating its value; rather we must give it a probabilistic description by stating the probabilities that X

takes on a specific value of values; e.g., P(X = 3), P(X > 8), P(−1≤ X < 7), etc.

2.2 Probability Mass FunctionsWe start with the probabilistic description for a discrete random variable.

Definition 16. (Probability Mass Function) If x is any real number, the probability mass function of x, denoted pX (x), is the

probability of the event X = x consisting of all outcomes that give rise to a value of X equal to x:

pX (x) = Pr (X = x)

We take the convention that upper case letters (X ) represent random variables and lower case letters (x) represent fixed

values that the random variable can assume (a realization).

∑x

PX (x) = 1 (2.1)

We consider the following famous discrete random variables which are of vital importance in communication theory.

Definition 17 (Bernoulli Random Variable). This is the simplest possible random variable and is used to represent experiments

that have two possible outcomes. These experiments are called Bernoulli trials and the resulting random variable is called a

Bernoulli random variable. It is common to associate the values 0,1 with the outcomes of the experiment. One is referred

to as a success events and the other as a failure event.

If X is a Bernoulli random variable, its PMF is of the form: PX (0) = 1−p and PX (1) = p. Also, the random variable X

following a Bernoulli distribution is denoted X ∼ Ber(p).

Definition 18 (Binomial Random Variable). Consider repeating a Bernoulli trial n times, where the outcome of each trial is

independent of all others. The Bernoulli trial has a sample space of S = 0,1 and we say that the repeated experiment has

a sample space Sn = 0,1n, which is referred to as a cartesian space. The discrete random variable X associated with k

successes out of n trials is referred to as the Binomial random variable with with parameters n and p and having the following

PMF

pX (k) = P(X = k) =

(n

k

)pk(1−p)n−k , k = 0,1, ...,n (2.2)

(Note that here and elsewhere, we simplify notation and use k, instead of x, to denoted the values of integer-valued random

variables). The random variable X following a binomial distribution is denoted by X ∼ Binom(n,p). Reminder:

Binomial expansion : (a+b)n =n

∑k=1

(n

k

)akbn−k (2.3)

The normalization property, specialized to the binomial random variable, is written as

n

∑k=1

(n

k

)pk(1−p)n−k = (p+1−p)n = 1



Example 20. Consider a random graph G consisting of 10 vertices. In graph theory, an adjacent vertex or a neighbor of a

vertex v in a graph is a vertex that is connected to v by an edge. In G , each edge exists with probability 1/4. What is the

probability that vertex 1 has 3 neighbors?

P(X = 3) =(

93

)(14

)3(34

)6

Definition 19 (Poisson Random Variable). Consider a binomial random variable X , where the number of repeated trials

n is very large. In that case, evaluating the binomial coefficients pose numerical problems. If the probability of success

in each individual trial, p, is very small, then the binomial random variable can be well approximated by a poisson random

variable.Formally, let n approach infinity and p approach zero in such a way that limn→∞ np = α. Then, the binomial PMF

converges to the form

PX (k) = P(X = k) =αk

k!e−α , k = 0,1,2, ... (2.4)

This probability mass function is known as the Poisson distribution of rate α.

Example 21. Verify that the Binomial distribution can be well approximated by a Poisson distribution. Let p = α/n. Then,

P(X = k) =

(n

k

)pk(1−p)n−k =

n!(n−k)!k!

(α

n

)k (1− α

n

)n−k

=n(n−1) ...(n−k +1)

nkαk

k!(1−α/n)n

(1−α/n)k

Now, as n→ ∞ and α moderate, we have

(1− α

n

)n' e−α ,

n(n−1) ...(n−k +1)nk

' 1,(

1− α

n

)k' 1

Hence, for n large and α moderate

P(X = k) =αk

k!e−α

Example 22. Suppose that the arrival of telephone calls at a switch can be modeled with a poisson PMF with rate λ t, where

λ is the average arrival rate in calls/minute. Suppose that the average rate of calls is 10 per minute. Find the probability

that fewer than three calls will be received in the first 6 seconds.

solution.

P(X = k) =(λ t)k

k!e−λ t , k = 0,1,2, ...

=(10t)k

k!e−10t , , k = 0,1,2, ...

The probability p that fewer than three calls will be received in the first 6 seconds (t=0.1 min) is

p =2

∑k=0

P(X = k) =e−1

0!+

e−1

1!+

e−1

2!= 0.919 (2.5)



Definition 20 (Geometric Random Variable). Consider repeating a Bernoulli trial until the first occurrence of a success (with

probability p) . If X represents the number of times a failure occurs until the first occurrence of the first success, then X is

a geometric random variable with the following PMF

PX (k) = P(X = k) = p(1−p)k−1, k = 1,2, ... (2.6)

If X counts the number of trials that were performed before the first occurrence of the first success, then the PMF of X is

PX (k) = P(X = k) = p(1−p)k , k = 0,1,2, ... (2.7)

Example 23. Computer A sends a message to computer B over an unreliable radio link. The message is encoded so that

B can detect when errors have been introduced into the message during transmission. If B detects an error, it requires A to

retransmit it. If the probability of a message transmission error is q = 0.1, what is the probability that a message needs to be

transmitted more than two times?

solution. Let X be the random variable denoting the number of retransmission before until no error detection. Obviously,

X follows a geometric distribution with pX (k) = p(1− p)k−1, k = 1,2, ... . The probability that the message needs to be

transmitted more than two times is equivalent to the following probability

1−P(X = 1)−P(X = 2) = 1−p−p(1−p) = p2−2p+1 = (1−q)2−2(1−q)+1 = 0.01

Definition 21 (Pascal Random Variable). Consider repeating a Bernoulli trial until the mth occurrence of a success (with

probability p) . If X represents the number of trials that must be repeated until mth occurrence of the success, then X is a

generalized geometric random variable or a pascal random variable with the following PMF

PX (k) = P(X = k) =

(k−1n−1

)pm(1−p)k−m, k = m,m+1, ... (2.8)

The pascal distribution can be easily derived from the geometric distribution. Note that in k − 1 trials we had (m− 1)successes, and at the mth trial, we had the mth success. Hence,

P(X = k) =

(k−1m−1

)pm−1(1−p)k−m×p =

(k−1n−1

)pm(1−p)k−m

Definition 22 (Hypergeometric Random Variable). Suppose we are allowed to perform n draws (without replacement) from

a population of size N that contains exactly m successes and N−m failures. The number of successes in n draws is a random

variable X that follows a hypergeometric distribution of the form

PX (k) = P(X = k) =

(mk

)×(N−mn−k

)(Nn

) , k = 0,1,2, ... ,n (2.9)

2.3 Cumulative Distribution FunctionsWe are thus confronted with the fact that continuous random variables take no fixed value with positive probability. The

way to understand this apparent paradox is to realize that continuous random variables are an idealized model of what we

normally think of as a continuous-valued measurements. For example, a voltmeter only shows a certain number of digits after

the decimal point, say 5.127 volts because physical devices have limited precision. Hence, the measurement X = 5.127 should

be understood as saying 5.1265≤ X < 5.1275, since all number in this range round to 5.127. Now, there is no paradox since

P(5.1265≤ X < 5.1275) has positive probability. You may still ask, “why not just use a discrete random variable taking the

distinct values k/1000, where k is any integer?” After all, this would model the voltmeter in question. One answer is that if

you get a better voltmeter, you need to redefine the random variable, while with the idealized, continuous-random-variable



model, even if the voltmeter changes, the random variable does not. Also, the continuous-random-variable model is often

mathematically simpler to work with.

Since a continuous random variable will typically have a zero probability of taking on a specific value, we avoid talking

about such probabilities. Instead, events of the form X ≤ x can be considered. In general, it would be desired to describe

all kinds of random variables (whether discrete or continuous) with a single mathematical concept. This is accomplished with

the cumulative distribution function, or CDF.

Definition 23 (Cumulative Distribution Function). The cumulative distribution function (CDF) of a random variable X is:

FX (x) = P(X ≤ x) (2.10)

Loosely speaking, the CDF FX (x) accumulates the probability up to the value x . Any random variable associated with

a given probability model has a CDF, regardless of whether it is discrete or continuous. This is because X ≤ x is always

an event and therefore has a well-defined probability. For a discrete random variable, the PMF and the CDF are related as

follows

FX (x) = P(X ≤ x) = ∑k≤x

pX (k) (2.11)

Properties of CDF:

1. FX (−∞) = 0, FX (∞) = 1

2. 0≤ FX (x)≤ 1

3. For x1 < x2, FX (x1)≤ FX (x2), i.e., the CDF is monotonically non-decreasing

4. For x1 < x2, P(x1 < X ≤ x2) = FX (x2)−FX (x1)

5. FX (x) is right continuous, i.e., FX (a+) = limx→a+ FX (x) = FX (a)

6. P(X = a) = FX (a)−FX (a−)

Example 24. Given the CDF of a random variable X : FX (x) = (1−e−x )U(x).

Find: (a)P(X > 5), (b)P(X < 5), (c)P(3 < X < 7), (d)P(X > 5/X < 7)

Solution.

(a)P(X > 5) = 1−P(X ≤ 5) = 1−FX (5) = e−5

(b)P((X < 5)∪ (X = 5)) = P(X < 5)+P(X = 5) = FX (5)+0⇒ P(X < 5) = FX (5) = 1−e−5

(c)P(3 < X < 7) = P(3 < X ≤ 7) = FX (7)−FX (3) = e−3−e−7

(d)P(X > 5|X < 7) = P((X>5)∩(X<7))P(X<7) = e−5−e−7

1−e−7

Example 25. You are allowed to take a certain test three times, and your final score will be the maximum of the test scores.

Assume that your score in each test takes one of the values from 1 to 10 with equal probability 1/10, independently of the

scores in other tests. What is the PMF of the final score?

Solution. Let X1,X2,X3 be the three test scores and X = maxX1,X2,X3 be the random variable representing the final

score. We start by calculating the CDF of X

FX (k) = P(X ≤ k) = P(X1 ≤ k ,X2 ≤ k ,X3 ≤ k) = P(X1 ≤ k)P(X2 ≤ k)P(X3 ≤ k) =

(k

10

)3



Thus, the PMF of X is given by

pX (k) = P(X = k) = FX (k)−FX (k−1) =(

k

10

)3

−(

k−110

)3

, k = 1, ... ,10

A random variable is said to be discrete if FX (x) consists only of steps over a countable set X . See Fig. 2.2.

pX (x) = PX = x for every x ∈X

Figure 2.2: CDF of a discrete random variable.

A random variable is said to be continuous if its FX (x) is continuous and differentiable (except possibly over a countable

set). See Fig. 2.3.

2.4 Probability Density FunctionsWhile the CDF introduced in the last section represents a mathematical tool to statistically describe a random variable,

it is often quite cumbersome to work with CDFs. For example, we will see later in this chapter that the most important

and commonly used random variable, the Gaussian random variable, has a CDF which cannot be expressed in closed form.

Furthermore, it can often be difficult to infer various properties of a random variable from its CDF. To help circumvent these

problems, an alternative and often more convenient description known as the probability density function is often used for

continuous random variables.

Definition 24. The probability density function (PDF) of a random variable X evaluated at X = x is

fX (x) =dFX (x)

dx(2.12)

Equivalently, having the PDF one can find the CDF

FX (x) =∫ x

−∞

fX (t)dt (2.13)

If FX (x) is differentiable everywhere, then

fX (x) =dFX (x)

dx= lim

∆x→0

FX (x +∆x)−FX (x)

∆x= lim

∆x→0

Px < X ≤ x +∆x∆x

(2.14)

Figure 2.3: CDF of a continuous random variable.



Hence, we can interpret the PDF as the probability per unit distance.

Properties of PDF:

1. Non-negativity property:fX (x)≥ 0

2. FX (x) =∫ x−∞

fX (a)da

3. Normalization property:∫

∞

−∞fX (x) = 1

4.∫ ba fX (x) = P(a < X ≤ b) = FX (b)−FX (a)

5. For any event A, P (X ∈ A) =∫x∈A fX (x)dx

Note that to qualify as a PDF, a function fX (x) must satisfy the non-negativity and the normalization properties. To

interpret the PDF, note that for an interval [x ,x +δ ] with very small length δ , we have

P (X ∈ [x ,x +δ ]) =∫ x+δ

xfX (x)dt ' fX (x) ·δ , (2.15)

so we can view fX (x) as the “probability mass per unit length” near x . It is important to note that even though a PDF is

used to calculate event probabilities, fX (x) is not the probability of any particular event. In particular, it is not restricted to

be less than or equal to 1.

The distributions functions which play an important role in communication theory are the: Uniform distribution, Expo-

nential distribution, Laplace distribution, Rayleigh distribution,Rician distribution, and Gaussian distribution.

Definition 25 (Uniform random variable).

fX (x) =1

b−a, a ≤ X < b (2.16)

When X is uniformly distributed, we denote it as X ∼ U[a,b] .

Definition 26 (Exponential random variable).

fX (x) =1b

exp(−x

b

)U(x) (2.17)

When X is exponentially distributed with parameter b, we denote it as X ∼ Exp(b) .

Definition 27 (Laplace random variable).

fX (x) =1

2bexp

(−|x |

b

)(2.18)

When X is Laplacian distributed, we denote it as X ∼ Laplace(b) .

Definition 28 (Rayleigh random variable).

fX (x) =x

σ2 exp

(−1

2x2

σ2

)U(x) (2.19)

When X is Rayleigh distributed, we denote it as X ∼ Rayleigh(σ) .

Definition 29 (Rician random variable).

fX (x) =x

σ2 exp

(−1

2x2 +a2

σ2

)I0

( ax

σ2

)U(x) (2.20)

where I0(x) is the modified Bessel function of first kind of order zero, defined as

I0(x) =1

2π

∫ 2π

0ex cosθ dθ (2.21)

When X is Rician distributed, we denote it as X ∼ Rice(a,σ) .



Definition 30 (Gamma random variable). A random variable X is Gamma distributed with parameters (α,λ ) if it has the

following PDF

fX (x) =λ e−λx (λ x)α−1

Γ(α)U(x) (2.22)

where Γ(α) is the Gamma function defined as

Γ(α) =∫

∞

0y α−1e−ydy (2.23)

The Gamma function has the following properties:

1. Γ(α) = (α−1)Γ(α−1)

2. Γ(n) = (n−1)!

We denote a Gamma random variable with parameters (α,λ ) by X ∼ Gamma(α,λ ).

Definition 31 (Gaussian/Normal random variable).

fX (x) =1√

2πσ2exp

(−1

2

(x −µ

σ

)2)

(2.24)

When X is Gaussian distributed, we denote it as X ∼N (µ,σ2). For the case when µ = 0 and σ = 1, the random variable is

referred to as the standard normal random variable.

Example 26. Let X be a random variable having the following pdf

fX (x) = Be−2x2−3x−1

Find B.

Solution. −2x2−3x −1 =−2(

x + 34

)2+ 1

8 . Hence, we can write fX (x) as

fX (x) = Bexp

−2(

x +34

)2

+18

= Bexp

18

exp

− (x +3/4)2

1/2

By identification with the pdf of Gaussian distribution, we get µ =−3/4, 2σ2 = 1/2⇒ σ2 = 1/4. Also, Be1/8 = 1√2π(1/4)

⇒

B =√

2π

e−18 .

Example 27. Assume X ∼ Rayleigh(7), find P(X > 7).

Solution.

P(X > 7) =∫

∞

7

x

49e−

x298 dx =

(−e−

x298

)∞

7= e0.5

Example 28. Show that the standard Gaussian PDF integrates to 1.



Solution.

[1√2π

∫∞

−∞

e−x2/2dx

]2

=1

2π

∫∞

−∞

e−x2/2dx

∫∞

−∞

e−y2/2dy

Let X = r cosθ and y = r sinθ and carry out the change of variable from cartesian to polar coordinates, we obtain

12π

∫∞

0

∫ 2π

0re−r

2/2drdθ =∫

∞

0re−r

2/2dr = 1

Example 29. Prove that the exponential random variable satisfies the memoryless property:

P[X > t +h|X > t] = P[X > h]

Solution.

P[X > t +h|X > t] =P[X > t +h∩X > t]

P[X > t]=

P[X > t +h]

P[X > t]=

∫∞

t+he−x/bb dx

∫∞

te−x/bb dx

=e−

t+hb

e−tb

= e−hb = P[X > h]

Example 30. Let X be a Laplace random variable that has the following pdf:

fX (x) =1

2bexp

(−|x |

b

)

Let Y = X 2 +1. Find the pdf of Y .

Solution. Given that Y = X 2 +1, in oder to find the pdf of Y we need to start from the CDF.

FY (y) = Pr [Y ≤ y ] = Pr [X 2 +1≤ y ]

= Pr [X 2 ≤ y −1] = Pr [−√

y −1≤ X ≤√

y −1]

=∫ √y−1

−√y−1fX (x)dx =

12b

∫ √y−1

−√y−1e−

|x |b dx

=1

2b

[∫ 0

−√y−1ex/bdx +

∫ √y−1

0e−x/bdx

]

= 1−e−√y−1b

Hence,

fY (y) =dFY (y)

dy=

d

dy

(1−e−

√y−1b

)=

12b√

y −1e−

√y−1b , y > 1



2.4.1 Tail Probability

The PDF of the Gaussian random variable X is given by

fX (x) =1√

2πσ2exp

(−1

2

(x −µ

σ

)2)

It is obvious that this distribution has a bell-shape curve centered and symmetric about µ and whose width increase with σ .

Evaluating the probability of events which follow a Gaussian distribution cannot be done analytically, but rather numerically

using tables for: Q(·) functions, φ functions, erf (·) functions or erfc(·) functions, defined as

Q(x) =1√2π

∫∞

xexp

(− t2

2

)dt (2.25)

φ(x) =1√2π

∫ x

−∞

exp

(− t2

2

)dt (2.26)

erf (x) =2√π

∫ x

0exp(−t2)dt (2.27)

erfc(x) =2√π

∫∞

xexp(−t2)dt (2.28)

We have the following relationships between the previous functions: Q(x) = 1−Q(−x), Q(x) = 1−φ(x), erfc(x) = 1−erf (x)

In electrical engineering it is customary to work with the Q(·) function, which is obviously a decreasing function of x . See

Fig. 2.4.

Figure 2.4: Tail probability for a standard Gaussian distribution.

Definition 32 (Tail Probability). The “tail probability” for a Gaussian random variable X ∼N (µ,σ2) is defined as:

P(X ≥ x1) = Q

(x1−µ

σ

)(2.29)

Example 31. A random variable X is normally distributed according to

fX (x) =1√8π

exp

− (X +3)2

8

Evaluate P(|X −2|> 1).

Solution. P(|X −2|> 1) = P(X < 1)+P(X > 3) = 1−P(X ≥ 1)+P(X > 3) = 1−Q(2)+Q(3)



Example 32. A binary message is transmitted as a signal s, which is either −1 or +1. The communication channel corrupts

the transmission with additive normal noise with µ = 0 and σ2 = 4.

The receiver concludes that the signal −1 (or +1) was transmitted if the value received is < 0 (or ≥ 0, respectively).

What is the probability of error?

Solution. An error occurs whenever −1 is transmitted and the noise N is at least 1 so that s +N =−1+N ≥ 0, or whenever

−1 is transmitted and the noise N is smaller than −1 so that s +N = 1+N < 0. In the former case, the probability of error is

P(N ≥ 1) = Q

(12

)

In the latter case, the probability of error is the same by symmetry.

2.5 Statistical ParametersWe will start this subsection with defining important statistical parameters that help analyzing a given random experiment

and/or statistical data.

Definition 33 (Expected Value). The expected value (mean/average) of random variable X is

E [X ] =

∞∫−∞

xfX (x)dx for continuous random variable

∑k

xkPX (xk) for discrete random variable(2.30)

The expected value of a discrete random variable is simply a weighted average of the values that the random variable can

take on, weighted by the probability mass of each value. The expected value of a continuous random variable can be viewed

as the centroid if we think of fX (x) as a mass distribution of an object along the x−axis.

Example 33. The mean of of the Poisson PMF can be calculated as follows

E [X ] =∞

∑k=0

kpX (k) =∞

∑k=0

ke−λ λ k

k!= λ

∞

∑k=1

e−λ λ k−1

(k−1)!= λ

∞

∑m=0

e−λ λm

m!= λ



Definition 34. The expected value of a random function g(X ) is

E [g(X )] =

∞∫−∞

g(x)fX (x)dx for continuous random variable

∑k

g(xk)PX (xk) for discrete random variable(2.31)

Theorem 5. E [ax +b] = aE [X ]+b, where a and b are constants.

Proof.

E [aX +b] =∫

∞

−∞

(ax +b)fX (x)dx =∫

∞

−∞

axfX (x)dx +∫

∞

−∞

bfX (x)dx = aE [X ]+b

Thus, E [·] is a linear operator and can be interchanged with any other linear operation (like summation, integration, etc.)

Definition 35 (Moments). The nth moment of a random variable X is:

E [X n] =

∞∫−∞

xnfX (x)dx for continuous random variable

∑k

xnkPX (xk) for discrete random variable(2.32)

Example 34. Find the nth moment of X ∼ U[0,a].

Solution.

fX (x) =1a

, 0≤ X < a⇒ E [X n] =∫ a

0

xn

a=

an

n+1

Example 35. Let Y = g(X ) = IC (X ) be the indicator function for the event X ∈ C, where C is some interval or union of

intervals in the real line

g(X ) =

0 X /∈ C

1 X ∈ C

The expected value of Y is E [Y ] =∞∫−∞

g(X )fX (x)dx =∫C

fX (x)dx = P[X ∈ C ]. Thus, the expected value of the indicator of

an event is equal to the probability of the event.

Definition 36 (Central Moments). The nth central moment of a random variable X is:

E [(X −µX )n] =

∞∫−∞

(x −µX )nfX (x)dx for continuous random variable

∑k(xk −µX )

nPX (xk) for discrete random variable(2.33)

When n = 2, the 2nd central moment is called “variance”:

E [(X −µX )2] = Var(X ) = σ

2X = E [X 2]− (E [X ])2 (2.34)

When n = 3, the 3rd central moment is called “skewness”:

E [(X −µX )3] (2.35)

When n = 4, the 4th central moment is called “kurtosis”:

E [(X −µX )4] (2.36)



Note that the second moment is a measure of randomness in the data. This concept will be used over and over in

designing fundamental blocks in the digital communication system.

Example 36. Show that if Y = aX +b, then Var(Y ) = a2Var(X ).

Solution.

µY = aµX +b⇒ Y −µY = (aX +b)− (aµX +b) = a(X −µX )

Var(Y ) = σ2Y = E [(Y −µY )

2] = E [a2(X −µX )2] = a2E [(X µX )

2] = a2Var(X ) = a2σ

2X

Remark 2. The Variance of a random variable, unlike the expected value, is not a linear operator. For instance Var(X +Y ) 6=Var(X )+Var(Y ), unless X and Y are statistically independent.

Remark 3. The variance of a constant c is zero: Var(X + c) = Var(X ). (Variance is a measure of randomness, and

randomness is not affected by a constant shift!).

Example 37. Find the variance of the Poisson distribution.

Solution.

E [X 2] =∞

∑k=0

k2e−λ λ k

k!=

∞

∑k=0

ke−λ λ k

(k−1)!=

∞

∑k=1

(k−1)e−λ λ k−1λ

(k−1)!+

∞

∑k=1

e−λ λ k−1λ

(k−1)!

= λ E [X ]+λ

∞

∑k=1

e−λ λ k−1

(k−1)!= λ

2 +λ

Hence, Var(X ) = λ 2 +λ −λ 2 = λ .

Example 38. Find the expected value and the variance of the binomial random variable.

Solution.

E [X ] =n

∑k=1

(n

k

)kpk(1−p)n−k =

n

∑k=1

n(n−1)!(k−1)!(n−k)!

pk(1−p)n−k

= npn

∑k=1

(n−1k−1

)pk−1(1−p)n−k = np



E [X 2] =n

∑k=0

k2(

n

k

)pk(1−p)n−k

=n

∑k=0

k(k−1)(

n

k

)pk(1−p)n−k +

n

∑k=0

k

(n

k

)pk(1−p)n−k

=n

∑k=0

n!(k−2)!(n−k)!

pk(1−p)n−k +np

= n(n−1)p2n

∑k=0

(n−2)!(k−2)!(n−k)!

pk−2(1−p)n−k +np

= n(n−1)p2n

∑k=2

(n−2k−2

)pk−2(1−p)n−k +np

= n(n−1)p2(p+1−p)n+np

Hence, Var(X ) = E [X 2]−E [X ]2 = np(1−p).

Example 39. Consider a geometric PMF: pX (k) = (1−p)pk , k = 0,1,2, ... . We need to find the variance of X . Note that

pd

dp(pk) = kpk ⇒ p

d

dp

(p

d

dp

(...p

d

dppk ...

))= knpk ⇒ pn

dn

dp(pk) = knpk

E [X n] =∞

∑k=0

knpX (k) =∞

∑k=0

kn(1−p)pk = (1−p)∞

∑k=0

pndn

dp(pk) = (1−p)pn

dn

dp(

∞

∑k=0

pk) = (1−p)pndn

dp

(1

1−p

)

For n = 1, E [X ] = (1−p)p ddp

(1

1−p

)= p

1−p

For n = 2, E [X 2] = (1−p)p2 d2

dp

(1

1−p

)= 2p2

(1−p)2

Var(X ) = E [X 2]−E [X ]2 = p(1−p)2

Example 40. Find the expected value of the random variable X ∼ Rayleigh(σ).

Solution.

E [X ] =∫

∞

0

x2

σ2 exp

(−1

2x2

σ2

)dx =

∫∞

0x,u

x

σ2 exp

(−1

2x2

σ2

)dx

,dv=−xexp

(−1

2x2

σ2

)∣∣∣∞

0+∫

∞

0exp

(−1

2x2

σ2

)dx

=12

∫∞

−∞

exp

(−1

2x2

σ2

)dx =

√2πσ2

2

∫∞

−∞

1√2πσ2

exp

(−1

2x2

σ2

)dx =

√π

2σ

Example 41. Find the expected value of X ∼ Gamma(α,λ ).



Solution.

E [X ] =1

Γ(α)

∫∞

0λ xe−λx (λ x)α−1dx =

λ

Γ(α)

∫∞

0e−λx (λ x)α dx

=1

Γ(α)

∫∞

0(e−λx )(λ x)α d(λ x) =

Γ(α +1)λ Γ(α)

=α

λ

Example 42. Show that the mean and the variance of the Gaussian random variable X ∼N (µ,σ2) are µ and σ2, respectively.

Solution.

fX (x) =1√

2πσ2exp

(−1

2

(x −µ

σ

)2)

E [X ] =∫

∞

−∞

x√2πσ2

exp

(−1

2

(x −µ

σ

)2)

dx =∫

∞

−∞

µ + y√2πσ2

exp

− y 2

2σ2

dy

=∫

∞

−∞

µ√2πσ2

exp

− y 2

2σ2

dy +

∫∞

−∞

y√2πσ2

exp

− y 2

2σ2

dy

= µ +0 = µ

Var(X ) =∫

∞

−∞

(x −µ)2√

2πσ2exp

(−1

2

(x −µ

σ

)2)

dx =∫

∞

−∞

y 2√

2πσ2exp

(−1

2

( y

σ

)2)

dy

=√

2σ2∫

∞

−∞

(2y 2σ

2)1√

2πσ2exp

−2y 2σ2

2σ2

dy =

4σ2√

π

∫∞

0y 2e−y

2dy

=4σ2√

π

∫∞

0(√

t)2(

2√

t)−1

e−tdt =4σ2√

π

12

∫∞

0t

32−1e−tdt =

4σ2√

π

12

Γ

(32

)

=4σ2√

π

12

√π

2= σ

2

Definition 37. ( Quantiles)

Let X be a random variable and q ∈ (0,1). Then, the value of x for which Pr(X < x)) ≤ q and Pr(X > x) ≤ 1− q is the

q-quantile of the distribution of X . Moreover, the value x associated with q = 0.5 is called the median.

2.6 Conditioning a Random Variable on an Event

Definition 38. (Conditional PMF) The conditional PMF of a random variable X , conditioned on a particular event A with

P(A)> 0, is defined by

pX |A(x) = P [X = x |A] = P [X = x∩A]

P(A)(2.37)



Example 43. Let X be the time required to transmit a message, where X is a uniform random variable with SX = 1,2, ... ,L.Suppose that a source has already been transmitting a message for m time units, find the probability that the remaining

transmission time is j time units.

Solution. We are given A = X > m, so for m+1≤m+ j ≤ L;

pX (m+ j |X > m) =P[X = m+ j ]

P[X > m]=

1LL−mL

=1

L−mm+1≤m+ j ≤ L

X is equally likely to be any of the remaining L−m possible values. As m increases 1/(L−m) increases implying that the

end of the message transmission becomes increasingly likely.

Definition 39. (Conditional CDF) The conditional CDF of a random variable X , conditioned on a particular event A with

P(A)> 0, is defined by

FX (x |A) =P [X ≤ x∩A]

P(A)(2.38)

The conditional PDF of X given A is then defined as

fX (x |A) =d

dxFX (x |A) =

fX (x)P(X∈A) , X ∈ A

0, X /∈ A(2.39)

Example 44. A binary transmission system sends a “0” bit by transmitting a −v voltage signal, and a “1” by transmitting

a +v . The received signal is corrupted by Gaussian noise and given by: Y = X +N, where X is the transmitted signal, and

N ∼N (0,σ2) is a noise voltage with pdf fN(x). Assume that P(1) = p = 1−P(0). Find the PDF of Y .

Solution. Let B0 be the event “0” is transmitted and B1 be the event “1” is transmitted, then B0 and B1 form a partition,

and

FY (y) = FY (y |B0)P(B0)+FY (y |B1)P(B1) = P[Y ≤ y∩B0](1−p)+P[Y ≤ y∩B1]p

= (1−p)Pr [N ≤ y + v ]+pPr [N ≤ y − v ]

= (1−p)FN(y + v)+pFN(y − v)

Taking the derivative of FY (y) with respect to y , we get

fY (y) = (1−p)fN(y + v)+pfN(y − v) = (1−p)1√

2πσ2e−(x+v)

2/2σ2+p

1√2πσ2

e−(x−v)2/2σ2

A conditional PMF or PDF can be thought of as an ordinary distribution over a new universe determined by the conditioning

event. In the same spirit, a conditional expectation is the same as an ordinary expectation, except that it refers to the new

universe, and all probabilities and distributions are replaced by their conditional counterparts.

Definition 40 (Conditional Expected Value). The conditional expected value of a random variable X conditioned on an event

A is:

E [g(X )|A] =∫

∞

−∞

g(x)fX (x |A)dx : for continuous random variable (2.40)

E [g(X )|A] = ∑k

g(xk)pX |A(xk) : for discrete random variable (2.41)



Theorem 6 (The Total Expectation Theorem).

E [X ] =n

∑i=1

P(Ai )E [X |Ai ] (2.42)

Proof. We write pX (x) =n

∑i=1

P(Ai )pX |Ai (x) and multiple both sides with x , and sum over x :

E [X ] = ∑x

xpX (x) = ∑x

xn

∑i=1

P(Ai )pX |Ai (x) = ∑x

P(Ai )n

∑i=1

xpX |Ai (x) =n

∑i=1

P(Ai )E [X |Ai ]

Example 45. Messages transmitted by a computer in Boston through a data network are destined to New York with probability

0.5, for chicago with probability 0.3, and for San Francisco with probability 0.2. The transit time X of a message is random.

Its mean is 0.05 seconds if it is destined for New York, 0.1 seconds if it is destined for chicago, and 0.3 seconds if it is destined

to San Francisco. Then the expected value is easily calculated as: E [X ] = 0.5×0.05+0.3×0.1+0.2×0.3 = 0.115 seconds.

Example 46. Let X ∼N (0,1) and A = x > 0. Find E [X |A].

Solution.

fX |A(x) =fX (x)

P[X ∈ A]IA =

fX (x)

P[X ∈ A]U(x) =

fX (x)

Q(0)= 2fX (x)

E [X |A] =∫

∞

−∞

xfX |Adx =∫

∞

−∞

xfX (x)

P[X > 0]U(x)dx =

∫∞

0x

√2π

exp

(−x2

2

)dx =

√2π

2.7 TransformsIn this section, we introduce the transform associated with a random variable. The transform provides us with an alternative

representation of a probability law. It is often convenient for certain types of mathematical manipulations.

2.7.1 The Moment Generating Function (MGF)

The moment generating function (MGF) of X is the transform associated with a random variable X defined as

MX (u) = E[

euX]

(2.43)

where u is a scalar parameter.

Theorem 7.

E[

X k]= MX (u)

(k)∣∣∣u=0

(2.44)

Furthermore

MX (u) =∞

∑k=0

1k!

(MX (u)

(k)∣∣∣u=0

)uk (2.45)



Proof. M ′X (u) =dduE

[euX

]= E

[ddu euX

]= E

[XeUX

]. Take u = 0, we get M ′X (0) = E [X ]. Differentiating k times and setting

u = 0, we get E[X k]= MX (u)

(k)∣∣∣u=0

.

E[

euX]= E

[∞

∑k=0

(uX )k

k!

]=

∞

∑k=0

uk

k!E[

X k]=

∞

∑k=0

uk

k!MX (u)

(k)∣∣∣u=0

Example 47. Consider the Erlang distribution

fX (x) =xn−1

(n−1)!e−xU(x)

Find E [X k ] by application of the MGF.

Solution.

MX (u) =∫

∞

0

xn−1

(n−1)!e−xeuxdx =

1(1−u)n

(integration by part)

MX (u)′ =

n

(1−u)n+1 ⇒ E [X ] = M ′X (u)∣∣∣u=0

= n

MX (u)′′ =

n(n+1)(1−u)n+2 ⇒ E [X 2] = M ′′X (u)

∣∣∣u=0

= n(n+1)

By further inspection, we get

E[

X k]= n(n+1)(n+2) ...(n+k−1) =

(n+k−1)!(n−1)!

2.7.2 Characteristic Function

The MGF is not always finite. For example, if X is a Cauchy random variable, then it is easy to see that MX (u) = ∞ for all

real u 6= 0.In order to develop transform methods that always work for any random variable X, we introduce the characteristic

function of X . The characteristic function of a random variable X is defined as

ΦX (ω) = E[

e jωX]

(2.46)

If X is a continuous random variable with density f , then

Φ(ω) =∫

∞

−∞

e jωx f (x)dx (2.47)

which is the Fourier transform of f by replacing ω by −ω.

Theorem 8.

E[

X k]= (−j)kΦ

(k)X (ω)

∣∣∣ω=0

(2.48)

and

ΦX (ω) =∞

∑k=0

1k!

Φ(k)X (ω)

∣∣∣ω=0

ωk (2.49)

Proof.

Φ′X (ω) = E

[d

dωe jωX

]= E

[jXe jωX

]⇒Φ′X (0) = jE [X ]



Φ(k)X (ω) = E

[dk

dωke jωX

]

= E[(jX )ke jωX

]⇒Φ

(k)X (0) = jkE [X k ]

Example 48. Let fX (x) = ae−axU(x) be the PDF of the random variable X . Using the characteristic function, find E [X k ].

Solution.

ΦX (ω) =a

a− jω=

11− j ω

a

=∞

∑k=0

(jω

a

)k=

∞

∑k=0

jkωk

ak

⇒ 1k!

Φ(k)X (ω)

∣∣∣ω=0

ωk =

jk

akωk ⇒Φ

(k)X (ω)

∣∣∣ω=0

=jkk!ak

E [X k ]

(−j)k=

jkk!ak⇒ E [X k ] =

k!ak

Example 49. Consider the Binomial distribution pX (k) =(nk

)pk(1−p)n−k , k = 0,1,2 ... . Using the characteristic function,

find the second moment of X .

Solution.

ΦX (ω) = E[

e jωX]= ∑k

pX (k)ejωk =

n

∑k=0

(n

k

)pk(1−p)n−ke jωk =

n

∑k=0

(n

k

)(pe jω)k(1−p)n−k

= (1−p+pe jω)n =

[1−p+p

(1+ jω− 1

2ω

2− 16

jω3 + ...

)]n

=

1+ jωp− p

2ω

2− 16

jpω3 + ...

︸︷︷︸,θ

n

= (1+θ)n = 1+nθ +

(n

2

)θ

2 +

(n

3

)θ

3 + ...

= 1+n(

jpω− p

2ω

2− p

6ω

3j + ...)+

(n

2

)((jωp)2 + ...

)+ ...

= 1+ jnpω︸︷︷︸=Φ′X (ω)

∣∣∣ω=0

+

[n(−p

2

)+

(n

2

)jp2]

︸︷︷︸= 1

2 Φ”X (ω)

∣∣∣ω=0

ω2 + ...

Thus, E [X ] = np, E [X 2] =−2[n(−p2 −

(n2

)p2)]

= np+(n2

)2p2.

Example 50. Let X ∼N (0,σ2). Find ΦX (ω) and E [X 2k ].

Solution.

fX (x) =1√

2πσ2exp

−1

2x2

σ2



ΦX (ω) =∫

∞

−∞

1√2πσ2

exp

−1

2x2

σ2

expjωxdx

=∫

∞

−∞

1√2πσ2

exp

−1

2

(x2

σ2 −2jωx

)dx

=∫

∞

−∞

1√2πσ2

exp

− 1

2σ2

(x2−2jωσ

2x)

dx

=∫

∞

−∞

1√2πσ2

exp

− 1

2σ2

(x − jωσ

2x)2

exp

− 1

2σ2 ω2σ

4

dx

= exp

−ω2σ2

2

ΦX (ω) =∞

∑k=0

1k!

(−ω2σ2

2

)k=

∞

∑k=0

1k!

(−1)kσ2kω2k

2k

Φ(k)X (ω)

∣∣∣ω=0

=(−1)kσ2k

2k, for k even

E [X 2k ] = (−j)2kΦ2kX (ω)

∣∣∣ω=0⇒ 1

k!

(−σ2

2

)k=

1(2k)!

d2kΦX (ω)

dω2k

∣∣∣ω=0

Hence,

Φ2kX (ω)

∣∣∣ω=0

=(2k)!

k!×(−σ2

2

)k⇒ E [X 2k ] =

σ2k

2k(2k)!

k!

2.8 Evaluating Tail ProbabilitiesIn some cases, evaluating tail probabilities can be quite complicated. Therefore, some bounds could help provide good

approximations.

2.8.1 Markov’s Inequality

For non-negative random variable X , the following bound is known as Markov’s inequality

P[X ≥ x0]≤µX

x0(2.50)

Proof.

E [X ] =∫

∞

−∞

xfX (x)dx =∫

∞

0xfX (x)dx =

∫ x00

xfX (x)dx +∫

∞

x0

xfX (x)dx

≥∫

∞

x0

xfX (x)dx ≥∫

∞

x0

x0fX (x)dx = x0

∫∞

x0

fX (x)dx = x0P[X ≥ x0]

Example 51. Let X be the average life span of a person. Give that µX = 80, we can bound P[X ≥ 120]≤ 80120 = 0.67.



2.8.2 Chebyshev’s Inequality

For any random variable X , the following bound is known as Chebyshev’s Inequality

P[|X −µX | ≥ x0]≤σ2X

x20

(2.51)

Proof.

|X −µX | ≥ x0⇒ |X −µX |2 ≥ x20

Applying Markov’s inequality, we get

P[|X −µX |2 ≥ x20 ]≤

E [(X −µX )2]

x20

=var(X )

x20

Example 52. Let X be the average life span of a person. Give that µX = 80 and σX = 10, we can bound the probability of

the event that the average life span of a person is greater than 120 or smaller than 40 in the following way

P[(X > 120)∪ (X < 40)] = P[|X −80| ≥ 40]≤ 1001600

=116

= 0.0625

2.8.3 Chernoff’s Bound

For any random variable X , the following upper bound is known as Chernoff’s Bound

P[X ≥ x0]≤minu≥0

[e−ux0 MX (u)

](2.52)

Proof.

P[X ≥ x0] =∫

∞

x0

fX (x)dx =∫

∞

−∞

fX (x)U(x − x0)dx

≤∫

∞

−∞

fX (x)eu(x−x0)dx = e−ux0

∫∞

−∞

fX (x)euxdx = e−ux0 MX (u)

Minimizing over all non-negative u we end the proof.

eu(xx0)

U(x x0)

x0



Example 53. Let X ∼N (0,1). P[X ≥ x0] = Q(x0). We already have derived MX (u) = eu2/2. Hence,

Q(x0)≤ exp

u2

2−ux0

To get a tight upper bound, we need to minimize expu2

2 −ux0

;

d

du

u2

2−ux0

= 0⇒ u∗ = x0

Replacing in MX (u), we get the Chernoff’s bound: P[X ≥ x0]≤ e−x20 /2.

2.9 Transformation of a Random Variable

Let X have a PDF fX (x). Define Y = g(X ) for some function g(·). What would be the PDF of Y ? Consider the event

g(·)X Y

A = y ≤ Y < y +dy for every small dy . The probability of A is

P(A) = P(y ≤ Y < y +dy) = area = fY (y)dy

Note that there might be several values of x that map to same value of y . These are the roots of the equation x = g−1(y).

We will refer to these roots as x1,x2, ... .xN . Let X+ be the subset of roots at which g(x) has positive slope (according to the

figure X+ = x1,x3) and X− be the subset of roots at which g(x) has negative slope (according to the figure X− = x2,x4).

A =

⋃

i :xi∈X+

xi ≤ X ≤ xi +dxi

∪

⋃

i :xi∈X−xi +dxi < X ≤ xi

Hence,

fY (y)dy = ∑xi∈X+

Pr [xi ≤ X ≤ xi +dxi ]+ ∑xi∈X−

Pr [xi +dxi < X ≤ xi ]

= ∑xi∈X+

fX (xi )dxi + ∑xi∈X−

fX (xi )(−dxi ) = ∑x

fX (x)|dx |



As a result,

fY (y) = ∑x

fX (x)∣∣∣dydx∣∣∣

∣∣∣xi=g

−1(y)(2.53)

Example 54. Consider Θ∼ U[−π,π]. Define X = sinΘ. Find fX (x).

Solution. Solving sinθ = x ⇒ θ = sin−1(x)or θ = ±π − sin−1(x). Since |x | < 1, dxdθ= cosθ , cos(sin−1 x) =

√1− x2 and

cos(±π− sin−1(x)) =−√

1− x2, then

fX (x) =1

2π

|cosθ |∣∣∣θ=sin−1 x ,θ=±π−sin−1 x

=1

2π√

1− x2+

12π√

1− x2=

1π√

1− x2, |x |< 1

Corollary 1. A linear transformation of a Gaussian random variable produces a Gaussian random variable.

Proof. Let X ∼N (µ,σ2). Define Y = aX +b. For Y = y , we get X = y−ba . Also, dydx = a. Hence,

fY (y) =

1√2πσ2 exp

(− 1

2

( x−µ

σ

)2)

|a|∣∣∣x= y−ba

=1√

2πσ2a2exp

(−1

2(y − (aµ +b))2

a2σ2

)

Thus, Y ∼N (aµ +b,a2σ2)


PAIRS OF RANDOM VARIABLES 38

CHAPTER 3

PAIRS OF RANDOM VARIABLES

Many random experiments involve several random variables. In some experiments a number of different quantities are

measured. For example, the voltage signal at several points in a circuit at some specific time may be of interest. Other

experiments involve the repeated measurement of a certain quantity such as the repeated measurement (sampling) of the

amplitude of an audio or video signal that varies with time. In this chapter, we extend the concepts already introduced to

two random variables.

3.1 Joint PMF, CDF and PDF of Pair of Random variables

3.1.1 Joint PMF of Two Random Variables

Consider two discrete random variables X and Y associated with the same experiment. The probabilities of the values

that X and Y can take are captured by the joint PMF of X and Y , denoted pX ,Y (x ,y), and defined as

pX ,Y (x ,y) = P[X = m∩Y = n], P[X = m,Y = n] (3.1)

The marginal PMFs of X and of Y are obtained by summing the joint PDF over all possible values of X and all possible

values of Y , respectively:

pX (x) = ∑y∈Y

pX ,Y (x ,y) (3.2)

pY (y) = ∑x∈X

pX ,Y (x ,y) (3.3)

The same properties that hold for a PMF can be extended for a joint PMF:

1. 0≤ pX ,Y (x ,y)≤ 1

2. ∑m∑n pX ,Y (X = m,Y = n) = 1

3. P[X ,Y ∈ A] = ∑(x ,y)∈A pX ,Y (x ,y)

Example 55. Consider the joint PMF pN,M(n,m) = (m+n)!n!m!

anbm

(a+b+1)n+m+1 , n,m = 0,1,2, ... . Find the marginal PMF pN(n).

Solution.

pN(n) =∞

∑m=0

(n+m)!n!m!

anbm

(a+b+1)n+m+1 =an

(a+b+1)n+1

∞

∑m=0

(n+m)!n!m!

(b

a+b+a

)m



Since∞

∑m=0

(n+m)!n!m!

xm =

(1

1− x

)n+1

pN(n) =an

(a+b+1)n+1

(a+b+1

a+1

)n+1

=

(a

a+1

)n 1a+1

3.1.2 Joint CDF of Two Random Variables

The joint CDF of a pair of random variables (X ,Y ) is defined as

FX ,Y (x ,y) = P(X ≤ x ,Y ≤ y) (3.4)

The following properties are an extension of the properties for the CDF of one random variable:

1. FX ,Y (−∞,−∞) = FX ,Y (−∞,y) = FX ,Y (x ,−∞) = 0

2. FX ,Y (∞,∞) = 1

3. 0≤ FX ,Y (x ,y)≤ 1

4. FX ,Y (x ,∞) = FX (x), FX ,Y (∞,y) = FY (y): Marginal CDFs

5. FX ,Y (x ,y) is monotonic, non-decreasing

6. P(x1 ≤ X ≤ x2,y1 ≤ Y ≤ y2) = FX ,Y (x2,y2)−FX ,Y (x1,y2)−FX ,Y (x2,y1)+FX ,Y (x1,y1)

Example 56. The joint CDF for (X ,Y ) is given by

FX ,Y (x ,y) =

(1−e−ax )(1−e−by ), x ≥ 0,y ≥ 0

0, elsewhere

Find the marginal CDFs, P(X > x ,Y > y), and P(1 < X ≤ 2,2 < Y ≤ 5).

Solution.

FX (x) = limy→∞

FX ,Y (x ,y) = 1−e−ax ,x ≥ 0

FY (y) = limx→∞

FX ,Y (x ,y) = 1−e−by ,y ≥ 0

X and Y have exponential distributions with parameters 1/a and 1/b, respectively.

P(X > x ,Y > y) = 1−P(X ≤ x∪Y ≤ y)= 1−P[X ≤ x ]−P[Y ≤ y ]+P[X ≤ x ,Y ≤ y ]

= (1−e−ax )+(1−e−by )+(1−e−ax )(1−e−by ) = 1−e−ax +e−ax−by

P(1 < X ≤ 2,2 < Y ≤ 5) = FX ,Y (2,5)−FX ,Y (2,2)−FX ,Y (1,5)+FX ,Y (1,2)



3.1.3 Joint PDF of Two Random Variables

The joint PDF of a pair of random variables (X ,Y ) is defined as

fX ,Y (x ,y) = limε→0δ→0

P(x ≤ X ≤ x + ε,y ≤ Y ≤ y +δ )

εδ(3.5)

The following properties are an extension of the properties for the PDF of one random variable:

1. fX ,Y (x ,y)≥ 0

2. fX ,Y (x ,y) =d2FX ,Y (x ,y)dxdy

3. FX ,Y (x ,y) =∫ x−∞

∫ y−∞

fX ,Y (u,v)dudv

4.∫

∞

−∞

∫∞

−∞fX ,Y (x ,y)dxdy = 1: Normalization

5.∫

∞

−∞fX ,Y (x ,y)dx = fY (y),

∫∞

−∞fX ,Y (x ,y)dy = fX (x): Marginal PDFs

6. P(x1 ≤ X ≤ x2,y1 ≤ Y ≤ y2) =∫ x2x1

∫ y2y1

fX ,Y (x ,y)dydx

Example 57. Suppose (X ,Y ) are uniformly distributed over the unit circle X 2 +Y 2 ≤ 1, then

fX ,Y (x ,y) = c , x2 + y 2 < 1

To find the value of c, we use the normalization property

∫ ∫

X 2+Y 2≤1cdxdy = 1⇒ π×c = 1⇒ c =

1π

The marginal distribution is

fX (x) =∫

∞

−∞

fX ,Y (x ,y)dy =∫ √1−x2

−√

1−x2

1π

dy =2π

√1− x2, −1≤ x ≤ 1

Example 58. Let fX ,Y (x ,y) = 12π

exp(− 1

2 (x2 + y 2)

). Find P(X 2 +Y 2 ≤ 1).

Solution.

P(X 2 +Y 2 ≤ 1) =∫ ∫

X 2+Y 2≤1

12π

exp

(−1

2(x2 + y 2)

)dxdy

=∫ 2π

0

∫ 1

0

r

2πexp

(− r 2

2

)drdθ

=∫ 1

0rexp

(− r 2

2

)dr = 1−e−1/2

where we have converted the integral into polar coordinates: dxdy = rdrdθ .

Example 59. Let fX ,Y (x ,y) = cexp(−x − y2

)U(x)U(y). Find c and P(X > Y ).



Solution. ∫∞

0

∫∞

0cexp

(−x − y

2

)dxdy = 1⇒

∫∞

0ce−xdx

∫∞

0e−ydy = 1⇒ c =

12

P(X > Y ) =∫

∞

0

∫∞

0

12

e−xe−y/2dxdy =∫

∞

0

12

e−y/2[∫

∞

ye−xdx

]dy =

13

3.2 Conditioning One Random Variable on Another

Definition 41. (Conditional PMF) Let X and Y be discrete random variables with joint PMF pX ,Y (x ,y). The conditional

PMF of X given Y is defined as

pX |Y (x |y) = P(X = x |Y = y) =pX ,Y (x ,y)

pY (y)(3.6)

Note that pX |Y (x |y) is a probability law; ∑x pX |Y (x |y) = 1.

Example 60. Consider a transmitter that is sending messages over a computer network. We define the random variable X

as the travel time of a given message and the random variable Y as the length of the given message. We assume that the

length of a message can take two possible values: y = 100 bytes with probability 5/6, and y = 104 bytes with probability 1/6.

We assume that the travel time X of the message depends on its length Y and the congestion in the network at the time of

transmission. In particular, the travel time is 10−4Y seconds with probability 1/2, 10−3Y seconds with probability 1/3, and

10−2Y seconds with probability 1/6. Find the PMF (unconditional) of the travel time of a message.

Solution.

pY (y) =

56 , y = 102

16 , y = 104 ,pX |Y (x |102) =

12 , x = 10−2

13 , x = 10−1

16 , x = 1

,pX |Y (x |104) =

12 , x = 113 , x = 1016 , x = 100

To find the PMF of X , we use the total probability formula pX (x) = ∑y pY (y)pX |Y (x |y). We obtain, pX (10−2) = 56 · 1

2 = 512 ,

pX (10−1) = 56 · 1

3 = 518 , pX (1) = 5

6 · 16 +

16 · 1

2 = 29 , pX (10) = 1

6 · 13 = 1

18 , pX (100) = 16 · 1

6 = 136 .

In some problems, it is necessary to work with joint random variables that differ in type, that is, one is discrete and the

other is continuous.

Example 61. The input to a communication channel is +1 volt or −1 volt with equal probability. The output Y of the

channel is the input plus a noise voltage N that is uniformly distributed in the interval from −2 volts and +2 volts. Find

P(X =+1,Y ≤ 0).

Solution.

P[X =+1,Y ≤ y ] = P[Y ≤ y |X =+1]P[X =+1]

where P[X =+1] = 1/2. When the input X = 1, the output is uniformly distributed in the interval [−1,3], therefore,

P[Y ≤ y |X =+1] =y +1

4, −1≤ y ≤ 3

Thus,

P[X =+1,Y ≤ 0] = P[Y ≤ 0|X =+1]P[X =+1] = (1/2)(1/4) = 1/8



Definition 42. (Conditional PDF) Let X and Y be discrete random variables with joint PDF fX ,Y (x ,y). The conditional

PDF of X given Y is defined as

fX |Y (x |y) =fX ,Y (x ,y)

fY (y)(3.7)

Note that∫

∞

−∞fX |Y (x |y)dx = 1.

Example 62. Let fX ,Y (x ,y) = 1π√

3exp

− 2

3 (x2− xy + y 2)

. Find fX |Y (x |y) and fX |A(x) where A = Y > a.

Solution.

fY (y) =1

π√

3e−

23 y

2∫

∞

−∞

exp

(−2

3(x2− xy)

)dx

=1

π√

3e−

23 y

2ey26

∫∞

−∞

exp

(−2

3(x2− xy +

y 2

4)

)dx

=1

π√

3e−

y22

∫∞

−∞

exp

(−2

3(x − y

2)2)

dx =1√2π

e−y22 ⇒ Y ∼N (0,1)

fX |Y (x |y) =fX ,Y (x ,y)

fY (y)=

√2

3πexp

−2

3

(x − y

2

)2⇒ X |Y ∼N

(y

2,

34

)

∫∞

afX |Y (x ,y)dy =

∫∞

a

1π√

3exp

−2

3(x2− xy + y 2)

dy

=1√2π

exp

(−x2

2

)∫∞

a

√2

3πexp

[−2

3

(y − x

2

)2]

dy

=1√2π

exp

(−x2

2

)Q

(2a− x√

3

)

∫∞

afY (y)dy =

∫∞

a

1√2π

exp

(−y 2

2

)dy = Q(a)

Hence,

fX |Y>a(x) =1√2π

exp

(−x2

2

)Q(

2a−x√3

)

Q(a)

3.2.1 Independence

The notion of independence of two random variables is similar to the independence of a random variable from an event.

Intuitively, independence means that the value of one random variable provides no information on the value of the other

random variable. We say that two random variables X and Y are independent if

pX ,Y (x ,y) = pX (x)pY (y) if X and Y are discrete random variables (3.8)

fX ,Y (x ,y) = fX (x)fY (y) if X and Y are continuous random variables (3.9)

This is the same as requiring that the two events X = x and Y = y be independent for every x and every y . The

independence is equivalent to the conditions

pX |Y (x |y) = pX (x) if X and Y are discrete random variables (3.10)

fX |Y (x |y) = fX (x) if X and Y are continuous random variables (3.11)



Independence is useful when it comes to dealing with large numbers of random variables whose behavior we want to

estimate jointly. For instance, whenever we perform repeated measurements of a quantity, such as when measuring the

voltage of a device, we will typically assume that the individual measurements are drawn from the same distribution and that

they are independent of each other. That is, having measured the voltage a number of times will not affect the value of

the next measurement. We will call such random variables to be independently and identically distributed, or in short, i.i.d

random variables.

Conversely, dependence can be vital in classification and regression problems. For instance, the traffic lights at an

intersection are dependent of each other. This allows a driver to perform the inference that when the lights are green in his

direction there will be no traffic crossing his path, i.e. the other lights will indeed be red. Likewise, whenever we are given a

picture x of a digit, we hope that there will be dependence between x and its label y.

There is also a notion of conditional independence of two random variables, given an event A with P(A) > 0. The

conditioning event A defines a new universe and all probabilities have to be replaced by their conditional parts. For example,

X and Y are said to be conditionally independent, given a positive probability event A, if

pX ,Y |A(x ,y) = pX |A(x)pY |A(y) if X and Y are discrete random variables (3.12)

fX ,Y |A(x ,y) = fX |A(x)fY |A(y) if X and Y are continuous random variables (3.13)

Example 63. A user in a certain cell can be served by one out of two bases stations A or B at a time. After requesting a

service, the waiting time for the user to be served by A is TA and the waiting time to be served by B is TB . Both TA and

TB are i.i.d. according to the exponential distribution with parameter 1. What is the probability that the user will be served

within 5 seconds of his request for a call? Find the pdf of the waiting time till the user get serviced.

Solution.

fTA(t) = fTB (t) = e−tU(t)⇒ FTA = FTB =∫ t

−∞

e−xdx = 1−e−t

Let T = minTA,TB denote the time until the user gets serviced after sending a request.

P(T ≤ 5) = P(minTA,TB ≤ 5) = 1−P(minTA,TB> 5)

= 1−P(TA > 5∩TB > 5= 1−P(TA > 5)P(TB > 5)

= 1− (1−FA(5))(1−FB(5)) = 1−e−10

FT (t) = P[T ≤ t] = 1−e−2t ⇒ fT (t) =d

dtFT (t) = 2e−2tU(t)

3.3 One Discrete and One Continuous Random Variables

Let A be a discrete random variable with PMF pA(a). For each A = a with pA(a) 6= 0, let Y be a continuous random

variable, i.e., FY |A(y |a) is continuous for all a. The conditional PMF of A given Y can be defined as the limit

pA|Y (a|y) = lim∆y→0

PA = a,y < Y ≤ y +∆yPy < Y ≤ y +∆y (3.14)

= lim∆y→0

pA(a)fY |A(y |a)∆y

fY (y)∆y(3.15)

=fY |A(y |a)pA(a)

fY (y)(3.16)

This leads to a general form of Baye’s rule:

pA|Y (a|y) =fY |A(y |a)pA(a)

∑a′

pA(a′)fY |A(y |a′)(3.17)



3.3.1 MAP and ML Estimation

Consider a system where the signal sent is modeled as the following discrete random variable

X =

x0 with probability p

x1 with probability 1−p

and the observed signal (received signal) is modeled as the random variable Y , such that

Y |X = x ∼ fY |X (y |x), x ∈ x0,x1

We wish to find the estimate X by observing Y , i.e, obtain X (Y ) (the estimation rule) that minimizes the probability of error:

pe , PrX (Y ) 6= X = PrX = x0, X = x1+PrX = x1, X = x0= pX = x0PrX = x1|X = x0+pX = x1PrX = x0|X = x1

We define the “maximum a posteriori probability” (MAP) estimator as

X (y) =

x0 if pX |Y (x0|y)> pX |Y (x1|y)x1 otherwise

(3.18)

The MAP estimation rule minimizes pe , since

minX

Pe = minX

(1−PrX (Y ) = X) = 1−maxX

PrX (Y ) = X (3.19)

= 1−maxX

∫∞

−∞

fY (y)PrX = X (y)|Y = ydy (3.20)

= 1−∫

∞

−∞

fY (y)maxX (y)

PrX = X (y)|Y = ydy (3.21)

and the probability of error is minimized if we pick the largest pX |Y (X (y)|y) for every y , which is precisely the MAP decoder. If

both X and Y are continuous, the MAP estimation rule reduces to choosing the largest fX |Y (X (y)|y) for every y . Alternatively,

we can write the MAP estimation rule as

X (y)MAP = arg maxX (y)

fX |Y (X (y)|y), ∀y (3.22)

If p = 1/2, i.e., equally likely signals, using Baye’s rule, the MAP decoder reduces to the “maximum likelihood estimator”

(ML)

X (y)ML = arg maxX (y)

fY |X (y |x), ∀y (3.23)

3.4 Application: ClassificationA credit is an amount of money loaned by a financial institution, for example, a bank, to be paid back with interest,

generally in installments. It is important for the bank to be able to predict in advance the risk associated with a loan, which

is the probability that the customer will default and not pay the whole amount back. This is both to make sure that the

bank will make a profit and also to not burden a customer with a loan over his or her financial capacity. In credit scoring,

the bank calculates the risk given the amount of credit and the information about the customer. The information about the

customer includes data we have access to and is relevant in calculating his or her financial capacity−namely, income, savings,

collaterals, profession, age, past financial history, and so forth. The bank has a record of past loans containing such customer

data and whether the loan was paid back or not. From this data of particular applications, the aim is to infer a general rule

coding the association between a customer’s attributes and his risk. That is, the machine learning system fits a model to the

past data to be able to calculate the risk for a new application and then decides to accept or refuse it accordingly.



This is an example of a “classification” problem where there are two classes: low-risk and high-risk customers. The

information about a customer makes up the input to the classifier whose task is to assign the input to one of the two classes.

After training with the past data, a classification rule learned may be of the form

IF income > θ1 AND savings > θ2 THEN low-risk ELSE high-risk

for suitable values of θ1 and θ2. See Figure. 3.1. This is an example of a discriminant; it is a function that separates the

examples of different classes.6 1 Introduction

Savin

gs

Income

Low-Risk

High-Riskθ2

θ1

Figure 1.1 Example of a training dataset where each circle corresponds to onedata instance with input values in the corresponding axes and its sign indicatesthe class. For simplicity, only two customer attributes, income and savings,are taken as input and the two classes are low-risk (‘+’) and high-risk (‘−’). Anexample discriminant that separates the two types of examples is also shown.

and high-risk. From this perspective, we can see classification as learn-ing an association from X to Y . Then for a given X = x, if we haveP(Y = 1|X = x) = 0.8, we say that the customer has an 80 percent proba-bility of being high-risk, or equivalently a 20 percent probability of beinglow-risk. We then decide whether to accept or refuse the loan dependingon the possible gain and loss.

There are many applications of machine learning in pattern recognition.pattern

recognition One is optical character recognition, which is recognizing character codesfrom their images. This is an example where there are multiple classes,as many as there are characters we would like to recognize. Especially in-teresting is the case when the characters are handwritten—for example,to read zip codes on envelopes or amounts on checks. People have differ-ent handwriting styles; characters may be written small or large, slanted,with a pen or pencil, and there are many possible images corresponding

Figure 3.1: Example of a training dataset where each circle corresponds to one data instance with input values in the

corresponding axes and its sign indicates the class. For simplicity, only two customer attributes, income and savings, are

taken as input and the two classes are low-risk (“+ ”) and high-risk (“− ”). An example discriminant that separates the two

types of examples is also shown.

Having a rule like this, the main application is prediction: Once we have a rule that fits the past data, if the future is similar

to the past, then we can make correct predictions for novel instances. Given a new application with a certain income and

savings, we can easily decide whether it is low-risk or high-risk. In some cases, instead of making a 0/1 (low-risk/high-risk)

type decision, we may want to calculate a probability, namely, P(Y |X ), where X are the customer attributes and Y is 0 or

1 respectively for low-risk and high-risk. From this perspective, we can see classification as learning an association from X

to Y . Then for a given X = x , if we have P(Y = 1|X = x) = 0.8, we say that the customer has an 80 percent probability of

being high-risk, or equivalently a 20 percent probability of being low-risk. We then decide whether to accept or refuse the

loan depending on the possible gain and loss.

Note that classification is what we call a supervised learning problem, where there is an input, X , an output, Y , and the

task is to learn the mapping from the input to the output. The approach in machine learning is that we assume a model

defined up to a set of parameters:

y = g(x |θ)where g(·) is the model and θ are its parameters. Y is a class code (e.g., 0/1), and g(·) is the discriminant function separating

instances of different classes. The machine learning program optimizes the parameters, θ , such that the approximation error

is minimized, that is, our estimates are as close as possible to the correct values given in the training set. In classification,

Bayes’ rule is used to calculate the probabilities of the classes.

Analyzing data pertaining to customers who paid back their loans and others who have not, we would like to learn the

class “high-risk customer” so that in the future, when there is a new application for a loan, we can check whether that person

obeys the class description or not and thus accept or reject the application. Using our knowledge of the application, let us

say that we decide that there are two pieces of information that are observable. We observe them because we have reason

to believe that they give us an idea about the credibility of a customer. Let us say, for example, we observe customer’s yearly

income and savings, which we represent by two random variables X1 and X2. With what we can observe, the credibility of a

customer is denoted by a Bernoulli random variable C conditioned on the observables¯X = [X1,X2]

T where C = 1 indicates



a high-risk customer and C = 0 indicates a low-risk customer. Thus if we know P(C |X1,X2),when a new application arrives

with X1 = x1 and X2 = x2, we can choose

C = 1, if P(C = 1|x1,x2)> P(C = 0|x1,x2)

C = 0, otherwise(3.24)

The probability of error is

pe = 1−maxP(C = 1|x1,x2),P(C = 0|x1,x2) (3.25)

Let us denote by¯x = [x1 x2]

T , the vector observed variables. The problem then is to be able to calculate P(C |¯x). Using

Baye’s rule, it can be written

P(C |¯x) =

P(¯x |C )P(C )

p(¯x)

• Prior Probability. P(C = 1) is called the prior probability that C takes the value 1, which in our example corresponds

to the probability that a customer is high-risk, regardless of the x value. It is called the prior probability because it is

the knowledge we have as to the value of C before looking at the observables¯x , satisfying

P(C = 0)+P(C = 1) = 1

• Class Likelihood. p(¯x |C ) is called the class likelihood and is the conditional probability that an event belonging to C

has the associated observation value¯x . In our case, p(x1,x2|C = 1) is the probability that a high-risk customer has his

or her X1 = x1 and X2 = x2. It is what the data tells us regarding the class.

• Evidence. p(¯x), the evidence, is the marginal probability that an observation

¯x is seen, regardless of whether it is a

positive or negative example.

p(¯x) = ∑

C

p(¯x ,C ) = p(

¯x |C = 1)P(C = 1)+p(

¯x |C = 0)P(C = 0)

• Posterior Probability. Combining the prior and what the data tells us using Bayes’ rule, we calculate the posterior

probability of the concept, P(C |¯x), after having seen the observation,

¯x .

posterior =prior × likelihood

evidence

Because of normalization by the evidence, the posteriors sum up to 1:

P(C = 0|¯x)+P(C = 1|

¯x) = 1

Once we have the posteriors, we decide by using Eq. (3.24).

In the general case, we have K mutually exclusive and exhaustive classes; Ci ,i = 1, ... ,K ; for example, in optical digit

recognition, the input is a bitmap image and there are ten classes. The posterior probability of class Ci can be calculated as

P(Ci |¯x) =p(

¯x |Ci )P(Ci )

p(¯x)

=p(

¯x |Ci )P(Ci )

K

∑k=1

p(¯x |Ck)P(Ck)

and for minimum error, the Baye’s classifier chooses the class with the highest posterior probability; that is, we choose Ci if

P(Ci |¯x) = maxk

P(Ck |¯x).



3.5 Statistical ParametersThe expected value of X identifies the center of mass of the distribution of X . The variance, which is defined as the

expected value of (X −µX )2, provides a measure of the spread of the distribution. In the case of two random variables we are

interested in how X and Y vary together. In particular, we are interested in whether the variation of X and Y are correlated.

Definition 43. (Expected Value) The problem of finding the expected value of a function of two or more random variables

is similar to that of finding the expected value of a function of single random variable.

E [g(X ,Y )] =

∞∫−∞

∞∫−∞

g(x ,y)fX ,Y (x ,y)dxdy if X and Y are continuous random variables

∑m

∑n

g(xm,yn)pX ,Y (xm,yn) if X and Y are discrete random variables(3.26)

Definition 44. (Joint Moments) The joint moments of two random variables X and Y summarize information about their

joint behavior. The jkth joint moment of X and Y is defined as

E [X jY k ] =

∞∫−∞

∞∫−∞

x jy k fX ,Y (x ,y)dxdy if X and Y are continuous random variables

∑m

∑n

x jmy kn pX ,Y (xm,yn) if X and Y are discrete random variables(3.27)

If j = 0, we obtain the moments of Y , and if k = 0, we obtain the moments of X . In electrical engineering, it is customary

to call the j = k = 1 moment, E [XY ], the correlation of X and Y . If E [XY ] = 0, then X and Y are said to be orthogonal.

Definition 45. (Central Moments) The central moments of two random variables X and Y is defined as the joint moment

of the centered random variables X −µX and Y −µY :

E [(X −µX )j (Y −µY )

k ] =

∞∫−∞

∞∫−∞

(x −µX )j (y −µY )

k fX ,Y (x ,y)dxdy if X and Y are continuous r.v.

∑m

∑n(xm−µX )

j (yn−µY )kpX ,Y (xm,yn) if X and Y are discrete r.v.

(3.28)

Note that if j = 2,k = 0, we get Var(X ) and if j = 0,k = 2, we get Var(Y ). The covariance of X and Y is defined for

j = k = 1Cov(X ,Y ) = E [(X −µX )(Y −µY )] = E [XY ]−E [X ]E [Y ] (3.29)

If Cov(X ,Y ) = 0, we say that X and Y are uncorrelated. The covariance between two random variables X and Y measures

the degree to which X and Y are (linearly) related.

Let’s see how the covariance measures the correlation between X and Y . The covariance measures the deviation from

µX and µY . If a positive value of (X −µX ) tends to be accompanied by a positive value of (Y −µY ), and negative (X −µX )

tend to be accompanied by a negative (Y −µY ); then (X −µX )(Y −µY ) will tend to be a positive value, and its expected

value, Cov(X ,Y ), will be positive. This is the case for the scattergram in Figure. 3.2(d) where the observed points tend

to cluster along a line of positive slope. On the other hand, if (X − µX ) and (Y − µY ) tend to have opposite signs, then

Cov(X ,Y ) will be negative. Finally, if (X −µX ) and (Y −µY ) sometimes have the same sign and sometimes have opposite

signs, then Cov(X ,Y ) will be close to zero. The three scattergrams in Figure. 3.2(a),(b), and (c) fall into this category.

Definition 46. (Correlation Coefficient) Covariances can be between 0 and infinity. Sometimes it is more convenient to work

with a normalized measure, with a finite upper bound. The (Pearson) correlation coefficient between X and Y is defined as

ρX ,Y =Cov(X ,Y )

σXσY(3.30)

Theorem 9.

|ρX ,Y | ≤ 1 (3.31)

with equality iff Y = aX +b, with a > 0 if ρX ,Y = 1 and a < 0 if ρX ,Y =−1.



Figure 3.2: A scattergram for 200 observations of four different pairs of random variables.

Proof.

E [(X +aY )2] = E [X 2]+E [a2Y 2]+2aE [XY ]≥ 0

Minimizing the above equality with respect to a:

d

da

(E [(X +aY )2]

)= 0⇒ 2E [XY ]+2aE [Y 2] = 0⇒ a∗ =−E [XY ]

E [Y 2]

Replacing a∗ in the inequality above, we get

E [X 2]−2E [XY ]2

E [Y 2]+

E [XY ]2

E [Y 2]≥ 0⇒ E [XY ]2 ≤ E [X 2]E [Y 2]

Replace X with X −µX and Y with Y −µY , we get

Cov(X ,Y )2 ≤ Var(X )Var(Y )⇒ ρ2X ,Y ≤ 1⇒ ρX ,Y ≤ 1

Concerning the equality proof, we will start by proving the necessary condition, and then prove the sufficiency condition. Let

Y = aX +b⇒ XY = aX 2 +bX ⇒ E [XY ] = aE [X 2]+bµX .

Cov(X ,Y ) = aE [X 2]+bE [X ]−E [X ]E [Y ] = aE [X 2]−bE [X ]−aE [X ]2−bE [X ] = aVar(X )

Since, Var(Y ) = a2Var(X )⇒ ρX ,Y = aVar(X )√a2Var(X )

=±1



Let |ρX ,Y |= 1⇒ ρ2X ,Y = 1⇒ Cov 2(X ,Y ) = Var(X )Var(Y ).

E

[(Cov(X ,Y )

Var(X )(X −µX )− (Y −µY )

)2]

= E

[Cov 2(X ,Y )(X −µX )

2

Var 2(X )+(Y −µY )

2−2Cov(X ,Y )(X −µX )(Y −µY )

Var(X )

]

=Cov 2(X ,Y )Var(X )

Var 2(X )+Var(Y )−2

Cov 2(X ,Y )

Var(X )

= Var(Y )+Var(Y )−2Var(Y )

= 0

Hence,∫

∞

−∞

∫∞

−∞

[(Cov(X ,Y )

Var(X )(X −µX )− (Y −µY )

)2]

fX ,Y (x ,y)dxdy = 0

Since fXY (x ,y) is never negative, thenCov(X ,Y )

Var(X )(X −µX )− (Y −µY ) = 0

Call A = Cov(X ,Y )Var(X ) and B = Cov(X ,Y )

Var(X ) −E [Y ], we get AX −Y = B, i.e., X and Y are linearly related.

Remark 4. Intuitively one might expect the correlation coefficient to be related to the slope of the regression line, i.e., the

coefficient a in the expression Y = aX +b. However, as we have showed before, the regression coefficient is in fact given by

a = Cov(X ,Y )Var(X ) .

Theorem 10. If X and Y are independent random variables, then X and Y are uncorrelated, i.e., Cov(X ,Y ) = 0. The

converse is not necessarily true.

A special case where uncorrelatedness leads to independence is for the jointly Gaussian random variable as we will see

later.

Example 64. In this example we present a counter example where X and Y are uncorrelated but not independent. Consider

a pair of random variables X and Y that are uniformly distributed over the unit circle so that: fX ,Y (x ,y) = 1π

, x2 +y 2 ≤ 1.

fX (x) =∫

∞

−∞

fX ,Y (x ,y)dy =∫ √1−x2

−√

1−x2

1π

dy =2π

√1− x2, |x |< 1

fX (x)fY (y) =2π

√1− x2× 2

π

√1− y 2 =

4π2

√(1− x2)(1− y 2) 6= fX ,Y (x ,y)

Hence, X and Y are dependent.

E [XY ] =∫ ∫

x2+y2≤1

xy

πdxdy =

1π

∫ 1

−1x

[∫ √1−x2

−√

1−x2ydy

]dx = 0

Since E [X ] = E [Y ] = 0 (integrating an odd function over symmetrical values),then, X and Y are uncorrelated.

Some striking examples of the fact that uncorrelated does not imply independent can be shown in Figure. 3.3. This

shows several data sets where there is clear dependence between X and y , and yet the correlation coefficient is 0. A more

general measure of dependence between random variables is mutual information. This is only zero if the variables truly are

independent.


PAIRS OF RANDOM VARIABLES 502.5. Joint probability distributions 45

Figure 2.12 Several sets of (x, y) points, with the correlation coefficient of x and y for each set. Notethat the correlation reflects the noisiness and direction of a linear relationship (top row), but not the slopeof that relationship (middle), nor many aspects of nonlinear relationships (bottom). N.B.: the figure in thecenter has a slope of 0 but in that case the correlation coefficient is undefined because the variance of Yis zero. Source: http://en.wikipedia.org/wiki/File:Correlation_examples.png

If x is a d-dimensional random vector, its covariance matrix is defined to be the followingsymmetric, positive definite matrix:

cov [x] ! E[(x − E [x])(x − E [x])

T]

(2.66)

=

⎛⎜⎜⎜⎝

var [X1] cov [X1, X2] · · · cov [X1, Xd]cov [X2, X1] var [X2] · · · cov [X2, Xd]

......

. . ....

cov [Xd, X1] cov [Xd, X2] · · · var [Xd]

⎞⎟⎟⎟⎠ (2.67)

Covariances can be between 0 and infinity. Sometimes it is more convenient to work with anormalized measure, with a finite upper bound. The (Pearson) correlation coefficient betweenX and Y is defined as

corr [X, Y ] ! cov [X, Y ]√var [X] var [Y ]

(2.68)

A correlation matrix has the form

R =

⎛⎜⎝

corr [X1, X1] corr [X1, X2] · · · corr [X1, Xd]...

.... . .

...corr [Xd, X1] corr [Xd, X2] · · · corr [Xd, Xd]

⎞⎟⎠ (2.69)

One can show (Exercise 4.3) that −1 ≤ corr [X, Y ] ≤ 1. Hence in a correlation matrix, eachentry on the diagonal is 1, and the other entries are between -1 and 1.

One can also show that corr [X, Y ] = 1 if and only if Y = aX + b for some parameters aand b, i.e., if there is a linear relationship between X and Y (see Exercise 4.4). Intuitively one

Figure 3.3: Several sets of (x ,y) points, with the correlation coefficient of x and y for each set. Note that the correlation

reflects the noisiness and direction of a linear relationship (top row), but not the slope of that relationship (middle), nor many

aspects of nonlinear relationships (bottom). N.B.: the figure in the center has a slope of 0 but in that case the correlation

coefficient is undefined because the variance of Y is zero.

Proposition 2.

Var(aX +bY ) = a2Var(X )+b2Var(Y )+2abCov(X ,Y ) (3.32)

Proof.

Var(aX +bY ) = E [(aX +bY −aµX −bµY )2]

= E [a(X −µX )2 +b(Y −µY )

2 +2ab(X −µX )(Y −µY )]

= a2Var(X )+b2Var(Y )+2abCov(X ,Y )

Note that if X and Y are uncorrelated, then Var(aX +bY ) = a2Var(X )+b2Var(Y )

Corollary 2. If X1,X2, ... ,Xn are independent random variables, then

Var(X1 +X2 + · · ·+Xn) = Var(X1)+Var(X2)+ · · ·+Var(Xn) (3.33)

Example 65. Given the following joint PDF:

fX ,Y (x ,y) =1

π√

3exp

−2

3(x2− xy + y 2)

. Find E [XY ].

Solution.

E [XY ] =∫

∞

−∞

∫∞

−∞

xy

π√

3exp

−2

3(x2− xy + y 2)

dxdy

=∫

∞

−∞

x√2π

e−x2/2

[∫∞

−∞

y

√2

3πexp

[−2

3

(y − x

2

)2]

dy

]

︸︷︷︸Y∼N (X2 , 3

4 )

dx

=12

∫∞

−∞

x2√

2πe−x

2/2dx =12



Example 66. Let X ∼N (1,4) and Y ∼N (−2,1) be correlated with ρXY = 1/2. We define: Z =X−2Y and W =X +Y +1.

Find Cov(X ,Y ), E [Z ], E [W ] Var(W ), Var(Z ), Cov(W ,Z ), and ρWZ .

Solution.

Cov(X ,Y ) = ρXYσXσY = (1/2)(2)(1) = 1

E [Z ] = E [X −2Y ] = E [X ]−2E [Y ] = 1−2(−2) = 5

E [W ] = E [X +Y +1] = E [X ]+E [Y ]+1 = 1−2+1 = 0

Var(Z ) = Var(X −2Y ) = Var(X )+Var(−2Y )+2Cov(X ,−2Y )

= Var(X )+4Var(Y )−4Cov(X ,Y )

= Var(X )+4Var(Y )−4ρX ,YσXσY

= 4+4(1)−4(1/2)(2)(1)

= 4

Var(W ) = Var(X +Y +1) = Var(X )+Var(Y +1)+2Cov(X ,Y +1)

= Var(X )+Var(Y )+2E [X (Y +1)]−E [X ]E [Y +1]

σXσY+1

= 4+1+2(E [XY +X ]−E [X ]E [Y +1])

= 5+2(E [XY ]+E [X ]−E [X ]E [Y +1])

= 5+2(E [XY ]+1+1)

= 5+2(Cov(X ,Y )+E [X ]E [Y ]+2) = 7

E [ZW ] = E [(X −2Y )(X +Y +1)] = E [X 2]−E [XY ]+E [X ]−2E [Y 2]−2E [Y ]

= Var(X )+(E [X ])2− (Cov(X ,Y )+E [X ]E [Y ])+E [X ]−2(Var(Y )+(E [Y ])2)−2E [Y ]

= 4+1−1+2+1−2(1+4)−2(−2) = 1

ρZ ,W =Cov(Z ,W )

σZσW=

E [ZW ]−E [Z ]E [W ]

σZσW=

1−0(2)(√

5)=

12√

5

Theorem 11. Let X and Y be two independent random variables and consider forming two new random variables: U = g(X )

and V = h(Y ). Then, U and V are independent.

Proof.

E [UV ] = E [g(X )h(Y )] =∫

∞

−∞

∫∞

−∞

g(x)h(y)fX ,Y (x ,y)dxdy =∫

∞

−∞

∫∞

−∞

g(x)h(y)fX (x)fY (y)dxdy

=∫

∞

−∞

g(x)fX (x)dx

∫∞

−∞

h(y)fY (y)dy = E [g(X )]E [h(Y )] = E [U]E [V ]



3.5.1 Conditional Expectation and Variance

In this section, we revisit the conditional expectation of a random variable X given another random variable Y , and view it

as a random variable determined by Y . We derive a reformulation of the total expectation theorem, called the law of iterated

expectations. We also obtain a new formula, the law of total variance, that relates conditional and unconditional variances.

We introduce a random variable, denoted E [X |Y ], that takes the value E [X |Y = y ] when Y takes the value y . Since

E [X |Y = y ] is a function of y , E [X |Y ] is a function of Y , and its distribution is determined by the distribution of Y . The

properties of E [X |Y ] will be important in the context of estimation and statistical inference.

Example 67. We are given a biased coin and we are told that because of manufacturing defects, the probability of heads,

denoted by Y , is itself random, with a known distribution over the interval [0,1]. We toss a coin a fixed number of n times,

and we let X be the number of heads obtained. Then, for any y ∈ [0,1], we have E [X |Y = y ] = ny , so E [X |Y ] is the random

variable nY .

Since E [X |Y ] is a random variable, it has an expectation E [E [X |Y ]] of its own, which can be calculated using the expected

value rule

E [E [X |Y ]] =

∑y E [X |Y = y ]pY (y), Y discrete∫∞

−∞E [X |Y = y ]fY (y)dy , Y continuous

By application of the total expectation theorem, both expressions in the right-hand are equal to E [X ]. This brings us to the

following conclusion, which is actually valid for every type of random variable Y (discrete, continuos, or mixed), as long as X

has a well-defined and finite E [X ].

Theorem 12. (Law of Iterated Expectations)

E [E [X |Y ]] = E [X ] (3.34)

Example 68. Suppose that Y , the probability of heads for our coin is uniformly distributed over the interval [0,1]. Since,

E [X |Y ] = nY and E [Y ] = 1/2, by the law of iterated expectations, we have

E [X ] = E [E [X |Y ]] = E [nY ] = nE [Y ] =n

2

Example 69. We start with a stick of length `. We break it at a point which is chosen randomly and uniformly over its

length, and keep the piece that contains the left end of the stick. We then repeat the same process on the piece that we

were left with. What is the expected length of the piece that we are left with after breaking twice?

Solution. Let Y be the length of the piece after we break for the first time. Let X be the length after we break for the

second time. We have E [X |Y = y ] = (y +0)/2 (since the breakpoint is chosen uniformly over a piece of length Y = y . Hence,

E [X |Y ] = Y /2. For a similar reason, we also have E [Y ] = `/2. Thus,

E [X ] = E [E [X |Y ]] = E [Y /2] =`

4

We introduce now the random variable Var(X |Y ) defined as

Var(X |Y ) = E [(X −E [X |Y ])2|Y ] = E [X 2|Y ] (3.35)

where X = X −E [X |Y ] (estimation error as we will see later). That is, Var(X |Y = y) = E [X 2|Y = y ].



Theorem 13. (Law of Total Variance)

Var(X ) = E [Var(X |Y )]+Var(E [X |Y ]) (3.36)

Proof. Let X = E [X |Y ]. Using the fact that E [X ] = E [X − X ] = E [X ]−E [X ] = 0 and the law of iterated expectations, we

can write the variance of X as Var(X ) = E [X 2] = E [E [X 2|Y ]] = E [Var(X |Y )].

We will show now that X and X are uncorrelated. Using the law of iterated expectations, we have E [X X ] = E [E [X X |Y ]] =

E [XE [X |Y ]] = 0 where the last two equalities follow from the fact that X is completely determined by Y , so that E [X X |Y ] =

XE [X |Y ] = 0 It follows that Cov(X , X ) = E [X X ]−E [X ]E [X ] = 0−E [X ] · 0 = 0 As a consequence, taking the Var of both

sides of the equation X = X − X , we obtain Var(X ) = Var(X )+Var(X ) Hence, Var(X ) = E [Var(X |Y )]+Var(E [X |Y ])

The law of total variance is helpful in calculating variances of random variables by using conditioning.

Example 70. Consider again the problem where we break twice a stick of length ` at randomly chosen points. Here, Y is

the length of the piece left after the first break and X is the length after the second break. We calculated the mean of X as

`/4. Find Var(X ).

Solution. Since X is uniformly distributed between 0 and Y , we have Var(X |Y ) = Y 2/12. Thus, since Y is uniformly

distributed between 0 and `, we have

E [var(X |Y )] =1

12

∫ `

0

1`

y 2dy =`2

36

We also have E [X |Y ] = Y /2, so

Var(E [X |Y ]) = Var(Y /2) =14

Var(Y ) =14· `

2

12=

`2

48

Using the law of total variance, we obtain

Var(X ) = E [Var(X |Y )]+Var(E [X |Y ]) =`2

36+

`2

48=

7`2

144

Finally, we present an important property: for any function g(·), we have

E [Xg(Y )|Y ] = g(Y )E [X |Y ] (3.37)

This is because given the value of Y , g(Y ) is a constant and can be pulled outside of the expectation.

Example 71. Let X ∼ exp(1) and suppose that Y |X ∼N (0,x2). Find E [Y 2] and E [Y 2X 3].

Solution.

E [Y 2] = E [E [Y 2|X ]] = E [Var(Y 2|X )] = E [X 2] =∫

∞

0x2e−xdx = 2

E [Y 2X 3] = E [E [Y 2X 3|X ]] = E [E [Y 2X 3|X = x ]] = E [x3E [Y 2|X = x ]] = E [X 5] =∫

∞

0x5e−xdx = 5!



3.6 Bivariate Gaussian DistributionThe bivariate Gaussian distribution appear in numerous applications in electrical engineering. They are the most important

model used in communication systems that involve dealing with signals in the presence of noise. The random variables X and

Y are said to be jointly Gaussian if their joint PDF has the form

fXY (x ,y) =1

2πσXσY

√1−ρ2

exp

− 1

2(1−ρ2)

[(X −µX

σX

)2+

(Y −µY

σY

)2−2ρ

(X −µX

σX

)(Y −µY

σY

)](3.38)

The PDF is centered at the point (µX , µY ), and has a bell shape that depends on the values of σX , σY , and ρX ,Y , ρ.

If X and Y are jointly Gaussian, then they are individually Gaussian; that is X ∼N (µX ,σ2X ) and Y ∼N (µY ,σ2

Y ).

Proposition 3. A pair of random variables is said to be jointly Gaussian if the marginals are Gaussian. The converse is not

necessarily true.

Example 72. Let X1 ∼N (0,1) and X2 =

+1, with probability 1

2−1, with probability 1

2be independent random variables. Let X3 = X1X2,

then X3 ∼N (0,1), but X1 and X3 are not jointly Gaussian.

Proposition 4. If jointly Gaussian random variables are uncorrelated, then they are independent.

Proposition 5. If fX ,Y (x ,y) is jointly Gaussian, then fX |Y (x |y) is a Gaussian distribution with expected value

E [X |Y = y ] = µX +ρσX

σY(y −µY ) (3.39)

and variance

Var(X |Y = y) = σ2X (1−ρ

2) (3.40)

Proof. Consider the jointly Gaussian distribution fX ,Y (x ,y). The conditional PDF fX |Y (x |y) is

fX |Y (x |y) =fX ,Y (x ,y)

fY (y)=

exp

−1

2(1−ρ2)σ2X

[x −ρ

σXσY

(y −µY )−µX

]2]

√2πσ2

X (1−ρ2)

Hence, X |Y is a Gaussian distribution with mean µX +ρσXσY

(y −µY ) and variance σ2X (1−ρ2)

3.7 Functions of Two Random VariablesQuite often we are interested in one or more function of the random variables associated with some experiment. For

example, being interested in the output of a fading channel: Y = ΛX +N, where Λ, X and N are random variables with given

probability distributions.

3.7.1 One Function of Two Random Variables

Let the random variable Z be defined as a function of two random variables X and Y : Z = g(X ,Y ). The CDF of Z is

found as

FZ (z) = Pr(Z ≤ z) = Pr(g(X ,Y )≤ z) =∫ ∫

g(X ,Y )≤zfX ,Y (x ,y)dxdy

The PDF of Z is then found by taking the derivative of FZ (z).



Example 73. Let Z = X +Y . We need to find fZ (z) in terms of the joint PDF of X and Y . The CDF of Z is

FZ (z) =∫ ∫

X+Y≤zfX ,Y (x ,y)dxdy =

∫∞

−∞

∫ z−y

−∞

fX ,Y (x ,y)dxdy

fZ (z) =d

dz

∫∞

−∞

[∫ z−y−∞

fX ,Y (x ,y)dx

]dy =

∫∞

−∞

d

dz

[∫ z−y−∞

fX ,Y (x ,y)dx

]dy =

∫∞

−∞

fX ,Y (z − y ,y)dy

Furthermore, suppose that X and Y are independent, then

fZ (z) =∫

∞

−∞

fX (z − y)fY (y)dy = fX (z)∗ fY (z)

We could have obtained this result using the characteristic function:

ΦZ (ω) = E [e jωZ ] = E [e jωX e jωY ] = E [e jωX ]E [e jωY ] = ΦX (ω)ΦY (ω)⇒ fZ (z) = fX (z)∗ fY (z)

Example 74. Suppose Z = X 2 +Y 2, where X ∼N (0,1), Y ∼N (0,1) , and ρX ,Y = 0. We need to find the PDF of Z .

ΦZ (ω) = E [e jωZ ] = E [e jωX2]E [e jωY

2]

E [e jωX2] =

∫∞

−∞

1√2π

exp

(−x2

2+ jωx2

)dx =

∫∞

−∞

1√1−2jω

1√2π

exp

(−u2

2

)du =

1√1−2jω

Hence, ΦZ (ω) = 11−2jω . Taking inverse Fourier transform (replacing −ω with ω), we get

fZ (z) =12

exp(−z

2

)U(z)

A more general method for finding the PDF of Z is as follows:

1. Condition on one of the variables, say X = x . Then, g(X ,Y ) becomes a one-to-one transformation.

2. Find fZ (z |x) using techniques presented earlier for single random variable transformation.

3. fZ (z) =∫

∞

−∞fZ (z |x)fX (x)dx

Example 75. Let Z = YX , where X ∼N (0,1), Y ∼N (0,1) , and ρX ,Y = 0. Find fZ (z).

Solution. Conditioned on X = x , Z ∼N (0,1/x2).

fZ (z) =∫ −∞

−∞

√x2

2πexp

(−z2x2

2

)1√2π

exp

(−x2

2

)dx

=1π

∫∞

0xexp

[−1

2(1+ z2)x2

]dx =

1π(1+ z2)



3.7.2 Transformations of Two Random Variables

Given two random variables X and Y , we now create two new random variables W and Z according to some transformation

of the general form

Z = g1(X ,Y ), W = g2(X ,Y )

Define AXY as the region AXY = (x ,x + εx )× (y ,y + εx ). We can write

Pr((x ,y) ∈ AXY ) = fX ,Y (x ,y)εX εY = fX ,Y (x ,y)× (Area of AXY )

Let g1(·) and g2(·) be mapping from AXY to AZW . Then,

Pr((x ,y) ∈ AXY ) = Pr((z ,w) ∈ AZW ) = fZ ,W (z ,w)× (Area of AZW )

Therefore,

fZ ,W (z ,w) = fX ,Y (x ,y)

(Area of AXYArea of AZW

)

From calculus, we know that

Area of AXYArea of AZW

=

∣∣∣∣∣∣∣∣∣J

(X Y

Z W

)∣∣∣∣∣∣∣∣∣=

∣∣∣∣det

[dxdz

dydz

dxdw

dydw

]∣∣∣∣ (3.41)

Any expression involving X and Y must be replaced with the appropriate functions of Z and W . Let the inverse

transformation be:

X = h1(Z ,W ), Y = h2(Z ,W )

fZ ,W (z ,w) = fX ,Y (x ,y)

∣∣∣∣∣∣∣∣∣J

(X Y

Z W

)∣∣∣∣∣∣∣∣∣ x=h1(Z ,W ),y=h2(Z ,W )

(3.42)

or

fZ ,W (z ,w) =fX ,Y (x ,y)∣∣∣∣∣∣∣∣∣

J

(Z W

X Y

)∣∣∣∣∣∣∣∣∣

∣∣∣ x=h1(Z ,W ),y=h2(Z ,W )

(3.43)

The above derivations are valid even if the transformations are not one-to-one transformations. If the inverse transfor-

mation has multiple roots, then the expressions must be evaluated at each root and then get added.

Example 76. Let X and Y have the following joint distribution

fX ,Y (x ,y) =1

2πσ2 exp

(−x2 + y 2

2σ2

)

Let R =√

X 2 +Y 2 and Θ = tan−1(YX

). Find fR,Θ(r ,θ), fR(r), and fΘ(θ).

Solution. X = R cosθ , Y = R sinθ ,

J

(X Y

R Θ

)=

∣∣∣∣[

cosθ sinθ

−r sinθ r cosθ

]∣∣∣∣= |r cos2θ + r sin2

θ |= r



Hence, fR,Θ(r ,θ) = r2πσ2 exp

(− x2+y2

2σ2

)∣∣∣x=r cosθ ,y=r sinθ

= r2πσ2 exp

(− r22σ2

), r ≥ 0,0≤ θ ≤ 2π.

fR(r) =∫ 2π

0fR,Θ(r ,θ)dθ =

r

σ2 exp

(− r 2

2σ2

)U(r)

fΘ(θ) =∫

∞

0fR,Θ(r ,θ)dr =

12π

, 0≤ θ ≤ 2π

Example 77. Suppose X and Y are independent and both uniformly distributed over (0,1), so that

fX ,Y (x ,y) = 1, 0≤ x ,y < 1

An important transformation in the world of computer simulations that generates Gaussian random variables is the “Box-Muller

transformation”. This transformation defines the new random variables Z and Y as follows:

Z =√−2ln(X )cos(2πY )

W =√−2ln(X )sin(2πY )

Show that this transformations transforms a pair of independent uniform random variables into a pair of independent Gaussian

random variables.

Solution.W

Z= tan(2πY )⇒ Y =

12π

tan−1(

W

Z

)

Z 2 +W 2 =−2lnX ⇒ X = exp

(− z2 +W 2√−2lnX

)

dZ

dX=− 1

X

cos(2πY )√−2lnX

,dW

dX=− 1

X

sin(2πY )√−2lnX

dZ

dY=−2π

√−2lnX sin(2πY ),

dW

dY= 2π

√−2lnX cos(2πY )

J

(Z W

X Y

)=−2π

X⇒ fZ ,W (z ,w) =

fX ,Y (x ,y)∣∣− 2π

X

∣∣ =1

2πexp

(−Z 2 +W 2

2

)

Hence, (Z ,W ) is a pair of jointly Gaussian independent random variables, with zero-mean, and unit variance.

We should highlight the following conclusions:

1. Any linear transformation of independent Gaussian random variables produces jointly Gaussian random variables.

2. Any correlated Gaussian random variables can be transformed from uncorrelated Gaussian random variables through a

linear transformation.

3. Any linear transformation of joint Gaussian random variables produces jointly Gaussian random variables.


VECTOR RANDOM VARIABLE 58

CHAPTER 4

VECTOR RANDOM VARIABLE

In this chapter, we will introduce random vectors and extend most of the concepts developed in the previous chapter.

Many problems encountered in engineering and science involve sets of random variables that are grouped for some purpose.

Such sets of random variables are conveniently studied by vector methods. For this reason we treat these grouped random

variables as a single object called a random vector.

4.1 Random Vector

Definition 47. (Random Vector) A vector¯X = (X1, X2, ... , Xn)

T, having all of its components Xi , i ∈ 1, ... ,n as random

variables, each having a mean E (Xi ) = µi , is called a random vector.

Example 78. A computer system equipped with a TV camera is designed to recognize black-lung disease from x-rays. It

does this by counting the number of radio-opacities in six lung zones (three in each lung) and estimating the average size of

the opacities in each zone. The result is a 12-component vector from which a decision is made. What is the best computer

decision?

For a random vector¯X , the joint CDF, PMF and PDF are given respectively by

F¯X (¯

x) = FX1,X2,...,Xn(x1,x2, ...,xn) = P(X1 ≤ x1,X2 ≤ x2, ...,Xn ≤ xn) (4.1)

p¯X (¯

x) = pX1,X2,...,XN (x1,x2, ...,xn) = P(X1 = x1,X2 = x2, ...,Xn = xn) (4.2)

f¯X (¯

x) = fX1,X2,...,Xn(x1,x2, ...,xn) =∂ n

∂ x1∂ x2...∂ xnFX1,X2,...,Xn(x1,x2, ...,xn) (4.3)

Example 79. Let¯X = (X1,X2,X3)

T denote the position of a particle inside of a sphere of radius a centered about the origin.

Assume that at the instant of observation, the particle is equally likely to be anywhere in the sphere, that is

f¯X (¯

x) =3

4πa3 ,√

x21 + x2

2 + x23 < a

Compute the probability that the particle lies within the subsphere of radius 2a/3 contained within the larger sphere and

centered about the origin.



Solution. Let E denote the event that the particle lies within the subsphere and let

R , x1,x2,x3 :√

x21 + x2

2 + x23 < 2a/3

The evaluation of

P(E ) =∫∫∫

R

f¯X (¯

x)d¯x

is best done using spherical coordinates, that is,

P(E ) =3

4πa3

∫ 2a/3

r=0

∫π

φ=0

∫ 2π

θ=0r 2 sinφdrdφdθ =

827

Note that in this case the answer can be obtained easily by noting the ratio of volumes.

The marginal CDFs and PDFs can be defined as follows:

FX1,X2,...,Xm(x1,x2, ...,xm) = FX1,X2,...,Xn(x1,x2, ...,xm,∞, ...,∞) (4.4)

fX1,X2,...,Xm(x1,x2, ...,xm) =∫

∞

−∞

...

∫∞

−∞

fX1,X2,...,Xn(x1,x2, ...,xn)dxm+1...dxn−1dxn (4.5)

For a set of n random variables X1,X2, ...,Xn, the conditional PDF of X1,X2, ...,Xm conditioned on Xm+1,Xm+2, ...,Xn is given

by:

fX1,...,Xm |Xm+1,...,Xn(x1, ...,xm|xm+1, ...,xn) =fX1,...,Xn(x1, ...,xn)

fXm+1,...,Xn(xm+1, ...,xn)(4.6)

Definition 48. The elements of a random variable vector¯X = [X1,X2, ...,Xn]

T are independent if

f¯X (¯

x) =N

∏i=1

fXi (xi ) (4.7)

For a distribution to be a valid n-dimensional joint distribution, it should satisfy the normalization and the non-negativity

properties.

Example 80. Let f¯X (¯

x) be given as

f¯X (¯

x) = Ke−¯xTΛU(

¯x)

where Λ = (λ1, ... ,λn)T with λi > 0 for all i ,

¯x = (x1, ... ,xn)

T , U(¯x) = 1 if xi ≥ 0, i = 1, ... ,n and zero otherwise, and K is a

constant. For what value of K , is f¯X (¯

x) a valid joint PDF?

Solution. ∫∞

0...

∫∞

0︸︷︷︸n

Ke−∑ni=1 xiλi dx1 ...dxn︸︷︷︸

n

= 1⇒(∫

∞

0e−λ1x1 dx1

)...

(∫∞

0e−λnxndxn

)=

1K

Hence, 1λ1× 1

λ2×·· ·× 1

λn= 1K ⇒ K = ∏

ni=1 λi

Definition 49. ( The Multinomial and Multinoulli Distributions)

The binomial distribution can be used to model the outcomes of coin tosses. To model the outcomes of tossing a K−sided

die, we can use the multinomial distribution. This is defined as follows: let¯X = [X1, ... ,XK ]

T be a random vector, where Xjis the number of times side j of the die occurs. Then

¯X has the following joint PMF:

p¯X (¯

x),(

n

x1 ...xk

) K

∏j=1

pxjj (4.8)



where pj is the probability that side j shows up, and

(n

x1 ...xk

), n!

x1!x2! ...xK !

is the multinomial coefficient (the number of way to divide a set of size n =K

∑k=1

xk into subsets with sizes x1 up to xK ).

Now suppose n = 1. This is like rolling a K−sided dice once, so¯X will be a vector of 0s and 1s (a bit vector), in which

only one bit can be turned on.Specifically, if the dice shows up as face K , then the k’th bit will be on. In this case, the PMF

becomes

p¯X (¯

x) =K

∏j=1

pI(xj=1)j (4.9)

This very common special case is known as the multinoulli distribution.

4.2 Matrix Analysis PrimerIn this section, we review fundamental concepts in matrix analysis pertaining to our treatment of random vectors in this

chapter.

Definition 50. (Positive Semi-Definite Matrix) A matrix M is positive semi-definite if

¯ZTM

¯Z ≥ 0,∀

¯Z ∈ Rn (we say M 0).

As an aside, this also implies that the eigenvalues of the M are all positive.

Example 81. The identity matrix I is positive semi-definite since any¯X = (X1, X2)

T ,

¯XT I

¯X =

(X1 X2

)[ 1 00 1

](X1X2

)= ||

¯X ||2 ≥ 0

Definition 51. (Eigenvalues and Eigenvectors) The eigenvalues of an n× n matrix M are those numbers λ for which the

characteristic equation M¯v = λ

¯v has a solution

¯v 6= 0. The column vector

¯v = (v1,v2, ... ,vn)

T is called an eigenvector

Theorem 14. The number λ is an eigenvalue of the square matrix M if and only if

det(M−λ I ) = 0 (4.10)

Eigenvectors are often normalized so that¯vT

¯v , ||

¯v ||2 = 1.

Example 82. Find the eigenvalues and the eigenvectors of the matrix M =

2 0 00 3 40 4 9

Solution.

det

2−λ 0 00 3−λ 40 4 9−λ

=−λ

3 +14λ2−35λ +22 = 0,

λ1 = 2, λ2 = 1, and λ3 = 11 therefore M 0. The eigenvector corresponding to λ1 = 2 is found using M¯v = 2

¯v to be

v1 = [ 1 0 0 ]T . Similarly, the other eigenvalues are v2 =[

0 2√5− 1√

5

]Tand v3 =

[0 1√

52√5

]T.



Remark 5. If we model a system by this matrix M, then when the input is the eigenvectors¯v the output will be the same up

to a constant λ , i.e.,

¯v →M → λ

¯v

Remark 6. Not all n×n matrices have n distinct eigenvalues or n eigenvectors. Sometimes a matrix can have fewer than n

distinct eigenvalues but still have n distinct eigenvectors.

Definition 52. (Similar Matrices) Two n× n matrices A and B are called similar if there exists an n× n matrix T with

det(T ) 6= 0, such that

T−1AT = B (4.11)

Theorem 15. An n×n matrix M is similar to a diagonal matrix if and only if M has n linearly independent eigenvectors.

Theorem 16. Let M be a real symmetric matrix with eigenvalues λ1,λ2, ... ,λn. Then, M has n mutually orthogonal unit

eigenvectors v1,v2, ... ,vn.

Definition 53. (Hermitian Matrix) The matrix M for which the entries mij satisfy: mij = m?ji is called a hermitian matrix.

The following properties concerning the determinant of a matrix will be handy in upcoming sections:

1. For any n×n matrices A and B: det(AB) = det(A)det(B)

2. For an n×n matrix A: det(cA) = cndet(A)

3. For an n×n matrix A: det(AT ) = det(A)

4. For an n×n matrix A: det(A) = ∏ni=1 λi , where λini=1 are the eigenvalues of A.

5. For an n×n matrix A: Eigenvalues(I +cA) = 1+c×Eigenvalues(A)

6. For an n×n matrix A: det(A−1) = 1det(A)

7. Let U be the matrix formed by normalized eigenvectors, hence UT = U−1.

Theorem 17. (Eigenvalue Decomposition Theorem) Let M be a real symmetric matrix with eigenvalues λ1,λ2, ... ,λn and

corresponding eigenvectors¯v1,

¯v2, ... ,

¯vn then

M = UΛU−1 (4.12)

where U =[

¯v1 ¯

v2 ...¯vn], and Λ is the diagonal matrix whose diagonal elements are the corresponding eigenvalues, i.e.,

Λii = λi .

Proof. We can write Λ = UTMU. Since U is a real symmetric matrix UT = U−1, and M = (UT)−1ΛU−1 = UΛUT.

4.3 Statistical Parameters

Definition 54. (Mean Vector) The mean vector of¯X is denoted by

¯µ and defined as

¯µ = (µ1, µ2, ... , µn)

T.

Definition 55. (Correlation Matrix)

R¯X

¯X , E

[¯X

¯XT]

(4.13)

=

E [X 21 ] E [X1X2] ... E [X1Xn]

E [X2X1] E [X 22 ] ... E [X2Xn]

......

. . ....

E [XnX1] E [XnX2] ... E [X 2n ]

(4.14)



Definition 56. (Covariance Matrix) The covariance matrix K¯X

¯X of

¯X is defined as

K¯X

¯X , E

[(¯X −

¯µ

)(¯X −

¯µ

)T]= E

X1−µ1X2−µ2

...

Xn−µn

(

X1−µ1 X2−µ2 ... Xn−µn

)

= E

(X1−µ1)2 (X1−µ1)(X2−µ2) · · · (X1−µ1)(Xn−µn)

(X2−µ2)(X1−µ1) (X2−µ2)2 · · · (X2−µ2)(Xn−µn)

......

. . ....

(Xn−µn)(X1−µ1) (Xn−µn)(X2−µ2) · · · (Xn−µn)2

=

σ21 K12 · · · K1n

K21 σ22 · · · K2n

......

. . ....

Kn1 Kn2 · · · σ2n

.

where Kij = cov (Xi ,Xj ) = E [(Xi −µi )(Xj −µj )].

The correlation matrix and the covariance matrix are related in the following way:

K¯X

¯X = R

¯X

¯X −

¯µ

¯µT . (4.15)

For instance, for n = 2, we have

Cov (X1,X2)︸︷︷︸K

= E [X1,X2]︸︷︷︸R

−µ1µ2,

[σ2X1

cov (X1,X2)

cov (X1,X2) σ2X2

]=

[E[X 2

1]

E [X1,X2]

E [X1,X2] E[X 2

2]]−[

µ21 µ1µ2

µ1µ2 µ22

]

Theorem 18. Correlation matrices R¯X

¯X and covariance matrices K

¯X

¯X are symmetric and non-negative definite.

Solution. Since E [XiXj ] = E [XjXi ] for any (i , j)−th element in R¯X

¯X and K

¯X

¯X , then RT

¯X

¯X = R

¯X

¯X and KT

¯X

¯X = K

¯X

¯X . Thus,

R¯X

¯X and K

¯X

¯X are symmetric matrices. For a matrix R

¯X

¯X to be non-negative definite, it should satisfy

¯ZTR

¯X

¯X ¯

Z > 0 for all

¯Z .

¯ZTR

¯X

¯X ¯

Z =¯ZTE [

¯X

¯XT ]

¯Z = E [

¯ZT

¯X

¯XT

¯Z ] = E

[(¯ZT

¯X)2]≥ 0

This implies that the eigenvalues of the correlation matrix are all positive. Identical proof hold for the covariance matrix.

Definition 57. Consider two real n-dimensional random vectors¯X and

¯Y with respective mean vectors µ

¯X and µ

¯Y . Then,

if the expected value of their outer product satisfies

E [¯X

¯Y T ] = µ

¯X µT

¯Y (4.16)

¯X and

¯Y are said to be uncorrelated. If E [

¯X

¯Y T ] =

¯0,

¯X and

¯Y are said to be orthogonal. Note that in the orthogonal case

E [XiYj ] = 0 for all 0≤ i , j ≤ n. Thus, the expected value of the inner product is zero, i.e., E [¯X

¯Y T ] = 0.

Example 83. The matrix K¯X

¯X =

[2 33 2

]can not be a covariance matrix because

ρX1X2 =cov (X1,X2)

σX1σX2

=K12√2√

2=

32> 1,

which contradicts the fact that ρX1X2 < 1.



4.4 Transformations

Given¯X =

[X1 X2 ... Xn

]T. We form the n−dimensional transformation

¯Y =

[Y1 Y2 ... Yn

]T= g(

¯X )

The PDF of¯Y can be obtained as a function of the PDF of

¯X in a similar way to the two dimensional case:

f¯Y (

¯y) = f

¯X (¯

x)

∣∣∣∣det

[J

(X1 X2 ... XnY1 Y2 ... Yn

)]∣∣∣∣¯X=h(

¯Y )

=f

¯X (¯

x)∣∣∣∣det

[J

(Y1 Y2 ... YnX1 X2 ... Xn

)]∣∣∣∣¯X=h(

¯Y )

(4.17)

where |J| is the Jacobian and¯X = h(

¯Y ) is the inverse transformation.

Example 84. Consider the following joint PDF

f¯X (¯

x) = (2π)−3/2exp

[−1

2(x2

1 + x22 + x2

3 )

]

We define the following transformation

y1 = x21 − x2

2

y2 = x21 + x2

2

y3 = x3

Find f¯Y (

¯y).

Solution. The 4 solution for the system are: x1 =±√y1+y2

2 , x2 =±√y2−y1

2 , x3 = y3. For the roots to be real, we need y1 > 0

and y2 > y1. We must compute the Jacobian |J| =

∣∣∣∣∣∣

2x1 −2x2 02x1 2x2 00 0 1

∣∣∣∣∣∣= 8x1x2. For example, at the first root we compute

J1 = 0.5√

y 22 − y 2

1 . A direct calculation shows that |J1|= |J2|= |J3|= |J4|. Finally, labeling the 4 solutions as x1,x2,x3,x4, we

obtain

f¯Y (

¯y) =

2√y 2

2 − y 21

4

∑i=1

f¯X (xi ) =

8(2π)−3/2√

y 22 − y 2

1

exp

[−1

2(y2 + y 2

3 )

]U(y1)U(y2− y1)

4.4.1 Linear Transformations

Consider the linear transformation

¯Y = A

¯X +

¯b

where the matrix A is defined as

A =

a11 a12 · · · a1na21 a22 · · · a2n

......

. . ....

an1 an2 · · · ann

=

∂y1∂x1

∂y1∂x2

· · · ∂y1∂xn

∂y2∂x1

∂y2∂x2

· · · ∂y2∂xn

......

. . ....

∂yn∂x1

∂yn∂x2

· · · ∂yn∂xn

= J

(X1 X2 ... XnY1 Y2 ... Yn

)

In this case,

f¯Y (

¯y) =

1|det(A)| f ¯

X

(A−1(

¯y −

¯b))

(4.18)



Theorem 19. Given the linear transformation¯Y = A

¯X +

¯b, where

¯X is an n−dimensional random vector and

¯Y is an

n−dimensional random vector, we have

1. µ¯Y = Aµ

¯X +

¯b

2. R¯Y

¯Y = AR

¯X

¯XAT +Aµ

¯X ¯

bT +¯bµ

¯XTAT +

¯b

¯bT

3. K¯Y

¯Y = AK

¯X

¯XAT

Proof.

1. E [¯Y ] = E [A

¯X +

¯b] = E [A

¯X ]+

¯b = Aµ

¯X +

¯b

2.

E [¯Y

¯Y T ] = E [(A

¯X +

¯b)(A

¯X +

¯b)T ]

= E [(A¯X +

¯b)((A

¯X )T +

¯bT )]

= E [(A¯X +

¯b)(

¯XTAT +

¯bT )]

= E [A¯X

¯XTAT +A

¯X

¯bT +

¯b

¯XTAT +

¯b

¯bT ]

= AR¯X

¯XAT +Aµ

¯X ¯

bT +¯bµ

¯XTAT +

¯b

¯bT

3.

K¯Y

¯YT = E [(A

¯X +

¯b−Aµ

¯X − ¯

b)(A¯X +

¯b−Aµ

¯X − ¯

b)T ]

= E [A(¯X −µ

¯X )( ¯

X −µ¯X )TAT ]

= AK¯X

¯XAT

Example 85. Let¯X = (X1, X2)

T and K¯X

¯X =

[4 22 4

]. Find A such that

¯Y = A

¯X , where

¯Y = (Y1, Y2)

T such that Y1 and

Y2 are uncorrelated.

Solution. Let

A =

[a11 a12a21 a22

]

¯Y =

(Y1 Y2

)T

⇒ Y1 = a11X1 +a12X2,

Y2 = a21X1 +a22X2.

We need KYY to be KYY =

[σ2Y1

00 σ2

Y2

]. By theorem 17 (Eigenvalue Decomposition Theorem) we have: Λ = UTMU.

Since K¯Y

¯Y = AK

¯X

¯XAT , we need to pick the matrix A such that A = UT for K

¯Y

¯Y to be a diagonal matrix. Hence, we need

to find the eigenvalues and eigenvectors of K¯X

¯X . It can be easily shown that the eigenvalues for K

¯X

¯X are λ1 = 2 and λ2 = 6.

For λ1 = 2 the eigenvector is¯v1 =

[1√2

1√2

]T, for λ2 = 6 the eigenvector is

¯v2 =

[1√2− 1√

2

]T

A =1√2

[1 −11 1

]

This leads to the final result Y1 =1√2(X1−X2), Y2 =

1√2(X1 +X2).



4.5 Gaussian Random Variables in Multiple Dimensions

Definition 58 (Vector Gaussian PDF). The joint Gaussian PDF for a vector of n random variables X , with mean vector µ¯X ,

and covariance matrix K¯X

¯X is given by:

f¯X (¯

x) =1√

(2π)ndet(K¯X

¯X )

exp

[−1

2(¯x −µ

¯X )TK−1

¯X

¯X (¯

x −µ¯X )

](4.19)

Example 86. For n = 1,

f¯X (¯

x) =1

(2π)1/2σexp[−1

2(¯x −

¯µ)T

1σ2 (¯

x −¯µ)

]=

1√2πσ2

exp

−1

2

(x −µx

σ

)2

Example 87. For n = 2,¯X = (X1,X2)

T and the covariance matrix K¯X

¯X is defined by

K¯X

¯X =

[σ2X1

Cov(X1,X2)

Cov(X1,X2) σ2X2

]=

[σ2X1

ρσX1σX2

ρσX1σX2 σ2X2

]

det(K¯X

¯X ) = σ

2X1

σ2X2−ρ

2σ

2X1

σ2X2

= (1−ρ2)σ2X1

σ2X2

Hence,

fX1X2(x1,x2) =1

(2π)σX1σX2

√1−ρ2

exp[ −1

2(1−ρ2)β

],

Where,

β =

(x1−µX1

σX1

)2

−2ρ

(x1−µX1

σX1

)(x2−µX2

σX2

)+

(x2−µX2

σX2

)2

Example 88. Let X ,Y ,Z be three zero-mean jointly Gaussian random variables with the following covariance matrix

K =

1 0.2 0.30.2 1 0.30.3 0.2 1

,

Find the PDF of fX ,Z (x ,z).

Solution. From the given information, X and Z are jointly Gaussian and KXZ =

[1 0.3

0.3 1

]. From KXZ we know that:

σX = σZ = 1

Cov [XZ ] = 0.3

⇒ ρ =

0.31

= 0.3.

Therefore,

fXZ (x ,z) =1

(2π)√

0.91exp[ −1

2(0.91)(

x2−0.6xz + z2)]

.



Example 89. Find the expression for the PDF of the N-dimensional Gaussian vector consisting of mutually uncorrelated

vectors.Verify that uncorrelated jointly Gaussian random variables are independent.

Solution. Two vectors are mutually uncorrelated, hence, Cov(Xi ,Xi ) = 0 for all i 6= j . Thus, K¯X

¯X is a diagonal matrix

K¯X

¯X =

σ21 0 ... 0

0 σ22 ... 0

0 0 ... 0...

.... . .

...

0 0 ... σ2n

⇒ K−1

¯X

¯X =

σ−21 0 ... 00 σ

−22 ... 0

0 0 ... 0...

.... . .

...

0 0 ... σ−2n

det(K¯X

¯X ) =

n

∏i=1

σ2i

(¯x −µ

¯X )TK−1

¯X

¯X (¯

x −µ¯X ) =

n

∑i=1

(xi −µi

σi

)2

Hence,

f¯X (¯

x) =1√

(2π)n∏ni=1 σ2

i

exp

[n

∑i=1

(xi −µi )2

2σ2i

]=n

∏i=1

1√2πσ2

i

exp

(− (xi −µi )

2

2σ2i

)=n

∏i=1

fXi (xi )

So uncorrelated Gaussian random variables are independent.

Theorem 20. Let¯X be jointly Gaussian, A be an invertible matrix and

¯Y = A

¯X +

¯b, then

¯Y is jointly Gaussian, i.e., any

linear transformation of jointly Gaussian random variables results jointly Gaussian random variables.

Proof. We start by performing the following manipulations

(¯x −µ

¯X )|

¯x=A−1(

¯y−

¯b) = A−1

¯y − (A−1

¯b+µ

¯X ) = A−1(

¯y − (Aµ

¯X +

¯b)) = A−1(

¯y −µ

¯Y )

K¯Y

¯Y = AK

¯X

¯XAT ⇒ K−1

¯Y

¯Y = (A−1)TK−1

¯X

¯XA−1

det(K¯Y

¯Y ) = |K

¯Y

¯Y |= |A||K

¯X

¯X ||AT |= |A|2|K

¯X

¯X |

Since,

f¯X (¯

x) =1√

(2π)ndet(K¯X

¯X )

exp

[−1

2(¯x −µ

¯X )TK−1

¯X

¯X (¯

x −µ¯X )

]

The PDF for¯Y is given by:

f¯Y (

¯y) =

f¯X (¯

x)

|A|∣∣∣¯x=A−1(

¯y−

¯b)

=1

(2π)n/2√|KXX ||A|exp

−1

2(¯y −

¯µY )

T (A−1)TK−1XXA−1

︸︷︷︸K−1YY

(¯y −

¯µY )

Hence,¯Y is jointly Gaussian with

¯µY = A

¯µX +

¯b and KYY = AKXXAT .

Example 90. Transform¯X (jointly Gaussian) into

¯Y = (Y1, ... ,Yn) where Yi are i.i.d.



Solution. Since for¯Y to be i.i.d., the covariance matrix should have the following form

K¯Y

¯Y =

σ2Y1

0 · · · 00 σ2

Y1· · · 0

......

. . ....

0 0 · · · σ2Yn

Pick random vector¯Y = A

¯X , where A is to be chosen such that K

¯Y

¯Y = AK

¯X

¯XAT . Since K

¯X

¯X is symmetric, from the

Eigenvalue Decomposition Theorem we have,

UTK¯X

¯XU = Λ =

λ1 0 · · · 00 λ2 · · · 0...

.... . .

...

0 0 ... λn

where λn are the eigenvalues of K¯X

¯X and U = [

¯v1,

¯v2, ... ,

¯vn] is the eigenvector matrix. Hence, A = UT .

Lemma 1. If X1, X2, ... , Xn are jointly Gaussian random variables, then Z1 = a1X1 +a2X2 + · · ·+anXn is a Gaussian random

variable ∀ai such that ∃ i for which ai 6= 0. Hence, any linear combination of Gaussian random variables is a Gaussian random

variable.

Proof. We can think of Z1 as being a component of¯Z = (Z1, Z2, ... , Zn)

T where,

Z1Z2...

Zn

=

a1 a2 · · · an0 1 ... 0...

.... . .

...

0 0 · · · 1

︸︷︷︸A

X1X2...

Xn

=

a1X1 +a2X2 + · · ·+anXnX2...

Xn

We know that A is invertible (full rank) which means that¯Z is jointly Gaussian. Thus, each component of

¯Z is Gaussian, in

particular Z1.

4.5.1 Quadratic Transformations

Suppose¯X is a vector of zero-mean Gaussian random variables. We seek to find the PDF of an arbitrary quadratic

function of¯X :

Z =¯XTB

¯X =

n

∑i=1

n

∑j=1

BijXiXj (4.20)

where B is a symmetric matrix.

Using the characteristic equation approach:

ΦZ (ω) = E [e jωZ ] = E [e jω ¯XTB

¯X ]

=∫

∞

−∞

...

∫∞

−∞

1√(2π)n|K

¯X

¯X |

exp

[−1

2 ¯XTK−1

¯X

¯X ¯

X + jω¯XTB

¯X

]d

¯X

=∫

∞

−∞

...

∫∞

−∞

1√(2π)n|K

¯X

¯X |

exp

[−1

2 ¯XT

(K−1

¯X

¯X −2jωB

)¯X

]d

¯X



Define a matrix F such as F−1 = K−1

¯X

¯X −2jωB. Then,

ΦZ (ω) =

√|F ||K

¯X

¯X |∫

∞

−∞

...

∫∞

−∞

1√(2π)n|F |

exp

[−1

2 ¯XTF−1

¯X

]d

¯X

=

√|F ||K

¯X

¯X |

=1√

|K¯X

¯X ||F−1|

=1√

|K¯X

¯XF−1|

=1√

|I −2jωK¯X

¯XB|

=n

∏k=1

(1−2jωλk)−1/2

where λk = kth eigenvalue of K¯X

¯XB.

Example 91. (Chi-Square Distribution) Let Xini=1 be uncorrelated zero-mean Gaussian random variable of equal variance

σ2. Define Z = ∑ni=1 X 2

i . Find the PDF of Z .

Solution. Let¯X = [X1 X2 ... Xn]

T . The covariance matrix is then K¯X

¯X = σ2I ⇒ BK

¯X

¯X = σ2I ⇒ λk = σ2 for all k .

ΦZ (ω) =n

∏k=1

(1−2jωσ

2)−1/2=(1−2jωσ

2)−n/2

Taking the inverse Fourier transform we get the PDF for Z

fZ (z) =zn2−1

(2σ2)n/2Γ(n/2)exp

(− z

2σ2

)U(z)

The above PDF is known as the Chi-Square distribution.

Example 92. Let X1,X2,X3,X4 be zero-mean independent Gaussian random variables. Define Z = X1X2 +X3X4. Find the

PDF of Z .

Solution. Let¯X = [X1 X2 X3 X4]

T . Then, K¯X

¯X = σ2I . We can write Z as follows

Z = [X1 X2 X3 X4]

0 1/2 0 01/2 0 0 00 0 0 1/20 0 1/2 0

X1X2X3X4

Hence,

BK¯X

¯X =

0 σ2/2 0 0σ2/2 0 0 0

0 0 0 σ2/20 0 σ2/2 0

The eigenvalues of BK¯X

¯X are λ1 = λ2 =

σ2

4 and λ3 = λ4 =−σ2

4 . The characteristic function can be written as

ΦZ (ω) =1

1− jω σ2

2

· 1

1+ jω σ2

2

=1

1+(

ωσ2

2

)2

Taking the inverse Fourier transform, we get

fZ (z) =1

σ2 exp

(−2|z |

σ2

): Laplace distribution



4.5.2 Coloring and Whitening Transformations

Suppose we have a zero-mean and unit-variance vector of uncorrelated Gaussian random variables¯X and we want to

create Gaussian random vector¯Y with a specific covariance matrix K . The transformation that achieves this objective is

called a coloring transformation. First, recall that any covariance matrix is symmetric and hence could be decomposed using

the Eigenvalue Decomposition Theorem into K = UΛUT . Note also that K is positive semi-definite and hence, its eigenvalues

are all positive. Thus, the matrix Λ is not only diagonal, but its diagonal elements are all positive and as a result, Λ is a valid

covariance matrix.

Λ =

λ1λ2 00

. . .

λn

That is, suppose we create a set of n uncorrelated Gaussian random variables¯Z , with a covariance matrix Λ. Then, the

matrix U will transform this set of uncorrelated Gaussian random variables to a new set of Gaussian random variables with

the desired covariance matrix K .

Let¯Z be a zero-mean vector of uncorrelated Gaussian random variables with Var(Zn) = λn. Then, K

¯Z

¯Z = Λ. Form the

linear transformation¯Y = U

¯Z , then K

¯Y

¯Y = UΛUT = K . Hence,

¯Y has the desired covariance matrix.

In order to form¯Z , we simply form

¯Z =√

Λ¯X , where

√Λ = diag(

√λ1,√

λ2, ... ,√

λn). Since¯X has a covariance matrix I ,

then K¯Z

¯Z =√

ΛI (√

Λ)T = Λ. The transformation is summarized in the following diagram. Hence, to get¯Y with covariance

X Z Y

KXX = I KY Y = UUTKZZ =

p U

matrix K out of¯X with unit covariance matrix, we need to perform the following linear transformation

¯Y = U

√Λ

¯X =

√K

¯X (4.21)

If the vector¯Y is specified to have a mean of µ

¯Y , then the transformation has the following form

¯Y = U

√Λ

¯X +µ

¯Y =√

K¯X +µ

¯Y (4.22)

Example 93. Consider a Gaussian random vector¯X of zero mean and K

¯X

¯X = I . Find the transform that turns

¯X into a

Gaussian random vector of mean vector µ¯X = [1 0 3]T and covariance matrix

K =

2 0 00 3 40 4 9

Solution. The eigenvalues for K are λ1 = 2, λ2 = 1, λ3 = 11.Hence,

√Λ =

√

2 0 00 1 00 0

√11

, U =

1 0 00 2 10 −1 2

⇒ U

√Λ =

√

2 0 00 2

√11

0 −1 2√

11

Thus,

Y =

√

2 0 00 2

√11

0 −1 2√

11

¯X +

103



As opposed to the coloring transformation, a whitening transformation is a decorrelation transformation that transforms

a Gaussian random vector¯Y having a known covariance matrix K and mean vector µ

¯Y into a zero-mean Gaussian random

vector¯X of identity covariance matrix. The transformation is called ”whitening” because it changes the input vector into a

white noise vector, and it is given by

¯X = (U

√Λ)−1(

¯Y −µ

¯Y ) = Λ−1/2UT (

¯Y −µ

¯Y ) (4.23)

Note that since U is an orthogonal matrix, then UT = U−1.

4.6 Overview on Estimation TheoryIn this section we are interested in estimating the value of an inaccessible random variable X in terms of the observation

of an accessible random variable Y . For example, X could be the input to a communication channel and Y could be the

observed output. In a prediction application, X could be a future value of some quantity and Y its present value.

4.6.1 Minimum Mean Square Error (LMMSE) Estimation

The estimate for X is given by a function of the observation X = g(Y ). In general, the estimation error, X−X =X−g(Y ),

is non-zero. We are usually interested in minimizing the estimation error. When X and Y are continuous random variables,

this is equivalent to minimizing the mean square of the estimation error

e = E [(X −g(Y ))2] (4.24)

We first consider the case where g(Y ) is constrained to be a linear function of Y , and then consider the case where g(Y )

can be any function, whether linear or non-linear.

• Estimating a random variable X by a constant.

Assume that g(Y ) = a, where a is a constant. The mean square estimation error need to be minimized

mina

E [(X −a)2] = mina

(E [X 2]−2aE [X ]+a2)

The best a is found by taking the derivative with respect to a, setting the result to zero, and solving for a, the result it

a∗ = E [X ] (4.25)

which makes sense since the expected value of X is the center of mass of the PDF. The mean square error for this

estimator is equal to

MMSE = E [(X −a∗)2] = Var(X ) (4.26)

• Estimating a random variable X by a linear function.

Assume that g(Y ) = aY +b, where a and b are constants. The mean square estimation error need to be minimized

mina,b

E [(X −aY −b)2]

We can view this problem as estimating the X −aY by a constant b, which results in the best b being

b∗ = E [X −aY ] = E [X ]−aE [Y ] (4.27)

This implies that the best a is found by finding

mina

E[(X −E [X ])−a(Y −E [Y ])2

](4.28)

We once again differentiate with respect to a, set the result to zero, and solve for a:

a∗ =Cov(X ,Y )

Var(Y )= ρX ,Y

σX

σY(4.29)



Therefore, the LMMSE estimator is

X = a∗Y +b∗ = ρX ,YσXY −E [Y ]

σY+E [X ] (4.30)

The mean square error of the best linear estimator is

LMMSE = E[(X −E [X ])−a∗(Y −E [Y ])2

]= Var(X )−a∗Cov(X ,Y ) = Var(X )(1−ρ

2X ,Y ) (4.31)

• Estimating a random variable X by a any function.

In general, the estimator for X that minimizes the mean square error is a non-linear function of Y . The estimator g(Y )

that best approximates X in the sense of minimizing mean square error must satisfy

ming(·)

E [(X −g(Y ))2]

The problem can be solved by using conditional expectation:

E [(X −g(Y ))2] = E [E [(X −g(Y ))2]|Y ] =∫

∞

−∞

E [(X −g(Y ))2|Y = y ]fY (y)dy

The integrand above is positive for all y ; therefore the integral is minimized by minimizing E [(X −g(Y ))2|Y = y ] for each y .

But g(y) is a constant as far as the conditional expectation is concerned, so the problem is equivalent to estimating using a

constant and the ”constant” that minimizes E [(X −g(Y ))2|Y = y ] is

g∗(y) = E [X |Y = y ] (4.32)

The minimum mean square error is

e∗ = E [(X −g∗(Y ))2] =∫

E [(X −E [X |y ])2|Y = y ]fY (y)dy =∫

Var [X |Y = y ]fY (y)dy (4.33)

Linear estimators in general are suboptimal and have larger mean square error.

Example 94. Consider the joint PDF

fXY (x ,y) =

2e−xe−y if 0≤ y ≤ x < ∞,

0 otherwise.

Using straightforward computations E [XY ] = 1, E [X ] = 3/2, E [Y ] = 1/2, Var(X ) = 5/4, Var(Y ) = 1/4, and ρX ,Y = 1/√

5.

The best linear and non-linear estimators for X in terms of Y are:

X =1√5

√5

2Y −1/2

1/2+

32= Y +1

E [X |y ] =∫

∞

yxe−(x−y)dx = y +1⇒ E [X |Y ] = Y +1

Thus, the optimum linear and non-linear estimators are the same.

The best linear and non-linear estimators for Y in terms of X are:

Y =1√5

12

X −3/2√5/2

+12=

X +15

E [Y |x ] =∫ x

0y

e−y

1−e−xdy = 1− xe−x

1−e−x

The optimum linear and non-linear estimators are not the same in this case. See Figure below.



x

Estimator of Y given x

Figure 4.1: Comparison between linear and non-linear programing.

Example 95. Let X be uniformly distributed in the interval (−1,1) and let Y = X 2. Find the best linear estimator for Y in

terms of X . Compare its performance to the best estimator.

Solution. The mean of X is zero and its correlation with Y is E [XY ] = E [X 2] =∫ 1−1/2

x3

2 dx = 0. Therefore, Cov(X ,Y ) = 0and the best linear estimator for Y is E [Y ]. The mean square error of this estimation is Var(Y ). The best non-linear

estimation is given by E [Y |X = x ] = E [X 2|X = x ] = x2. The mean square error is E [(Y −g(X ))2] = E [(X 2−X 2)2] = 0. Thus,

in this problem, the best linear estimator performs poorly while the non-linear estimator gives the smallest possible mean

square error, zero.

Example 96. The minimum mean square estimator of X in terms of Y when X and Y are jointly Gaussian random variables

is given by

X = E [X |Y = y ] = E [X ]+ρX ,YσX

σY(Y −E [Y ])

This is identical to the best linear estimator. Thus, for jointly Gaussian random variables the MMSE estimator is linear.

4.6.2 Estimation Using a Vector of Observations

Here we wish to estimate X by a function g(¯Y ) of a random vector of n observations so that the mean square error is

minimized

ming(·)

E [(X −g(¯Y )2]

To simplify the discussion we will assume that X and the Yi have zero means. Similarly to the one observation case, the

optimum minimum mean square estimator is

g∗(¯y) = E [X |

¯Y =

¯y ] (4.34)

The minimum mean square error is then

E [(X −g∗(Y ))2] =∫

RnE [(X −E [X |

¯Y ])2|

¯Y =

¯y ]f

¯Y (

¯y)d

¯y =

∫

RnVar [X |

¯Y =

¯y ]f

¯Y (

¯y)d

¯y (4.35)



Now suppose the estimate is a linear function of the observations:

g(¯Y ) =

n

∑k=1

akYk = ¯aT

¯Y

The mean square error is now:

E [(X −g(¯Y ))2] = E

(

X −n

∑k=1

akYk

)2

We take the derivatives with respect to ak and again obtain the orthogonality conditions:

E

(

X −n

∑k=1

akYk

)2

Yj

= 0, j = 1, ... ,n

The orthogonality conditions becomes:

E [XYi ] = E

[(n

∑k=1

akYk

)Yj

]=

n

∑k=1

akE [YkYj ], j = 1, ... ,n

We obtain a compact expression by introducing matrix notation:

E [X¯Y ] = R

¯Y ¯

a,where¯a = (a1,a2, ... ,an)

T

where E [X¯Y ] = [E [XY1],E [XY2], ... ,E [XYn]]

T , and R¯Y is the correlation matrix. Assuming R

¯Y is invertible, the optimum

coefficients are

¯a = R−1

¯Y E [X

¯Y ] (4.36)

The mean square error of the optimum linear estimator is

LMMSE = E [(X −¯aT

¯Y )2] = E [(X −

¯aT

¯Y )X ]−E [(X −

¯aT

¯Y )

¯aT

¯Y ] = E [(X −

¯aT

¯Y )X ] = Var(X )−

¯aTE [

¯YX ] (4.37)

Now suppose that X has mean µX and¯Y has mean vector µ

¯Y , so our estimator now has the form

X = g(¯Y ) =

n

∑k=1

akYk +b =¯aT

¯Y +b (4.38)

Following similar earlier arguments, the optimum choice for b is

b = E [X ]−¯aT µ

¯Y (4.39)

Therefore, the optimum linear estimator has the form

X = g(¯Y ) =

¯aT (

¯Y −µ

¯Y )+µX =

¯aT

¯Z +µX (4.40)

where,¯Z =

¯Y −µ

¯Y is a random variable with zero mean vector. The mean square error for this estimator is

E [(X −g(¯Y ))2] = E [(X −

¯aT

¯Z −µX )

2] = E [(W −¯aT

¯Z )2]

where W = X −µX has zero mean. We have reduced the general estimation problem to one with zero mean random variables,

i.e., W and¯Z , which has solution given by Equation. (4.36). Therefore, the optimum set of linear predictors is given by

¯a = R−1

¯Z E [W

¯Z ] = K−1

¯Y

¯YE [(X −µX )( ¯

Y −µ¯Y )] (4.41)

The mean square error is:

LMMSE = E [(X −¯aT

¯Y −b)2] = E [(W −

¯aT

¯ZW )] = Var(W )−

¯aTE [W

¯Z ] = Var(X )−

¯aTE [(X −µX )( ¯

Y −µ¯Y )]

This result is of particular importance on the case where X and¯Y are jointly Gaussian random variables. Since the

conditional expected value of X given Y is a linear function of Y of the form X =¯aT

¯Y +b, the optimum minimum mean

square estimator corresponds to the optimum linear estimator.



Example 97. (Multiple Antenna Receiver) A radio receiver has two antennas to receive noisy versions of a signal X . The

desired signal X is a Gaussian random variable with zero mean and variance 2. The signals received in the first and second

antennas are Y1 = X +N1 and Y2 = X +N2, where N1 and N2 are zero-mean unit-variance Gaussian random variables. In

addition, X , N1, and N2 are independent random variables. Find the optimum mean square error linear estimator for X based

on a single antenna signal and the corresponding mean square error. Compare the results to the optimum mean square

estimator for X based on both antennas signals¯Y = (Y1,Y2).

Solution. Since all random variables have zero mean, we only need the correlation matrix and the cross-correlation vector

R¯Y =

[E [Y 2

1 ] E [Y1Y2]

E [Y1Y2] E [Y 22 ]

]=

[E [(X +N1)

2] E [(X +N1)(X +N2)]

E [(X +N1)(X +N2)] E [(X +N2)2]

]

=

[E [X 2]+E [N2

1 ] E [X 2]

E [X 2] E [X 2]+E [N22 ]

]=

[3 22 3

]

and

E [X¯Y ] =

[E [XY1]

E [XY2]

]=

[E [X 2]

E [X 2]

]=

[22

]

The optimum estimator using a single received signal involves solving 1×1 version of the above system

X =E [X 2]

E [X 2]+E [N21 ]

Y1 =23

Y1

and the associated mean square error is

Var(X )−a∗Cov(X ,Y1) = 2− 23

2 =23

The coefficients of the optimum estimator using two antennas signals are

¯a = R−1

¯Y E [X

¯Y ] =

[3 22 3

]−1 [22

]=

15

[3 −2−2 3

][22

]=

[0.40.4

]

and the optimum estimator is

X = 0.4Y1 +0.4Y2

The mean square error for the two antenna estimator is

E [(X −¯aT

¯Y )2] = Var(X )−

¯aTE [

¯YX ] = 2−

[0.4 0.4

][22

]= 0.4

As expected, the two antenna systems has a smaller mean square error. Note that the receiver adds the two received signals

and scales the result by 0.4. The sum of the signals is

X = 0.4Y1 +0.4Y2 = 0.4(2X +N1 +N2) = 0.8(

X +N1 +N2

2

)

so combining the signals keeps the desired signal portion, X , constant while averaging the two noise signals N1 and N2.


RANDOM SEQUENCES AND SERIES 75

CHAPTER 5

RANDOM SEQUENCES AND SERIES

In this chapter, we discuss some fundamental issues related to the asymptotic behavior of sets composed of an infinite

number of random variables called random sequences. Many problems involve the counting of the number of occurrences

of events, the measurements of cumulative effects, or the phenomenon of arithmetic averages in a series of measurements.

Usually these problems can be reduced to the problem of finding, exactly or approximately, the distribution of a random

variable that consists of the sum of n independent, identically distributed random variables.

We will introduce laws (WLLN, SLLN, CLT) that demonstrate the remarkable consistency between probability theory and

observed behavior, and that reinforce the relative frequency interpretation of probability.

5.1 Random Sequences

Definition 59. (Random Sequence) A sequence Xn, n = 1,2, ... , is called a random sequence if its elements are random

variables.

Definition 60. A random sequence Xini=1 is independent and identically distributed (i.i.d.) if

1. FXi (x) = FX (x) for all i = 1,2, ... ,n (identically distributed)

2. FX1,X2,...,Xn(x1,x2, ... ,xn) = ∏ni=1 FXi (xi ) (independent)

The principal context in this chapter involves a sequence of i.i.d. random variables with mean µ and variance σ2. Let

Sn = X1 +X2 + · · ·+Xn

be the sum of the first n of them. Limit theorems are mostly concerned with the properties of Sn and related variables as n

becomes very large. Due to independence, we have

Var(Sn) = Var(X1)+ · · ·+Var(Xn) = nσ2

Thus, the distribution of Sn spreads out as n increases, and cannot have a meaningful limit. The situation is different if we

consider the sample mean defined as

Mn =X1 +X2 + · · ·+Xn

n=

Snn

(5.1)

A quick calculation yields

E [Mn] = µ, Var(Mn) =σ2

n

In particular, the variance of Mn decreases to zero as n increases, and the bulk of the distribution of Mn must be very close to

the mean µ. This phenomenon is the subject of certain laws of large numbers, which generally assert that the sample mean



Mn (a random variable) converges to the true mean µ (a number), in a precise sense. These laws provide a mathematical

basis for the loose interpretation of expectation E [X ] = µ as the average of a large number of independent samples drawn

from the distribution of X .

We will also consider a quantity which is intermediate between Sn and Mn. We first subtract nµ from Sn to obtain a

zero-mean random variable Sn−nµ and the divide by σ√

n, to form the random variable of unit variance

Zn =Sn−nµ

σ√

n

Since the mean and the variance of Zn remain unchanged as n increases, its distribution neither spreads, nor shrinks to a

point. The central limit theorem is concerned with the asymptotic shape of the distribution of Zn and asserts that it becomes

the standard Gaussian distribution.

5.2 Convergence of a Random SequenceConvergence for a sequence of random variables is not straightforward to define and can occur in a variety of manners.

Example 98. Consider the sequence of real numbers

Xn =n

n+1, n = 0,1,2, ...

This sequence converges to the limit `= 1. We write limn→∞ Xn = `= 1 This means that in any neighborhood around 1

we can trap the sequence, i.e.,

∀ε > 0,∃n0(ε),s.t. ∀n ≥ n0(ε), |Xn− `| ≤ ε.

We can pick ε to be very small and make sure that the sequence will be trapped after reaching n0(ε). Therefore as ε

decreases n0(ε) will increase. For example, in the considered sequence:

ε =12

, n0(ε) = 2

ε =1

1000, n0(ε) = 1001.

5.2.1 Almost sure convergence

Definition 61. A random sequence Xn, n = 0,1,2, ... , converges almost surely to the random variable X iff

P( limn→∞

Xn = X ) = 1 (5.2)

We write

Xna.s.−−→ X . (5.3)

Example 99. Let W ∼ U[0,1]. Define the random sequence Xn = W n for n = 0,1,2, ... . Show that Xn converges almost

surely.

Solution. If W = 1, then X0 = X1 = · · · = 1⇒ limn→∞ = 1. If W 6= 1, then limn→∞ Xn = 0. Hence, X = 1 if W = 1 where

P(W = 1) = 0, and X = 0 if W 6= 1 where P(W 6= 0) = 1.

∴ Xna.s.−−→ 0



Example 100. Let ω be a random variable that is uniformly distributed on [0,1]. Define the random sequence Xn as Xn = ωn:

X0 = 1, X1 = ω, X2 = ω2, X3 = ω3, ... Does this sequence of random variables converge?

Solution. Let us take specific values of ω. For instance, if ω = 12

X0 = 1, X1 =12

, X2 =14

, X3 =18

, ...

We can think of it as an urn containing sequences, and at each time we draw a value of ω, we get a sequence of fixed

numbers. In the example of tossing a coin, the output will be either heads or tails. Whereas, in this case the output of the

experiment is a random sequence, i.e., each outcome is a sequence of infinite numbers. This sequence converges to

X =

0 if ω 6= 1 with probability 1 = P (ω 6= 1)1 if ω = 1 with probability 0 = P (ω = 1)

Since the pdf is continuous, the probability P(ω = a) = 0 for any constant a. Notice that the convergence of the sequence

to 1 is possible but happens with probability 0. Therefore, we say that Xn converges almost surely to 0, i.e., Xna.s.−−→ 0.

5.2.2 Convergence in probability

Definition 62. A random sequence Xn converges to the random variable X in probability if

∀ε > 0, limn→∞

Pr |Xn−X | ≥ ε= 0 (5.4)

We write:

Xnp−→ X .

Example 101. Let X be a discrete random variable with support 0,2 and PMF

PX (x) =

1/3, x = 12/3, x = 0

Define Xn =(1+ 1

n

)X . Show that Xn

p−→ X .

Solution. Take any ε ≥ 0, |Xn−X |= 1nX .

If X = 0 (with probability 2/3), |Xn−X |= 0⇒ Xn→ X ⇒ |Xn−X | ≤ ε.

If X = 1 (with probability 1/3), |Xn−X |= 1n . Hence, |Xn−X | ≤ ε only if 1

n ≤ ε.

∴ P(|Xn−X |> ε) =

1/3, n < 1

ε

0, n ≥ 1ε

As n→ ∞, Xnp−→ X .

Example 102. Consider a random variable ω uniformly distributed on [0,1] and the sequence Xn defined by:

Xn =

0 with probability ω

n

1 with probability 1− ω

n



Distributed as shown in Figure. 5.1. Does this sequence converge? Notice that only X2 or X3 can be equal to 1 for the

same value of ω. Similarly, only one of X4,X5,X6 and X7 can be equal to 1 for the same value of ω and so on and so forth.

Intuitively, the sequence will converge to 0. Let us take some examples to see how the sequence behave.

for ω = 0 : 1n=1

10n=2

1000n=3

10000000n=4

...

for ω =13

: 1n=1

10n=2

0100n=3

00100000n=4

...

From a calculus point of view, these sequences never converge to zero because there is always a “jump” showing up

no matter how many zeros are preceding (See Figure. 5.2); for any ω : Xn(ω) does not converge in the “calculus” sense.

Which means also that Xn does not converge to zero almost surely (a.s.).This sequence converges in probability since

limn→∞ P (|Xn−0| ≥ 0) = 0 ∀ε > 0.

Remark 7. The observed sequence may not converge in “calculus” sense because of the intermittent “jumps”; however the

frequency of those “jumps” goes to zero when n goes to infinity.

5.2.3 Convergence in mean square

Definition 63. A random sequence Xn converges to a random variable X in mean square sense if

limn→∞

E[|X −Xn|2

]= 0. (5.5)

We write:

Xnm.s.−−→ X . (5.6)

Remark 8. In mean square convergence, not only the frequency of the “jumps” goes to zero when n goes to infinity; but

also the “energy” in the jump should go to zero.

Example 103. Let Xini=1 be a random process, such that E [Xi ] = µ, Var(Xi ) = σ2, Cov(Xi ,Xj ) = 0 for i 6= j . Define the

sample mean

Xn =1n

n

∑i=1

Xi

Show that Xnm.s.−−→ µ.

Solution.

E [|Xn−µ|2] = Var(Xn) =1

n2 Var(X1 +X2 + · · ·+Xn) =1n

σ2

As n→, E [|Xn−X |2]→ 0⇒ Xnm.s.−−→ µ.

Example 104. Consider a random variable ω uniformly distributed over [0,1], and the sequence Xn(ω) defined as:

Xn(ω) =

an for ω ≤ 1

n

0 otherwise



Does this sequence converge? Note that P (Xn = an) =1n and P (Xn = 0) = 1− 1

n . Let us check the different convergence

criteria we have see so far.

1. Almost sure convergence: Xna.s.−−→ 0 because

limn→∞

P(Xn = 0) = 1.

2. Convergence in probability: Xnp.−→ 0 because

limn→∞

P |Xn−0| ≤ ε= 0.

(Flash Forward: almost sure convergence ⇒ convergence in probability.)

Xna.s.−−→ X ⇒ Xn

p.−→ X .

3. Mean Square Convergence:

E[|Xn−0|2

]= a2

n (P (Xn = an)+0P (Xn = 0)) =a2n

n.

If an = 10⇒ limn→∞

E[|Xn−0|2

]= 0⇒ Xn

m.s.−−→ 0,

If an =√

n⇒ limn→∞

E[|Xn−0|2

]= 1⇒ Xn does not converge in m.s. to 0.

In this example, the convergence of Xn in the mean square sense depends on the value of an.

5.2.4 Convergence in distribution

Definition 64. A random sequence Xn converges to X in distribution if when n goes to infinity, the values of the sequence

are distributed according to a known distribution. We say

Xnd .−→ X . (5.7)

Equivalently, Xn converges to X in distribution iff

limn→∞

FXn (x) = FX (x) except at points where FX (x) is not continuous. (5.8)

Example 105. Consider the sequence Xn defined as:

Xn =

Xi ∼ B( 1

2 ) for i = 1(Xi−1 +1) mod 2 = X ⊕1 for i > 1

In which sense, if any, does this sequence converge? This sequence has two outcomes depending on the value of X1:

X1 = 1, Xn : 101010101010 ...

X1 = 0, Xn : 010101010101 ...

1. Almost sure convergence: Xn does not converge almost surely because the probability of every jump is always equal to12 .



2. Convergence in probability: Xn does not converge in probability because the frequency of the jumps is constant equal

to 12 .

3. Convergence in mean square: Xn does not converge to 12 in mean square sense because

limn→∞

E

[|Xn−

12|2]= E

[X 2n −Xn+

14

]= E [X 2

n ]−E [Xn]+14

= 12 12+02 1

2−0+

14=

12

.

4. Convergence in distribution: At infinity, since we do not know the value of X1, each value of Xn can be either 0 or

1 with probability 12 . Hence, any number Xn is a random variable ∼ B( 1

2 ). We say, Xn converges in distribution to

Bernoulli( 12 ) and we denote it by:

Xnd−→ Ber(

12).

Theorem 21.

Almost sure convergence

Convergence in mean square

⇒ convergence in probability⇒ convergence in distribution.

Note:

• There is no relation between Almost Sure and Mean Square Convergence.

• The relation is unidirectional, i.e., convergence in distribution does not imply convergence in probability neither almost

sure convergence nor mean square convergence.

Example 106. Let the random variable U be uniformly distributed on [0,1]. Consider the sequence defined as:

X (n) =(−1)nU

n

1. Does this sequence converge almost surely?

2. Does this sequence converge in probability?

3. Does this sequence converge in mean-square sense?

4. Does this sequence converge in distribution?

Solution.

1. Suppose U = a. The sequence becomes

X1 =−a,X2 =a

2,X3 =−

a

3,X4 =

a

4, ...

In fact, for any a ∈ [0,1], limn→∞ Xn = 0, therefore, Xna.s.−−→ 0.

Remark 9. Xna.s.−−→ 0 because, by definition, a random sequence converges almost surely to the random variable X if

the sequence of functions Xn converges for all values of U except for a set of values that has a probability zero.



2. By proving almost-sure convergence, we get directly the convergence in probability and in distribution. However, for

completeness we will formally prove that Xn converges to 0 in probability. To do so, we have to prove that

limn→∞

P (|X −0| ≥ ε) = 0 ∀ε > 0⇒⇒ limn→∞

P (|Xn| ≥ ε) = 0 ∀ε > 0

By definition,

|Xn|=U

n≤ 1

n.

Thus,

limn→∞

P(|Xn| ≥ ε

)= limn→∞

P

(U

n≥ ε

), (5.9)

= limn→∞

P

(n ≤ U

ε

), (5.10)

= limn→∞

P

(n ≤ 1

ε

), (5.11)

= 0. (5.12)

Where, equation 5.11 results from the fact that U ≤ 1 and equation 5.12 follows from the fact that finding n big enough

less 1ε∀ε happens with a probability that vanishes when increasing n.

3. In order to answer this question, we need to prove that limn→∞ E[|Xn−0|2

]= 0. We know that,

limn→∞

E[|Xn−0|2

]= limn→∞

E[X 2n

]= limn→∞

E

[U2

n2

]= limn→∞

1n2 E

[U2]

= limn→∞

1n2

∫ 1

0u2du = lim

n→∞

1n2

u3

3

]1

0= limn→∞

13n2 = 0.

Hence, Xnm.s.−−→ 0.

4. Recall that the limit r.v. X is the constant 0 and therefore has the CDF show in Figure. 5.4.

Since Xn =(−1)nUn , the distribution of the Xi can be derived as shown in Figure. ?? and Figure. ??.

Remark 10. At 0 the CDF of Xn will be flip-flopping between 0 (if n is even) and 1 (if n is odd) (See Figure. 5.5)

which implies that there is a discontinuity at that point. Therefore, we say that Xn converges in distribution to a CDF

FX (x) except at points where FX (x) is not continuous.

5.2.5 The Weak Law of Large Numbers

The weak law of large numbers asserts that the sample mean of a large number of i.i.d. distributed random variables is

very close to the true mean, with high probability. As defined earlier, the sample mean is the random variable Mn defined as

Mn =X1 +X2 + · · ·+Xn

n

having a mean µ and variance σ2

n . We apply Chebyshev’s inequality and obtain

P(|Mn−µ| ≥ ε)≤ σ2

nε2 , for any ε > 0



We observe that for any fixed ε > 0, the right-hand side of this inequality goes to zero as n increases. As a consequence,

we obtain the weak law of large numbers, which is stated below. It turns out that this law remains true even if the Xi have

infinite variance, but a much more elaborate argument is needed, which we omit. The only assumption needed is that E [Xi ]

is well-defined.

Theorem 22 (The Weak Law of Large Numbers (WLLN)). Let X1,X2, ... be independent identically distributed random

variables with mean µ. For every ε > 0, we have

P (|Mn−µ| ≥ ε) = P

(∣∣∣∣X1 + · · ·+Xn

n−µ

∣∣∣∣≥ ε

)→ 0, as n→ ∞ (5.13)

In the language of this convergence, we write

Mnp−→ µ (5.14)

The weak law of large number states that for large n, the bulk of the distribution of Mn is concentrated near µ. That is,

if we consider a positive length interval [µ−ε, µ +ε] around µ, then there is high probability that Mn will fall in that interval;

as n→ ∞, this probability converges to 1. Of course, if ε is very small, we may have to wait longer (i.e., need a larger value

of n) before we can assert that Mn is highly likely to fall in that interval.

A stronger version of the WLLN is the Strong Law of Large Numbers (SLLN).

Theorem 23 (The Strong Law of Large Numbers (SLLN)). Let X1,X2, ... be a sequence of i.i.d. random variable with finite

mean E [X ] = µ and finite variance, then

P[

limn→∞

Mn = µ

]= 1 (5.15)

In the language of this chapter, we write

Mna.s−−→ µ (5.16)

This theorem states that with probability 1, every sequence of sample mean calculation will eventually approach and stay

close to E [X ] = µ. This is the type of convergence we expect in physical situations where statistical regularity holds.

Since almost sure convergence implies convergence in probability, then SLLN implies WLLN. The difference between

WLLN and SLLM is subtle and deserves close scrutiny. The weak law states that the probability P(|Mn − µ| ≥ ε) of a

significant deviation of Mn from µ goes to zero as n→ ∞. Still, for any finite n, this probability can be positive and it is

conceivable that once in a while, even if infrequently, Mn deviates significantly from µ. The weak law provides no conclusive

information on the number of such deviations, but the string law does. According to the string law, and with probability 1,

Mn converges to µ. This implies that for any ε > 0, the probability that the difference |Mn− µ| will exceed ε an infinite

number of times is equal to zero.

Example 107 (Polling I). Let p be the fraction of voters who support a particular candidate for office. We interview n

randomly selected voters and record Mn, the fraction of them that support the candidate. We view Mn as our estimate of p

and would like to investigate properties.

We interpret “randomly selected” to mean that the n voters are chosen independently and uniformly from the given

population. Thus, the reply of each person interviewed can be viewed as an independent Bernoulli random variable Xi with

success probability p and variance σ2 = p(1−p). The Chebyshev’s inequality yields

P(|Mn−p| ≥ ε)≤ p(1−p)

nε2

The true value of the parameter p is assumed to be unknown. On the other hand, it may be verified that p(1−p) ≤ 1/4,

which yields

P(|Mn−p| ≥ ε)≤ 14nε2

For example, if ε = 0.1 and n = 100, we obtain

P(|Mn−p| ≥ ε)≤ 14(100)(0.1)2 = 0.25



In words, with a sample size of n = 100, the probability that our estimate is incorrect by more than 0.1 is no longer than 0.25.

Suppose now that we impose some tight specifications on our poll. We would like to have high confidence (probability at

least 95%) that our estimate will be very accurate (within 0.01 of p). How many voters should be sampled?

The only guarantee that we have at this point is the inequality

P(|Mn−p| ≥ ε)≤ 14n(0.01)2

We will be sure to satisfy the above specifications if we choose n large enough so that

14n(0.01)2 ≤ 1−0.95 = 0.05

which yields n ≥ 50,000. This choice of n satisfies our specifications, but turns out to be fairly conservative, because it is

based on rather loose Chebyshev’s inequality. A refinement will be considered in a subsequent example.

Now we introduce one of the most fundamental results in probability theory; the Central Limit Theorem (CLT).

Theorem 24 (The Central Limit Theorem). Let Y1,Y2, ... be a sequence of independent identically distributed random

variables with common mean µ and variance σ2, and define

Zn =Y1 +Y2 + · · ·+Yn−nµ

σ√

n

Then, the CDF of Zn converges to the normal CDF

Φ(z) =1√2π

∫ z

−∞

e−x2/2dx

in the sense that

limn→∞

P(Zn ≤ z) = Φ(z), ∀z

Proof. The characteristic function of

Zn =X1 +X2 + · · ·+Xn

σ√

n

where Xi = Yi −µ. is given by

MZn(s) = E [esZn ] = E

[exp

s

σ√

n

n

∑i=1

Xi

]=n

∏i=1

E [esXi/(σ√n)] =

(MX

(s

σ√

n

))n

We can write the second order Taylor series expansion round 0 of MX (s) as MX (s) = a+bs +cs2+o(s2). Using the moment

generating properties of the transform, we have

a = MX (0) = 1, b = M′X (s)

∣∣∣s=0

= E [X ] = 0, c =12

M′′X (s)

∣∣∣s=0

=E [X 2]

2=

σ2

2

Hence, we can express MZn(s)as

MZn(s) =

(MX

(s

σ√

n

))n=

(a+

bs

σ√

n+

cs2

σ2n+o

(s2

σ2n

))n

Replacing a, b, and c , it follows that

MZn(s) =

(1+

s2

2n+o

(s2

σ2n

))n



We now take the limit atsn→ ∞, and use the identity

limn→∞

(1+

c

n

)n= ec

to obtain

limn→∞

MZn(s) = es2/2

The central limit theorem is surprisingly general. Besides independence, and the implicit assumption that the mean and

the variance are finite, it places no other requirement on the distribution of the Xi , which could be discrete, continuous, or

mixed. This theorem is of tremendous importance for several reasons, both conceptual and practical. On the conceptual side,

it indicates that the sum of a large number of independent random variables is approximately Gaussian. As such, it applies to

many situations in which a random effect is the sum of a large number of small but independent random factors. Noise in many

natural or engineered systems has this property. In a wide array of contexts, it has been found empirically that the statistics

of noise are well-described by Gaussian distributions, and the CLT provides a convincing explanation for this phenomenon.

On the practical side, the CLT eliminates the need for detailed probabilistic models, and for tedious manipulations of PMFs

and PDFs. Rather, it allows the calculation of certain probabilities by simply referring to the Q function table.

Example 108. We load a plane of 100 packages whose weights are independent random variables that are uniformly distributed

between 5 and 50 pounds. What is the probability that the total weight will exceed 3000 pounds?

Solution. It is not easy to calculate the CDF of the total weight and the desired probability, but an approximate answer can

be quickly obtained using the CLT.

We want to calculate P(S100 > 3000), where S100 is the sum of the weights of 100 packages. The mean and the variance

of the weight of a single package are

µ =5+50

2= 27.5, σ

2 =(50−5)2

12= 168.75

based on the formulas for the mean and variance of the uniform PDF. We calculate the normalized value

z =3000−100(27.5)√

168.75×100= 1.92

Hence, P(S100 ≤ 3000) = Φ(1.92) = 0.9726. Thus, the desired probability is P(S100 > 3000) = 1−P(S ≤ 3000) = 0.0274.

Example 109. (Polling II) Let us revisit the polling problem. We are interested in the probability P(|Mn−p| ≥ ε) that the

polling error is larger than some desired accuracy ε. Because of the symmetry of the PDF of the Gaussian distribution around

the mean, we have

P(|Mn−p| ≥ ε)' 2P(Mn−p ≥ ε)

The variance p(1−p)/n of Mn−p depends on p and is therefore unknown. We note that the probability of a large deviation

from the mean increases with the variance. Thus, we can obtain an upper bound on P(Mn−p ≥ ε) by assuming that Mn−p

has the largest possible variance, namely 1/(4n) which corresponds to p = 1/2. To calculate this upper bound, we evaluate

the standardized value

z =ε

1/(2√

n)

and use the normal approximation

P(Mn−p ≥ ε)≤ 1−Φ(z) = 1−Φ(2ε√

n)



For instance, consider the case where n = 100 and ε = 0.1. Assuming the worst-case variance, and treating Mn as if it were

normal, we obtain

P(|M100−p| ≥ 0.1)' 2P(Mn−p ≥ 0.1)≤ 2−2Φ(2×0.1×√

100) = 0.046

This is much smaller (and accurate) than the estimate of 0.25 that was obtained using Chebyshev’s inequality.

We now consider a reverse problem. How large a sample size n is needed if we wish our estimate Mn to be within 0.01

of p with probability at least 0.95?

Assuming again the worst possible variance, we are led to the condition

2−2Φ(2×0.01×√

n)≤ 0.05⇒ 2×0.01√

n ≥ 1.96⇒ n ≥ 9604

This is significantly better than the sample size of 50,000 that we found using Chebyshev’s inequality.

5.3 Asymptotic Equipartition Property

In information theory, the analog of the law of large number is the asymptotic equipartition property (AEP). It is a direct

consequence of the weak law of large numbers. The law of large numbers states that for i.i.d random variables, 1n ∑ni=1 Xi is

close to its expected value E [X ] for large values of n. The AEP states that 1n log 1

p(X1,X2,...,Xn)is close to the entropy H(X ),

where Xini=1 are i.i.d random variables and p(X1,X2, ... ,Xn) is the probability of observing the sequence Xini=1. Thus, the

probability p(X1,X2, ... ,Xn) assigned to an observed sequence will be close to 2−nH .

This enables us to divide the set of all sequences into two sets, the typical set, where the sample entropy is close to the

true entropy, and the nontypical set, which contains the other sequences. In this section we focus on what we call weakly

typical sequences (as opposed to strongly typical sequences), and prove some nice properties involving them.

Definition 65. (Entropy)

Consider a random variable X that takes value from the alphabet X . The entropy of X is defined as

H(X ) =− ∑x∈X

p(x) logp(x) (5.17)

Theorem 25. (AEP)

If X1,X2, ... are i.i.d. p(x), then

−1n

logp(X1,X2, ... ,Xn)→ H(X ) in probability (5.18)

Proof. Functions of independent random variables are also independent random variables. Thus, since the Xi are i.i.d., so

are logp(Xi ). Hence, by the weak law of large numbers,

−1n

p(X1,X2, ... ,Xn) =−1n ∑i

logp(Xi )p−→−E [logp(X )] = H(X )

Definition 66. (Weakly typical sequences)

The weakly typical set A(n)ε with respect to p(x) is the set of sequences (x1,x2, ... ,xn) ∈X n with property

2−n(H(X )+ε) ≤ p(x1,x2, ... ,xn)≤ 2−n(H(X )−ε) (5.19)

As a consequence of the AEP, we can show that the set A(n)ε has the following properties:

Theorem 26.

1. If (x1,x2, ... ,xn) ∈ A(n)ε , then

H(X )− ε ≤−1n

logp(x1,x2, ... ,xn)≤ H(X )+ ε



2. PrA(n)ε > 1− ε for n sufficiently large

3. (1− ε)2n(H(X )−ε) ≤ |A(n)ε | ≤ 2n(H(X )+ε) for n sufficiently large

Proof.

1. The proof is immediate from the definition.

2. From the AEP theorem, the probability of the event (X1,X2, ... ,Xn) ∈ A(n)ε tends to 1 as n→ ∞. Thus, for any δ > 0,

there exists an n0 such that for all n ≥ n0, we have

Pr

∣∣∣− 1n

logp(X1,X2, ... ,Xn)−H(X )∣∣∣< ε

> 1−δ

Setting δ = ε, we obtain the second part of the theorem.

3. We write the following

1 = ∑

¯x∈X n

p(¯x)≥ ∑

¯x∈A(n)ε

p(¯x)≥ ∑

¯x∈A(n)ε

2−n(H(X )+ε) = 2−n(H(X )+ε)|A(n)ε |

Hence,

|A(n)ε | ≤ 2n(H(X )+ε)

Finally, for sufficiently large n, PrA(n)ε > 1− ε, so that

1− ε < PrA(n)ε ≤ ∑

¯x∈A(n)ε

2−n(H(X )−ε) = 2−n(H(X )−ε)|¯x ∈ A

(n)ε |

Hence,

|A(n)ε | ≥ (1− ε)2n(H(X )−ε)



0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.51

ω

X7

(ω)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.51

ω

X6

(ω)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.51

ω

X5

(ω)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.51

ω

X4

(ω)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.51

ω

X3

(ω)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.51

ω

X2

(ω)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.51

ω

X1

(ω)

Figure 5.1: Plot of the distribution of Xn(ω)



0 50 100 150 200 250 300 350 400 450 500 550 6000

0.5

1

1.5

n

Xn

Figure 5.2: Plot of the sequence for ω = 0

Figure 5.3: Plot of the sequence Xn(ω)

−2 −1 0 1 2−1

0

1

2

Figure 5.4: Plot of the CDF of 0



−2 −1 0 1 2−1

0

1

2CDF of X3

−2 −1 0 1 2−1

0

1

2CDF of X2

−2 −1 0 1 2−1

0

1

2CDF of X1

−2 −1 0 1 2−1

0

1

2CDF of U

Figure 5.5: Plot of the CDF of U,X1,X2 and X3


STOCHASTIC PROCESSES 90

CHAPTER 6

STOCHASTIC PROCESSES

In certain random experiments, the outcome is a function of time or space. For example, in speech recognition systems,

decisions are made on the basis of a voltage waveform corresponding to a speech utterance. In an image processing system,

the intensity and color of the image varies over a rectangular region. In a peer-to-peer network, the number of peers in the

system varies with time. In some situations, two or more functions of time may be of interest. For example, the temperature

in a certain city and the demand placed on the local electric power utility vary together in time. The random time functions

in the above examples can be viewed as numerical quantities that evolve randomly in time or space.Thus what we really have

is a family of random variables indexed by the time or space variable. In this chapter we introduce the notion of a random

process (or stochastic process), which is defined as an indexed family of random variables.

6.1 Definition of a Random Process

Consider a random experiment specified by the outcomes ξ from some sample space S , by the events defined on S , and

by the probabilities on these events. Suppose that to every outcome ξ ∈ S , we assign a function of time according to some

rule:

X (t;ξ )

The graph of the function X (t;ξ ) versus t, for ξ fixed, is called a realization, sample path, or sample function of the random

process. Thus, we can view the outcome of the random experiment as producing an entire function of time as shown in

Figure. 6.1. On the other hand, if we fix a time tk from the index set I , then X (t;ξ ) is a random variable (see Figure. 6.1)

since we are mapping ξ onto a real number.Thus we have created a family (or ensemble) of random variables indexed by

the parameter t, X (t;ξ ),t ∈ I. This family is called a random process.We also refer to random processes as stochastic

processes. We usually suppress the ξ and use X (t) to denote a random process.

The index parameter t is typically time, but can also be a spatial dimension.

• For fixed t, X (t,ξ ) is a random variable in Ω

• For fixed ξ , X (t,ξ ) is a deterministic function of t, called a sample function.

A stochastic process is said to be discrete-time if the index set I is a countable set (i.e., the set of integers or the set

of nonnegative integers).When dealing with discrete time processes, we usually use n to denote the time index and Xn to

denote the random process.A continuous-time stochastic process is one in which I is continuous (i.e., the real line or the

nonnegative real line).

Random processes are used to model random experiments that evolve in time:

• Received sequence/waveform at the output of a communication channel

• Packet arrival times at a node in a communication network



Figure 6.1: Several realizations of a random process.

• Thermal noise in a resistor

• Daily price of stock

• Winnings or losses of a gambler

The answer of lot of questions involve random processes. For example

• How do future received values depend on past received values?

• How do future prices of a stock depend on its past values?

• What is the proportion of time a queue is empty?

• What is the average noise power at the output of a circuit?

• What is the probability that a link in a communication network is congested?

• What is the probability that the maximum power in a power distribution line is exceeded?

• What is the probability that a gambler will lose all his capital?

The following example shows how we can imagine a stochastic process as resulting from nature selecting ξ at the beginning

of time and gradually revealing it in time through X (t;ξ ).

Example 110. Let ξ be selected at random from the interval [−1,1]. Define the continuous-time random process X (t,ξ ) by

X (t,ξ ) = ξ cos(2πt)

The realizations of this random process are sinusoids with amplitude ξ , as shown in Figure. 6.2(a). Let ξ be selected at

random from the interval (−π,π) and let Y (t,ξ ) = cos(2πt + ξ ). The realizations of Y (t,ξ ) are phase-shifted versions of

cos(2πt) as shown in Figure. 6.2(b).



Figure 6.2: (a) Sinusoid with random amplitude, (b) Sinusoid with random phase.

The randomness in ξ induces randomness in the observed function X (t,ξ ). In principle, one can deduce the probability of

events involving a stochastic process at various instants of time from probabilities involving ξ by using the equivalent-event

method introduced earlier.

Example 111. Let ξ be a number selected at random from the interval S = [0,1], and let b1b2 ... be the binary expansion of

ξ :

ξ =∞

∑i=1

bi2−i , bi ∈ 0,1

Define the discrete-time process X (n,ξ ) by

X (n,ξ ) = bn, n = 1,2, ...

The resulting process is sequence of binary numbers, with X (n,ξ ) equal to the nth number in the binary expansion of ξ .Find

the following probabilities: P[X (1,ξ ) = 0] and P[X (1,ξ ) = 0 and X (2,ξ ) = 1].The probabilities are obtained by finding the equivalent events in terms of ξ :

P[X (1,ξ ) = 0] = P

[0≤ ξ <

12

]=

12

P[X (1,ξ ) = 0 and X (2,ξ ) = 1] = P

[14≤ ξ <

12

]=

14

In general, the sample paths of a stochastic process can be quite complicated and cannot be described by simple formulas.

In addition, it is usually not possible to identify an underlying probability space for the family of observed functions of time.

Thus the equivalent-event approach for computing the probability of events involving X (t,ξ ) in terms of the probabilities of

events involving ξ does not prove useful in practice. In the next section we show an alternative method for specifying the

probabilities of events involving a stochastic process.



6.2 Joint Distribution of Time SamplesAs with random variables, we can mathematically describe a random process in terms of a CDF, a PDF or a PMF. Let

X1,X2, ... ,Xk be the k random variables obtained by sampling the random process X (t,ξ ) at the times t1,t2, ... ,tk :

X1 = X (t1,ξ ), X2 = X (t2,ξ ), ... ,Xk = X (tk ,ξ )

The joint behavior of the random process at these k time instants is specified by the joint cumulative distribution of the

vector random variable X1,X2, ... ,Xk . The probabilities of any event involving the random process at all or some of these

time instants can be computed from this CDF using the methods developed for vector random variables. Thus, a stochastic

process is specified by the collection of kth-order joint cumulative distribution functions:

FX1,...,Xk (x1,x2, ... ,xk) = P[X (t1)≤ x1,X (t2)≤ x2, ... ,X (tk)≤ xk ] (6.1)

for any k and any choice of sampling instants t1,t2, ... ,tk .

6.2.1 Statistical Parameters

The moments of time samples of a random process can be used to partially specify the random process because they

summarize the information contained in the joint CDFs.

Definition 67 (Mean Function). The mean function of a random process is

E [X (t)] = µX (t) =∫

∞

−∞

xfX (x ;t)dx

Example 112. Consider a random sinusoidal process: X (t) = Acos(2πft +Θ).

Find the expected value of X (t) for each of the following cases:

1. Θ = 0 and A is random

2. Θ∼ U[0,2π] and A = a (fixed)

3. A is random and Θ∼ U[0,2π]

Solution.

1. E [X (t)] = E [Acos(2πft)] = E [A]cos(2πft) = µA cos(2πft)

2. E [X (t)] = E [Acos(2πft +Θ)] =∫ 2π

01

2πa cos(2πft +θ)dθ = 0

3. E [X (t)] = E [Acos(2πft +Θ)] = E [A]E [cos(2πft +θ)] = 0

Definition 68 (Autocorrelation Function). The autocorrelation function RXX (t1,t2) of a random process is

RXX (t1,t2) = E [X (t1)X (t2)] =∫

∞

−∞

∫∞

−∞

x1x2fX1X2(x1,x2,t1,t2)dx1dx2

It is common to write t1 = t, t2 = t + τ, so that RXX (t,t + τ) = E [X (t)X (t + τ)]

Example 113. Find the autocorrelation function of the sinusoidal process

X (t) = Acos(2πft +Θ)

where Θ and A are independent r.v.



Solution.

RXX (t1,t2) = E [A2 cos(2πft1 +Θ)cos(2πft2 +Θ)]

= E [A2]E

[12

cos(2πf (t1 + t2)+2Θ)+12

cos(2πf (t1− t2))

]

=12

E [A2]cos(2πf (t1− t2))

=12

E [A2]cos(2πf τ) (τ = t1− t2)

= RXX (τ)

Definition 69 (Cross-Correlation Function). The cross-correlation of two random processes X (t) and Y (t) is defined as

RXY (t1,t2) = E [X (t1)Y (t2)]

In general, RXY (t1,t2) 6= RYX (t2,t1)

The processes X (t) and Y (t) are said to be orthogonal random processes if

RXY (t1,t2) = 0, ∀t1,t2

Definition 70 (Auto-Covariance Function). The auto-covariance function of a random process is

KXX (t1,t2) = Cov(X (t1),X (t2))

= E [(X (t1)−µX (t1))(X (t2)−µX (t2))]

= RXX (t1,t2)−µX (t1)µX (t2)

Definition 71 (Cross-Covariance Function). The auto-covariance function of a two random processes X (t) and Y (t) is

KXY (t1,t2) = Cov(X (t1),Y (t2))

= E [(X (t1)−µX (t1))(Y (t2)−µY (t2))]

= RXY (t1,t2)−µX (t1)µY (t2)

The processes X (t) and Y (t) are said to be uncorrelated if

KXY (t1,t2) = 0, ∀t1,t2

Example 114. Let X (t) = cos(ωt+Θ) and Y (t) = sin(ωt+Θ), where Θ is a random variable uniformly distributed in [−π,π].

Find the cross-covariance of X (t) and Y (t).

Solution.

KXY (t1,t2) = E [cos(ωt1 +Θ)sin(ωt2 +Θ)] = E

[−1

2sin(ω(t1− t2))+

12

sin(ω(t1 + t2)+2Θ)

]=−1

2sin(ω(t1− t2))

Note that X (t1) and Y (t2) are uncorrelated random processes for t1 and t2 such that ω(t1− t2) = kπ, where k is any

integer.



6.2.2 Discrete Random Process

Definition 72. A discrete random process X (n), also denoted as Xn, is an infinite sequence of random variables X1,X2,X3, ... ;

we think of n as the time index.

1. Mean function: µX (n) = E [X (n)].

2. Auto-correlation function: RXX (k , l) = E [X (k)X (l)].

3. Auto-covariance function: KXX (k , l) = RXX (k , l)−µX (k)µX (l).

Example 115. We generate a discrete random process X [n] by repeatedly throwing a fair die. The values of the random

process corresponds to the results of each throw.

1. Find the mean function E [X [n]].

2. Find the autocorrelation function RX [k1,k2].

Solution.

1. E [X [n]] = 16 +

26 +

36 +

46 +

56 +

66 = 21

6

2. RX [k1,k2] = E [X [k1]X [k2]].

If k1 6= k2, RX [k1,k2] = E [X [k1]X [k2]] =44136 = 49

4 .

If k1 = k2, RX [k1,k2] = E [X 2[k1]] =16

[12 +22 +32 + · · ·+62

]= 91

6 .

6.2.3 Gaussian Random Process

Definition 73 (Gaussian Random Process). A random process for which any n samples X (t1) = X1, X (t2) = X2, ... ,

X (tn) = Xn from a set of jointly Gaussian random variables for any n and any t1,t2, ...,tn is a Gaussian random process.

Example 116. Let W (n) be an i.i.d. Gaussian random process with autocorrelation RWW (k , l) = σ2δ (k − l) and mean

µW (n) = 0,∀n, where

RWW (k , l) = E [WkWl ] =

σ2, k = l

0, otherwise

Which means that for l = k, RW (k , l) = E [W 2l ] = V [Wl ] = σ2, also Wl and Wk are correlated. While for l 6= k, RW (k , l) = 0,

then Wl and Wk are uncorrelated and therefore independent because they are jointly Gaussian.

A new averaging random process X (n) is defined as

X (n) =W (n)+W (n−1)

2n ≥ 1,

We need to find the PDF of X (n) and the autocorrelation RXX (k , l).

From previous chapters we know that X (n) is a Gaussian random variable because it is a linear combination of Gaussian

random Gaussian variables. Hence, it is enough to find the mean and variance of Xn to find its pdf.

E [X (n)] = E

[12(Wn+Wn−1)

]=

12(E [Wn]+E [Wn−1]) = 0



Var [X (n)] = Var

[12(Wn+Wn−1)

]=

14(Var [Wn]+Var [Wn−1]) =

14(σ

2 +σ2)= σ2

2

Therefore,

fXn(xn) =1√

2πσ√

2

exp

(− x2

n

2 σ2

2

)=

1σ√

πexp(− x2n

σ2

)

Before we apply the formula, let us try to find the autocorrelation intuitively. By definition:

X1 =W1 +W0

2, X2 =

W2 +W1

2, X3 =

W3 +W2

2.

It is clear that X1 and X3 are uncorrelated (independent) because they do not have any Wi in common and W (n) is i.i.d.

However, X1,X2 and X2,X3 are correlated.

RXX (k , l) = E [XkXl ] ,

=14

E [(Wk +Wk−1)(Wl +Wl−1)] ,

=14(E [WkWl ]+E [WkWl−1]+E [Wk−1Wl ]+E [Wk−1Wl−1]) .

Recall from the definition of W (n) that

E [WkWl ] =

σ2 k = l ,

0 otherwise,E [WkWl−1] =

σ2 k = l −1,

0 otherwise,

E [Wk−1Wl ] =

σ2 k = l +1,

0 otherwise,E [Wk−1Wl−1] =

σ2 k = l ,

0 otherwise.

Therefore,

RXX (k , l) =

σ2

2 k = l ,σ2

4 k = l ±1,

0 otherwise.

Example 117. (Random Walk) Let W0 = X0 = constant, and W1,W2, ... be i.i.d. random process with the following distri-

bution

Wi =

1, p

−1, 1−p

The random walk process Xn, n = 1,2, ... is then defined as

Xn = W0 +W1 +W2 + · · ·+Wn.

1. What is the pdf of Xn?

2. Find the mean function of Xn.

3. Find the variance of Xn.

Solution.



0 1 2 3 4 5 6 7

−2

0

2

n

Xn

0 1 2 3 4 5 6 7

−4

−2

0

2

4

n

Wn

Figure 6.3: A possible realization of the random process W (n) and its corresponding averaging function X (n).

.

1. Xn is a Binomial r.v. since it is the summation of Bernoulli r.v. So to find the pdf of Xn is the same as finding

P (Xn = h). Let U be the number of steps “up”, i.e., the corresponding Wi = 1; and let D be the number of steps

“down”, i.e., the corresponding Wi =−1,

U−D = h−X0

U +D = n

⇒ U =

n+h−X0

2.

Then,

P (Xn = h) =

(n

U

)pU (1−p)n−U =

(n

n+h−X02

)pn+h−X0

2 (1−p)n−h+X0

2

2. E [Xn] = E [X0]+E [W1]+ · · ·+E [Wn] = X0 +n (1×p+(−1)(1−p)) = X0 +(2p−1)n

Leading to the following for different values of p:

if p =12

, E [Xn] = X0,

if p >12

, E [Xn]−−−→n→∞

+∞,

if p <12

, E [Xn]−−−→n→∞

−∞.

3.

V [Xn] = V [X0]+V [W1]+ · · ·+V [Wn] , (6.2)

= 0+4np(1−p), (6.3)

= 4np(1−p). (6.4)

Where equation (6.2) is applicable because W (n), n = 1,2, ... , are i.i.d.



Figure 6.4: A possible realization of the random walk process

6.3 Stationary and Ergodic Random ProcessesMany random processes have the property that the nature of the randomness in the process does not change with time.

An observation of the process in the time interval (t0,t1) exhibits the same type of random behavior as an observation in

some other time interval (t0 + τ,t1 + τ). This leads us to postulate that the probabilities of samples of the process do not

depend on the instant when we begin taking observations, that is, probabilities involving samples taken at time t1, ... ,tk will

not differ from those taken at t1 + τ, ... ,tk + τ.

Definition 74 (Strict-Sense Stationary (SSS) Process). A random process is strict sense stationary (SSS) if f¯X (¯

x ,¯t) is

invariant to a time shift:

fX1,...,Xn(x1, ... ,xn;t1, ... ,tn) = fX1,...,Xn(x1, ... ,xn;t1 + τ, ... ,tn+ τ) (6.5)

Definition 75 (Wide-Sense Stationary (WSS) Process). A random process is wide-sense stationary (WSS) if the mean and

the autocorrelation function are invariant to a time shift:

1. E [X (t)] = constant

2. RXX (t,t + τ) = RXX (τ)

Example 118. Suppose we modulate a carrier with a random process: Y (t) = X (t)cos(ωt +Θ), where X (t) and Θ are

independent and Θ∼ U[0,2π].

E [Y (t)] = E [X (t)cos(ωt +Θ)] = E [X (t)]E [cos(ωt +Θ)] = 0

RYY (t,t + τ) = E [X (t)X (t + τ)]E [cos(ωt +Θ)cos(ωt +ωτ +Θ)] =12

RXX (t,t + τ)cos(ωτ)

Hence, Y (t) is a WSS process only if RXX (t,t + τ) is a function of τ only.



Example 119. Let X (t) be a WSS random process with autocorrelation RX (τ) = Ae−α|τ|. Find the second moment of

random variable Y = X (5)−X (2).

Solution. E [Y ] = E [X (5)−X (2)] = E [X (5)]−E [X (2)] = 0 (since the process is WSS E [X (5)] = E [X (3)]

E [(Y −mY )2] = E [Y 2] = E [(X (5)−X (2))2] = E [X 2(5)]+E [X 2(3)]−2E [X (5)X (3)]

= RXX (0)+RXX (0)−2RXX (2) (process is WSS:E [X (t1)X (t2)] = RXX (t1− t2))

= 2Ae0−2Ae−2α = 2A−2Ae−2α

Interpretation of autocorrelation function. Let X (t) be WSS with zero mean. If RX (τ) drops quickly with τ, this means

that the samples become uncorrelated quickly as we increase τ. Conversely, if RX (τ) drops slowly with τ, then the samples

are highly correlated. Hence, RX (τ) is a measure of the rate of change of X (t) with time t, i.e., the frequency response of

X (t). It turns out that that this is not just an intuitive interpretation−the Fourier transform of RX (τ) (the power spectral

density) is in fact the average power density of X (t) over frequency.

Definition 76 (Ergodic Process). A WSS random process is ergodic if the ensemble averages can be replaced by time

averages of any realization of the process. We have two notions of ergodicity:

• Ergodic in the mean:

E [X (t)] = 〈X (t)〉= limt0→∞

12t0

∫ t0−t0

X (t)dt

• Ergodic in the autocorrelation:

RXX (τ) = 〈X (t)X (t + τ)〉= limt0→∞

12t0

∫ t0−t0

X (t)X (t + τ)dt

Example 120. Let X (t) = A. Hence, E [X (t)] = E [A] and RXX (t,t + τ) = E [A2] = σ2A+ µ2

A. Note that this process is SSS

since X (t) = X (t + τ) for every realization. Also, since 〈X (t)〉 = 〈A〉 = A, then the time-average is random. However,

〈X (t)X (t + τ)〉= 〈A2〉= A2. As a result, the process is stationary but not ergodic.

Example 121. X (t) = a cos(ωt +Θ), where Θ ∼ U(0,2π). As derived previously, X (t) is WSS because E [X (t)] = 0 and

RXX (τ) =a2

2 cos(ωτ). The ensemble average of X (t) is 〈X (t)〉= 〈a cos(ωt+Θ)〉= 0, and the ensemble average of X (t)X (t+

τ) is

〈X (t)X (t + τ)〉= 〈a2 cos(ωt +Θ)cos(ωt +ωτ +Θ)〉=⟨

a2

2cos(ωτ)

⟩+

⟨a2

2cos(2ω +2Θ+ωτ)

⟩=

a2

2cos(ωτ)

Hence, the process X (t) is ergodic in the mean and the autocorrelation.

6.3.1 Properties of Autocorrelation Function

Since the autocorrelation function, along with the mean, is considered to be a principal statistical descriptor of a WSS

random process, we will now consider some properties of the autocorrelation function. It should quickly become apparent

that not just any function of can be a valid autocorrelation function.



1. RXX (0) = E [X 2(t)]: Average normalized power in X (t)

Proof. To clarify this, note that RXX (0) = E [X 2(t)]. Now, suppose the random process X (t) was a voltage measured

at some point in a system. For a particular realization, x(t), the instantaneous power would be p(t) = x2(t)/r , where

r is the impedance in Ohms. The average power (averaged over all realizations in the ensemble) would then be

Pavg = E [X 2(t)]/r = RXX (0)/r . If, on the other hand, X (t) were a current rather than a voltage, then the average

power would be Pavg = RXX (0)r . From a system level, it is often desirable not to concern ourselves with whether a signal

is a voltage of a current. Accordingly, it is common to speak of normalized power, which is the power measured using

a 1 Ohm impedance. With r = 1, the two expression for average power are the same and equal to the autocorrelation

function evaluated at zero.

2. RXX (−τ) = RXX (τ)

Proof. Thsi property can be easily established from the definition of autocorrelation. Note that RXX (−τ)=E [X (t)X (t−τ)]. Since, X (t) is WSS, this expression is the same for any value of t. In particular, replace t in the previous expression

with t + τ so that RXX (−τ) = E [X (t + τ)X (t)] = RXX (τ). As a result of this property, any function of τ which is not

even cannot be a valid autocorrelation function.

3. |RXX (τ)| ≤ RXX (0)

Proof. For every t,

(RX (τ))2 = (E [X (t)X (t + τ)])2

≤ E[X 2(t)

]E[X 2(t + τ)

]by Schwarz inequality

= (RX (0))2

4. If X (t) has periodic components, then RXX (τ) will have periodic components with the same period

5. If X (t) is ergodic and has no periodic components, then limτ→∞ RXX (τ) = E [X ]2

6.4 Poisson Processes

Consider a process X (t) which counts the number of occurrences of some event in the time interval [0,t). The event

might be the telephone calls arriving at a certain switch in a public telephone network, customers entering a certain store,

or the birth of a certain species of animal under study. Since the random process is discrete (in amplitude), we will describe

it in terms of a probability mass function, pX (i ;t) = Pr(X (t) = i). Each occurrence of the event being counted is referred

to as an arrival or a point. These types of processes are referred to as counting processes or birth processes. Suppose this

random process has the following general properties:

• Independent Increments: The number of arrivals in two non-overlapping intervals are independent. That is, for two

intervals [t1,t2) and [t3,t4) such that t1 ≤ t2 ≤ t3 ≤ t4, the number of arrivals in [t1,t2) is statistically independent of

the number of arrivals in [t3,t4).

• Stationary Increments: The number of arrivals in an interval [t,t + τ) depends only on the length of the interval τ and

not on where the interval occurs, t.

• Distribution of Infinitesimal Increments: For an interval of infinitesimal length, [t,t +∆t), the probability of a single

arrival is proportional to ∆t, and the probability of having more than one arrival in the interval is negligible compared

to ∆t. Mathematically, we say that for some arbitrary constant λ :

Pr(no arrivals in [t,t +∆t)) = 1−λ ∆t +o(∆t) (6.6)

Pr(one arrival in [t,t +∆t)) = λ ∆t +o(∆t) (6.7)

Pr(more than one arrival in [t,t +∆t)) = o(∆t) (6.8)



Surprisingly enough, these rather general properties are enough to exactly specify the distribution of the counting process as

shown next.

Consider the PMF of the counting process at time t +∆t. In particular, consider finding the probability of the event

X (t +∆t) = 0.

pX (0;t +∆t) = Pr(no arrivals in [0,t +∆t)) (6.9)

= Pr(no arrivals in [0,t))Pr(no arrivals in [t,t +∆t)) (6.10)

= pX (0;t)[1−λ t +o(∆t)] (6.11)

Subtracting pX (0;t) from both sides and dividing by ∆t results in

pX (0;t +∆t)−pX (0;t)

∆t=−λ pX (0;t)+

o(∆t)

∆tpX (0;t) (6.12)

Passing to the limit as ∆t→ 0 gives the first order differential equation

d

dtpX (0;t) =−λ pX (0;t) (6.13)

The solution to this equation is of the general form

pX (0;t) = ce−λ tu(t) (6.14)

The constant c is found to be equal to unity using the fact that at time zero, the number of arrivals must be zero. That is,

pX (0;0) = 1. Therefore,

pX (0;t) = e−λ tu(t) (6.15)

The rest of the PMF for the random process X (t) can be specified in a similar manner. We find the probability of the general

event X (t +∆t) = i for some integer i > 0.

pX (i ;t +∆t) = Pr(i arrivals in [0,t))Pr(no arrivals in [t,t +∆t))

+ Pr(i −1 arrivals in [0,t))Pr(one arrival in [t,t +∆t))

+ Pr(less than i-1 arrivals in [0,t))Pr(more than one arrival in [t,t +∆t))

= pX (i ;t)[1−λ ∆t +o(∆t)]+pX (i −1;t)[λ ∆t +o(∆t)]+i−2

∑j=0

pX (i ;t)o(∆t) (6.16)

As before, subtracting pX (i ;t) from both sides and dividing by ∆t results in

pX (i ;t +∆t)−pX (i ;t)

∆t=−λ pX (i ;t)+λ pX (i −1;t)+

i

∑j=0

pX (i ;t)o(∆t)

∆t(6.17)

Passing to the limit as ∆t→ 0 gives another first-order differential equation.

d

dtpX (i ;t)+λ pX (i ;t) = λ pX (i −1;t) (6.18)

It is fairly easy to solve this set of differential equations. For example,for i = 1, we get

d

dtpX (1;t)+λ pX (1;t) = λ pX (0;t) = λ e−λ tu(t) (6.19)

together with the initial condition that pX (1;0) = 0. The solution to this equation is

pX (1;t) = λ te−λ tu(t) (6.20)

It is straightforward to verify that the general solution to the family of differential equations is

pX (i ;t) =(λ t)i

i !e−λ tu(t) (6.21)



Starting with the three mild assumptions made about the nature of this counting process at the start of this section, we

have demonstrated that X (t) follows a Poisson distribution, hence this process is referred to as a Poisson counting process.

Starting with the PMF for the Poisson counting process, one can easily find the mean and autocorrelation functions for this

process. First, the mean function is given by

E [X (t)] =∞

∑i=0

i(λ t)i

i !u(t) = λ tu(t) (6.22)

In other words, the average number of arrivals in the interval [0,1) is λ t. This gives the parameter λ the physical

interpretation of the average rate of arrivals, or as it is more commonly referred to the arrival rate of the Poisson process.

Another observation we can make from the mean process is that the Poisson counting process is not stationary.

The autocorrelation function can be calculated as follows

RXX (t1,t2) = E [X (t1)X (t2)] = E [X (t1)(X (t1)+(X (t2)−X (t1)))]

= E [X 2(t1)]+E [X (t1)(X (t2)−X (t1))] (6.23)

To simplify the second expression, we use the independent increments property of the Poisson counting process. Assuming

that t1 < t2, then X (t1) represents the number of arrivals in the interval [0,t1), while X (t2)−X (t1) is the number of arrivals in

the interval [t1,t2). Since these two intervals are non-overlapping, the number of arrivals in the two intervals are independent.

Therefore,

RXX (t1,t2) = E [X 2(t1)]+E [X (t1)]E [X (t2)−X (t1)] = Var(X (t1))+E [X (t1)]E [X (t2)] = λ t1 +λ2t1t2 (6.24)

This can be written more concisely in terms of the autocovariance function,

KXX (t1,t2) = Var(X (t1)) = λ t1 (6.25)

If t2 < t1, then the roles of t1 and t2 need to be reversed. In general for the Poisson counting process, we have

KXX (t1,t2) = λ min(t1,t2) (6.26)

Another feature that can be extracted from the PMF of the Poisson counting process is the distribution of the inter-arrival

time. That is, let T be the time at which the first arrival occurs. We seek the distribution of the random variable T . The

CDF of T can be found as

FT (t) = Pr(T ≤ t) = Pr(at least one arrival in [0,t]) = 1−Pr(no arrivals in [0,t]) = [1−e−λ t ]u(t) (6.27)

Therefore, it follows that the arrival time is an exponential random variable with a mean value of E [T ] = 1/λ . The PDF of

T is

fT (t) = λ e−λ tu(t) (6.28)

We could get the same result starting from any point in time. That is, we do not need to measure the time to the next

arrival starting from time zero. Picking any arbitrary point in time t0, we could define T to be the time until the first arrival

after time t0 . Using the same reasoning as above we would arrive at the same exponential distribution. If we pick t0 to be

the time of a specific arrival, and then define T to be the time to the next arrival, then T is interpreted as an inter-arrival

time. Hence, we conclude that the time between successive arrivals in the Poisson counting process follows an exponential

distribution with a mean of 1/λ .

6.5 Power Spectral DensityAs with deterministic signals, we week a frequency domain description of a random process. For a deterministic continuous

signal, X (t), the Fourier transform is used to describe its spectral content. In order to study random processes in the frequency

domain, we seek a quantity that will describe the spectral characteristics of a random process. To start with, for a random

process X (t), define a truncated version of the random process as

Xt0 =

X (t) |t|< t0

0 otherwise



The energy of this random process is

EXt0 =∫ t0−t0

X 2(t)dt =∫

∞

−∞

X 2t0(t)dt

and hence, the time-averaged power is

PXt0 =1

2t0

∫∞

−∞

X 2t0(t)dt =

12t0

∫∞

−∞

|Xt0(f )|2df

Since PXt0 is random variable and so to get the ensemble averaged power, we must take the expectation,

PXt0 = E [PXt0 ] =1

2t0

∫∞

−∞

E[∣∣Xt0(f )

∣∣2]

df

The power in the (untruncated) random process X (t) is then found by passing to the limit as t0→ ∞,

PX = limt0→∞

PXt0 = limt0→∞

12t0

∫∞

−∞

E[∣∣Xt0(f )

∣∣2]

df =∫

∞

−∞

limt0→∞

E[∣∣Xt0(f )

∣∣2]

2t0df

Then, the average power in the process can be expressed as

PX =∫

∞

−∞

SXX (f )df (6.29)

where

SXX (f ) = limt0→∞

E[∣∣Xt0(f )

∣∣2]

2t0= power spectral density (PSD) (6.30)

SXX (f ) has the units of power per unit frequency and so it is the power density function of the random process in the

frequency domain. So, SXX (f ) has the property that when integrated over all frequency, the total power in the process is

obtained. The power spectral density has the following properties

1. SXX (f ) is real

2. SXX (f )≥ 0

3. SXX (f ) is an even function of f

Theorem 27 (The Wiener-Khintchine-Einstein Theorem). For a WSS process X (t) whose autocorrelation is RXX (τ), the

PSD of the process is

SXX (f ) = F [RXX (τ)] =∫

∞

−∞

RXX (τ)e−j2πf τ dτ (6.31)

For a discrete time process Xn, the power spectral density is the discrete-time Fourier transform (DTFT ) of the sequence

RX [n]

SX (f ) =∞

∑n=−∞

RX [n]e−j2πnf , |f |< 1

2(6.32)

As a result, the autocorrelation function (RXX (τ) or (RX [n]) can be recovered from SX (f ) by taking the inverse Fourier

transform or inverse DTFT

RX (τ) =∫

∞

−∞

SX (f )ej2πf τ df

RX (τ) =∫ 1

2

− 12

SX (f )ej2πnf df



Example 122. Find the autocorrelation function of the sinusoidal process

X (t) = α cos(2πf0t +Θ)

where Θ∼ U[0,2π] and α is a constant. Find SXX (f ).

Solution. E [X (t)] = 0, RXX (τ) =α2

2 cos(2πf0τ). Hence, X (t) is WSS. Using WKE theorem

SXX (f ) = F [RX (τ)] = F

[α2

2cos(2πf0τ)

]=

α2

4(δ (f − f0)+δ (f + f0))

Example 123. Discrete time white process: X1, X2,... ,Xn, zero mean, uncorrelated, with average power N

RX [n] = Nδ [n]

If Xn is also a GRP, then we obtain a discrete time WGN process.

Example 124. Bandlimited zero mean WSS white noise process X (t). For any t, the samples X(

t± n2B

)for n = 0,1,2, ...

are uncorrelated.

Example 125. White noise process: if we let B → ∞ in the previous example, we obtain a white noise process, which has

SX (f ) =N

2for all f

RX (τ) =N

2δ (τ)

If, in addition, X (t) is a GRP, then we obtain the famous white Gaussian noise (WGN) process.

Remarks on white noise:

• For a white noise process, all samples are uncorrelated

• The process is not physically realizable, since it has infinite power. However, it plays a similar role in random processes

to point mass in physics and delta function in linear systems

• Thermal noise and shot noise are well modeled as white Gaussian noise, since they have a flat PSD over a wide band

(GHz)

Definition 77 (Cross-spectral density). The cross-spectral density between two random processes, X (t) and Y (t) is the

Fourier transform of the cross-correlation function

SXY (f ) = F [RXY (τ)]

SXY (f ) = SYX (−f ) = S∗XY (−f )



6.6 Random Processes in Linear Systems

Consider a linear time-invariant system (LTI) described by an impulse response h(t). The input to the system is X (t) and

the output is Y (t):

Y (t) =∫

∞

−∞

X (u)h(t−u)du (6.33)

Important special case: If the input X (t) is a Gaussian random process, then the output process Y (t) will also be a Gaussian

random process (since the integral above can be approximated by a sum and thus the output process is obtained via a linear

transformation of X (t)). In that case, to completely describe the output of the system, we need to compute the mean and

the autocovariance (or autocorrelation) function of the output.

We focus on finding the mean and autocorrelation function of Y (t) in terms of the mean and autocorrelation functions of

the input process X (t) and the impulse response of the system h(t). We are also interested in finding the crosscorrelation

function between X (t) and Y (t) defined as

RXY (t1,t2)E [X (t1)X (t2)]

Note that unlike RX (t1,t2), RXY (t1,t2) is not necessarily symmetric in t1 and t2. However, RXY (t1,t2) = RYX (t2,t1).

Theorem 28. Let X (t) be a WSS process at the input of a LTI filter with impulse response h(t). The output of the filter is

a WSS process Y (t). We have the following relationships:

1. E [Y (t)] = E [X (t)]H(0)

2. RYX (τ) = RX (τ)?h(τ)

3. RY (τ) = RX (τ)?h(τ)?h(−τ)

4. SY (f ) = SX (f )|H(f )|2

Proof. Note that here the LTI system is in steady state

1. To find the mean of Y (t), consider

E [Y (t)] = E

(∫∞

−∞

X (τ)h(t− τ)dτ

)

=∫

∞

−∞

E [X (τ)]h(t− τ)dτ

= E [X (t)]∫

∞

−∞

h(t− τ)dτ = E [X (t)]H(0)

2. To find the crosscorrelation function between Y (t) and X (t), consider

RYX (τ) = E [Y (t + τ)X (t)]

= E

[∫∞

−∞

h(α)X (t + τ−α)X (t)dα

]

=∫

∞

−∞

h(α)RX (t−α)dα

= h(τ)?RX (τ)

3. To find the autocorrelation function Y (t), consider

RY (τ) = E [Y (t + τ)Y (t)]

= E

[Y (t + τ)

∫∞

−∞

h(α)X (t−α)dα

]

=∫

∞

−∞

h(α)RYX (t +α)dα

= h(−τ)?RYX (τ)

= RX (τ)?h(τ)?h(−τ)



4. Take the Fourier transform of RY (τ) = RX (τ)?h(τ)?h(−τ) we get the result.


CCE506: STOCHASTIC PROCESSES, DETECTION & ESTIMATIONmustafa-halabi.appspot.com/NOTESCCE506.pdf ·...

Documents

Transcript of CCE506: STOCHASTIC PROCESSES, DETECTION & ESTIMATIONmustafa-halabi.appspot.com/NOTESCCE506.pdf ·...