Probabilityand Stochastic Processesskara.di.uoa.gr/PP/transparency/Chapter2.pdf · Probabilityand...

49
Chapter 2 Probability and Stochastic Processes 2.1 Introduction The goal of this chapter is to provide the basic definitions and properties related to probability theory and stochastic processes. It is assumed that the reader has attended a basic course on probability and statistics, prior to start reading this book. So, the aim here is to help the reader to “refresh” the memory and establish a common language and a commonly understood notation. Besides probability and random variables, random processes will be briefly reviewed and some basic theorems will be stated. Finally, at the end of the chapter, basic definitions and properties related to information theory are sum- marized. The reader who is familiar with all these notions can bypass this chapter. 2.2 Probability and Random Variables A random variable, x, is a variable whose variations are due to chance/randomness. A random variable can be considered as a function, which assigns a value to the outcome of an experiment. For example, in a coin tossing experiment, the cor- responding random variable, x, can assume the values x 1 = 0 if the result of the experiment is “head” and x 2 = 1 if the result is “tail”. We are going to denote a random variable with a lower case roman, e.g., x, and the values it takes, once an experiment has been preformed, with mathmode italics, e.g., x. A random variable is described in terms of a probability, if it is discrete, or in terms of a probability density function, if it is continuous. For a more formal treatment and discussion, see, e.g., [Papo 02,Jayn 03]. 1

Transcript of Probabilityand Stochastic Processesskara.di.uoa.gr/PP/transparency/Chapter2.pdf · Probabilityand...

Page 1: Probabilityand Stochastic Processesskara.di.uoa.gr/PP/transparency/Chapter2.pdf · Probabilityand Stochastic Processes 2.1 Introduction The goal of this chapter is to provide the

Chapter 2

Probability and Stochastic

Processes

2.1 Introduction

The goal of this chapter is to provide the basic definitions and properties relatedto probability theory and stochastic processes. It is assumed that the readerhas attended a basic course on probability and statistics, prior to start readingthis book. So, the aim here is to help the reader to “refresh” the memory andestablish a common language and a commonly understood notation.

Besides probability and random variables, random processes will be brieflyreviewed and some basic theorems will be stated. Finally, at the end of thechapter, basic definitions and properties related to information theory are sum-marized.

The reader who is familiar with all these notions can bypass this chapter.

2.2 Probability and Random Variables

A random variable, x, is a variable whose variations are due to chance/randomness.A random variable can be considered as a function, which assigns a value to theoutcome of an experiment. For example, in a coin tossing experiment, the cor-responding random variable, x, can assume the values x1 = 0 if the result of theexperiment is “head” and x2 = 1 if the result is “tail”.

We are going to denote a random variable with a lower case roman, e.g., x,and the values it takes, once an experiment has been preformed, with mathmodeitalics, e.g., x.

A random variable is described in terms of a probability, if it is discrete, orin terms of a probability density function, if it is continuous. For a more formaltreatment and discussion, see, e.g., [Papo 02,Jayn 03].

1

Page 2: Probabilityand Stochastic Processesskara.di.uoa.gr/PP/transparency/Chapter2.pdf · Probabilityand Stochastic Processes 2.1 Introduction The goal of this chapter is to provide the

2 2. PROBABILITY AND STOCHASTIC PROCESSES

2.2.1 Probability

Although the words “probability” and “probable” are quite common in our ev-eryday vocabulary, the mathematical definition of probability is not a straight-forward one and there are a number of different definitions, which have beenproposed over the years. Needless to say that, whatever definition is adopted,the end result is that the properties and rules, which are derived, remain thesame. Two of the most commonly used definitions are:

Relative Frequency Definition

The probability, P (A), of an event, A, is the limit

P (A) = limn−→∞

nA

n, (2.1)

where n is the number of total trials and na the number of times event Aoccurred. The problem with this definition is that, in practice in any physicalexperiment, the numbers nA and n can be large, yet they are always finite.Thus, the limit only as a hypotheses can be used and not as something that canbe attained experimentally. In practice, often, we use

P (A) ≈ nA

n, (2.2)

for large values of n. However, this has to be used with caution, especially whenthe probability of an event is very small.

Axiomatic Definition

This definition of probability is traced back to 1933 to the work of AndreyKolmogorov, who found a close connection between probability theory and themathematical theory of sets and functions of a real variable, in the context ofmeasure theory, e.g., [Kolm 56].

The probability, P (A), of an event is a non-negative number assigned to thisevent, i.e.,

P (A) ≥ 0. (2.3)

The probability of an event, C, which is certain to occur, equals to one

P (C) = 1. (2.4)

If two events, A and B, are mutually exclusive (they cannot occur simultane-ously), then the probability of occurrence of either A or B (denoted as A ∪B)is given by

P (A ∪B) = P (A) + P (B). (2.5)

It turns out that these three defining properties, which can be considered asthe respective axioms, suffice to develop the rest of the theory. For example, itcan be shown that the probability of an impossible event is equal to zero, e.g.,[Papo 02].

Page 3: Probabilityand Stochastic Processesskara.di.uoa.gr/PP/transparency/Chapter2.pdf · Probabilityand Stochastic Processes 2.1 Introduction The goal of this chapter is to provide the

2.2. PROBABILITY AND RANDOM VARIABLES 3

The previous two approaches for defining probability are not the only ones.Another interpretation, which is in line with the way we are going to use the no-tion of probability in a number of places in this book in the context of Bayesianlearning, has been given by Cox, [Cox 46]. There, probability was seen as ameasure of uncertainty concerning an event. Take for example the uncertainevent whether the Minoan civilization was destroyed as a consequence of theearthquake that happened close to the island of Santorini. This is obviouslynot an event whose probability can be tested with repeated trials. However,putting together historical as well as scientific evidence, we can quantify ourexpression of uncertainty concerning such a conjecture. Also, we can modifythe degree of our uncertainty, once more historical evidence comes in light dueto new archeological findings. Assigning numerical values to represent degreesof belief, Cox developed a set of axioms encoding common sense properties ofsuch beliefs and he came to a set of rules equivalent to the ones we are going toreview soon, see also, [Jayn 03].

The origins of probability theory are traced back to the middle 17th centuryin the works of Pierre Fermat (1601–1665), Blaise Pascal (1623–1662) and Chris-tian Huygens (1629–1695). The concepts of probability and the mean value of arandom variable can be found there. The motivation for developing the theoryis not related to any purpose for “serving society”; the purpose was to serve theneeds of gabling and games of chance!

2.2.2 Discrete Random Variables

A discrete random variable, x, can take any value from a finite or countably

infinite set X . The probability of the event, “x = x ∈ X” is denoted as

P (x = x) or simply P (x). (2.6)

The function P (·) is known as the probability mass function (pmf). Being aprobability, it has to satisfy the first axiom, i.e., P (x) ≥ 0. Assuming that notwo values in X can occur simultaneously and that after any experiment a singlevalue will always occur, the second and third axioms combined give

x∈X

P (x) = 1. (2.7)

The set X is also known as the sample space.

Joint and Conditional Probabilities

The joint probability of two events, A, B, is the probability that both eventsoccur simultaneously and it is denoted as P (A,B). Let us now consider two ran-dom variables, x, y, with state spaces, X = {x1, . . . , xnx

} and Y = {y1, . . . , yny},

respectively. Let us adopt the relative frequency definition and assume thatwe carry out n experiments and that each one of the values in X occurred,

Page 4: Probabilityand Stochastic Processesskara.di.uoa.gr/PP/transparency/Chapter2.pdf · Probabilityand Stochastic Processes 2.1 Introduction The goal of this chapter is to provide the

4 2. PROBABILITY AND STOCHASTIC PROCESSES

nx1 , . . . , n

xnx

times and each one of the values in Y occurred ny1 , . . . , n

yny

times,respectively. Then,

P (xi) ≈nxi

n, i = 1, 2, . . . , nx, and P (yj) ≈

nyj

nj = 1, 2, . . . , ny.

Let us denote by nij the number of times the values xi and yj occurredsimultaneously. Then, P (xi, yj) ≈ nij

n . A simple reasoning dictates that thetotal number, nx

i , that value xi occurred, is equal to

nxi =

ny∑

j=1

nij . (2.8)

Dividing both sides in the above by n, the following sum rule readily results

P (x) =∑

y∈Y

P (x, y) : Sum Rule. (2.9)

The conditional probability of an event, A, given another event, B, is denotedas P (A|B) and it is defined as

P (A|B) :=P (A,B)

P (B): Conditional Probability, (2.10)

provided P (B) 6= 0. It can be shown that this is indeed a probability, in thesense that it respects all three axioms, e.g., [Papo 02]. We can grasp better itsphysical meaning, if the relative frequency definition is adopted. Let nAB be thenumber of times that both events occurred simultaneously, and nB the times Boccurred, out of n experiments. Then, we have

P (A|B) =nAB

n

n

nB=

nAB

nB. (2.11)

In other words, the conditional probability of an event, A, given another one, B,is the relative frequency that A occurred not with respect to the total numberof experiments performed but relative to the times event B occurred.

Written differently and adopting similar notation in terms of random vari-ables, in conformity with (2.9), the definition of the conditional probability isalso known as the product rule of probability, i.e.,

P (x, y) = P (x|y)P (y) : Product Rule. (2.12)

To differentiate from the joint and conditional probabilities, probabilities, P (x)and P (y) are known as marginal probabilities.

Statistical Independence: Two random variables are said to be statistically in-dependent, if and only if their joint probability is written as the product of therespective marginals, i.e.,

P (x, y) = P (x)P (y). (2.13)

Page 5: Probabilityand Stochastic Processesskara.di.uoa.gr/PP/transparency/Chapter2.pdf · Probabilityand Stochastic Processes 2.1 Introduction The goal of this chapter is to provide the

2.2. PROBABILITY AND RANDOM VARIABLES 5

Bayes Theorem

Bayes theorem is a direct consequence of the product rule and of the symmetryproperty of the joint probability, i.e., P (x, y) = P (y, x), and it is stated as

P (y|x) = P (x|y)P (y)

P (x): Bayes Theorem, (2.14)

where the marginal, P (x), can be written as

P (x) =∑

y∈Y

P (x, y) =∑

y∈Y

P (x|y)P (y),

and it can be considered as the normalizing constant of the numerator on theleft hand side in (2.14), which guarantees that summing up P (y|x) with respectto all possible values of y ∈ Y results in one.

Bayes theorem plays a central role in Machine Learning and it will be thebasis for developing the so called Bayesian techniques for estimating the valuesof unknown parameters.

2.2.3 Continuous Random Variables

So far, we have focussed on discrete random variables. Our interest now turnsto the extension of the notion of probability to continuous random variables,which take values on the real axis, R.

The starting point is to compute the probability of a continuous randomvariable, x, to lie in an interval, x1 ≤ x ≤ x2. Note that the two events, x ≤ x1

and x1 < x ≤ x2 are mutually exclusive. Thus, we can write that

P (x ≤ x1) + P (x1 < x ≤ x2) = P (x ≤ x2). (2.15)

Define the cumulative distribution function (cdf) of x, as

Fx(x) := P (x ≤ x) : Cumulative Distribution Function. (2.16)

Then, (2.15) can be written as

P (x1 < x ≤ x2) = Fx(x2)− Fx(x1). (2.17)

Note that Fx(·) is a monotonically increasing function. Assuming that it isalso continuous and differentiable, we can define the probability density function

(pdf) of x, as

px(x) :=dFx(x)

dx: Probability Density Function, (2.18)

which then leads to

P (x1 < x ≤ x2) =

∫ x2

x1

px(x)dx. (2.19)

Page 6: Probabilityand Stochastic Processesskara.di.uoa.gr/PP/transparency/Chapter2.pdf · Probabilityand Stochastic Processes 2.1 Introduction The goal of this chapter is to provide the

6 2. PROBABILITY AND STOCHASTIC PROCESSES

Also,

Fx(x) =

∫ x

−∞

px(z)dz. (2.20)

Using familiar from calculus arguments, the pdf can be interpreted as

∆P (x < x ≤ x+ dx) ≈ px(x)∆x, (2.21)

which justifies its name as a “density” function, being the probability (∆P )of x lying in a small interval ∆x, divided by the length of this interval. Notethat as ∆x −→ 0 this probability tends to zero. Thus, the probability of acontinuous random variable to take any single value is zero. Moreover, sinceP (−∞ < x < +∞)=1, we have that

∫ +∞

−∞

px(x)dx = 1. (2.22)

Usually, in order to simplify notation, the subscript x is dropped and we willwrite p(x), unless it is necessary for avoiding a possible confusion. Note, also,that we have adopted the lower case “p” to denote a pdf and the capital “P”to denote a probability.

All previously stated rules for the probability are readily carried out for thecase of pdfs, i.e.,

p(x|y) = p(x, y)

p(y), p(x) =

∫ +∞

−∞

p(x, y)dy. (2.23)

2.2.4 Mean and Variance

Two of the most common and useful quantities associated with any random vari-able are the respective mean value and variance. The mean value (or sometimescalled expected value) is denoted as

E[x] :=

∫ +∞

−∞

xp(x)dx : Mean Value, (2.24)

where for discrete random variables the integration is replaced by summation(

E[x] =∑

x∈X xP (x))

.The variance is denoted as σ2

x and it is defined as

σ2x :=

∫ +∞

−∞

(

x− E[x])2p(x)dx : Variance, (2.25)

where integration is replaced by summation for discrete variables. The varianceis a measure of the spread of the values of the random variable around its meanvalue.

Page 7: Probabilityand Stochastic Processesskara.di.uoa.gr/PP/transparency/Chapter2.pdf · Probabilityand Stochastic Processes 2.1 Introduction The goal of this chapter is to provide the

2.2. PROBABILITY AND RANDOM VARIABLES 7

Given two random variables x, y, their covariance is defined as

cov(x, y) := E[(

x− E[x])(

y− E[y])]

, (2.26)

and their correlation as

rxy := E[xy] = cov(x, y) + E[x]E[y]. (2.27)

A random vector is a collection of random variables, x = [x1, . . . , xl]T , and

p(x) is the joint pdf (probability for discrete variables), i.e.,

p(x) = p(x1, . . . , xl). (2.28)

The covariance matrix of a random vector, x, is defined as

Cov(x) := E

[

(

x− E[x])

(x− E[x])T

]

: Covariance Matrix, (2.29)

or

Cov(x) =

cov(x1, x1) . . . cov(x1, xl)...

. . ....

cov(xl, x1) . . . cov(xl, xl)

. (2.30)

Similarly the correlation matrix of a random vector x is defined as

Rx := E[

xxT]

: Correlation Matrix, (2.31)

or

Rx =

E[x1, x1] . . . E[x1, xl]...

. . ....

E[xl, x1] . . . E[xl, xl]

= Cov(x) + E[x]E[xT ]. (2.32)

Both the covariance and correlation matrices have a very rich structure,which will be exploited in various parts of this book to lead to computationalsavings, whenever they are present in calculations. For the time being, observethat both are symmetric and non-negative definite. The symmetry, Σ = ΣT , isreadily deduced from the definition. An l × l matrix,A, is called non-negative

definite if,yTAy ≥ 0, ∀y ∈ R

l. (2.33)

If the inequality is a strict one, the matrix is said to be positive definite. Forthe covariance matrix, we have

yTE

[

(

x− E[x])

(x− E[x])T

]

y = E

[

(

yT(

x− E[x]))2

]

≥ 0,

and the claim has been proved.

Page 8: Probabilityand Stochastic Processesskara.di.uoa.gr/PP/transparency/Chapter2.pdf · Probabilityand Stochastic Processes 2.1 Introduction The goal of this chapter is to provide the

8 2. PROBABILITY AND STOCHASTIC PROCESSES

Complex Random Variables

A complex random variable, z ∈ C, is a sum

z = x + jy, (2.34)

where x, y are real random variables and j :=√−1. Note that for complex

random variables, the pdf cannot be defined since inequalities of the form, x +jy ≤ x + jy, have no meaning. When we write p(z), we mean the joint pdf ofthe real and imaginary parts, i.e.,

p(z) := p(x, y). (2.35)

For complex random variables, the notions of mean and covariance are definedas,

E[z] := E[x] + j E[y], (2.36)

andcov(z1, z2) := E

[

(

z1 − E[z1])(

z2 − E[z2])∗]

, (2.37)

where “ ∗ ” denotes complex conjugation. The latter definition leads to thevariance of a complex variable,

σ2z = E

[

∣z− E[z]∣

2]

= E

[

∣z∣

2]

−∣

∣E [z]∣

2. (2.38)

Similarly, for complex random vectors, z = x+ jy ∈ Cl, we have that

p(z) := p(x1, . . . , xl, y1, . . . , yl), (2.39)

where xi, yi, i = 1, 2, . . . , l, are the components of the involved real vectors,respectively. The covariance and correlation matrices are similarly defined, e.g.,

Cov(z) := E

[

(

z− E[z])(

z− E[z])H

]

, (2.40)

where “H” denotes the Hermitian (transposition and conjugation) operation.For the rest of the chapter, we are going to deal mainly with real random

variables and whenever needed, differences with the case of complex variableswill be stated.

2.2.5 Transformation of Random Variables

Let x and y be two random vectors, which are related via the vector transform,

y = f(x), (2.41)

where f : Rl 7−→ Rl is an invertible transform. That is, given y, then x =f−1(y) can be uniquely obtained. We are given the joint pdf, px(x), of x andthe task is to obtain the joint pdf, py(y), of y.

Page 9: Probabilityand Stochastic Processesskara.di.uoa.gr/PP/transparency/Chapter2.pdf · Probabilityand Stochastic Processes 2.1 Introduction The goal of this chapter is to provide the

2.2. PROBABILITY AND RANDOM VARIABLES 9

Figure 2.1: Note that by the definition of a pdf, py(y)|∆y| = px(x)|∆x|.

The Jacobian matrix of the transformation is defined as

J(y,x) :=∂(y1, y2, . . . , yl)

∂(x1, x2, . . . , xl):=

∂y1

∂x1

. . . ∂y1

∂xl

.... . .

...∂yl

∂x1

. . . ∂yl

∂xl

. (2.42)

Then, it can be shown (e.g., [Papo 02]) that

py(y) =px(x)

|det(J(y,x))|∣

x=f−1(y), (2.43)

where |det(·)| denotes the absolute value of the determinant of a matrix. Forreal random variables, i.e., y = f(x), Equation (2.43) simplifies to

py(y) =px(x)

| dydx |

x=f−1(y). (2.44)

The latter can be graphically understood from Figure 2.1. The following twoevents have equal probabilities,

P (x < x ≤ x+∆x) = P (y +∆y < y ≤ y), ∆x > 0, ∆y < 0.

Hence, by the definition of a pdf we have,

py(y)|∆y| = p(x)|∆x|, (2.45)

which leads to (2.44).

Example 2.1. Let two random vectors which are related via the linear trans-form,

y = Ax, (2.46)

Page 10: Probabilityand Stochastic Processesskara.di.uoa.gr/PP/transparency/Chapter2.pdf · Probabilityand Stochastic Processes 2.1 Introduction The goal of this chapter is to provide the

10 2. PROBABILITY AND STOCHASTIC PROCESSES

where A is invertible. Compute the joint pdf of y in terms of px(x).

The Jacobian of the transformation is easily computed and given by

J(y,x) =

[

a11 a12a21 a22

]

= A.

Hence,

py(y) =px(A

−1y)

|det(A)| . (2.47)

2.3 Examples of Distributions

In this section, some notable examples of distributions are provided. These arepopular for modelling the random nature of variables met in a wide range ofapplications, and they will be used later on in this book.

2.3.1 Discrete Variables

The Bernoulli Distribution

A random variable is said to be distributed according to a Bernoulli distribution,if it is binary, X = {0, 1}, with

P (x = 0) = p, P (x = 1) = 1− p.

In a more compact way, we write that x ∼ Bern(x|p) where

P (x) = Bern(x; p) := px(1 − p)1−x. (2.48)

Its mean value is equal to

E[x] = 1p+ 0(1− p) = p (2.49)

and its variance is equal to

σ2x = (1 − p)2p+ p2(1 − p) = p(1− p). (2.50)

The Binomial Distribution

A random variable, x, is said to follow a binomial distribution with parametersn, p, and we write x ∼ Bin(x|n, p) if X = {0, 1, . . . , n} and

P (x = k) := Bin(k|n, p) =(

nk

)

pk(1− p)n−k, k = 0, 1, . . . , n. (2.51)

For example, this distribution models the times that head occurs in n successivetrials, where P (Head) = p. The binomial is a generalization of the Bernoulli

Page 11: Probabilityand Stochastic Processesskara.di.uoa.gr/PP/transparency/Chapter2.pdf · Probabilityand Stochastic Processes 2.1 Introduction The goal of this chapter is to provide the

2.3. EXAMPLES OF DISTRIBUTIONS 11

distribution, which results if in (2.51) we set n = 1. The mean and variance ofthe binomial distribution are (Problem 2.1)

E[x] = np, (2.52)

andσ2x = np(1− p). (2.53)

Figure 2.2a shows the probability P (k) as a function of k for p = 0.4 andn = 9. Figure 2.2b shows the respective cumulative distribution. Observe thatthe latter has a staircase form, as it is always the case for discrete variables.

Figure 2.2: a) The probability mass function (pmf) for the binomial distributionfor p = 0.4 and n = 9. b) The respective cumulative probability distribution(cdf). Since the random variable is discrete, the cdf has a staircase-like graph.

The Multinomial Distribution

This is a generalization of the binomial distribution, if the outcome of eachexperiment is not binary, but it can take one out of K possible values. Forexample, instead of tossing a coin, a die with K sides is thrown. Each one ofthe possible K outcomes has probability P1, P2, . . . , PK to occur, and we denote

P = [P1, P2, . . . , PK ]T .

After n experiments, assume that x1, x2, . . . , xK times sides x = 1, x = 2,. . .,x = K occurred, respectively. We say that the random (discrete) vector,

x = [x1, x2, . . . , xK ]T , (2.54)

follows a multinomial distribution, x ∼ Mult(x|n,P ), if

P (x) = Mult(x|n,P ) :=

(

nx1, x2, . . . , xK

) K∏

i=1

P xk

k , (2.55)

Page 12: Probabilityand Stochastic Processesskara.di.uoa.gr/PP/transparency/Chapter2.pdf · Probabilityand Stochastic Processes 2.1 Introduction The goal of this chapter is to provide the

12 2. PROBABILITY AND STOCHASTIC PROCESSES

where(

nx1, x2, . . . , xK

)

=n!

x1!x2! . . . xK !.

Note that the variables, x1, . . . , xK , are subject to the constraint

K∑

k=1

xk = n,

and alsoK∑

k=1

PK = 1.

The mean value, the variances and the covariances are given by

E[x] = nP , σ2k = nPk(1− Pk), k = 1, 2, . . . ,K, cov(xi, xj) = −nPiPj . (2.56)

2.3.2 Continuous Variables

The Uniform Distribution

A random variable x is said to follow a uniform distribution in an interval (a, b)and we write x ∼ U(a, b), with a > −∞ and b < +∞, if

p(x) =

{

1b−a , if a ≤ x ≤ b,

0 otherwise. (2.57)

Figure 2.3 shows the respective graph. The mean value is equal to

Figure 2.3: The pdf of a uniform distribution U(a, b).

E[x] =a+ b

2, (2.58)

Page 13: Probabilityand Stochastic Processesskara.di.uoa.gr/PP/transparency/Chapter2.pdf · Probabilityand Stochastic Processes 2.1 Introduction The goal of this chapter is to provide the

2.3. EXAMPLES OF DISTRIBUTIONS 13

and the variance is given by (Problem 2.2).

σ2x =

1

12(b − a)2. (2.59)

The Gaussian Distribution

The Gaussian or normal distribution is one among the most widely used dis-tributions in all scientific disciplines. We say that a random variable, x, isGaussian or normal with parameters µ and σ2, and we write x ∼ N (µ, σ2) orN (x|µ, σ2), if

p(x) =1√2πσ

exp(

− (x− µ)2

2σ2

)

. (2.60)

It can be shown that the corresponding mean and variance are

E[x] = µ and σ2x = σ2. (2.61)

Indeed, by the definition of the mean value we have that

E[x] =1√2πσ

∫ +∞

−∞

x exp(

− (x− µ)2

2σ2

)

dx

=1√2πσ

∫ +∞

−∞

(y + µ) exp(

− y2

2σ2

)

dy. (2.62)

Due to the symmetry of the exponential function, performing the integrationinvolving y gives zero and the only surviving term is due to µ. Taking intoaccount that a pdf integrates to one, we obtain the result.

For the variance, we have that

∫ +∞

−∞

exp(

− (x− µ)2

2σ2

)

dx =√2πσ. (2.63)

Taking the derivative of both sides with respect to σ, we obtain

∫ +∞

−∞

(x− µ)2

σ3exp

(

− (x − µ)2

2σ2

)

dx =√2π (2.64)

or1√2πσ

∫ +∞

−∞

(x− µ)2 exp(

− (x − µ)2

2σ2

)

dx = σ2, (2.65)

which proves the claim.

Figure 2.4 shows the graph for two cases, N (x|1, 0.1) and N (x|1, 0.01). Bothcurves are symmetrically placed around the mean value µ = 1. Observe that,the smaller the variance is the sharper around the mean value the pdf becomes.The generalization of the Gaussian to vector variables, x ∈ Rl, results in the

Page 14: Probabilityand Stochastic Processesskara.di.uoa.gr/PP/transparency/Chapter2.pdf · Probabilityand Stochastic Processes 2.1 Introduction The goal of this chapter is to provide the

14 2. PROBABILITY AND STOCHASTIC PROCESSES

Figure 2.4: The graphs of two Gaussian pdfs for µ = 1 and σ2 = 0.1 (red) andσ2 = 0.01 (gray).

so-called multivariate Gaussian or normal distribution, x ∼ N (x|µ, Σ), withparameters µ and Σ, which is defined as

p(x) =1

(2π)l/2|Σ|1/2 exp(

− 1

2

(

x− µ)T

Σ−1(

x− µ)

)

: Gaussian PDF,

(2.66)where | · | denotes the determinant of a matrix. It can be shown (Problem 2.3)that,

E[x] = µ and Cov(x) = Σ. (2.67)

Figure 2.5 shows the two-dimensional normal pdf for two cases. Both sharethe same mean value, µ = 0, but they have different covariance matrices, i.e.,

Σ1 =

[

0.1 0.00.0 0.1

]

, Σ2 =

[

0.1 0.010.01 0.2

]

. (2.68)

Figure 2.6 shows the corresponding isovalue contours for equal density values.In 2.6a the contours are circles, corresponding to the symmetric pdf in Figure2.5a with covariance matrix Σ1. The one shown in Figure 2.6b corresponds tothe pdf in Figure 2.5b associated with Σ2. Observe that, in general, the isovaluecurves are ellipses/hyperelliposoids. They are centered at the mean value, andthe orientation of the major axis as well their exact shape is controlled by theeigenstructure of the associated covariance matrix. Indeed, all points x ∈ Rl,which score the same density value, obey to

(x− µ)TΣ−1(x− µ) = constant = c. (2.69)

We know that the covariance matrix is symmetric, Σ = ΣT . Thus, its eigen-values are real and the corresponding eigenvectors can be chosen to form anorthonormal basis (Appendix ??), which leads to its diagonalization, i.e.,

Σ = UTΛU, (2.70)

Page 15: Probabilityand Stochastic Processesskara.di.uoa.gr/PP/transparency/Chapter2.pdf · Probabilityand Stochastic Processes 2.1 Introduction The goal of this chapter is to provide the

2.3. EXAMPLES OF DISTRIBUTIONS 15

(a) (b)

Figure 2.5: The graph of two two-dimensional Gaussian pdfs for µ = 0 anddifferent covariance matrices. a) The covariance matrix is diagonal with equalelements along the diagonal. b) The corresponding covariance matrix is non-diagonal.

withU := [u1, . . . ,ul], (2.71)

where ui, i = 1, 2, . . . , l, are the orthonormal eigenvectors, and

Λ := diag{λ1, . . . , λl}, (2.72)

are the respective eigenvalues. We assume that Σ is invertible, hence all eigen-values are positive (being a positive definite it has positive eigenvalues, Ap-pendix ??). Due to the orthonormality of the eigenvectors, matrix U is unitary,i.e., UUT = UTU = I. Thus, (2.69) can now be written as

yTΛ−1y = c, (2.73)

where we have used the linear transformation

y := U(x− µ)., (2.74)

which corresponds to a rotation of the axes by U and a translation of the originto µ. Equation (2.73) can be written as

y21λ1

+ . . .+y2lλl

= c, (2.75)

which is readily observed that it is an equation describing a (hyper)ellipsoid inthe Rl. From (2.74), it is easily seen that it is centered at µ and that the majoraxes of the ellipsoid are parallel to u1, . . . ,ul (plug in place of x its unit x1

direction [1, 0, . . . , 0]T ). The size of the respective axes are controlled by the

Page 16: Probabilityand Stochastic Processesskara.di.uoa.gr/PP/transparency/Chapter2.pdf · Probabilityand Stochastic Processes 2.1 Introduction The goal of this chapter is to provide the

16 2. PROBABILITY AND STOCHASTIC PROCESSES

corresponding eigenvalues. This is shown in Figure 2.6b. For the special case ofa diagonal covariance with equal elements across the diagonal, all eigenvaluesare equal to the value of the common diagonal element and the ellipsoid becomesa (hyper)sphere (circle).

(a) (b)

Figure 2.6: The isovalue contours for the two Gaussians of Figure 2.5. Thecontours for the Gaussian in 2.5a are circles, while those corresponding to Figure2.5b are ellipses. The main and minor axes of the ellipse are determined by theeigenvectors/eigenvelues of the respective covariance matrix. For the case ofthe diagonal matrix, with equal elements along the diagonal, all eigenvalues areequal, and the ellipse becomes a circle.

The Gaussian pdf has a number of nice properties, which we are going todiscover as we move on in this book. For the time being, note that if thecovariance matrix is diagonal,

Σ = diag{σ21 , . . . , σ

2l },

that is, when the covariance of all the elements cov(xi, xj) = 0, i, j = 1, 2, . . . , l,then the random variables comprising x are statistically independent. In gen-eral, this is not true. Uncorrelated variables do not necessarily mean that theyare independent. Independence is a much stronger condition. This is true, how-ever, if they follow a multivariate Gaussian. Indeed, if the covariance matrix isdiagonal, then the multivariate Gaussian is written as,

p(x) =l

i=1

1√2πσi

exp(

− (xi − µi)2

2σ2i

)

. (2.76)

In other words,

p(x) =

l∏

i=1

p(xi), (2.77)

which is the condition for statistical independence.

Page 17: Probabilityand Stochastic Processesskara.di.uoa.gr/PP/transparency/Chapter2.pdf · Probabilityand Stochastic Processes 2.1 Introduction The goal of this chapter is to provide the

2.3. EXAMPLES OF DISTRIBUTIONS 17

The Central Limit Theorem

This is one of the most fundamental theorems in probability theory and statisticsand it partly explains the popularity of the Gaussian distribution. Consider Nmutually independent random variables, each following its own distribution withmean values µi and variances σ2

i , i = 1, 2, . . . , N . Define a new random variableas their sum, i.e.,

x =

N∑

i=1

xi. (2.78)

Then the mean and variance of the new variable are given by,

E[x] =

N∑

i=1

µi, and σ2x =

N∑

i=1

σ2i . (2.79)

It can be shown, e.g., [Papo 02, Jayn 03], that as N −→ ∞ the distribution ofthe normalized variable

z =x− µ

σ, (2.80)

tends to the standard normal distribution, and for the corresponding pdf wehave that

p(z) −−−−→N→∞

N (z|0, 1). (2.81)

In practice, even summing up a relatively small number of random variablesone can obtain a good approximation to a Gaussian. For example, if the in-dividual pdfs are smooth enough and the random variables are identically and

independently distributed (iid), a number between 5 to 10 is quite good.

The Exponential Distribution

We say that a random variable follows an exponential distribution with param-eter λ > 0, if

p(x) =

{

λ exp (−λx) , if x ≥ 0,0, otherwise

. (2.82)

The distribution has been used, for example, to model the time between arrivalsof telephone calls or of a bus at a bus stop. The mean and variance can be easilycomputed by following simple integration rules, and they are

E[x] =1

λ, σ2

x =1

λ2. (2.83)

Page 18: Probabilityand Stochastic Processesskara.di.uoa.gr/PP/transparency/Chapter2.pdf · Probabilityand Stochastic Processes 2.1 Introduction The goal of this chapter is to provide the

18 2. PROBABILITY AND STOCHASTIC PROCESSES

(a) (b)

Figure 2.7: The graphs of the pdfs of the Beta distribution for different values ofthe parameters. a) The dotted line corresponds to a = 1, b = 1, the gray line toa = 0.5, b = 0.5 and the red one to a = 3, b = 3. b) The gray line correspondsto a = 2, b = 3 and the red one to a = 8, b = 4. For values a = b, the shapeis symmetric around 1/2. For a < 1, b < 1 it is convex. For a > 1, b > 1, it iszero at x = 0 and x = 1. For a = 1 = b, it becomes the uniform distribution. Ifa < 1, p(x) −→ ∞, x −→ 0 and if b < 1, p(x) −→ ∞, x −→ 1.

The Beta Distribution

We say that a random variable, x ∈ [0, 1], follows a beta distribution withpositive parameters, a, b, and we write, x ∼ Beta(x|a, b, ), if

p(x) =

{ 1B(a,b)x

a−1(1− x)b−1, if 0 ≤ x ≤ 1,

0 otherwise., (2.84)

where B(a, b) is the beta function, defined as

B(a, b) :=

∫ 1

0

xa−1(1− x)b−1dx. (2.85)

The mean and variance of the beta distribution are given by (Problem 2.4)

E[x] =a

a+ b, σ2

x =ab

(a+ b)2(a+ b+ 1). (2.86)

Moreover, it can be shown (Problem 2.5) that

B(a, b) =Γ(a)Γ(b)

Γ(a+ b), (2.87)

where Γ(·) is the gamma functions defined as,

Γ(a) =

∫ ∞

0

xa−1e−xdx. (2.88)

Page 19: Probabilityand Stochastic Processesskara.di.uoa.gr/PP/transparency/Chapter2.pdf · Probabilityand Stochastic Processes 2.1 Introduction The goal of this chapter is to provide the

2.3. EXAMPLES OF DISTRIBUTIONS 19

The beta distribution is very flexible and one can achieve various shapes bychanging the parameters a, b. For example, if a = b = 1 the uniform distributionresults. If a = b the pdf has a symmetric graph around 1/2. If a > 1, b > 1then p(x) −→ 0 both at x = 0 and x = 1. If a < 1 and b < 1 it is convex with aunique minimum. If a < 1 it tends to ∞ as x −→ 0 and if b < 1 it tends to ∞for x −→ 1. Figures 2.7a and 2.7b show the graph of the beta distribution fordifferent values of the parameters.

Figure 2.8: The pdf of the gamma distribution takes different shapes for thevarious values of the parameters: a = 0.5, b = 1 (full line gray), a = 2, b = 0.5(red), a = 1, b = 2 (dotted).

The Gamma Distribution

A random variable follows the gamma distribution with positive parameters a, b,and we write x ∼ Gamma(x|a, b) if

p(x) =

{

ba

Γ(a)xa−1e−bx, x > 0,

0 otherwise.(2.89)

The mean and variance are given by

E[x] =a

b, σ2

x =a

b2. (2.90)

The gamma distribution also takes various shapes by varying the parameters.For a < 1, it is strictly decreasing and p(x) −→ ∞ as x −→ 0 and p(x) −→ 0

Page 20: Probabilityand Stochastic Processesskara.di.uoa.gr/PP/transparency/Chapter2.pdf · Probabilityand Stochastic Processes 2.1 Introduction The goal of this chapter is to provide the

20 2. PROBABILITY AND STOCHASTIC PROCESSES

as x −→ ∞. Figure 2.8 shows the resulting graphs for various values of theparameters.

Remarks 2.1.

• Setting in the gamma distribution a to be an integer (usually a = 2) theso-called Erlang distribution results. This distribution is being used tomodel waiting times in queueing systems.

• The chi-squared is also a special case of the gamma distribution and itis obtained if we set b = 1/2 and a = ν/2. The chi-squared distributionresults if we sum up ν squared normal variables.

Figure 2.9: The 2-dimensional simplex in R3.

The Dirichlet Distribution

The Dirichlet distribution can be considered as the multivariate generalizationof the beta distribution. Let x = [x1, . . . , xK ]T be a random vector, with com-ponents such as

0 ≤ xk ≤ 1, k = 1, 2, . . . ,K, and

K∑

k=1

xk = 1. (2.91)

In other words, the random variables lie on (K−1)-dimensional simplex, Figure2.9. We say that the random vector x follows a Dirichlet distribution withparameters a = [a1, . . . , aK ]T , and we write x ∼ Dir(x|a), if

Page 21: Probabilityand Stochastic Processesskara.di.uoa.gr/PP/transparency/Chapter2.pdf · Probabilityand Stochastic Processes 2.1 Introduction The goal of this chapter is to provide the

2.4. STOCHASTIC PROCESSES 21

(a) (b) (c)

Figure 2.10: The Dirichlet distribution over the 2D-simplex for a) (0.1,0.1,0.1),b) (1,1,1) and c) (10,10,10).

p(x) = Dir(x|a) := Γ(a)

Γ(a1) . . .Γ(aK)

K∏

k=1

xak−1k , (2.92)

where

a =K∑

k=1

ak. (2.93)

The mean, variance and covariances of the involved random variables aregiven by (Problem 2.7),

E[x] =1

aa, σ2

k =ak(a− ak)

a2(a+ 1), cov(xi, xj) = − aiaj

a2(a+ 1). (2.94)

Figure 2.10 shows the graph of the Dirichlet distribution for different values ofthe parameters, over the respective 2D-simplex.

2.4 Stochastic Processes

The notion of a random variable has been introduced to describe the result of arandom experiment, whose outcome is a single value. For example, the head ortail in a coin tossing experiment or a value between one and six when throwingthe die in a pengamon game.

In this section, the notion of a stochastic process is introduced to describerandom experiments where the outcome of each experiment is a function or asequence; in other words, the outcome of each experiment is an infinite numberof values. In this book, we are only going to be concerned with stochasticprocesses associated with sequences. Thus, the result of a random experimentis a sequence, un (or sometimes denoted as u(n)), n ∈ Z, where Z is the setof integers. Usually, n is interpreted as a time index, and un is called a time

series or in the signal processing jargon a discrete-time signal. In contrast, ifthe outcome is a function, u(t), it is called a continuous-time signal. We are

Page 22: Probabilityand Stochastic Processesskara.di.uoa.gr/PP/transparency/Chapter2.pdf · Probabilityand Stochastic Processes 2.1 Introduction The goal of this chapter is to provide the

22 2. PROBABILITY AND STOCHASTIC PROCESSES

going to adopt the time interpretation of the free variable, n, for the rest of thechapter, without harming generality.

When discussing random variables, we used the notation x to denote therandom variable, which assumes a value, x, from the sample space, once anexperiment is performed. Similarly, we are going to use un to denote the specificsequence resulting from a single experiment and the roman font, un, to denotethe corresponding discrete-time random process; that is, the rule that assigns aspecific sequence as the outcome of an experiment. A stochastic process can beconsidered as a family or ensemble of sequences. The individual sequences areknown as sample sequences or simply as realizations.

For our notational convention, we use the symbol u and not x; this is onlyfor pedagogical reasons, just to make sure that the reader readily recognizeswhen the focus is on random variables and when it is on random processes. Inthe signal processing jargon, a stochastic process is also known as a random

signal. Figure 2.11 illustrates what we have just said that the outcome of anexperiment involving a stochastic process is a sequence of values.

Figure 2.11: The outcome of each experiment, associated with a discrete-time

stochastic process, is a sequence of values. For each one of the realizations,the corresponding values obtained at any instant, e.g., n or m, comprise theoutcomes of a corresponding random variable, un or um respectively.

Note that fixing the time to a specific value, e.g., n = n0, then un0is a

random variable. Indeed, for each random experiment we perform, a singlevalue results at time instant n0. From this perspective, a random process canbe considered as the collection of infinite many random variables, i.e., {un, n ∈Z}. So, is there a need to study a stochastic process separately than randomvariables/vectors? The answer is yes and the reason is that we are going toallow certain time dependencies among the random variables corresponding todifferent time instants and study the respective effect on the time evolution ofthe random process. Stochastic processes will be considered in Chapters ??,where the underlying time dependencies will be exploited for computationalsimplifications, and in ?? in the context of Gaussian processes.

Page 23: Probabilityand Stochastic Processesskara.di.uoa.gr/PP/transparency/Chapter2.pdf · Probabilityand Stochastic Processes 2.1 Introduction The goal of this chapter is to provide the

2.4. STOCHASTIC PROCESSES 23

2.4.1 First and Second Order Statistics

For a stochastic process to be fully described, one must know the joint pdfs(pmfs for discrete-valued random variables)

p(un, um, . . . , uk;n,m, . . . , r), (2.95)

for all possible combinations of random variables, un, um, . . . , ur. Note that,we have explicitly given the dependence of the joint pdfs on the correspondingtime indices. However, in practice, and certainly in this book, the emphasisis on computing first and second order statistics only, based on p(un;n) andp(un, um;n,m). To this end, the following quantities are of particular interest,

Mean at Time n:

µn := E[un] =

∫ +∞

−∞

p(un;n)dun. (2.96)

Autocovariance at Time Instants, n,m:

cov(n,m) := E[(

un − E[un])(

um − E[um])]

. (2.97)

Autocorrelation at Time Instants, n,m:

r(n,m) := E [unum] . (2.98)

We refer to these mean values as ensemble averages, to stress out that theyconvey statistical information over the ensemble of sequences, that comprisethe process.

The respective definitions for complex stochastic processes are

cov(n,m) = E

[

(

un − E[un])(

um − E[um])∗]

(2.99)

andr(n,m) = E [unu

∗m] , (2.100)

2.4.2 Stationarity and Ergodicity

Definition 2.1. Strict Sense Stationarity: A stochastic process, un, is saidto be strict-sense stationary (SSS) if its statistical properties are invariant to ashift of the origin, i.e., if ∀k ∈ Z

p(un, um, . . . , ur) = p(un+k, um+k, . . . , ur+k), (2.101)

and for any possible combination of time instants, n,m, . . . , r ∈ Z.

In other words, the stochastic processes un and un+k are described by thesame joint pdfs of all orders. A weaker version of stationarity is that of the mthorder stationarity, where joint pdfs involving up to m variables are invariant tothe choice of the origin. For example, for a second order (m = 2) stationaryprocess, we have that p(un) = p(un+k) and p(un, ur) = p(un+k, ur+k), ∀n, r, k ∈Z

Page 24: Probabilityand Stochastic Processesskara.di.uoa.gr/PP/transparency/Chapter2.pdf · Probabilityand Stochastic Processes 2.1 Introduction The goal of this chapter is to provide the

24 2. PROBABILITY AND STOCHASTIC PROCESSES

Definition 2.2. Wide Sense Stationarity: A stochastic process, un, is saidto be wide-sense stationary (WSS) if the mean value is constant over all timeinstants and the autocorrelation/autocovariance sequences depend on the dif-ference of the involved time indices, i.e.,

µn = µ, and r(n, n− k) = r(k). (2.102)

Note that WSS is a weaker version of the second order stationarity; in thelatter case, all possible second order statistics are independent of the time ori-gin. In the former, we only require the autocorrelation (autocovariance) to beindependent of the time origin. The reason that we focus on the autocorrelationis that this quantity is of major importance in the study of linear systems andin the mean square estimation, as we will see in Chapter ??.

Obviously, a strict-sense stationary precess is also wide-sense stationary but,in general, not the other way round. For stationary processes, the autocorrela-tion becomes a sequence with a single time index as the free parameter; thus itsvalue, that measures a relation of the variables at two time instants, dependssolely on how much these time instants differ, and not on their specific values.

From our basic statistics course, we know that given a random variable,x, its mean value can be approximated by the sample mean. Carrying out Nsuccessive independent experiments, let xn, n = 1, 2, . . . , N , be the obtainedvalues, observations. The sample mean is defined as,

µN :=1

N

N∑

n=1

xn. (2.103)

For large enough values of N , we expect the sample mean to be close to thetrue mean value, E[x]. In a more formal way, this is guaranteed by the factthat µN is an unbiased and consistent estimator. We are going to discuss suchissues in Chapter ??. However, we can refresh our memory at this point. Everytime we repeat the N random experiments, different samples result and hencea different estimate µN is computed. Thus, µN is itself a random variable. It isunbiased, because it can easily be shown that

E[µN ] = E[x], (2.104)

and it is consistent because its variance tends to zero as N −→ +∞ (Problem2.8). These two properties guarantee that, with high probability, for large valuesof N , µN will be close to the true mean value.

To apply the concept of sample mean approximation to random processes,one must have at his/her disposal a number of, say N , realizations and computethe sample mean, at different time instants, “across the process”, i.e., usingdifferent realizations, representing the ensemble of sequences. Similarly, samplemean arguments can be used to approximate the autocovariance/autocorrelationsequences. However, this is a costly operation, since now each experiment resultsin an infinite number of values (a sequence of values). Moreover, it is commonin practical applications, that only one realization is available to the user.

Page 25: Probabilityand Stochastic Processesskara.di.uoa.gr/PP/transparency/Chapter2.pdf · Probabilityand Stochastic Processes 2.1 Introduction The goal of this chapter is to provide the

2.4. STOCHASTIC PROCESSES 25

To this end, we are now going to define a special type of stochastic processes,where the sample mean operation can be significantly simplified.

Definition 2.3. Ergodicity: A stochastic process is said to be ergodic, if thecomplete statistics can be determined by any one of the realizations.

In other words, if a process is ergodic, every single realization carries anidentical statistical information and it can describe the entire random process.Since from a single sequence only one set of pdfs can be obtained, we concludethat every ergodic process is necessarily stationary. A nonstationary processhas infinite sets of pdfs depending upon the choice of the origin. For example,there is only one mean value, which can result from a single realization and itcan be obtained as a (time) average, over the values of the sequence. Hence,the mean value of a stochastic process which is ergodic must be constant forall time instants, i.e., independent of the time origin. The same is true for allhigher order statistics.

A special type of ergodicity is that of the second order ergodicity. Thismeans that only statistics up to a second order can be obtained from a singlerealization. Second order ergodic processes are necessarily wide-sense stationary.For second order ergodic processes, the following are true:

E[un] = µ = limN→∞

µn, (2.105)

where

µn :=1

2N + 1

N∑

n=−N

un.

Also,

cov(k) = limN→∞

1

2N + 1

N∑

n=−N

(un − µ)(un−k − µ), (2.106)

where the limits are in the mean square sense; that is,

limN→∞

E[

|µN − µ|2]

= 0,

and similarly for the autocovariance. Note that often, ergodicity is only requiredto be assumed for the computation of the mean and covariance and not for allpossible second order statistics. In this case, we talk about mean-ergodic andcovariance-ergodic processes.

In summary, when ergodic processes are involved, ensemble averages “across

the process” can be obtained as time averages “along the process”, see Figure2.12.

. In practice, when only a finite number of samples from a realization isavailable, then the mean and covariance are approximated as the respectivesample means.

An issue is to establish conditions under which a process is mean-ergodicor covariance-ergodic. Such conditions do exist, and the interested reader can

Page 26: Probabilityand Stochastic Processesskara.di.uoa.gr/PP/transparency/Chapter2.pdf · Probabilityand Stochastic Processes 2.1 Introduction The goal of this chapter is to provide the

26 2. PROBABILITY AND STOCHASTIC PROCESSES

Figure 2.12: For ergodic processes, mean values for each time instant (timeaveraging along the process) are computed as time averages “across” the process.

dig such information from more specialized books, e.g., [Papo 02]. It turns outthat, the condition for mean-ergodicity relies on second order statistics and thecondition on covariance-ergodicity on fourth order statistics.

It is very common in Statistics as well as in Machine Learning and SignalProcessing to subtract the mean value from the data, during the so called pre-

processing stage. In such a case, we say that the data are centered. The resultingnew process has now zero mean value, and the covariance and autocorrelationsequences coincide. From now on, we are going to assume that the mean isknown (or computed as a sample mean) and then subtracted. Such a treatmentsimplifies the analysis without harming generality.

Example 2.2. The goal of this example is to construct a process which is WSSyet it is not ergodic. Let a WSS process, xn, i.e.,

E[xn] = µ

and

E[xnxn−k] = rx(k).

Define the process,

zn := axn, (2.107)

where a is a random variable taking values in {0, 1}, with probabilities P (0) =P (1) = 0.5. Moreover, a and xn are statistically independent. Then, we havethat

E[zn] = E[axn] = E[a]E[xn] = 0.5µ, (2.108)

and

E[znzn−k] = E[a2]E[xnxn−k] = 0.5rx(k). (2.109)

Thus, zn is WSS. However, it is not covariance-ergodic. Indeed, some of therealizations will be equal to zero (when a = 0), and the mean value and auto-correlation, which will result form them as time averages, will be zero, which isdifferent from the ensemble averages.

Page 27: Probabilityand Stochastic Processesskara.di.uoa.gr/PP/transparency/Chapter2.pdf · Probabilityand Stochastic Processes 2.1 Introduction The goal of this chapter is to provide the

2.4. STOCHASTIC PROCESSES 27

2.4.3 Power Spectral Density

The Fourier transform is an indispensable tool for representing in a compact way,in the frequency domain, the variations that a function/sequence undergoes interms of its free variable (e.g., time). Stochastic processes are inherently relatedto time. The question that is now raised is whether stochastic processes can bedescribed in terms of a Fourier transform. The answer is in the affirmative andthe vehicle to achieve this is via the autocorrelation sequence, for processes whichare at least wide-sense stationary. Prior to providing the necessary definitions,it is useful to summarize some properties of the autocorrelation sequence, whichare commonly used.

Properties of the Autocorrelation Sequence

Let un be a wide-sense stationary process. Its autocorrelation sequence has thefollowing properties, which are given for the more general complex-valued case:

• Property I.

r(k) = r∗(−k), ∀k ∈ Z. (2.110)

This property ia a direct consequence of the invariance with respect to thechoice of the origin. Indeed,

r(k) = E[unu∗n−k] = E[un+ku

∗n] = r∗(−k).

• Property II.

r(0) = E[

|un|2]

. (2.111)

That is, the value of the autocorrelation at k = 0 is equal to the meansquare value of the process. Interpreting the square of a variable as itsenergy, then r(0) can be interpreted as the corresponding (average) power.

• Property III.

r(0) ≥ |r(k)|, ∀k 6= 0. (2.112)

The proof is provided in Problem 2.9. In other words, the correlation of thevariables, corresponding to two different time instants, cannot be larger(in magnitude) than r(0). As we will see in Chapter ??, this propertyis essentially the Cauchy-Schwartz inequality for the inner products (see,also, Appendix of Chapter ??).

• Property IV. The autocorrelation of a stochastic process is positive definite.That is,

N∑

n=1

N∑

m=1

ana∗mr(n,m) ≥ 0, ∀an ∈ C, n = 1, 2, . . . , N, ∀N ∈ Z. (2.113)

Page 28: Probabilityand Stochastic Processesskara.di.uoa.gr/PP/transparency/Chapter2.pdf · Probabilityand Stochastic Processes 2.1 Introduction The goal of this chapter is to provide the

28 2. PROBABILITY AND STOCHASTIC PROCESSES

Proof: The proof is easily obtained by the definition of the autocorrelation,

0 ≤ E

[∣

N∑

n=1

anxn

2]

=

N∑

n=1

N∑

m=1

ana∗m E [xnxm] , (2.114)

which proves the claim. Note that strictly speaking we should say thatit is semipositive definite. However, the “positive definite” name is theone that has survived in the literature. This property will be useful whenintroducing Gaussian processes in Chapter ??.

• Property V. Let un and vn be two WSS processes. Define the new process

zn = un + vn.

Then,rz(k) = ru(k) + rv(k) + ruv(k) + rvu(k), (2.115)

where the cross-correlation between two jointly WSS stationary stochasticprocesses is defined as

ruv(k) := E[unv∗n−k], k ∈ Z : Cross-correlation. (2.116)

The proof is a direct consequence of the definition. Note that if the twoprocesses are uncorrelated, i.e., ruv(k) = rvu(k) = 0, then

rz(k) = ru(k) + rv(k).

Obviously, this is also true if the processes un and vn are independent andof zero mean value, since then E[unvn−k] = E[un]E[vn−k] = 0. It shouldbe stressed our here that, uncorelateness is a weaker condition and it doesnot necessarily imply independence; the opposite is true, for zero meanvalues.

• Property VI.ruv(k) = r∗vu(−k) (2.117)

The proof is similar with that of Property I.

• Property VII.ru(0)rv(0) ≥ |ruv(k)|, ∀k ∈ Z. (2.118)

The proof is also given in Problem 2.9.

Power Spectral Density

Definition 2.4. Given a WSS stochastic process, un, its power spectral density

(PSD) (or simply the power spectrum) is defined as the Fourier transform of itsautocorrelation sequence, i.e.,

S(ω) :=

∞∑

k=−∞

r(k) exp (−jωk) : Power Spectral Density. (2.119)

Page 29: Probabilityand Stochastic Processesskara.di.uoa.gr/PP/transparency/Chapter2.pdf · Probabilityand Stochastic Processes 2.1 Introduction The goal of this chapter is to provide the

2.4. STOCHASTIC PROCESSES 29

Using the Fourier transform properties, we can recover the autocorrelation se-quence via the inverse Fourier transform, i.e.,

r(k) =1

∫ +π

−π

S(ω) exp (jωk) dω. (2.120)

Due to the properties of the autocorrelation sequence, the PSD have someinteresting and useful from a practical point of view properties.

Properties of the PSD

• The PSD of a WSS stochastic process is a real and non-negative functionof ω. Indeed, we have that,

S(ω) =

+∞∑

k=−∞

r(k) exp (−jωk)

= r(0) +

−1∑

k=−∞

r(k) exp (−jωk) +

∞∑

k=1

r(k) exp (−jωk)

= r(0) +

+∞∑

k=1

r∗(k) exp (jωk) +

∞∑

k=1

r(k) exp (−jωk)

= r(0) + 2

+∞∑

k=1

Real(

r(k) exp (−jωk))

, (2.121)

which proves the claim that PSD is a real number. In the proof, PropertyI of the autocorrelation sequence has been used. We defer the proof forthe non-negative part for later on in Section 2.4.3.

• The area under the graph of S(ω) is equal to the power of the stochasticprocess, i.e.,

E[|un|2] = r(0) =1

∫ +π

−π

S(ω)dω, (2.122)

which is obtained from (2.120) if we set k = 0. We will come to thephysical meaning of this property very soon.

Transmission Through a Linear System

One of the most important tasks in Signal Processing and Systems Theory isthe linear filtering operation on an input time series (signal) to generate anotheroutput sequence. The block diagram of the filtering operation is shown in Figure2.13. From the linear system theory and signal processing basics, it is knownthat for a class of linear systems, known as linear time invariant (LTI) the input-output relation is given via the elegant convolution between the input sequence

Page 30: Probabilityand Stochastic Processesskara.di.uoa.gr/PP/transparency/Chapter2.pdf · Probabilityand Stochastic Processes 2.1 Introduction The goal of this chapter is to provide the

30 2. PROBABILITY AND STOCHASTIC PROCESSES

Figure 2.13: The linear system (filter) is excited by the input sequence (signal),un, and provides the output sequence (signal), dn.

and the impulse response of the filter, i.e.,

dn = wn ∗ un :=+∞∑

k=−∞

w∗kun−k : Convolution Sum, (2.123)

where . . . , w0, w1, w2, . . . are the parameters comprising the impulse responsedescribing the filter, e.g., [Proa 92]. In case the impulse response is of finiteduration, for example, w0, w1, . . . , wl−1, and the rest of the values are zero,then the convolution can be written as

dn =

l−1∑

k=0

w∗kun−k = wHun, (2.124)

wherew := [w0, w1, . . . , wl−1]

T , (2.125)

andun := [un, un−1, . . . , un−l+1]

T ∈ Rl. (2.126)

The latter is known as the input vector of order l and at time n. It is interestingto note that this is a random vector. However, its elements are part of thestochastic process at successive time instants. This gives the respective auto-correlation matrix certain properties and a rich structure, which will be studiedand exploited in Chapter ??. As a matter of fact, this is the reason that weused a different symbol when dealing with processes and general random vec-tors; thus, the reader can readily remember that when dealing with process, theelements of the involved random vectors have this extra structure. Moreover,observe from Eq. 2.123 that, if the impulse response of the system is zero fornegative values of the time index, n, this guarantees causality. That is, theoutput depends only on the values of the input at the current and previous timeinstants only, and there is no dependence on future values. As a matter of fact,this is also a necessary condition for causality; that is, if the system is causal,then its impulse response is zero for negative time instants, e.g., [Proa 92].

Theorem 2.1. The power spectral density of the output, dn, of a linear timeinvariant system, when it is excited by a WSS stochastic process, un, is given

Page 31: Probabilityand Stochastic Processesskara.di.uoa.gr/PP/transparency/Chapter2.pdf · Probabilityand Stochastic Processes 2.1 Introduction The goal of this chapter is to provide the

2.4. STOCHASTIC PROCESSES 31

by

Sd(ω) =∣

∣W (ω)∣

2Su(ω), (2.127)

where

W (ω) :=

+∞∑

n=−∞

wn exp (−jωn) . (2.128)

Proof: First, it is shown (Problem 2.10) that,

rd(k) = ru(k) ∗ wk ∗ w∗−k. (2.129)

Then taking the Fourier transform of both sides we obtain (2.127). To this end,we used the well known properties of the Fourier transform, i.e.,

ru(k) ∗ wk 7−→ Su(ω)W (ω), and w∗−k 7−→ W ∗(ω).

Figure 2.14: An ideal bandpass filter. The output contains frequencies only inthe range of |ω − ω0| < ∆ω/2.

Physical Interpretation of the PSD

We are now ready to justify why the Fourier transform of the autocorrelationsequence was given the specific name of “power spectral density”. Figure 2.14shows the Fourier transform of the impulse response of a very special linearsystem. The Fourier transform is unity for any frequency in the range |ω −ωo| ≤ ∆ω

2 and zero otherwise. Such a system is known as bandpass filter. Weassume that ∆ω is very small. Then, using (2.127) and assuming that withinthe intervals |ω − ωo| ≤ ∆ω

2 , Su(ω) ≈ Su(ωo) we have that

Sd(ω) =

{

Su(ωo), if |ω − ωo| ≤ ∆ω2 ,

0, otherwise.(2.130)

Page 32: Probabilityand Stochastic Processesskara.di.uoa.gr/PP/transparency/Chapter2.pdf · Probabilityand Stochastic Processes 2.1 Introduction The goal of this chapter is to provide the

32 2. PROBABILITY AND STOCHASTIC PROCESSES

Hence,

E[|dn|2] = rd(0) := ∆P =1

∫ +∞

−∞

Sd(ω)dω = Su(ωo)∆ω

π, (2.131)

where real data have been assumed, which guarantees the symmetry of the(magnitude) of the Fourier transform (Su(ω) = Su(−ω)). Hence,

1

πSu(ωo) =

∆P

∆ω. (2.132)

In other words, the value Su(ωo) can be interpreted as the power density (powerper frequency interval) in the frequency (spectrum) domain.

Moreover, this also establishes what was said before that the PSD is a non-

negative real function, for any value of ω ∈ [−π,+π] (The PSD, being the Fouriertransform of a sequence, is periodic with period 2π, e.g., [Proa 92]).

Remarks 2.2.

• Note that for anyWSS stochastic process, there is only one autocorrelationsequence that describes it. However, the converse is not true. A singleautocorrelation sequence can correspond to more than one WSS processes.Recall that the autocorrelation is the mean value of the product of randomvariables. However, many random variables can have the same mean value.

• We have shown that the Fourier transform, S(ω), of an autocorrelationsequence, r(k), is non-negative. Moreover, if a sequence, r(k), has a non-negative Fourier transform, then it is positive definite and we can always

construct a WWS process that has r(k) as its autocorrelation sequence(e.g., [Papo 02, pages 410,421]). Thus, the necessary and sufficient condi-

tion for a sequence to be an autocorrelation sequence is the nonnegativity

of its Fourier transform.

Example 2.3. White Noise Sequence.A stochastic process, ηn, is said to be white noise if the mean and its autocor-relation sequence satisfy

E[ηn] = 0 and r(k) =

{

σ2η if k = 0,0, if k 6= 0.

: White Noise, (2.133)

where σ2η is its variance. In other words, all variables at different time instants

are uncorrelated. If, in addition, they are independent, we say that it is strictlywhite noise. It is readily seen that its PSD is given by

Sη(ω) = σ2η. (2.134)

This is, it is constant and this is the reason it is called white noise, in analogy tothe the white light whose spectrum is equally spread over all the wavelengths.

Page 33: Probabilityand Stochastic Processesskara.di.uoa.gr/PP/transparency/Chapter2.pdf · Probabilityand Stochastic Processes 2.1 Introduction The goal of this chapter is to provide the

2.4. STOCHASTIC PROCESSES 33

2.4.4 Autoregressive Models

We have just seen an example of a stochastic process, namely the white noise.We now turn our attention in generating WSS processes via appropriate mod-eling. In this way, we are going to introduce controlled correlation among thevariables corresponding to the various time instants.

Autoregressive processes are one among the most popular and widely usedmodels. An autoregressive process of order l, denoted as AR(l), is defined viathe following difference equation

un + a1un−1 + . . . , alun−l = ηn : Autoregressive Process. (2.135)

where ηn is a white noise process with variance σ2η.

As it is always the case with any difference equation, one starts form someinitial conditions and then generates samples recursively, by plugging into themodel the input sequence samples. The input samples here correspond to awhite noise sequence and the initial conditions are set equal to zero, u−1 =. . . u−l = 0.

There is no need to mobilize mathematics to see that such a process isnot stationary. Indeed, time instant n = 0 is distinctly different form all therest, since it is the time that initial conditions are applied in. However, theeffects of the initial conditions tend asymptotically to zero if all the roots of thecorresponding characteristic polynomial,

zl + a1zl−1 + . . .+ al = 0,

are less that unity (the solution of the corresponding homogeneous equation,without input, tends to zero), e.g., [Prie 81]. Then, it can be shown that asymp-totically, the AR(l) becomes WSS. This is the assumption that it is usuallyadopted in practice and it will be the case for us for the rest of this section.Note that the mean value of the process is zero (try it).

The goal now becomes to compute the corresponding autocorrelation se-quence, r(k), k ∈ Z. Multiplying both sides in (2.135) with un−k, k > 0, andtaking the expectation, we obtain

l∑

i=0

ai E[un−iun−k] = E[ηnun−k], k > 0,

where a0 := 1, orl

i=1

air(k − i) = 0. (2.136)

We have used the fact that E[ηnun−k], k > 0 is zero. Indeed, un−k dependsrecursively on ηn−k, ηn−k−1 . . . , which are all uncorrelated to ηn, since this isa white noise process. Note that (2.136) is a difference equation, which can be

Page 34: Probabilityand Stochastic Processesskara.di.uoa.gr/PP/transparency/Chapter2.pdf · Probabilityand Stochastic Processes 2.1 Introduction The goal of this chapter is to provide the

34 2. PROBABILITY AND STOCHASTIC PROCESSES

solved provided we have the initial conditions. To this end, multiply (2.135) byun and take expectations, which results in

l∑

i=0

air(i) = σ2η, (2.137)

since un recursively depends on ηn, which contributes the σ2η term, and ηn−1, . . .,

which result to zeros. Combining (2.137) with (2.136) the following linear systemof equations results,

r(0) r(1) . . . r(l)r(1) r(0) . . . r(l − 1)...

......

...r(l) r(l − 1) . . . r(0)

1a1...al

=

σ2η

0...0

. (2.138)

These are known as the Yule-Walker equations, whose solution results in thevalues, r(0), . . . , r(l) which are then used as the initial conditions to solve thedifference equation in (2.136) and obtain r(k), ∀k ∈ Z.

Observe the special structure of the matrix in the linear system. This typeof matrices are known as Toeplitz and this is the property that will be exploitedto solve efficiently such systems, which result when the autocorrelation matrixof a WSS process is involved, see Chapter ??.

Besides the autoregressive models, other type of stochastic models have beensuggested and used. The autoregressive-moving average model (ARMA) of order(l,m) is defined by the difference equation,

un + a1un−1 + . . .un−l = b1ηn + . . .+ bmηn−m, (2.139)

and the moving average model of order m, denoted as MA(m), is defined as

un = b1ηn + . . .+ bmηn−m. (2.140)

Note that the AR(l) and the MA(m) models can be considered as special casesof the ARMA(l,m). . For a more theoretical treatment of the topic, see, e.g.,[Broc 91].

Example 2.4. Consider the AR(1) process,

un + aun−1 = ηn.

Following the general methodology explained before, we have

r(k) + ar(k − 1) = 0, k = 1, 2, . . .

r(0) + ar(1) = σ2η.

Taking the first equation for k = 1 together with the second one, readily resultsin

r(0) =σ2η

1− a2.

Page 35: Probabilityand Stochastic Processesskara.di.uoa.gr/PP/transparency/Chapter2.pdf · Probabilityand Stochastic Processes 2.1 Introduction The goal of this chapter is to provide the

2.4. STOCHASTIC PROCESSES 35

(a) (b)

(c) (d)

Figure 2.15: a) The time evolution of a realization of the AR(1) with a = 0.9and b) the respective autocorrelation sequence. c) The time evolution of arealization of the AR(1) with a = 0.4 and d) the corresponding autocorrelationsequence.

Plugging this value in the difference equation, we recursively obtain

r(k) = a|k|σ2η

1− a2, k = 0,±1,±2, . . . , (2.141)

where we used the property, r(k) = r(−k). Observe that if a > 1, r(0) < 0which is meaningless. Also, a < 1 guarantees that the root of the characteristicpolynomial (z∗ = a) is smaller than one. Moreover, a < 1 guarantees thatr(k) −→ 0 as k −→ ∞. This is in line with common sense, since variables whichare far away must be uncorrelated.

Figure 2.15 shows the time evolution of two AR(1) processes (after the pro-cesses have converged to be stationary) together with the respective autocorre-lation sequences, for two cases, corresponding to a = 0.9 and a = 0.4. Observethat the larger the value of a is the smoother the realization becomes and time

Page 36: Probabilityand Stochastic Processesskara.di.uoa.gr/PP/transparency/Chapter2.pdf · Probabilityand Stochastic Processes 2.1 Introduction The goal of this chapter is to provide the

36 2. PROBABILITY AND STOCHASTIC PROCESSES

(a) (b)

Figure 2.16: a) The time evolution of a realization from a white noise process. b)The power spectral densities in dBs, for the two AR(1) sequences of Figure 2.15.The red one corresponds to a = 0.4 and the gray one to a = 0.9. The smallerthe a the more to a white noise the process is and its power spectral densitytends to increase the power with which high frequencies participate. Since thePSD is the Fourier transform of the autocorrelation sequence, observe that thebroader a sequence is in time, the narrower its Fourier transform becomes, andvice versa.

variations are slower. This is natural, since nearby samples are highly correlatedand so, on average, they tend to have similar values. The opposite is true forsmall values of a. For comparison purposes, Figure 2.16a is the case of a = 0,which corresponds to a white noise.Figure 2.16b shows the power spectral den-sities corresponding to the two cases of Figure 2.15. Observe that the faster theautocorrelation approaches zero the more spread out is the PSD and vise versa.

2.5 Information Theory

So far in this chapter, we have looked at some basic definitions and propertiesconcerning probability theory and stochastic processes. In the same vein, we arenow going to focus on the basic definitions and notions related to information

theory. Although information theory was originally developed in the context ofcommunications and coding disciplines, its application and use has now beenadopted in a wide range of areas, including Machine Learning. Notions frominformation theory are used for establishing cost functions for optimization inparameter estimation problems and also concepts for information theory areemployed to estimate unknown probability distributions, in the context of con-strained optimization tasks. We are going to meet such methods later on in thisbook

The father of information theory is Claude Elwood Shannon (1916–2001).

Page 37: Probabilityand Stochastic Processesskara.di.uoa.gr/PP/transparency/Chapter2.pdf · Probabilityand Stochastic Processes 2.1 Introduction The goal of this chapter is to provide the

2.5. INFORMATION THEORY 37

He was an American mathematician and electrical engineer. He founded infor-mation theory with the landmark paper, “A mathematical theory of commu-nication”, published in the Bell System Technical Journal in 1948. However,he is also credited with founding digital circuit design theory in 1937, when, asa 21-year-old master’s degree student at the Massachusetts Institute of Tech-nology (MIT), he wrote his thesis demonstrating that electrical applications ofboolean algebra could construct and resolve any logical, numerical relationship.So, he is also credited as one of the fathers of digital computers. Shannon, whileworking for the national defense during the World War II, contributed to thefield of cryptography, converting its from an art to a rigorous scientific field.

As it is the case for probability, the notion of information is part of oureveryday vocabulary. In this context, an event carries information if it is eitherunknown to us or if the probability of its occurrence is very low and in spite ofthat it happens. For example, if one tells us that in summer during the day in theSahara desert the sun shines bright, we could consider such a statement ratherdull and useless. On the contrary, if somebody gives us news about raining inSahara during summer, then this statement carries a lot of information and canpossibly ignite a discussion concerning the climate change.

Thus, trying to formalize the notion of information, from a mathematicalpoint of view, it is reasonable to define it in terms of the negative logarithmof the probability of an event. If the event is certain to occur, it carries zeroinformation content; however, if its probability of occurrence is low, then itsinformation content has a large positive value.

2.5.1 Discrete Random Variables

Information

Given a discrete random variable, x, which takes values in the set X , the infor-

mation associated with any value x ∈ X , is denoted as I(x) and it is definedas,

I(x) = − logP (x) : Information Associated with x = x ∈ X . (2.142)

Any base for the logarithm can be used. If the natural logarithm is chosen,information is measured in terms of nats (natural units). If the base 2 logarithmis employed, information is measured in terms of bits (binary digits). Employingthe logarithmic function to define information is also in line with a common sensereasoning, that the information content of two statistically independent eventsshould be the sum of the information conveyed by each one of them individually;I(x, y) = − lnP (x, y) = − lnP (x)− lnP (y).

Example 2.5. We are given a binary random variable x ∈ X = {0, 1}, andassume that P (1) = P (0) = 0.5. We can consider this random variable as asource that generates and emits two possible values. The information contentof each one of the two equiprobable events is,

I(0) = I(1) = − log2 0.5 = 1 bit.

Page 38: Probabilityand Stochastic Processesskara.di.uoa.gr/PP/transparency/Chapter2.pdf · Probabilityand Stochastic Processes 2.1 Introduction The goal of this chapter is to provide the

38 2. PROBABILITY AND STOCHASTIC PROCESSES

Let us now consider another source of random events, which generates code

words comprising k binary variables together. The output of this source can beseen as a random vector with binary-valued elements, x = [x1, . . . , xk]

T . Thecorresponding probability space, X , comprises K = 2k elements. If all possiblevalues have the same probability, 1/K, then the information content of eachpossible event is equal to

I(xi) = − log21

K= k bits.

We observe that in the case where the number of possible events is larger,the information content of each individual one (assuming equiprobable events)becomes larger. This is also in line with common sense reasoning, since if thesource can emit a large number of (equiprobable) events, the occurrence of anyone of them carries more information, compared to a source which can only emita few possible events.

Mutual and Conditional Information

Besides marginal probabilities, we have already met the concept of conditionalprobability. This leads to the definition of the mutual information.

Given two discrete random variables, x ∈ X and y ∈ Y, the informationcontent provided by the occurrence of the event y = y about the event x = x ismeasured by the mutual information, denoted as I(x; y) and defined by,

I(x, y) := logP (x|y)P (x)

: Mutual Information. (2.143)

Note that, if the two variables are statistically independent, then their mutualinformation is zero; most reasonable, since observing y says nothing about x. Onthe contrary, if observing y then it is certain that x will occur, i.e., P (x|y) = 1,then the mutual information becomes, I(x, y) = I(x), which is again in linewith common reasoning. Mobilizing our familiar, by now, product rule it isstraightforward to see that,

I(x, y) = I(y, x).

The conditional information of x given y, is defined as

I(x|y) = − logP (x|y) : Conditional Information. (2.144)

Making use of the probability product rule, it is simple to show that,

I(x, y) = I(x)− I(x|y). (2.145)

Example 2.6. In a communications channel, the source transmits binary sym-bols, x, with probability P (0) = P (1) = 1/2. The channel is noisy, so the

Page 39: Probabilityand Stochastic Processesskara.di.uoa.gr/PP/transparency/Chapter2.pdf · Probabilityand Stochastic Processes 2.1 Introduction The goal of this chapter is to provide the

2.5. INFORMATION THEORY 39

received symbols, y, may have changed polarity, due to noise, with the follow-ing probabilities

P (y = 0|x = 0) = 1− p,

P (y = 1|x = 0) = p,

P (y = 1|x = 1) = 1− q,

P (y = 0|x = 1) = q.

This example illustrates in its simplest form the effect of a communications

channel. Transmitted bits are hit by noise and what the receiver receives isthe noisy (wrong) information. The task of the receiver is to decide, upon thereception of a sequence of symbols, what was the originally transmitted one.

The goal of our example is to determine the mutual information about theoccurrence of x = 0 and x = 1, once y = 0 has been observed. To this end, wefirst need to compute the marginal probabilities,

P (y = 0) = P (y = 0|x = 0))P (x = 0)+P (y = 0|x = 1)P (x = 1) =1

2(1− p+ q),

and similarly,

P (y = 1) =1

2(1− q + p),

Thus, the mutual information is

I(0, 0) = log2P (x = 0|y = 0)

P (x = 0)= log2

P (y = 0|x = 0)

P (y = 0)

= log22(1− p)

1− p+ q,

and

I(1, 0) = log22q

1− p+ q.

Let us now consider that, p = q = 0. Then I(0, 0) = 1 bit, which is equal toI(x = 0), since the output specifies the input with certainty. If on the otherhand, p = q = 1/2, then I(0, 0) = 0 bits, since the noise can randomly changepolarity with equal probability. If now, p = q = 1/4, then I(0, 0) = log2

32 =

0.587 bits and I(1, 0) = −1 bit. Observe that the mutual information can takenegative values, too.

Entropy and Average Mutual Information

Given a discrete random variable, x ∈ X , its entropy is defined as the averageinformation over all possible outcomes, i.e.,

H(x) := −∑

x∈X

P (x) logP (x) : Entropy of x. (2.146)

Page 40: Probabilityand Stochastic Processesskara.di.uoa.gr/PP/transparency/Chapter2.pdf · Probabilityand Stochastic Processes 2.1 Introduction The goal of this chapter is to provide the

40 2. PROBABILITY AND STOCHASTIC PROCESSES

Note that if P (x) = 0, P (x) logP (x) = 0, by taking into consideration thatlimx→0 x log x = 0.

In a similar way, the average mutual information between two random vari-ables, x, y, is defined as

I(x, y) :=∑

x∈X

y∈Y

P (x, y)I(x; y)

=∑

x∈X

y∈Y

P (x, y) logP (x|y)P (y)

P (x)P (y)

or

I(x; y) =∑

x∈X

y∈Y

P (x, y) logP (x, y)

P (x)P (y): Average Mutual Information.

(2.147)It can be shown that

I(x, y) ≥ 0,

and it is zero iff x and y are statistically independent (Problem 2.12).In analogy, the conditional entropy of x given y is defined as

H(x|y) := −∑

x∈X

y∈Y

P (x, y) logP (x|y) : Conditional Entropy. (2.148)

It is readily shown, by taking into account the probability product rule, that

I(x, y) = H(x)−H(x|y). (2.149)

Lemma 2.1. The entropy of a random variable, x ∈ X , takes its maximumvalue if all possible values, x ∈ X , are equiprobable.

Proof: The proof is given in Problem 2.14.

In other words, the entropy can be considered a measure of randomness ofa source that emits symbols randomly. The maximum value is associated withthe maximum uncertainty of what is going to be emitted, since the maximumvalue occurs if all symbols are equiprobable. The smallest value of the entropyis equal to zero and this corresponds to the case where all events have zeroprobability with the exception of one, whose probability to occur is equal toone.

Example 2.7. Consider a binary source, which transmits the values 1 or 0with probabilities p and 1− p respectively. Then the entropy of the associatedrandom variable is

H(x) = −p log2 p− (1 − p) log2(1− p).

Figure 2.17 shows the graph for various values of p ∈ [0, 1]. Observe that themaximum value occurs for p = 1/2.

Page 41: Probabilityand Stochastic Processesskara.di.uoa.gr/PP/transparency/Chapter2.pdf · Probabilityand Stochastic Processes 2.1 Introduction The goal of this chapter is to provide the

2.5. INFORMATION THEORY 41

Figure 2.17: The maximum value of the entropy for a binary random variableoccurs if the two possible events have equal probability, p = 1/2.

2.5.2 Continuous Random Variables

All the definitions given before can be generalized to the case of continuousrandom variables. However, this generalization must take place with caution.Recall that the probability of occurrence of any single value of a random variablethat takes values in an interval in the real axis is zero. Hence, the correspondinginformation content is infinite.

To define the entropy of a continuous variable, x, we first discretize it andform the corresponding discrete variable, x∆,

x∆ := n∆, if (n− 1)∆ < x ≤ n∆, (2.150)

where ∆ > 0. Then,

P (x∆ = n∆) = P (n∆−∆ < x ≤ n∆) =

∫ n∆

(n−1)∆

p(x)dx = ∆p(n∆), (2.151)

where p(n∆) is a number between the maximum and the minimum value ofp(x), x ∈ (n∆−∆, n∆] (such number exists by the mean value theorem). Thenwe can write,

H(x∆) = −+∞∑

n=−∞

∆p(n∆) log(

∆p(n∆))

, (2.152)

and since+∞∑

n=−∞

∆p(n∆) =

∫ +∞

−∞

p(x)dx = 1,

we obtain

H(x∆) = − log∆−+∞∑

n=−∞

∆p(n∆) log(

p(n∆))

. (2.153)

Page 42: Probabilityand Stochastic Processesskara.di.uoa.gr/PP/transparency/Chapter2.pdf · Probabilityand Stochastic Processes 2.1 Introduction The goal of this chapter is to provide the

42 2. PROBABILITY AND STOCHASTIC PROCESSES

Note that x∆ −→ x as ∆ −→ 0. However, if we take the limit in (2.153), then− log∆ goes to infinity. This is the crucial difference with the discrete variables.

The entropy for a continuous random variable, x, is defined as the limit

H(x) := lim∆→0

(

H(x∆) + log∆)

,

or

H(x) = −∫ +∞

−∞

p(x) log p(x)dx : Entropy. (2.154)

This is the reason that often the entropy of a continuous variable is also calleddifferential entropy.

Note that the entropy is still a measure of randomness (uncertainty) of thedistribution describing x. This is demonstrated via the following example.

Example 2.8. We are given a random variable x ∈ [a, b]. Of all the possiblepdfs, which can describe this variable, find the one that maximizes the entropy.

This task, translates to the following constrained optimization task:

maximize with respect to p(·) H = −∫ b

a

p(x) ln p(x)dx,

subject to

∫ b

a

p(x)dx = 1.

The constraint guarantees that the function to result is indeed a pdf. Usingcalculus of variations to perform the minimization (Problem 2.15), turns outthat

p(x) =

{

1b−a , if x ∈ [a, b],

0 otherwise.

In other words, the result is the uniform distribution, which s indeed the mostrandom one since it gives no preference to any particular subinterval of [a, b].

We will come to this method of estimating pdfs in Section ??. This elegantmethod for estimating pdfs is due to Jaynes [Jayn 57,Jayn 03] and it is knownas the maximum entropy method. In its more general form, more constrains areinvolved to fit the needs of the specific problem.

Average Mutual Information and Conditional Information

Given two continuous random variables the average mutual information is de-fined as

I(x, y) :=

∫ +∞

−∞

∫ +∞

−∞

p(x, y) logp(x, y)

p(x)p(y)dxdy. (2.155)

and the conditional entropy of x given y

H(x|y) :=∫ +∞

−∞

∫ +∞

−∞

p(x, y) log p(x|y)dxdy. (2.156)

Page 43: Probabilityand Stochastic Processesskara.di.uoa.gr/PP/transparency/Chapter2.pdf · Probabilityand Stochastic Processes 2.1 Introduction The goal of this chapter is to provide the

2.6. STOCHASTIC CONVERGENCE 43

Using standard arguments and the product rule, it is easy to show that

I(x; y) = H(x)−H(x|y) = H(y)−H(y|x). (2.157)

Relative Entropy or Kullback-Leibler Divergence

The relative entropy or Kullback-Leibler divergence is a quantity that has beendeveloped within the context of information theory for measuring similaritybetween two pdfs. It is widely used in Machine Learning optimization taskswhen pdfs are involved, see, e.g., Chapter ??. Given two pdfs, p(·) and q(·),their Kullback-Leibler divergence, denoted as KL(p||q), is defined as

KL(p||q) :=∫ +∞

−∞

p(x) logp(x)

q(x)dx : Kullback-Leibler Divergence. (2.158)

Note that

I(x, y) = KL(

p(x, y)||p(x)p(y))

.

The Kullback-Leibler divergence is not symmetric, i.e, KL(p||q) 6= KL(q||p)and it can be shown that it is a non-negative quantity (the proof similar withthe proof that the mutual information is non-negative, se also, Problem ?? ofChapter ??). Moreover, it is zero if and only if p = q.

Note that all we have said concerning entropy and mutual information arereadily generalized to the case of random vectors.

2.6 Stochastic Convergence

We are going to close this memory refreshing tour in the theory of probabilityand related concepts, by putting some final touches concerning convergence ofsequences of random variables.

Let a sequence of random variables,

x0, x1, . . . , xn . . .

We can consider this sequence as a discrete-time stochastic process. Due to therandomness, a realization of this process, i.e.,

x0, x1, . . . , xn . . . ,

may converge or may not. Thus, the notion of convergence of random variableshas to be treated carefully and different interpretations have been developed.

Recall from our basic calculus that a sequence of numbers, xn, converges toa value, x, if ∀ǫ there exists a number, n(ǫ), such that

|xn − x| < ǫ, ∀n ≥ n(ǫ). (2.159)

Page 44: Probabilityand Stochastic Processesskara.di.uoa.gr/PP/transparency/Chapter2.pdf · Probabilityand Stochastic Processes 2.1 Introduction The goal of this chapter is to provide the

44 2. PROBABILITY AND STOCHASTIC PROCESSES

Convergence Everywhere

We say that a random sequence converges everywhere if every realization, xn, ofthe random process converges to a value x, according to the definition given in(2.159). Note that every realization converges to a different value, which itselfcan be considered as the outcome of a random variable x, and we write

xn −−−−→n→∞

x. (2.160)

It is common to denote a realization (outcome) of a random process as xn(ζ),where ζ denotes a specific experiment.

Convergence Almost Everywhere

A weaker version of convergence, compared to the previous one, is the conver-

gence almost everywhere. Let the set of outcomes ζ such as

lim xn(ζ) = x(ζ), n −→ ∞,

We say that the sequence xn converges almost everywhere, if

P (xn −→ x) = 1, n −→ ∞. (2.161)

Note that {xn −→ x} denotes the event comprising all the outcomes such aslim xn(ζ) = x(ζ). The difference with the convergence everywhere is that nowit is allowed to a finite or countably infinite number of realizations (that is, toa set of zero probability) not to converge.

Convergence in The Mean Square Sense

We say that a random sequence, xn, converges to the random variable, x, in themean square sense (MS), if

E[

|xn − x|2]

−→ 0, n −→ ∞. (2.162)

Convergence in Probability

Given a random sequence, xn, a random variable, x, and a non-negative numberǫ, then {|xn − x| > ǫ} is an event. We define the new sequence of numbers,P(

{|xn−x| > ǫ})

. We say that xn converges to x in probability if the constructedsequence of numbers tends to zero, i.e.,

P(

{|xn − x| > ǫ})

−→ 0, n −→ ∞, ∀ǫ ≥ 0. (2.163)

Convergence in Distribution

Given a random sequence, xn and a random variable, x, let Fn(x) and F (x) bethe cumulative distribution functions, respectively. We say that xn convergesto x in distribution, if

Fn(x) −→ F (x), n −→ ∞, (2.164)

Page 45: Probabilityand Stochastic Processesskara.di.uoa.gr/PP/transparency/Chapter2.pdf · Probabilityand Stochastic Processes 2.1 Introduction The goal of this chapter is to provide the

2.6. STOCHASTIC CONVERGENCE 45

for every point x of continuity of F (x).It can be shown that if a random sequence converges either almost every-

where of in the MS sense then it necessarily converges in probability and if itconverges in probability then it necessarily converges in distribution. The con-verse arguments are not necessarily true. In other words, the weakest versionof convergence is that of convergence in distribution.

Page 46: Probabilityand Stochastic Processesskara.di.uoa.gr/PP/transparency/Chapter2.pdf · Probabilityand Stochastic Processes 2.1 Introduction The goal of this chapter is to provide the

46 2. PROBABILITY AND STOCHASTIC PROCESSES

Problems

2.1. Derive the mean and variance for the binomial distribution.

2.2. Derive the mean and the variance for the uniform distribution.

2.3. Derive the mean and covariance matrix of the multivariate Gaussian.

2.4. Show that the mean and variance of the beta distribution with parametersa and b are given by

E[x] =a

a+ b

and

σ2x =

ab

(a+ b)2(a+ b+ 1).

Hint: Use the property Γ(a+ 1) = aΓ(a).

2.5. Show that the normalizing constant in the beta distribution with param-eters a, b is given by

Γ(a+ b)

Γ(a)Γ(b).

2.6. Show that the mean and variance of the gamma pdf

Gamma(x|a, b) = ba

Γ(a)xa−1e−bx, a, b, x > 0.

are given by

E[x] =a

b,

σ2x =

a

b2.

2.7. Show that the mean and variance of a Dirichlet pdf with K variables,xk, k = 1, 2, . . . ,K and parameters ak, k = 1, 2, . . . ,K, are given by

E[xk] =aka, k = 1, 2, . . . ,K

σ2k =

ak(a− ak)

a2(1 + a), k = 1, 2, . . . ,K,

cov[xixj ] = − aiaj

a2(1 + a), i 6= j,

where a =∑K

k=1 ak.

2.8. Show that the sample mean, using N i.i.d drawn samples, is an unbiasedestimator with variance that tends to zero asymptotically, as N −→ ∞.

Page 47: Probabilityand Stochastic Processesskara.di.uoa.gr/PP/transparency/Chapter2.pdf · Probabilityand Stochastic Processes 2.1 Introduction The goal of this chapter is to provide the

PROBLEMS 47

2.9. Show that for WSS processes

r(0) ≥ |r(k)|, ∀k ∈ Z,

and that for jointly WSS processes,

ru(0)rv(0) ≥ |ruv(k)|, ∀k ∈ Z.

2.10. Show that the autocorrelation of the output of a linear system, with im-pulse response, wn, n ∈ Z, is related to the autocorrelation of the inputprocess, via,

rd(k) = ru(k) ∗ wk ∗ w∗−k.

2.11. Show thatlnx ≤ x− 1.

2.12. Show thatI(x; y) ≥ 0.

Hint: Use the inequality of Problem 2.11.

2.13. Show that if ai, bi, i = 1, 2, . . . ,M are positive numbers, such as

M∑

i=1

ai = 1, and

M∑

i=1

bi ≤ 1,

then

−M∑

i=1

ai ln ai ≤M∑

i=1

ai ln bi.

2.14. Show that the maximum value of the entropy of a random variable occursif all possible outcome are equiprobable.

2.15. Show that from all the pdfs which describe a random variable in an interval[a, b] the uniform one maximizes the entropy.

Page 48: Probabilityand Stochastic Processesskara.di.uoa.gr/PP/transparency/Chapter2.pdf · Probabilityand Stochastic Processes 2.1 Introduction The goal of this chapter is to provide the

48 2. PROBABILITY AND STOCHASTIC PROCESSES

Page 49: Probabilityand Stochastic Processesskara.di.uoa.gr/PP/transparency/Chapter2.pdf · Probabilityand Stochastic Processes 2.1 Introduction The goal of this chapter is to provide the

Bibliography

[Broc 91] Brockwell P.J., Davis R.A. Time Series: Theory and Methods, 2ndEd., Springer, 1991.

[Cox 46] Cox R.T. “Probability, frequency and reasonable expectation,” Amer-

ican Journal of Physics, Vo. 14(1), pp. 1–13, 1946.

[Jayn 57] Jaynes, E. T. “Information Theory and Statistical Mechanics,” Phys-

ical Review, Vol. 106(4), pp. 620–630, 1957.

[Jayn 03] Jaynes E.T. Pobability Theory: The Logic of Science, Cambridge Uni-versity Press, 2003.

[Kolm 56] Kolmogorov A. N. Foundations of the Theory of Probability, 2nd Ed.,Chelsea Publishing Company, 1956.

[Papo 02] Papoulis A., Pillai S.U. Probability, Random Variables and Stochastic

Processes, 4th Ed., MaGraw Hill, 2002.

[Prie 81] Priestly M.B. Spectral Analysis and Time Series, Academic Press, NewYork, 1981.

[Proa 92] Proakis J, Manolakis D. Digital Signal Processing, 2nd Ed. MacMil-lan, 1992.

49