Probability - Stanford University · Review and elaboration of basic probability + simple examples...

26
EE 278 Lecture Notes # 2 Winter 2010–2011 Probability Review and elaboration of basic probability + simple examples of random variables, vectors, and processes Probability spaces, fair spinner, one coin flip, multiple coin flips, a Bernoulli random process. pdfs and pmfs EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 1 Probability Space Probability assigns a measure like length, area, volume, weight, mass to events = sets in some space Usually involves sums (discrete probability) or integrals (continuous) Basic construct is a probability space or experiment (, F , P) which consists of three items: EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 2 1. Sample space = an abstract space of elements called points E.g., {H, T }, {0, 1}, [0, 1), R k 2. Event space F = collection of subsets of called events, to which probabilities are assigned E.g., all subsets of 3. Probability measure P = assignment of real numbers to events consistent with a set of axioms Consider each component in order: EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 3 Sample space An abstract space of elements called (sample) points Intuition: contains all distinguishable elementary outcomes or finest grain results of an experiment. EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 4

Transcript of Probability - Stanford University · Review and elaboration of basic probability + simple examples...

EE 278Lecture Notes # 2Winter 2010–2011

Probability

Review and elaboration of basic probability + simple examples ofrandom variables, vectors, and processes

Probability spaces, fair spinner, one coin flip, multiple coin flips, aBernoulli random process. pdfs and pmfs

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 1

Probability Space

Probability assigns a measure like length, area, volume, weight,mass to events = sets in some space

Usually involves sums (discrete probability) or integrals (continuous)

Basic construct is a probability space or experiment (Ω,F , P) whichconsists of three items:

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 2

1. Sample space Ω= an abstract space of elements called points

E.g., H,T , 0, 1, [0, 1), Rk

2. Event space F = collection of subsets of Ω called events, to whichprobabilities are assigned

E.g., all subsets of Ω

3. Probability measure P = assignment of real numbers to eventsconsistent with a set of axioms

Consider each component in order:

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 3

Sample space Ω An abstract space of elements called (sample)points

Intuition: contains all distinguishable elementary outcomes orfinest grain results of an experiment.

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 4

Event space F (sigma-field) Collection of subsets of Ω s.t.

a) If F ∈ F , then also Fc ∈ F

b) If Fi ∈ F , i = 1, 2, . . . , then also

i

Fi ∈ F ,

Intuition: Algebraic structure — a)-b) + set theory⇒ countable settheoretic operations (union, intersection, complementation,difference) of events produces new events. Ω ∈ F , ∅ ∈ F .

F ⊂ Ω, but F ∈ F (set inclusion vs. element inclusion)

Event spaces not an issue in elementary case where Ω discrete, useF = all subsets of Ω (power set of Ω)

Event spaces are in issue in continuous spaces, power set too largefor useful theory. If Ω = R, Borel field B(R). (Smallest event spacecontaining the intervals.)

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 5

Probability measure P An assignment of a number P(F) to everyevent F ∈ F in a way that satisfies Kolmogorov’s axioms ofprobability:

1. P(F) ≥ 0 for all F ∈ F2. P(Ω) = 1

3. If Fi ∈ F , i = 1, 2, . . . are disjoint or mutually exclusive(Fi ∩ F j = ∅ if i j), then

P(∪iFi) =

i

P(Fi)

Axioms are enough to get a useful calculus of probability + usefulmathematical models of random processes with predictable longterm behavior (laws of large numbers or ergodic theorems, centrallimit theorem)

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 6

Example: Spinning pointer

Introduce several fundamental ideas in context of two simpleexamples: a fair spinning pointer (or wheel) and a single coin flip.Then consider many coin flips.

Spin a fair pointer in a circle:

0.0

0.5

0.250.75

When pointer stops it can point to any number in the unit intervalΩ = [0, 1) ∆= r : 0 ≤ r < 1 Describe (Ω,F , P)

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 7

Sample space = Ω = [0, 1)

Event space = F = smallest event space containing all of theintervals, called B([0, 1)), the Borel field of [0, 1).

Probability measure: For fair spinner probability outcome a point inF ∈ B([0, 1)) is

P(F) =

F

f (x)dx

wheref (x) = 1, x ∈ [0, 1),

is a uniform probability density function (pdf)

E.g., if F = [a, b] = r : a ≤ r ≤ b with 0 ≤ a ≤ b < 1, P(F) = b − a

The probability of the pointer landing in an interval of length b − a isb − a, the fraction of the sample space corresponding to the event

intuitive!

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 8

In general, f (x), x ∈ Ω is a pdf if

1. f (x) ≥ 0, all x,

2.Ω

f (x)dx = 1.

pdf⇒ a probability measure by integration:

P(F) =

F

f (x)dx

Can also writeP(F) =

1F(x) f (x)dx

where indicator function of F given by

1F(r) ∆=

1 if r ∈ F

0 if r F

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 9

Same result if instead set Ω = R, F = B(R), and pdf

f (r) =

1 if r ∈ [0, 1)0 otherwise

.

Comments:

• Event space details⇒ integrals make senseImportant in research, not so much in practice. But good to knowthe language when reading the literature.

• The integrals in practice are RiemannIn theory they are Lebesgue (better limiting properties)In most cases two integrals are the same.

• The axioms of probability are properties of integration in disguise.See the next slide.

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 10

PDFs and the axioms of probability

Suppose f is a pdf (nonnegative,Ω

f (x)dx = 1) and

P(F) =

F

f (x)dx

Then

• Probabilities are nonnegative since integrating a nonnegativeargument yields a nonnegative result. (Axiom 1)

• The probability of the entire sample space is 1 since integrating 1over the unit interval yields 1. (Axiom 2)

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 11

• The probability of a finite union of disjoint regions is the sum of theprobabilities of the individual events since integration is linear:

P(F ∪G) =

1F∪G(r) f (r) dr =

(1F(r) + 1G(r)) f (r) dr

=

1F(r) f (r) dr +

1G(r) f (r) dr

= P(F) + P(G).

(Part of Axiom 3 from linearity of integration)

Need to show for a countable number of disjoint events for Axiom 3.Is above enough? Unfortunately, no for Riemann integral.

True if use Lebesgue integral. (We do not pursue details.)

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 12

Example Probability Spaces: Single coin flip

Develop in two ways:

Model I Direct description — Simplest way, fine if all you care aboutis one coin flip.

Model II As a random variable (or signal processing) defined on fairspinner. Will define lots of other random variables on same space.

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 13

Coin Flip Model I: Direct description

Sample space Ω = 0, 1— using numbers instead of head, tail willprove convenient

Event space F = 0, 1,Ω = 0, 1, ∅, all subsets of Ω (power setof Ω)

Probability measure P Define in terms of a sum of probability massfunction, analogous to integral of pdf for spinner

Given a discrete sample space (= countable number of elements) Ω,a probability mass function (pmf) p(ω) is a nonnegative function suchthatω∈Ω p(ω) = 1.

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 14

Given pmf p, define P by P(ω) + Axiom 3:

P(F) =

x∈FP(x) =

x∈Fp(x) =

1F(x)p(x)

Again: theory of sums⇒ axioms of probability satisfied, obvious if Ωfinite.

For the fair coin, set p(1) = p(0) = 1/2. For a biased coin, common touse p(1) = p, p(0) = 1 − p for a parameter p ∈ (0, 1)

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 15

Notes:

• Probabilities are defined on sets — P(F), pdfs and pmfs aredefined on points!!

• In discrete case, one point (singleton) sets have possibly nonzeroprobability, e.g.,

P(0) = p(0) =12

A pmf gives the probability of something.A pdf is not the probability of anything, e.g., in our uniform spinnercase

P(12) =

1/21dx = 0.

If P determined by pdf, probability of individual point (e.g., 1/π) =0

Must integrate a pdf to get a probability

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 16

Coin Flip Model II: Random variable on spinner

Suppose (Ω,F , P) describes uniform spinner: Ω = [0, 1), P describedby pdf f (r) = 1 for r ∈ [0, 1)

Define a measurement q made on outcome of spin by

q(r) = 1[0.5,1)(r) =

1 if r ∈ [0.5, 1)0 if r ∈ [0, 0.5)

(1)

Simple quantizer

Simple example of a random variable = a real-valued functiondefined on sample space, q(r), r ∈ Ω

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 17

Simple example of signal processing — fair spinner produces signalor input r, quantizer operates on signal to produce a value q(r). ThinkA/D conversion or threshold decision rule based on noisy observation

Original probability space + function⇒ new probability space withbinary sample space Ωq = 0, 1, event space Fq = power set of0, 1. New space inherits probability measure, say Pq, from old.

Output space is discrete⇒ only need pmf to characterize probabilitymeasure

pq(0) = P(r : q(r) = 0) = 0.5

01dr =

12

Often write informally as Pr(q = 0) or Pq = 0, short hand for“probability random variable q takes on a value of 0”

Similarly pq(1) = P(r : q(r) = 1) = 1

0.51dr =

12

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 18

Can use pmf to find any probability involving q:

Pq(F) =

x∈Fpq(x)

Derived new probability measure from old + function q.

Probability measure Pq describing output of random variable q calledthe distribution of q

In general, Pq(F) = P(ω : q(ω) ∈ Fq−1(F)

) relates probability in new

space to probability in old.

q−1(F) = inverse image of set F under mapping q

Basic example of derived distribution: (Ω,F , P) + q⇒ (Ωq,Fq, Pq)

General solution: inverse image method Pq(F) = P(q−1(F))

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 19

Notes:

• Multiple ways to arrive at same model — equivalent models

Directly in terms of P, indirectly in terms of probability space + rv

• Basic idea will be seen often:

probability space + function (random variable)

⇒ new probability space with probability measure = distribution ofrandom variable given by inverse image formula

Will see many tools and tricks for doing actual calculus.

• Using Model II, can define more random variables on commonexperiment, e.g., two coin flips or even an infinite number.

Model I only good for single coin flip.

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 20

Two coin flips

Let (Ω,F , P) be the fair spinner experiment.

Rename quantizer q as X (common to use upper case for randomvariables). Define another random variable Y on same space:

X(ω) =

0 if ω ∈ [0, 0.5)1 if ω ∈ [0.5, 1.0)

Y(ω) =

0 if ω ∈ [0, 0.25) ∪ [0.5, 0.75)1 if ω ∈ [0.25, 5.0) ∪ [0.75, 1.0)

0 r

1X(r)=0

X(r)=1

Y(r)=0 Y(r)=1 Y(r)=0 Y(r)=1

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 21

Single experiment⇒ values of two random variables X and Y, orsingle 2D random vector (X,Y).

Easy to compute pmfs for individual random variables X and Y

(marginal pmfs)

pX(k) = pY(k) =12

; k = 0, 1Equivalent random variables, same pmf.

Now also have joint pmf of 2 rvs together: inverse image formula⇒

pXY(x, y) = Pr(X = x,Y = y) = P(ω : (X,Y)(ω) = (x, y))

Do the math: pXY(x, y) =14

; x = 0, 1; y = 0, 1 E.g.,

pXY(0, 1) = P(ω : ω ∈ [0, 0.5) ∩ ω ∈ [0.25, 5.0) ∪ [0.75, 1.0))

= P(ω : ω ∈ [0.25, 5.0)) = 14

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 22

Notes:

• Here separately derived joint pmf pXY and marginal pmfs pX, pY.Alternatively could compute marginals from joint using totalprobability:

pX(x) = P(ω : X(ω) = x) =

y∈ΩY

P(ω : X(ω) = x,Y(ω) = y)

=

y∈ΩY

pXY(x, y)

pY(y) =

x∈ΩX

pXY(x, y)

Joint and marginal pmfs are consistent — can get marginals eitherfrom original P or from joint pmf pXY

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 23

• For this example pXY(x, y) = pX(x)pY(y), a product pmf

If two discrete random variables satisfy above, they are said to beindependent (will discuss more later)

Can define any number of random variables on a common probabilityspace.

An infinite collection of random variables such as Xn; n = 0, 1, 2, . . .defined on a common probability space is called a random process

Extend example: a Bernoulli random process

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 24

A Bernoulli Random Process: Fair Coin Flips

Again let Ω = [0, 1) and P be determined by uniform pdf.

Every number in u ∈ [0, 1) has a binary representation

u =

n=0

bn(u)2−n−1

binary analog of decimal representation — unique if chooserepresentation bn not having a finite number of 0s,

e.g., choose 1/2→ 100000 · · · , not 01111 · · ·

E.g., if u = 3/4, then b0(u) = b1(u) = 1, bn(u) = 0 for all n ≥ 2.

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 25

Xn(u) = bn(u) defines a discrete-time random processXn; n = 0, 1, 2, . . . one experiment⇒ an infinite number of rvs!

In earlier 2D example, X = X0, Y = X1.

Similar computations to 2D example (inverse image formula)⇒

• marginal pmfs pXn(x) = 1/2, x = 0, 1 (all equivalent, fair coin flip)

• for any k = 1, 2, . . ., joint pmf describing random vector Xk is

pXk(x

k) = Pr(Xk = xk) = 2−k; x

k ∈ 0, 1k

Hence

pXk(x

k) =n−1

n=0

pXn(xn); x

k ∈ 0, 1k,

a product pmf

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 26

A collection of discrete random variables is said to be (mutually)independent if the joint pmf = the product of the marginal pmfs.

A random vector is said to be independent identically distributed oriid (or i.i.d. or IID) if its component random variables are independentand identically distributed

A random process Xn is iid if any finite collection of randomvariables in the collection is iid

An iid process also called a Bernoulli process (sometimes namereserved for binary processes). Classic example is an infinitesequence of fair coin flips.

End extended example of uniform spinner and Bernoulli process,back to general material Elaborate on Ω,F , P —

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 27

Probability spaces: sample space Ω

Common examples:

• 0, 1 Coin flip, value in data stream at time t

• [0, 1) Fair spin, analog sensor output

• [0,∞) Time to arrival of first packet, bus, customer

• Zk

∆= 0, 1, . . . , k − 1 die roll, ascii character, card drawn from deck

• Z+ = 0, 1, 2, . . . Number of packets/buses/customers arriving in[0,T )

• R = (−∞,∞) voltage at sensor (without known bound)

• 0, 1k=all binary k-tuples (a product space), flip one coin k times,flip k coins once, sample k successive values in a data stream

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 28

• [0, 1)k= k-dimensional unit cube, measurements from k identicalsensors at different locations

• Rkk-dimensional Euclidean space, voltage of array of k sensors

• etc. — e.g., Zk, sequence spaces such as all binary sequences,waveform spaces such as all continuous differential waveforms

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 29

Probability spaces: event space F

Given Ω, smallest event space is ∅,Ω.

Biggest event space is power set. (Too big to be useful for continuousspaces.)

Useful event space in R and Rk is Borel field=smallest event spacecontaining all of the intervals (k=1) and rectangles (k > 1)

(Do not need for HW or exams, useful to have an idea whenencounter in books and papers)

A measurable space (Ω,F ) = sample space + event space ofsubsets of sample space

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 30

Probability spaces: probability measure

When dealing with finite sample spaces, only need Axiom 3 to holdfor finite collections of disjoint events.

Key point A set function P defined on an event space F of subsetsof a sample space Ω is a probability measure if and only if (iff) itsatisfies the axioms of probability

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 31

A trivial example

The simplest possible example is useless except for providing a trivialexample.

Ω is any abstract space F = Ω, ∅

P defined by P(Ω) = 1, P(∅) = 0

Axioms of probability measure satisfied.

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 32

Simple example: biased coin

Simplest nontrivial example

Ω = 0, 1

F = 0, 1,Ω = 0, 1, ∅ axioms of event space satisfied

P defined by

P(F) =

1 − p if F = 0p if F = 10 if F = ∅1 if F = Ω

,

where p ∈ (0, 1) is a fixed parameter (p = 0, 1 is a variation on thetrivial probability space)

Axioms can be verified in a straightforward manner.

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 33

In general can not do it this way and list probabilities of ever event,too many events

Instead provide a formula for computing probabilities of events asintegrals (of a pdf) or sums (of a pmf)

Will see many common examples, most have names (uniform,binomial, geometric, Poisson, Gaussian, Bernoulli, exponential,Laplacian, . . . )

Before more examples, derive several fundamental properties ofprobability — several elementary and one advanced

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 34

Elementary properties of probability

(Ω,F , P)

(a) For all events F, P(Fc) = 1 − P(F)

(b) For all events F, P(F) ≤ 1

(c) Let ∅ be the null or empty set, then P(∅) = 0.

(d) Total Probability If events Fi; i = 1, 2, . . . form a partition of Ω,i.e., if Fi ∩ Fk = ∅ when i k and

i Fi = Ω, then for any event G

P(G) =

i

P(G ∩ Fi).

(e) If G ⊂ F for events G, F, then P(F −G) = P(F) − P(G)(F −G

∆= F ∩G

c, also written F\G)

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 35

Proof

(a) F ∪ Fc = Ω⇒ P(F ∪ F

c) = 1 (Axiom 2). F ∩ Fc = ∅ ⇒

1 = P(F ∪ Fc) = P(F) + P(Fc) (Axiom 3).

(b) P(F) = 1 − P(Fc) ≤ 1 (Axiom 1 and (a) above).

(c) By Axiom 2 and (a) above, P(Ωc) = P(∅) = 1 − P(Ω) = 0.

Note: Empty set ∅ has probability 0, but P(F) = 0 does not meanF = ∅. E.g., uniform spinner, F = 1/n : n = 1, 2, 3, . . .

(d) Using set theory and Axiom 3:P(G) = P(G ∩Ω) = P(G ∩ (

iFi)) = P(

i (G ∩ Fi)

disjoint) =

iP(G ∩ Fi).

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 36

(e) F −G = F ∩Gc and G are disjoint, so Axiom 3⇒

P((F −G) ∪G) = P(F −G) + P(G).

Since G ⊂ F, G = F ∩G⇒(F −G) ∪G = (F ∩G

c) ∪ (F ∩G) = F ∩ (G ∪Gc)

Ω

= F.

Thus P(F) = P(F −G) + P(G).

Note (e)⇒ that if G ⊂ F, then P(G) ≤ P(F)

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 37

An advanced property of probability: Continuity

A sequence of sets Fn, n = 0, 1, 2, . . . is increasing if Fn−1 ⊂ Fn all n

(also called nested) decreasing if Fn ⊂ Fn−1

a b

F1 F2 F3 F4 F4 F3 F2 F1

(a) Increasing sets, (b) decreasing sets

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 38

E.g., Increasing: Fn = [0, n), Fn = [1, 2 − 1/n), (−n, a)

Decreasing: Fn = [1, 3 + 1/n), Fn = (1 − 1/n, 1 + 1/n)

Natural definition of limit of increasing sets

limn→∞

Fn

∆=

n=1

Fn

E.g.,

limn→∞

[0, n) = [0,∞)

limn→∞

[1, 2 − 1/n) = [1, 2)

limn→∞

(−n, a) = [−∞, a)

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 39

Natural definition of limit of decreasing sets

limn→∞

Fn

∆=

n=1

Fn

E.g.,

limn→∞

[1, 3 + 1/n) = [1, 3)

limn→∞

[1 − 1/n, 1 + 1/n) = 1

There is no natural definition of a limit of an arbitrary sequence ofsets, only for increasing or decreasing.

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 40

Continuity of probability If Fn is an increasing or decreasingsequence of events, then

P( limn→∞

Fn) = limn→∞

P(Fn)

Prove for increasing sets Fn; n = 0, 1, 2, . . ..

Recall set theory difference A − B = A ∩ Bc = points in A that are not

in B.

Define G0 = F0, Gn = Fn − Fn−1 for n = 1, 2, . . .

Gn are disjoint,

n

k=0 Fk = Fn =

n

k=0 Gk,

k=0

Fk =

k=0

Gk

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 41

Then

P

limn→∞

Fn

= P

k=0

Fk

= P

k=0

Gk

=

k=0

P(Gk)

Axiom 3

= limn→∞

n

k=0

P(Gk)

definition of infinite sum

,

Gn = Fn − Fn−1 and Fn−1 ⊂ Fn⇒ P(Gn) = P(Fn) − P(Fn−1)

⇒n

k=0

P(Gk) = P(F0) +n

k=1

(P(Fn) − P(Fn−1)) = P(Fn).

“telescoping sum,” all terms cancel but last one:

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 42

P(Fn) = + P(Fn) − P(Fn−1)

+ P(Fn−1) − P(Fn−2)

+ P(Fn−2) − P(Fn−3)...

+ P(F1) − P(F0)

+ P(F0).

⇒ If Fn is a sequence of increasing events, then

P

limn→∞

Fn

= lim

n→∞P(Fn)

E.g., P((−∞, a]) = limn→∞ P((−n, a]).

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 43

Similar proof for decreasing events. E.g.,

P(a) = limn→∞ P((a − 1/n, a + 1/n))

If P described by a pdf, then probability of a point is 0.

Can show Axioms 1, 2, and 3 for finite collections of disjoint events +continuity of probability⇔ Axioms 1–3

Kolmogorov’s countable additivity axiom ensures good limitingbehavior.

Back to more concrete issues.

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 44

Discrete probability spaces

Introduce several common examples.

Recall basic construction:

Ω = a discrete (countable) set, F = power set of Ω

Given a probability measure P on (Ω,F ),⇒ pmf

p(ω) ∆= P(ω);ω ∈ Ω

Conversely, given a pmf p (nonnegative, sums to 1),⇒ P via

P(F) =

ω∈Fp(ω).

Calculus (properties of sums) implies axioms of probability satisfied.

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 45

Common examples of pmfs

Binary (Bernoulli) pmf. Ω = 0, 1; p(0) = 1 − p, p(1) = p, where p

is a parameter in (0, 1).A uniform pmf. Ω = Zn = 0, 1, . . . , n − 1 and p(k) = 1/n; k ∈ Zn.Binomial pmf. Ω = Zn+1 = 0, 1, . . . , n and

p(k) =n

k

p

k(1 − p)n−k; k ∈ Zn+1,

where n

k

=

n!k!(n − k)!

is the binomial coefficient (read as “n choose k”).

Geometric pmf. Ω = 1, 2, 3, . . . and p(k) = (1 − p)k−1p; k = 1, 2, . . .,

where p ∈ (0, 1) is a parameter.

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 46

The Poisson pmf. Ω = Z+ = 0, 1, 2, . . . and p(k) = (λke−λ)/k!, where

λ is a parameter in (0,∞). (Keep in mind that 0! ∆= 1.)

These are all obviously nonnegative, hence to verify they are pmfsneed only show

ω∈Ω p(ω) = 1

Obvious for Bernoulli and uniform. For binomial, use binomialtheorem, for geometric use geometric progression formula, forPoisson use Taylor series expansion for e

x (will do details later, try iton your own)

These are exercises in calculus of sums.

Before continuing to continuous examples (pdfs), consider moregeneral weighted sums called expectations.

(Treated in depth later, useful to introduce basic idea early.)

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 47

Discrete expectations

Given pmf p on discrete sample space Ω

Suppose that g is a real-valued function defined on Ω: g : Ω→ R

Recall: A real-valued function defined on a probability space iscalled a random variable 1

Define expectation of g (with respect to p) by

E(g) =

ω∈Ωg(ω)p(ω).

1There is a required technical condition we will see later, but it is automatic for discrete probabilityspaces with the power set as event space.

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 48

Example: If g(ω) = 1F(ω), then

P(F) = E(1F), (2)

⇒ probability can be viewed as a special case of expectation

Some other important examples:

Suppose that Ω ⊂ R, e.g., R, [0, 1), Z or Zn. Fix pmf p.If g(ω) = ω, mean or first moment

m =ωp(ω)

kth momentm

(k) =

(ω)kp(ω),

e.g., m = m(1).

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 49

The second moment is often of interest:

m(2) =

(ω)2

p(ω).

centralized moments: (ω − m)k

p(ω),Most important is the variance

σ2 =

(ω − m)2p(ω).

Note:

σ2 =

(ω − m)2p(ω) =

(ω2 − 2ωm + m

2)p(ω)

=ω2

p(ω) − 2m

ωp(ω) + m

2

p(ω)

= m(2) − 2m

2 + m2 = m

(2) − m2

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 50

variance = second moment - mean2

Will see similar definition for continuous case with a pdf, withintegrals instead of sums

For now moments can be viewed as simply attributes of pmfs andpdfs. For many pmfs and pdfs, knowing certain moments completelydescribes the pmf/pdf.

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 51

Computational examples

Discrete uniform pmf on Zn

P(F) =1n

1F(ω) =

#(F)n,

#(F)= number of elements in F.

mean:

m =1n

n−1

k=0

k =n + 1

2second moment:

m(2) =

1n

n−1

k=0

k2 =

(n + 1)(2n + 1)6

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 52

Binomial pmf To show a valid pmf, use binomial theorem:

(a + b)n =

n

k=0

n

k

a

nb

n−k

Set a = p, b = 1 − p

n

k=0

p(k) =n

k=0

n

k

p

k(1 − p)n−k = (p + 1 − p)n = 1.

Finding mean messy, but good practice. Later find shortcuts.

m =

n

k=0

k

n

k

p

k(1 − p)n−k

=

n

k=1

n!(n − k)!(k − 1)!

pk(1 − p)n−k

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 53

Trick: resembles terms in binomial theorem, massage it into form:

Change variables l = k − 1,

m =

n−1

l=0

n!(n − l − 1)!l!

pl+1(1 − p)n−l−1

= np

n−1

l=0

(n − 1)!(n − 1 − l)!l!

pl(1 − p)n−1−l

= np(p + 1 − p)n−1 = np.

Postpone second moment until have a better method.

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 54

Geometric pmf Use geometric progression: if |a| < 1,

k=0

ak =

11 − a

,

⇒ geometric pmf indeed sums to 1.

mean:

m =

k=1

kp(k) =∞

k=1

kp(1 − p)k−1.

How evaluate? Can look it up, or use trick: differentiate geometricprogression formula to obtain

d

da

k=0

ak

∞k=0 kak−1

=d

da

11 − a

1(1−a)2

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 55

Set a = 1 − p and get

m =1p

for geometric pmf

A similar idea works for the second moment:

m(2) =

k=1

k2p(1 − p)k−1 = p

2p3 −

1p2

=

2 − p

p2

henceσ2 = m

(2) − m2 =

1 − p

p2 .

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 56

Example of probability calculus using geometric pmf: Find theprobabilities of the event F = k : k ≥ 10 and G = k : k is odd.

Note that F = 10, 11, 12, . . . and G = 1, 3, 5, 7, . . .

Solutions:

P(F) =

k∈Fp(k) =

k=10

p(1 − p)k−1

=p

1 − p

k=10

(1 − p)k =p

1 − p(1 − p)10

k=10

(1 − p)k−10

= p(1 − p)9∞

k=0

(1 − p)k = (1 − p)9,

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 57

P(G) =

k∈Gp(k) =

k=1,3,...

p(1 − p)k−1

= p

k=0,2,4,...

(1 − p)k = p

k=0

[(1 − p)2]k

=p

1 − (1 − p)2 =1

2 − p.

Poisson pmf

Show sums to 1:∞

k=0

p(k) =∞

k=0

λke−λ

k!= e−λ ×

k=0

λk

k!

Tayor series for eλ

= 1

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 58

mean:

k=0

kp(k) =∞

k=1

kλk

e−λ

k!= e−λ∞

k=1

λk

(k − 1)!.

Change variables l = k − 1

k=0

kp(k) = λe−λ∞

l=0

λl

l!= λ

m(2) =

k=1

k2λ

ke−λ

k!=

k=2

k(k − 1)λk

e−λ

k!+ m

=

k=2

λke−λ

(k − 2)!+ m

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 59

Change variables l = k − 2

m(2) = λ2

l=0

λle−λ

l!+ m = λ2 + λ

so σ2 = λ

Moments crop up in many signal processing applications (andstatistical analysis of many kinds), so it is useful to have table ofmoment formulas for important distributions handy and anappreciation of the underlying calculus.

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 60

Multidimensional pmf’s

Vector spaces with discrete coordinates like Zk

nare also discrete.

Same ideas of describing probabilities by pmfs work.

Suppose A is a discrete space

Ak = all vectors x = (x0, . . . xk−1) with xi ∈ A, i = 0, 1, . . . , k − 1 is also

discrete.

If have pmf p on Ak, i.e., p(x) ≥ 0,

x∈Ak

p(x) = 1,

then P(F) =

x∈F p(x) a probability measure

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 61

Common example of a pmf on vectors is a product pmf:

Product pmf Suppose pi; i = 0, 1, . . . , k − 1 one-dimensional pmf’son A (for each i, pi(x) ≥ 0 all x ∈ Ω,

x∈Ω pi(x) = 1) each

nonnegative, sums to 1)

Define the product k-dimensional pmf p on Ak by

p(x) = p(x0, x1, . . . , xk−1) =k−1

i=0

pi(xi).

Easily seen to be a pmf (nonnegative, sums to 1)

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 62

Example: product of Bernoulli pmfs:

pi(x) = px(1 − p)1−x; x = 0, 1, all i

⇒ product pmf

p(x0, x1, . . . , xk−1) =k−1

i=0

pxi(1 − p)1−xi

= pw(x0,x1,...,xk−1)(1 − p)k−w(x0,x1,...,xk−1),

where w(x0, x1, . . . , xk−1)= # number of ones in binary k-tuplex0, x1, . . . , xk−1, Hamming weight of the vector.

Will see when probability of a bunch of things factors into a bunch ofprobabilities in this way, things simplify significantly

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 63

Continuous probability spaces

Recall basic construction using pdf on the real line R:

(Ω,F ) = (R,B(R))

f is a pdf on R iff (r) ≥ 0, all r ∈ Ω

Ω

f (r)dr = 1

Define set function P by

P(F) =

F

f (r) dr =

1F(r) f (r) dr, F ∈ B(R)

Unlike pmf, a pdf is not a probability of something — it is a density ofprobability. To get a probability, must integrate a pdf over a set.

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 64

Common approximation: Suppose Ω = R and f a pdf. Considerevent F = [x, x + ∆x), ∆ small.

Mean value theorem of calculus⇒

P([x, x + ∆x)) =

x+∆x

x

f (α) dα ≈ f (x)∆x

Theoretical issue Does P(F) =

Ff (r) dr =

1F(r) f (r) dr, F ∈ B(R)

in fact define a probability measure? That is, does set function P

satisfy axioms 1-3?

Answer: Yes, with proper care to use correct event space and notionof integration. As usual, rarely a problem in practice. Need details inresearch.

Usually continuous probability just substitutes integration forsummation, but pmfs and pdfs are inherently different.

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 65

Common examples of pdfs

pdfs are assumed to be 0 outside the specified domain and are givenin terms of real-valued parameters b, a, λ > 0, m, and σ > 0.

Uniform pdf. Given b > a, f (r) = 1/(b − a) for r ∈ [a, b].

Exponential pdf. f (r) = λe−λr; r ≥ 0.

Doubly exponential (or Laplacian) pdf f (r) =λ

2e−λ|r|; r ∈ R.

Gaussian (or Normal) pdf f (r) = (2πσ2)−1/2 exp(−(r − m)2/2σ2);r ∈ R. Commonly denoted by N(m,σ2).

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 66

Continuous expectation

Analogous to discrete expectation.

Have Ω, pdf f , function g : Ω→ R Define expectation of g (withrespect to f )

E(g) =

g(r) f (r) dr.

As in discrete case P(F) = E(1F),

Important examples if Ω ⊂ R

mean or first moment m =

r f (r) dr,

kth moment m(k) =

rkf (r) dr

second moment m(2) =

r

2f (r) dr

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 67

centralized moments

(r − m)kf (r) dr

including the variance σ2 =

(r − m)2f (r) dr

When allow complex-valued functions, often kth absolute moment isused: m

(k)a =|r|k f (r) dr.

As in the discrete case,

σ2 = m(2) − m

2

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 68

Computational examples

Continuous uniform pdf on [a, b)

Consider a = 0 and b = 1

mean m = 1

0 r dr = r2

2

1

0= 1

2

second moment m(2) =

10 r

2dr = r

3

3

1

0= 1

3,

variance σ2 = 13 −

12

2= 1

12

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 69

Exponential pdf

From tables, or continuous analogs of tricks for geometric.

The validation of the pdf (integrates to 1) and the mean, secondmoment, and variance of the exponential pdf can be found fromintegral tables, or by the integral analogs to the correspondingcomputations for the geometric pmf, or from integration by parts:

0λe−λr dr = 1

m =

0rλe−λr dr =

m(2) =

0r

2λe−λr dr =2λ2

σ2 =2λ2 −

1λ2 =

1λ2

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 70

Laplacian pdf = mixture of exponential and its reverse, left asexercise.

Gaussian pdf Moment computation more trouble than its worth, willfind an easier method later.

For now just state: ∞

−∞

1√2πσ2

e−(x−m)2/2σ2

dx = 1 ∞

−∞

1√2πσ2

xe−(x−m)2/2σ2

dx = m

−∞

1√2πσ2

(x − m)2e−(x−m)2/2σ2

dx = σ2

i.e., mean=m, variance = σ2, as notation suggests.

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 71

Computing probabilities sometimes easy, sometimes not.

With a uniform pdf on [a, b], then for a ≤ c < d ≤ b

P([c, d]) = (d − c)/(b − a)

for exponential pdf, [c, d], 0 ≤ c < d,

P([c, d]) =

d

c

λe−λxdx = e

−λc − e−λd.

No such nice form for Gaussian, but well tabulated in terms of

Φ(α) =1√2π

α

−∞e−u

2/2du

Q(α) =1√2π

αe−u

2/2du = 1 − Φ(α)

erf(α) =2√π

α

0e−u

2du

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 72

Note: Φ(α) = P((−∞,α]) = P(x : x ≤ α) for N(0, 1)

Q function = complementary function Q(α) = 1 − Φ(α).

Common in communications systems analysis

Q(α) =12

1 − erf

α√2

= 1 − Φ(α).

Change variables to find probabilities with tables (or numerically fromthese functions)

For example, find P((−∞,α)) for N(m,σ2):

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 73

Change variables: u = (x − m)/σ⇒

P(x : x ≤ α) = α

−∞

1√2πσ2

e−(x−m)2/2σ2

dx

=

(α−m)/σ

−∞

1√2π

e−u

2/2du

= Φα − m

σ

= 1 − Q

α − m

σ

.

P((a, b]) = P((−∞, b]) − P((−∞, a]) = Φb − m

σ

− Φa − m

σ

.

Symmetry of a Gaussian density⇒

1 − Φ(a) = Φ(−a).

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 74

Multidimensional pdf example

Ω = R2

Event space: two-dimensional Borel field B(R)2 (to give it a name,main point is there is a standard, useful definition which you canlearn about in advanced probability)

Probability measure described by a 2D pdf:

f (x, y) =

λµe−λx−µy; x ∈ [0,∞), y ∈ [0,∞)0 otherwise

.

What is the probability of the event F = (x, y) : x < y?

Note: Analogous to a product pmf, this is a product pdf

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 75

Interpretation: Sample points (x, y) = first arrival times of two distincttypes of particle (type A and type B) at a sensor (or packets at arouter or buses at a bus stop) after time 0

F = event that a particle of type A arrives at the sensor before one oftype B

As with 1D, probability = integral of pdf over event:

P(F) =

(x,y):(x,y)∈Ff (x, y) dx dy

=

(x,y):x≥0,y≥0,x<y

λµe−λx−µydx dy.

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 76

Now it is just calculus:

P(F) =

(x,y):x≥0,y≥0,x<y

λµe−λx−µydx dy

= λµ

0dy

y

0dxe

−λxe−µy

= λµ

0dye

−µy

y

0dxe

−λx

= λµ

0dye

−µy1λ

(1 − e−λy)

= µ

0dye

−µy − ∞

0dye

−(µ+λ)y

= 1 − µ

µ + λ=λ

µ + λ.

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 77

Mass functions as densities

Can use continuous ideas in discrete problems with Dirac deltas, butusually clumsy and adds complication. Describe briefly.

Dirac delta defined implicitly by behavior inside integral: ifg(r); r ∈ R continuous at a ∈ R, then

g(r)δ(r − a) dr = g(a)

(no ordinary function does this, Dirac deltas are generalizedfunctions)

Given a pmf p defined on Ω ⊂ R, can define a pdf f by

f (r) =

p(ω)δ(r − ω).

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 78

Then f (r) ≥ 0 and

f (r) dr =

p(ω)δ(r − ω)

dr

=

p(ω)δ(r − ω) dr

=

p(ω) = 1

1F(r) f (r) dr =

1F(r)

p(ω)δ(r − ω)

dr

=

p(ω)

1F(r)δ(r − ω) dr

=

p(ω)1F(ω) = P(F).

But sum of pmfs is simpler, deltas usually make things harder.

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 79

Multidimensional pdfs: General Case

Multidimensional integrals allow construction of probabilities on Rk

Given the measurable space (Rk,B(R)k), a real-valued function f onR

k is a pdf if

f (x) ≥ 0 ; all x = (x0, x1, . . . , xk−1) ∈ Rk,

Rk

f (x) dx = 1

Define a set function P by

P(F) =

F

f (x) dx, all F ∈ B(R)k,

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 80

where the vector integral is shorthand for the k-dimensional integral

P(F) =

(x0,x1,...,xk−1)∈Ff (x0, x1, . . . , xk−1) dx0 dx1 . . . dxk−1.

As with multidimensional pmf’s, a pdf is not itself the probability ofanything.

As with 1D case, subject to appropriate assumptions (event space,integral), this does define a probability measure, i.e., it satisfies theaxioms.

Two common and very important examples of k-dimensional pdfs:

Product pdf. A product pdf has the form

f (x) = f (x0, x1, . . . , xk−1) =k−1

i=0

fi(xi)

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 81

where fi; i = 0, 1, . . . , k − 1 are one-dimensional pdfs on the real line.

Most important special case: all fi(r) are the same (identicallydistributed)

So all pdfs on R⇒ product pdfs on Rk

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 82

Multidimensional Gaussian pdf

m = (m0,m1, . . . ,mk−1)t, a column vector

Λ = a k by k square matrix with entriesλi, j; i = 0, 1, . . . , k − 1; j = 0, 1, . . . , k − 1. Assume

1. Λ is symmetric (Λt = Λ or, equivalently, λi, j = λ j,i, all i, j)

2. Λ is positive definite; i.e., for any nonzero vector y ∈ Rk

ytΛy =k−1

i=0

k−1

j=0

yiλi, jy j > 0

A multidimensional pdf is Gaussian if it has the form

f (x) = (2π)−k/2(detΛ)−1/2e−(x−m)tΛ−1(x−m)/2 ; x ∈ Rk.

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 83

where detΛ is the determinant of the matrix Λ

Notes:

• Λ positive definite⇒ detΛ > 0 and Λ−1 exists

• Hard to show integrates to 1

• If Λ is a diagonal matrix, this becomes a product pdf

• There is a more general definition of a Gaussian random vectorthat does only requires that Λ be nonnegative definite

• Gaussian important because it crops up often as a goodapproximation (central limit theorem) and sometimes is a “worstcase”

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 84

Mixtures

Example of constructing new probability measures from old

Pi, i = 1, 2, . . . ,∞ is a collection of probability measures on acommon measurable space (Ω,F )

ai ≥ 0, i = 1, 2, . . .,∞

i=0

ai = 1. Then

P(F) =∞

i=1

aiPi(F)

is also a probability measure on (Ω,F ).

Abbreviation: P =

i=1

aiPi

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 85

Useful for constructing probability measures mixing continuous anddiscrete:

Ω = R, f a pdf and p a pmf. λ ∈ (0, 1)

Then mixture

P(F) = λ

x∈Fp(x) + (1 − λ)

x∈Ff (x) dx

E.g., experiment: First spin a fair wheel. If pointer in [0, λ), then roll adie described by p. If pointer in [λ, 1), then choose ω using Gaussian.

(Almost) most general model. Find expectations in such cases innatural way:

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 86

Given a function g,

E(g) = λ

x∈Ωg(x)p(x) + (1 − λ)

x∈Ωg(x) f (x) dx .

Works for scalar and vector sample spaces.

Another use of mixtures: First choose parameters at random (e.g.,bias of coin), then use probability measure described by parameter(e.g., Bernoulli (p))

Will see examples.

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 87

Event Independence

Given (Ω,F , P), two events F and G are independent ifP(F ∩G) = P(F)P(G)

A collection of events Fi; i = 0, 1, . . . , k − 1 is independent ormutually independent if for any distinct subcollectionFli

; i = 0, 1, . . . ,m − 1, lm < k,

P

m−1

i=0

Fli

=

m−1

i=0

P(Fli).

Note: Requirement on all subcollections! Not enough to just sayP

k−1i=0 Fi

=

k−1i=0 P(Fi)

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 88

Example of what can go wrong: Suppose that

P(F) = P(G) = P(H) =13

P(F ∩G ∩ H) =127= P(F)P(G)P(H)

P(F ∩G) = P(G ∩ H) = P(F ∩ H) =1

27 P(F)P(G).

Zero probability on the overlap F ∩G except where it also overlaps H,i.e., P(F ∩G ∩ H

c) = 0. Thus P(F ∩G ∩ H) = P(F)P(G)P(H) = 1/27,but P(F ∩G) = 1/27 P(F)P(G) = 1/9.

So P(F ∩G ∩ H) = P(F)P(G)P(H) for three events F, G, and H, yet itis not true that P(F ∩G) = P(F)P(G).

So here it is not true that the three events are mutually independent!

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 89

Probabilistic independence has an intuitive interpretation in terms ofnext topic — elementary conditional probability.

But definition does not require conditional probability.

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 90

Elementary conditional probability

Intuitively, independence of two events means that the occurrence ofone event should not affect the probability of occurrence of the other.

E.g., outcome of roll of one die does not change probability of nextroll (or rolling another die)

Need definition of probability of one event conditioned on another.

Motivation: Given (Ω,F , P), suppose know that event G has occured.What is a reasonable definition for probability F will occur given(conditioned on) G? P(F | G)

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 91

For a fixed G, P(F | G) should be defined for all events F, it should bea probability measure on (Ω,F ).

What properties should it have? Clearly P(Gc|G) = 0 & P(G|G) = 1.

Since P(· | G) must be a probability measure,⇒

P(F | G) = P(F ∩ (G ∪Gc) | G)

= P(F ∩G | G) + P(F ∩Gc | G)

0

= P(F ∩G | G)

so

P(F | G) = P(F ∩G | G) (1)

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 92

Next, no reason to suspect that the relative probabilities within G

should change because of the knowledge that G occurred.

E.g., F ⊂ G is twice as probable as an event H ⊂ G with respect to P,then the same should be true with respect to P(· | G).

For arbitrary events F and H, F ∩G,H ∩G ⊂ G,⇒

P(F ∩G|G)P(H ∩G|G)

=P(F ∩G)P(H ∩G)

.

Set H = Ω + (1)⇒

P(F|G) = P(F ∩G|G) =P(F ∩G)

P(G)(2)

Motivates using (2) as definition of conditional probability

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 93

Only works if P(G) > 0, called elementary conditional probability

(will later see nonelementary conditional probability, it is morecomplicated)

Easy to prove that since P is a probability measure, so is P(· | G)

Independence redux

Suppose that F and G are independent events and that P(G) > 0,then

P(F | G) =P(F ∩G)

P(G)= P(F);

Occurence of G does not affect probability of F, a posterioriprobability P(F | G) = a priori probability P(F).

Not used as definition of independence since less general.

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 94

Conditional probability provides a means of constructing newprobability spaces from old ones

Example: Given discrete (Ω,F , P) described by a pmf p, event A

with P(A) > 0. Define pmf pA:

pA(ω) =

p(ω)/P(A) = P(ω|A) , ω ∈ A

0 ω A

⇒ (Ω,F , PA), where

PA(F) =

ω∈FpA(ω) = P(F|A).

pA is a conditional pmf

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 95

E.g., p = geometric pmf and A = ω : ω ≥ K = K,K + 1, . . ..

pA(k) =(1 − p)k−1

p∞

l=K(1 − p)l−1p

=(1 − p)k−1

p

(1 − p)K−1

= (1 − p)k−Kp; k = K + 1,K + 2, . . . ,

a geometric pmf that begins at k = K + 1

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 96

Example: Continuous (Ω,F , P), pdf f P(A) > 0 Define conditional fA:

fA(ω) =

f (ω)/P(A) ω ∈ A

0 ω ∈ A

and conditional probability

PA(F) =

ω∈FfA(ω) dω = P(F|A).

E.g., f = exponential pdf, A = r : r ≥ c

fA(x) =λe−λx

∞cλe−λy dy

=λe−λx

e−λc

= λe−λ(x−c); x ≥ c,

Like geometric, conditioned on x ≥ c, has same form, just shifted.

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 97

Interpretation: Exponential models bus arrival times. pdf for nextarrival given you have waited an hour is the same as when you got tothe bus stop.

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 98

Bayes Rule

Total probability + conditional probability

Recall total probability: If Fi; i = 1, 2, . . . is a partition of, then forany event G

P(G) =

i

P(G ∩ Fi).

Suppose know P(Fi), all i — a priori probabilities of collection ofevents

Suppose know P(G | Fi)

Find a posteriori probabilities P(Fi | G).

Given you observe G, how does that change the probabilities of Fi?

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 99

P(Fi | G) =P(Fi ∩G)

P(G)=

P(G | Fi)P(Fi)j P(G ∩ F j)

=P(G | Fi)P(Fi)j P(G | F j)P(F j)

Example of Bayes’ rule: Binary communication channel

Noisy binary communication channel: 0 or 1 is sent and 0 or 1 isreceived. Assume that 0 is sent with probability 0.2 (⇒ 1 is sent withprobability 0.8)

The channel is noisy. If a 0 is sent, a 0 is received with probability0.9, and if a 1 is sent, a 1 is received with probability 0.975

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 100

Can represent this channel model by a probability transition diagram

P(0) = 0.2 0 •

P(1) = 0.8 1 •

P(0 | 0) = 0.9

P(1 | 1) = 0.975

P(1 | 0) = 0.1P(0 | 1) = 0.025

0

0

Problem: Given that 0 is received, find the probability that 0 was sent

Sample space Ω = (0, 0), (0, 1), (1, 0), (1, 1), points: (bit sent, bitreceived)

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 101

Define

A = 0 is sent = (0, 1), (0, 0),B = 0 is received = (0, 0), (1, 0)

Probability measure is defined via the P(A), P(B|A), and P(Bc|Ac)provided on the probability transition diagram of the channel

To find P(A|B), use Bayes rule

P(A|B) =P(B|A)P(A)

P(A)P(B|A) + P(Ac)P(B|Ac),

⇒P(A|B) =

0.90.2× 0.2 = 0.9,

much higher than the a priori probability of A (= 0.2)

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 102

End of basic probability: Next: random variables, vectors, andprocesses.

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 103