Probability - Stanford University · Review and elaboration of basic probability + simple examples...

EE 278Lecture Notes # 2Winter 2010–2011

Probability

Review and elaboration of basic probability + simple examples ofrandom variables, vectors, and processes

Probability spaces, fair spinner, one coin flip, multiple coin flips, aBernoulli random process. pdfs and pmfs

EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 1

Probability Space

Probability assigns a measure like length, area, volume, weight,mass to events = sets in some space

Usually involves sums (discrete probability) or integrals (continuous)

Basic construct is a probability space or experiment (Ω,F , P) whichconsists of three items:


1. Sample space Ω= an abstract space of elements called points

E.g., H,T , 0, 1, [0, 1), Rk

2. Event space F = collection of subsets of Ω called events, to whichprobabilities are assigned

E.g., all subsets of Ω

3. Probability measure P = assignment of real numbers to eventsconsistent with a set of axioms

Consider each component in order:


Sample space Ω An abstract space of elements called (sample)points

Intuition: contains all distinguishable elementary outcomes orfinest grain results of an experiment.


Event space F (sigma-field) Collection of subsets of Ω s.t.

a) If F ∈ F , then also Fc ∈ F

b) If Fi ∈ F , i = 1, 2, . . . , then also

i

Fi ∈ F ,

Intuition: Algebraic structure — a)-b) + set theory⇒ countable settheoretic operations (union, intersection, complementation,difference) of events produces new events. Ω ∈ F , ∅ ∈ F .

F ⊂ Ω, but F ∈ F (set inclusion vs. element inclusion)

Event spaces not an issue in elementary case where Ω discrete, useF = all subsets of Ω (power set of Ω)

Event spaces are in issue in continuous spaces, power set too largefor useful theory. If Ω = R, Borel field B(R). (Smallest event spacecontaining the intervals.)


Probability measure P An assignment of a number P(F) to everyevent F ∈ F in a way that satisfies Kolmogorov’s axioms ofprobability:

1. P(F) ≥ 0 for all F ∈ F2. P(Ω) = 1

3. If Fi ∈ F , i = 1, 2, . . . are disjoint or mutually exclusive(Fi ∩ F j = ∅ if i j), then

P(∪iFi) =

i

P(Fi)

Axioms are enough to get a useful calculus of probability + usefulmathematical models of random processes with predictable longterm behavior (laws of large numbers or ergodic theorems, centrallimit theorem)


Example: Spinning pointer

Introduce several fundamental ideas in context of two simpleexamples: a fair spinning pointer (or wheel) and a single coin flip.Then consider many coin flips.

Spin a fair pointer in a circle:

0.0

0.5

0.250.75

When pointer stops it can point to any number in the unit intervalΩ = [0, 1) ∆= r : 0 ≤ r < 1 Describe (Ω,F , P)


Sample space = Ω = [0, 1)

Event space = F = smallest event space containing all of theintervals, called B([0, 1)), the Borel field of [0, 1).

Probability measure: For fair spinner probability outcome a point inF ∈ B([0, 1)) is

P(F) =

F

f (x)dx

wheref (x) = 1, x ∈ [0, 1),

is a uniform probability density function (pdf)

E.g., if F = [a, b] = r : a ≤ r ≤ b with 0 ≤ a ≤ b < 1, P(F) = b − a

The probability of the pointer landing in an interval of length b − a isb − a, the fraction of the sample space corresponding to the event

intuitive!


In general, f (x), x ∈ Ω is a pdf if

1. f (x) ≥ 0, all x,

2.Ω

f (x)dx = 1.

pdf⇒ a probability measure by integration:

P(F) =

F

f (x)dx

Can also writeP(F) =

1F(x) f (x)dx

where indicator function of F given by

1F(r) ∆=

1 if r ∈ F

0 if r F


Same result if instead set Ω = R, F = B(R), and pdf

f (r) =

1 if r ∈ [0, 1)0 otherwise

.

Comments:

• Event space details⇒ integrals make senseImportant in research, not so much in practice. But good to knowthe language when reading the literature.

• The integrals in practice are RiemannIn theory they are Lebesgue (better limiting properties)In most cases two integrals are the same.

• The axioms of probability are properties of integration in disguise.See the next slide.


PDFs and the axioms of probability

Suppose f is a pdf (nonnegative,Ω

f (x)dx = 1) and

P(F) =

F

f (x)dx

Then

• Probabilities are nonnegative since integrating a nonnegativeargument yields a nonnegative result. (Axiom 1)

• The probability of the entire sample space is 1 since integrating 1over the unit interval yields 1. (Axiom 2)


• The probability of a finite union of disjoint regions is the sum of theprobabilities of the individual events since integration is linear:

P(F ∪G) =

1F∪G(r) f (r) dr =

(1F(r) + 1G(r)) f (r) dr

=

1F(r) f (r) dr +

1G(r) f (r) dr

= P(F) + P(G).

(Part of Axiom 3 from linearity of integration)

Need to show for a countable number of disjoint events for Axiom 3.Is above enough? Unfortunately, no for Riemann integral.

True if use Lebesgue integral. (We do not pursue details.)


Example Probability Spaces: Single coin flip

Develop in two ways:

Model I Direct description — Simplest way, fine if all you care aboutis one coin flip.

Model II As a random variable (or signal processing) defined on fairspinner. Will define lots of other random variables on same space.


Coin Flip Model I: Direct description

Sample space Ω = 0, 1— using numbers instead of head, tail willprove convenient

Event space F = 0, 1,Ω = 0, 1, ∅, all subsets of Ω (power setof Ω)

Probability measure P Define in terms of a sum of probability massfunction, analogous to integral of pdf for spinner

Given a discrete sample space (= countable number of elements) Ω,a probability mass function (pmf) p(ω) is a nonnegative function suchthatω∈Ω p(ω) = 1.


Given pmf p, define P by P(ω) + Axiom 3:

P(F) =

x∈FP(x) =

x∈Fp(x) =

1F(x)p(x)

Again: theory of sums⇒ axioms of probability satisfied, obvious if Ωfinite.

For the fair coin, set p(1) = p(0) = 1/2. For a biased coin, common touse p(1) = p, p(0) = 1 − p for a parameter p ∈ (0, 1)


Notes:

• Probabilities are defined on sets — P(F), pdfs and pmfs aredefined on points!!

• In discrete case, one point (singleton) sets have possibly nonzeroprobability, e.g.,

P(0) = p(0) =12

A pmf gives the probability of something.A pdf is not the probability of anything, e.g., in our uniform spinnercase

P(12) =

1/21dx = 0.

If P determined by pdf, probability of individual point (e.g., 1/π) =0

Must integrate a pdf to get a probability


Coin Flip Model II: Random variable on spinner

Suppose (Ω,F , P) describes uniform spinner: Ω = [0, 1), P describedby pdf f (r) = 1 for r ∈ [0, 1)

Define a measurement q made on outcome of spin by

q(r) = 1[0.5,1)(r) =

1 if r ∈ [0.5, 1)0 if r ∈ [0, 0.5)

(1)

Simple quantizer

Simple example of a random variable = a real-valued functiondefined on sample space, q(r), r ∈ Ω


Simple example of signal processing — fair spinner produces signalor input r, quantizer operates on signal to produce a value q(r). ThinkA/D conversion or threshold decision rule based on noisy observation

Original probability space + function⇒ new probability space withbinary sample space Ωq = 0, 1, event space Fq = power set of0, 1. New space inherits probability measure, say Pq, from old.

Output space is discrete⇒ only need pmf to characterize probabilitymeasure

pq(0) = P(r : q(r) = 0) = 0.5

01dr =

12

Often write informally as Pr(q = 0) or Pq = 0, short hand for“probability random variable q takes on a value of 0”

Similarly pq(1) = P(r : q(r) = 1) = 1

0.51dr =

12


Can use pmf to find any probability involving q:

Pq(F) =

x∈Fpq(x)

Derived new probability measure from old + function q.

Probability measure Pq describing output of random variable q calledthe distribution of q

In general, Pq(F) = P(ω : q(ω) ∈ Fq−1(F)

) relates probability in new

space to probability in old.

q−1(F) = inverse image of set F under mapping q

Basic example of derived distribution: (Ω,F , P) + q⇒ (Ωq,Fq, Pq)

General solution: inverse image method Pq(F) = P(q−1(F))


Notes:

• Multiple ways to arrive at same model — equivalent models

Directly in terms of P, indirectly in terms of probability space + rv

• Basic idea will be seen often:

probability space + function (random variable)

⇒ new probability space with probability measure = distribution ofrandom variable given by inverse image formula

Will see many tools and tricks for doing actual calculus.

• Using Model II, can define more random variables on commonexperiment, e.g., two coin flips or even an infinite number.

Model I only good for single coin flip.


Two coin flips

Let (Ω,F , P) be the fair spinner experiment.

Rename quantizer q as X (common to use upper case for randomvariables). Define another random variable Y on same space:

X(ω) =

0 if ω ∈ [0, 0.5)1 if ω ∈ [0.5, 1.0)

Y(ω) =

0 if ω ∈ [0, 0.25) ∪ [0.5, 0.75)1 if ω ∈ [0.25, 5.0) ∪ [0.75, 1.0)

0 r

1X(r)=0

X(r)=1

Y(r)=0 Y(r)=1 Y(r)=0 Y(r)=1


Single experiment⇒ values of two random variables X and Y, orsingle 2D random vector (X,Y).

Easy to compute pmfs for individual random variables X and Y

(marginal pmfs)

pX(k) = pY(k) =12

; k = 0, 1Equivalent random variables, same pmf.

Now also have joint pmf of 2 rvs together: inverse image formula⇒

pXY(x, y) = Pr(X = x,Y = y) = P(ω : (X,Y)(ω) = (x, y))

Do the math: pXY(x, y) =14

; x = 0, 1; y = 0, 1 E.g.,

pXY(0, 1) = P(ω : ω ∈ [0, 0.5) ∩ ω ∈ [0.25, 5.0) ∪ [0.75, 1.0))

= P(ω : ω ∈ [0.25, 5.0)) = 14


Notes:

• Here separately derived joint pmf pXY and marginal pmfs pX, pY.Alternatively could compute marginals from joint using totalprobability:

pX(x) = P(ω : X(ω) = x) =

y∈ΩY

P(ω : X(ω) = x,Y(ω) = y)

=

y∈ΩY

pXY(x, y)

pY(y) =

x∈ΩX

pXY(x, y)

Joint and marginal pmfs are consistent — can get marginals eitherfrom original P or from joint pmf pXY


• For this example pXY(x, y) = pX(x)pY(y), a product pmf

If two discrete random variables satisfy above, they are said to beindependent (will discuss more later)

Can define any number of random variables on a common probabilityspace.

An infinite collection of random variables such as Xn; n = 0, 1, 2, . . .defined on a common probability space is called a random process

Extend example: a Bernoulli random process


A Bernoulli Random Process: Fair Coin Flips

Again let Ω = [0, 1) and P be determined by uniform pdf.

Every number in u ∈ [0, 1) has a binary representation

u =

∞

n=0

bn(u)2−n−1

binary analog of decimal representation — unique if chooserepresentation bn not having a finite number of 0s,

e.g., choose 1/2→ 100000 · · · , not 01111 · · ·

E.g., if u = 3/4, then b0(u) = b1(u) = 1, bn(u) = 0 for all n ≥ 2.


Xn(u) = bn(u) defines a discrete-time random processXn; n = 0, 1, 2, . . . one experiment⇒ an infinite number of rvs!

In earlier 2D example, X = X0, Y = X1.

Similar computations to 2D example (inverse image formula)⇒

• marginal pmfs pXn(x) = 1/2, x = 0, 1 (all equivalent, fair coin flip)

• for any k = 1, 2, . . ., joint pmf describing random vector Xk is

pXk(x

k) = Pr(Xk = xk) = 2−k; x

k ∈ 0, 1k

Hence

pXk(x

k) =n−1

n=0

pXn(xn); x

k ∈ 0, 1k,

a product pmf


A collection of discrete random variables is said to be (mutually)independent if the joint pmf = the product of the marginal pmfs.

A random vector is said to be independent identically distributed oriid (or i.i.d. or IID) if its component random variables are independentand identically distributed

A random process Xn is iid if any finite collection of randomvariables in the collection is iid

An iid process also called a Bernoulli process (sometimes namereserved for binary processes). Classic example is an infinitesequence of fair coin flips.

End extended example of uniform spinner and Bernoulli process,back to general material Elaborate on Ω,F , P —


Probability spaces: sample space Ω

Common examples:

• 0, 1 Coin flip, value in data stream at time t

• [0, 1) Fair spin, analog sensor output

• [0,∞) Time to arrival of first packet, bus, customer

• Zk

∆= 0, 1, . . . , k − 1 die roll, ascii character, card drawn from deck

• Z+ = 0, 1, 2, . . . Number of packets/buses/customers arriving in[0,T )

• R = (−∞,∞) voltage at sensor (without known bound)

• 0, 1k=all binary k-tuples (a product space), flip one coin k times,flip k coins once, sample k successive values in a data stream


• [0, 1)k= k-dimensional unit cube, measurements from k identicalsensors at different locations

• Rkk-dimensional Euclidean space, voltage of array of k sensors

• etc. — e.g., Zk, sequence spaces such as all binary sequences,waveform spaces such as all continuous differential waveforms


Probability spaces: event space F

Given Ω, smallest event space is ∅,Ω.

Biggest event space is power set. (Too big to be useful for continuousspaces.)

Useful event space in R and Rk is Borel field=smallest event spacecontaining all of the intervals (k=1) and rectangles (k > 1)

(Do not need for HW or exams, useful to have an idea whenencounter in books and papers)

A measurable space (Ω,F ) = sample space + event space ofsubsets of sample space


Probability spaces: probability measure

When dealing with finite sample spaces, only need Axiom 3 to holdfor finite collections of disjoint events.

Key point A set function P defined on an event space F of subsetsof a sample space Ω is a probability measure if and only if (iff) itsatisfies the axioms of probability


A trivial example

The simplest possible example is useless except for providing a trivialexample.

Ω is any abstract space F = Ω, ∅

P defined by P(Ω) = 1, P(∅) = 0

Axioms of probability measure satisfied.


Simple example: biased coin

Simplest nontrivial example

Ω = 0, 1

F = 0, 1,Ω = 0, 1, ∅ axioms of event space satisfied

P defined by

P(F) =

1 − p if F = 0p if F = 10 if F = ∅1 if F = Ω

,

where p ∈ (0, 1) is a fixed parameter (p = 0, 1 is a variation on thetrivial probability space)

Axioms can be verified in a straightforward manner.


In general can not do it this way and list probabilities of ever event,too many events

Instead provide a formula for computing probabilities of events asintegrals (of a pdf) or sums (of a pmf)

Will see many common examples, most have names (uniform,binomial, geometric, Poisson, Gaussian, Bernoulli, exponential,Laplacian, . . . )

Before more examples, derive several fundamental properties ofprobability — several elementary and one advanced


Elementary properties of probability

(Ω,F , P)

(a) For all events F, P(Fc) = 1 − P(F)

(b) For all events F, P(F) ≤ 1

(c) Let ∅ be the null or empty set, then P(∅) = 0.

(d) Total Probability If events Fi; i = 1, 2, . . . form a partition of Ω,i.e., if Fi ∩ Fk = ∅ when i k and

i Fi = Ω, then for any event G

P(G) =

i

P(G ∩ Fi).

(e) If G ⊂ F for events G, F, then P(F −G) = P(F) − P(G)(F −G

∆= F ∩G

c, also written F\G)


Proof

(a) F ∪ Fc = Ω⇒ P(F ∪ F

c) = 1 (Axiom 2). F ∩ Fc = ∅ ⇒

1 = P(F ∪ Fc) = P(F) + P(Fc) (Axiom 3).

(b) P(F) = 1 − P(Fc) ≤ 1 (Axiom 1 and (a) above).

(c) By Axiom 2 and (a) above, P(Ωc) = P(∅) = 1 − P(Ω) = 0.

Note: Empty set ∅ has probability 0, but P(F) = 0 does not meanF = ∅. E.g., uniform spinner, F = 1/n : n = 1, 2, 3, . . .

(d) Using set theory and Axiom 3:P(G) = P(G ∩Ω) = P(G ∩ (

iFi)) = P(

i (G ∩ Fi)

disjoint) =

iP(G ∩ Fi).


(e) F −G = F ∩Gc and G are disjoint, so Axiom 3⇒

P((F −G) ∪G) = P(F −G) + P(G).

Since G ⊂ F, G = F ∩G⇒(F −G) ∪G = (F ∩G

c) ∪ (F ∩G) = F ∩ (G ∪Gc)

Ω

= F.

Thus P(F) = P(F −G) + P(G).

Note (e)⇒ that if G ⊂ F, then P(G) ≤ P(F)


An advanced property of probability: Continuity

A sequence of sets Fn, n = 0, 1, 2, . . . is increasing if Fn−1 ⊂ Fn all n

(also called nested) decreasing if Fn ⊂ Fn−1

a b

F1 F2 F3 F4 F4 F3 F2 F1

(a) Increasing sets, (b) decreasing sets


E.g., Increasing: Fn = [0, n), Fn = [1, 2 − 1/n), (−n, a)

Decreasing: Fn = [1, 3 + 1/n), Fn = (1 − 1/n, 1 + 1/n)

Natural definition of limit of increasing sets

limn→∞

Fn

∆=

∞

n=1

Fn

E.g.,

limn→∞

[0, n) = [0,∞)

limn→∞

[1, 2 − 1/n) = [1, 2)

limn→∞

(−n, a) = [−∞, a)


Natural definition of limit of decreasing sets

limn→∞

Fn

∆=

∞

n=1

Fn

E.g.,

limn→∞

[1, 3 + 1/n) = [1, 3)

limn→∞

[1 − 1/n, 1 + 1/n) = 1

There is no natural definition of a limit of an arbitrary sequence ofsets, only for increasing or decreasing.


Continuity of probability If Fn is an increasing or decreasingsequence of events, then

P( limn→∞

Fn) = limn→∞

P(Fn)

Prove for increasing sets Fn; n = 0, 1, 2, . . ..

Recall set theory difference A − B = A ∩ Bc = points in A that are not

in B.

Define G0 = F0, Gn = Fn − Fn−1 for n = 1, 2, . . .

Gn are disjoint,

n

k=0 Fk = Fn =

n

k=0 Gk,

∞

k=0

Fk =

∞

k=0

Gk


Then

P

limn→∞

Fn

= P

∞

k=0

Fk

= P

∞

k=0

Gk

=

∞

k=0

P(Gk)

Axiom 3

= limn→∞

n

k=0

P(Gk)

definition of infinite sum

,

Gn = Fn − Fn−1 and Fn−1 ⊂ Fn⇒ P(Gn) = P(Fn) − P(Fn−1)

⇒n

k=0

P(Gk) = P(F0) +n

k=1

(P(Fn) − P(Fn−1)) = P(Fn).

“telescoping sum,” all terms cancel but last one:


P(Fn) = + P(Fn) − P(Fn−1)

+ P(Fn−1) − P(Fn−2)

+ P(Fn−2) − P(Fn−3)...

+ P(F1) − P(F0)

+ P(F0).

⇒ If Fn is a sequence of increasing events, then

P

limn→∞

Fn

= lim

n→∞P(Fn)

E.g., P((−∞, a]) = limn→∞ P((−n, a]).


Similar proof for decreasing events. E.g.,

P(a) = limn→∞ P((a − 1/n, a + 1/n))

If P described by a pdf, then probability of a point is 0.

Can show Axioms 1, 2, and 3 for finite collections of disjoint events +continuity of probability⇔ Axioms 1–3

Kolmogorov’s countable additivity axiom ensures good limitingbehavior.

Back to more concrete issues.


Discrete probability spaces

Introduce several common examples.

Recall basic construction:

Ω = a discrete (countable) set, F = power set of Ω

Given a probability measure P on (Ω,F ),⇒ pmf

p(ω) ∆= P(ω);ω ∈ Ω

Conversely, given a pmf p (nonnegative, sums to 1),⇒ P via

P(F) =

ω∈Fp(ω).

Calculus (properties of sums) implies axioms of probability satisfied.


Common examples of pmfs

Binary (Bernoulli) pmf. Ω = 0, 1; p(0) = 1 − p, p(1) = p, where p

is a parameter in (0, 1).A uniform pmf. Ω = Zn = 0, 1, . . . , n − 1 and p(k) = 1/n; k ∈ Zn.Binomial pmf. Ω = Zn+1 = 0, 1, . . . , n and

p(k) =n

k

p

k(1 − p)n−k; k ∈ Zn+1,

where n

k

=

n!k!(n − k)!

is the binomial coefficient (read as “n choose k”).

Geometric pmf. Ω = 1, 2, 3, . . . and p(k) = (1 − p)k−1p; k = 1, 2, . . .,

where p ∈ (0, 1) is a parameter.


The Poisson pmf. Ω = Z+ = 0, 1, 2, . . . and p(k) = (λke−λ)/k!, where

λ is a parameter in (0,∞). (Keep in mind that 0! ∆= 1.)

These are all obviously nonnegative, hence to verify they are pmfsneed only show

ω∈Ω p(ω) = 1

Obvious for Bernoulli and uniform. For binomial, use binomialtheorem, for geometric use geometric progression formula, forPoisson use Taylor series expansion for e

x (will do details later, try iton your own)

These are exercises in calculus of sums.

Before continuing to continuous examples (pdfs), consider moregeneral weighted sums called expectations.

(Treated in depth later, useful to introduce basic idea early.)


Discrete expectations

Given pmf p on discrete sample space Ω

Suppose that g is a real-valued function defined on Ω: g : Ω→ R

Recall: A real-valued function defined on a probability space iscalled a random variable 1

Define expectation of g (with respect to p) by

E(g) =

ω∈Ωg(ω)p(ω).

1There is a required technical condition we will see later, but it is automatic for discrete probabilityspaces with the power set as event space.


Example: If g(ω) = 1F(ω), then

P(F) = E(1F), (2)

⇒ probability can be viewed as a special case of expectation

Some other important examples:

Suppose that Ω ⊂ R, e.g., R, [0, 1), Z or Zn. Fix pmf p.If g(ω) = ω, mean or first moment

m =ωp(ω)

kth momentm

(k) =

(ω)kp(ω),

e.g., m = m(1).


The second moment is often of interest:

m(2) =

(ω)2

p(ω).

centralized moments: (ω − m)k

p(ω),Most important is the variance

σ2 =

(ω − m)2p(ω).

Note:

σ2 =

(ω − m)2p(ω) =

(ω2 − 2ωm + m

2)p(ω)

=ω2

p(ω) − 2m

ωp(ω) + m

2

p(ω)

= m(2) − 2m

2 + m2 = m

(2) − m2


variance = second moment - mean2

Will see similar definition for continuous case with a pdf, withintegrals instead of sums

For now moments can be viewed as simply attributes of pmfs andpdfs. For many pmfs and pdfs, knowing certain moments completelydescribes the pmf/pdf.


Computational examples

Discrete uniform pmf on Zn

P(F) =1n

1F(ω) =

#(F)n,

#(F)= number of elements in F.

mean:

m =1n

n−1

k=0

k =n + 1

2second moment:

m(2) =

1n

n−1

k=0

k2 =

(n + 1)(2n + 1)6


Binomial pmf To show a valid pmf, use binomial theorem:

(a + b)n =

n

k=0

n

k

a

nb

n−k

Set a = p, b = 1 − p

n

k=0

p(k) =n

k=0

n

k

p

k(1 − p)n−k = (p + 1 − p)n = 1.

Finding mean messy, but good practice. Later find shortcuts.

m =

n

k=0

k

n

k

p

k(1 − p)n−k

=

n

k=1

n!(n − k)!(k − 1)!

pk(1 − p)n−k


Trick: resembles terms in binomial theorem, massage it into form:

Change variables l = k − 1,

m =

n−1

l=0

n!(n − l − 1)!l!

pl+1(1 − p)n−l−1

= np

n−1

l=0

(n − 1)!(n − 1 − l)!l!

pl(1 − p)n−1−l

= np(p + 1 − p)n−1 = np.

Postpone second moment until have a better method.


Geometric pmf Use geometric progression: if |a| < 1,

∞

k=0

ak =

11 − a

,

⇒ geometric pmf indeed sums to 1.

mean:

m =

∞

k=1

kp(k) =∞

k=1

kp(1 − p)k−1.

How evaluate? Can look it up, or use trick: differentiate geometricprogression formula to obtain

d

da

∞

k=0

ak

∞k=0 kak−1

=d

da

11 − a

1(1−a)2


Set a = 1 − p and get

m =1p

for geometric pmf

A similar idea works for the second moment:

m(2) =

∞

k=1

k2p(1 − p)k−1 = p

2p3 −

1p2

=

2 − p

p2

henceσ2 = m

(2) − m2 =

1 − p

p2 .


Example of probability calculus using geometric pmf: Find theprobabilities of the event F = k : k ≥ 10 and G = k : k is odd.

Note that F = 10, 11, 12, . . . and G = 1, 3, 5, 7, . . .

Solutions:

P(F) =

k∈Fp(k) =

∞

k=10

p(1 − p)k−1

=p

1 − p

∞

k=10

(1 − p)k =p

1 − p(1 − p)10

∞

k=10

(1 − p)k−10

= p(1 − p)9∞

k=0

(1 − p)k = (1 − p)9,


P(G) =

k∈Gp(k) =

k=1,3,...

p(1 − p)k−1

= p

k=0,2,4,...

(1 − p)k = p

∞

k=0

[(1 − p)2]k

=p

1 − (1 − p)2 =1

2 − p.

Poisson pmf

Show sums to 1:∞

k=0

p(k) =∞

k=0

λke−λ

k!= e−λ ×

∞

k=0

λk

k!

Tayor series for eλ

= 1


mean:

∞

k=0

kp(k) =∞

k=1

kλk

e−λ

k!= e−λ∞

k=1

λk

(k − 1)!.

Change variables l = k − 1

∞

k=0

kp(k) = λe−λ∞

l=0

λl

l!= λ

m(2) =

∞

k=1

k2λ

ke−λ

k!=

∞

k=2

k(k − 1)λk

e−λ

k!+ m

=

∞

k=2

λke−λ

(k − 2)!+ m


Change variables l = k − 2

m(2) = λ2

∞

l=0

λle−λ

l!+ m = λ2 + λ

so σ2 = λ

Moments crop up in many signal processing applications (andstatistical analysis of many kinds), so it is useful to have table ofmoment formulas for important distributions handy and anappreciation of the underlying calculus.


Multidimensional pmf’s

Vector spaces with discrete coordinates like Zk

nare also discrete.

Same ideas of describing probabilities by pmfs work.

Suppose A is a discrete space

Ak = all vectors x = (x0, . . . xk−1) with xi ∈ A, i = 0, 1, . . . , k − 1 is also

discrete.

If have pmf p on Ak, i.e., p(x) ≥ 0,

x∈Ak

p(x) = 1,

then P(F) =

x∈F p(x) a probability measure


Common example of a pmf on vectors is a product pmf:

Product pmf Suppose pi; i = 0, 1, . . . , k − 1 one-dimensional pmf’son A (for each i, pi(x) ≥ 0 all x ∈ Ω,

x∈Ω pi(x) = 1) each

nonnegative, sums to 1)

Define the product k-dimensional pmf p on Ak by

p(x) = p(x0, x1, . . . , xk−1) =k−1

i=0

pi(xi).

Easily seen to be a pmf (nonnegative, sums to 1)


Example: product of Bernoulli pmfs:

pi(x) = px(1 − p)1−x; x = 0, 1, all i

⇒ product pmf

p(x0, x1, . . . , xk−1) =k−1

i=0

pxi(1 − p)1−xi

= pw(x0,x1,...,xk−1)(1 − p)k−w(x0,x1,...,xk−1),

where w(x0, x1, . . . , xk−1)= # number of ones in binary k-tuplex0, x1, . . . , xk−1, Hamming weight of the vector.

Will see when probability of a bunch of things factors into a bunch ofprobabilities in this way, things simplify significantly


Continuous probability spaces

Recall basic construction using pdf on the real line R:

(Ω,F ) = (R,B(R))

f is a pdf on R iff (r) ≥ 0, all r ∈ Ω

Ω

f (r)dr = 1

Define set function P by

P(F) =

F

f (r) dr =

1F(r) f (r) dr, F ∈ B(R)

Unlike pmf, a pdf is not a probability of something — it is a density ofprobability. To get a probability, must integrate a pdf over a set.


Common approximation: Suppose Ω = R and f a pdf. Considerevent F = [x, x + ∆x), ∆ small.

Mean value theorem of calculus⇒

P([x, x + ∆x)) =

x+∆x

x

f (α) dα ≈ f (x)∆x

Theoretical issue Does P(F) =

Ff (r) dr =

1F(r) f (r) dr, F ∈ B(R)

in fact define a probability measure? That is, does set function P

satisfy axioms 1-3?

Answer: Yes, with proper care to use correct event space and notionof integration. As usual, rarely a problem in practice. Need details inresearch.

Usually continuous probability just substitutes integration forsummation, but pmfs and pdfs are inherently different.


Common examples of pdfs

pdfs are assumed to be 0 outside the specified domain and are givenin terms of real-valued parameters b, a, λ > 0, m, and σ > 0.

Uniform pdf. Given b > a, f (r) = 1/(b − a) for r ∈ [a, b].

Exponential pdf. f (r) = λe−λr; r ≥ 0.

Doubly exponential (or Laplacian) pdf f (r) =λ

2e−λ|r|; r ∈ R.

Gaussian (or Normal) pdf f (r) = (2πσ2)−1/2 exp(−(r − m)2/2σ2);r ∈ R. Commonly denoted by N(m,σ2).


Continuous expectation

Analogous to discrete expectation.

Have Ω, pdf f , function g : Ω→ R Define expectation of g (withrespect to f )

E(g) =

g(r) f (r) dr.

As in discrete case P(F) = E(1F),

Important examples if Ω ⊂ R

mean or first moment m =

r f (r) dr,

kth moment m(k) =

rkf (r) dr

second moment m(2) =

r

2f (r) dr


centralized moments

(r − m)kf (r) dr

including the variance σ2 =

(r − m)2f (r) dr

When allow complex-valued functions, often kth absolute moment isused: m

(k)a =|r|k f (r) dr.

As in the discrete case,

σ2 = m(2) − m

2


Computational examples

Continuous uniform pdf on [a, b)

Consider a = 0 and b = 1

mean m = 1

0 r dr = r2

2

1

0= 1

2

second moment m(2) =

10 r

2dr = r

3

3

1

0= 1

3,

variance σ2 = 13 −

12

2= 1

12


Exponential pdf

From tables, or continuous analogs of tricks for geometric.

The validation of the pdf (integrates to 1) and the mean, secondmoment, and variance of the exponential pdf can be found fromintegral tables, or by the integral analogs to the correspondingcomputations for the geometric pmf, or from integration by parts:

∞

0λe−λr dr = 1

m =

∞

0rλe−λr dr =

1λ

m(2) =

∞

0r

2λe−λr dr =2λ2

σ2 =2λ2 −

1λ2 =

1λ2


Laplacian pdf = mixture of exponential and its reverse, left asexercise.

Gaussian pdf Moment computation more trouble than its worth, willfind an easier method later.

For now just state: ∞

−∞

1√2πσ2

e−(x−m)2/2σ2

dx = 1 ∞

−∞

1√2πσ2

xe−(x−m)2/2σ2

dx = m

∞

−∞

1√2πσ2

(x − m)2e−(x−m)2/2σ2

dx = σ2

i.e., mean=m, variance = σ2, as notation suggests.


Computing probabilities sometimes easy, sometimes not.

With a uniform pdf on [a, b], then for a ≤ c < d ≤ b

P([c, d]) = (d − c)/(b − a)

for exponential pdf, [c, d], 0 ≤ c < d,

P([c, d]) =

d

c

λe−λxdx = e

−λc − e−λd.

No such nice form for Gaussian, but well tabulated in terms of

Φ(α) =1√2π

α

−∞e−u

2/2du

Q(α) =1√2π

∞

αe−u

2/2du = 1 − Φ(α)

erf(α) =2√π

α

0e−u

2du


Note: Φ(α) = P((−∞,α]) = P(x : x ≤ α) for N(0, 1)

Q function = complementary function Q(α) = 1 − Φ(α).

Common in communications systems analysis

Q(α) =12

1 − erf

α√2

= 1 − Φ(α).

Change variables to find probabilities with tables (or numerically fromthese functions)

For example, find P((−∞,α)) for N(m,σ2):


Change variables: u = (x − m)/σ⇒

P(x : x ≤ α) = α

−∞

1√2πσ2

e−(x−m)2/2σ2

dx

=

(α−m)/σ

−∞

1√2π

e−u

2/2du

= Φα − m

σ

= 1 − Q

α − m

σ

.

P((a, b]) = P((−∞, b]) − P((−∞, a]) = Φb − m

σ

− Φa − m

σ

.

Symmetry of a Gaussian density⇒

1 − Φ(a) = Φ(−a).


Multidimensional pdf example

Ω = R2

Event space: two-dimensional Borel field B(R)2 (to give it a name,main point is there is a standard, useful definition which you canlearn about in advanced probability)

Probability measure described by a 2D pdf:

f (x, y) =

λµe−λx−µy; x ∈ [0,∞), y ∈ [0,∞)0 otherwise

.

What is the probability of the event F = (x, y) : x < y?

Note: Analogous to a product pmf, this is a product pdf


Interpretation: Sample points (x, y) = first arrival times of two distincttypes of particle (type A and type B) at a sensor (or packets at arouter or buses at a bus stop) after time 0

F = event that a particle of type A arrives at the sensor before one oftype B

As with 1D, probability = integral of pdf over event:

P(F) =

(x,y):(x,y)∈Ff (x, y) dx dy

=

(x,y):x≥0,y≥0,x<y

λµe−λx−µydx dy.


Now it is just calculus:

P(F) =

(x,y):x≥0,y≥0,x<y

λµe−λx−µydx dy

= λµ

∞

0dy

y

0dxe

−λxe−µy

= λµ

∞

0dye

−µy

y

0dxe

−λx

= λµ

∞

0dye

−µy1λ

(1 − e−λy)

= µ

∞

0dye

−µy − ∞

0dye

−(µ+λ)y

= 1 − µ

µ + λ=λ

µ + λ.


Mass functions as densities

Can use continuous ideas in discrete problems with Dirac deltas, butusually clumsy and adds complication. Describe briefly.

Dirac delta defined implicitly by behavior inside integral: ifg(r); r ∈ R continuous at a ∈ R, then

g(r)δ(r − a) dr = g(a)

(no ordinary function does this, Dirac deltas are generalizedfunctions)

Given a pmf p defined on Ω ⊂ R, can define a pdf f by

f (r) =

p(ω)δ(r − ω).


Then f (r) ≥ 0 and

f (r) dr =

p(ω)δ(r − ω)

dr

=

p(ω)δ(r − ω) dr

=

p(ω) = 1

1F(r) f (r) dr =

1F(r)

p(ω)δ(r − ω)

dr

=

p(ω)

1F(r)δ(r − ω) dr

=

p(ω)1F(ω) = P(F).

But sum of pmfs is simpler, deltas usually make things harder.


Multidimensional pdfs: General Case

Multidimensional integrals allow construction of probabilities on Rk

Given the measurable space (Rk,B(R)k), a real-valued function f onR

k is a pdf if

f (x) ≥ 0 ; all x = (x0, x1, . . . , xk−1) ∈ Rk,

Rk

f (x) dx = 1

Define a set function P by

P(F) =

F

f (x) dx, all F ∈ B(R)k,


where the vector integral is shorthand for the k-dimensional integral

P(F) =

(x0,x1,...,xk−1)∈Ff (x0, x1, . . . , xk−1) dx0 dx1 . . . dxk−1.

As with multidimensional pmf’s, a pdf is not itself the probability ofanything.

As with 1D case, subject to appropriate assumptions (event space,integral), this does define a probability measure, i.e., it satisfies theaxioms.

Two common and very important examples of k-dimensional pdfs:

Product pdf. A product pdf has the form

f (x) = f (x0, x1, . . . , xk−1) =k−1

i=0

fi(xi)


where fi; i = 0, 1, . . . , k − 1 are one-dimensional pdfs on the real line.

Most important special case: all fi(r) are the same (identicallydistributed)

So all pdfs on R⇒ product pdfs on Rk


Multidimensional Gaussian pdf

m = (m0,m1, . . . ,mk−1)t, a column vector

Λ = a k by k square matrix with entriesλi, j; i = 0, 1, . . . , k − 1; j = 0, 1, . . . , k − 1. Assume

1. Λ is symmetric (Λt = Λ or, equivalently, λi, j = λ j,i, all i, j)

2. Λ is positive definite; i.e., for any nonzero vector y ∈ Rk

ytΛy =k−1

i=0

k−1

j=0

yiλi, jy j > 0

A multidimensional pdf is Gaussian if it has the form

f (x) = (2π)−k/2(detΛ)−1/2e−(x−m)tΛ−1(x−m)/2 ; x ∈ Rk.


where detΛ is the determinant of the matrix Λ

Notes:

• Λ positive definite⇒ detΛ > 0 and Λ−1 exists

• Hard to show integrates to 1

• If Λ is a diagonal matrix, this becomes a product pdf

• There is a more general definition of a Gaussian random vectorthat does only requires that Λ be nonnegative definite

• Gaussian important because it crops up often as a goodapproximation (central limit theorem) and sometimes is a “worstcase”


Mixtures

Example of constructing new probability measures from old

Pi, i = 1, 2, . . . ,∞ is a collection of probability measures on acommon measurable space (Ω,F )

ai ≥ 0, i = 1, 2, . . .,∞

i=0

ai = 1. Then

P(F) =∞

i=1

aiPi(F)

is also a probability measure on (Ω,F ).

Abbreviation: P =

∞

i=1

aiPi


Useful for constructing probability measures mixing continuous anddiscrete:

Ω = R, f a pdf and p a pmf. λ ∈ (0, 1)

Then mixture

P(F) = λ

x∈Fp(x) + (1 − λ)

x∈Ff (x) dx

E.g., experiment: First spin a fair wheel. If pointer in [0, λ), then roll adie described by p. If pointer in [λ, 1), then choose ω using Gaussian.

(Almost) most general model. Find expectations in such cases innatural way:


Given a function g,

E(g) = λ

x∈Ωg(x)p(x) + (1 − λ)

x∈Ωg(x) f (x) dx .

Works for scalar and vector sample spaces.

Another use of mixtures: First choose parameters at random (e.g.,bias of coin), then use probability measure described by parameter(e.g., Bernoulli (p))

Will see examples.


Event Independence

Given (Ω,F , P), two events F and G are independent ifP(F ∩G) = P(F)P(G)

A collection of events Fi; i = 0, 1, . . . , k − 1 is independent ormutually independent if for any distinct subcollectionFli

; i = 0, 1, . . . ,m − 1, lm < k,

P

m−1

i=0

Fli

=

m−1

i=0

P(Fli).

Note: Requirement on all subcollections! Not enough to just sayP

k−1i=0 Fi

=

k−1i=0 P(Fi)


Example of what can go wrong: Suppose that

P(F) = P(G) = P(H) =13

P(F ∩G ∩ H) =127= P(F)P(G)P(H)

P(F ∩G) = P(G ∩ H) = P(F ∩ H) =1

27 P(F)P(G).

Zero probability on the overlap F ∩G except where it also overlaps H,i.e., P(F ∩G ∩ H

c) = 0. Thus P(F ∩G ∩ H) = P(F)P(G)P(H) = 1/27,but P(F ∩G) = 1/27 P(F)P(G) = 1/9.

So P(F ∩G ∩ H) = P(F)P(G)P(H) for three events F, G, and H, yet itis not true that P(F ∩G) = P(F)P(G).

So here it is not true that the three events are mutually independent!


Probabilistic independence has an intuitive interpretation in terms ofnext topic — elementary conditional probability.

But definition does not require conditional probability.


Elementary conditional probability

Intuitively, independence of two events means that the occurrence ofone event should not affect the probability of occurrence of the other.

E.g., outcome of roll of one die does not change probability of nextroll (or rolling another die)

Need definition of probability of one event conditioned on another.

Motivation: Given (Ω,F , P), suppose know that event G has occured.What is a reasonable definition for probability F will occur given(conditioned on) G? P(F | G)


For a fixed G, P(F | G) should be defined for all events F, it should bea probability measure on (Ω,F ).

What properties should it have? Clearly P(Gc|G) = 0 & P(G|G) = 1.

Since P(· | G) must be a probability measure,⇒

P(F | G) = P(F ∩ (G ∪Gc) | G)

= P(F ∩G | G) + P(F ∩Gc | G)

0

= P(F ∩G | G)

so

P(F | G) = P(F ∩G | G) (1)


Next, no reason to suspect that the relative probabilities within G

should change because of the knowledge that G occurred.

E.g., F ⊂ G is twice as probable as an event H ⊂ G with respect to P,then the same should be true with respect to P(· | G).

For arbitrary events F and H, F ∩G,H ∩G ⊂ G,⇒

P(F ∩G|G)P(H ∩G|G)

=P(F ∩G)P(H ∩G)

.

Set H = Ω + (1)⇒

P(F|G) = P(F ∩G|G) =P(F ∩G)

P(G)(2)

Motivates using (2) as definition of conditional probability


Only works if P(G) > 0, called elementary conditional probability

(will later see nonelementary conditional probability, it is morecomplicated)

Easy to prove that since P is a probability measure, so is P(· | G)

Independence redux

Suppose that F and G are independent events and that P(G) > 0,then

P(F | G) =P(F ∩G)

P(G)= P(F);

Occurence of G does not affect probability of F, a posterioriprobability P(F | G) = a priori probability P(F).

Not used as definition of independence since less general.


Conditional probability provides a means of constructing newprobability spaces from old ones

Example: Given discrete (Ω,F , P) described by a pmf p, event A

with P(A) > 0. Define pmf pA:

pA(ω) =

p(ω)/P(A) = P(ω|A) , ω ∈ A

0 ω A

⇒ (Ω,F , PA), where

PA(F) =

ω∈FpA(ω) = P(F|A).

pA is a conditional pmf


E.g., p = geometric pmf and A = ω : ω ≥ K = K,K + 1, . . ..

pA(k) =(1 − p)k−1

p∞

l=K(1 − p)l−1p

=(1 − p)k−1

p

(1 − p)K−1

= (1 − p)k−Kp; k = K + 1,K + 2, . . . ,

a geometric pmf that begins at k = K + 1


Example: Continuous (Ω,F , P), pdf f P(A) > 0 Define conditional fA:

fA(ω) =

f (ω)/P(A) ω ∈ A

0 ω ∈ A

and conditional probability

PA(F) =

ω∈FfA(ω) dω = P(F|A).

E.g., f = exponential pdf, A = r : r ≥ c

fA(x) =λe−λx

∞cλe−λy dy

=λe−λx

e−λc

= λe−λ(x−c); x ≥ c,

Like geometric, conditioned on x ≥ c, has same form, just shifted.


Interpretation: Exponential models bus arrival times. pdf for nextarrival given you have waited an hour is the same as when you got tothe bus stop.


Bayes Rule

Total probability + conditional probability

Recall total probability: If Fi; i = 1, 2, . . . is a partition of, then forany event G

P(G) =

i

P(G ∩ Fi).

Suppose know P(Fi), all i — a priori probabilities of collection ofevents

Suppose know P(G | Fi)

Find a posteriori probabilities P(Fi | G).

Given you observe G, how does that change the probabilities of Fi?


P(Fi | G) =P(Fi ∩G)

P(G)=

P(G | Fi)P(Fi)j P(G ∩ F j)

=P(G | Fi)P(Fi)j P(G | F j)P(F j)

Example of Bayes’ rule: Binary communication channel

Noisy binary communication channel: 0 or 1 is sent and 0 or 1 isreceived. Assume that 0 is sent with probability 0.2 (⇒ 1 is sent withprobability 0.8)

The channel is noisy. If a 0 is sent, a 0 is received with probability0.9, and if a 1 is sent, a 1 is received with probability 0.975


Can represent this channel model by a probability transition diagram

P(0) = 0.2 0 •

P(1) = 0.8 1 •

P(0 | 0) = 0.9

P(1 | 1) = 0.975

P(1 | 0) = 0.1P(0 | 1) = 0.025

•

•

0

0

Problem: Given that 0 is received, find the probability that 0 was sent

Sample space Ω = (0, 0), (0, 1), (1, 0), (1, 1), points: (bit sent, bitreceived)


Define

A = 0 is sent = (0, 1), (0, 0),B = 0 is received = (0, 0), (1, 0)

Probability measure is defined via the P(A), P(B|A), and P(Bc|Ac)provided on the probability transition diagram of the channel

To find P(A|B), use Bayes rule

P(A|B) =P(B|A)P(A)

P(A)P(B|A) + P(Ac)P(B|Ac),

⇒P(A|B) =

0.90.2× 0.2 = 0.9,

much higher than the a priori probability of A (= 0.2)


End of basic probability: Next: random variables, vectors, andprocesses.


Probability - Stanford University · Review and elaboration of basic probability + simple examples...

Documents

Transcript of Probability - Stanford University · Review and elaboration of basic probability + simple examples...