Download - Prof[1] F P Kelly - Probability

8/14/2019 Prof[1] F P Kelly - Probability

1/78

Probability

Prof. F.P. Kelly

Lent 1996

These notes are maintained by Paul Metcalfe.

Comments and corrections to [email protected].


2/78

Revision: 1.1

Date: 2002/09/20 14:45:43

The following people have maintained these notes.

June 2000 Kate Metcalfe

June 2000 July 2004 Andrew RogersJuly 2004 date Paul Metcalfe


3/78

Contents

Introduction v

1 Basic Concepts 11.1 Sample Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Classical Probability . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.3 Combinatorial Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.4 Stirlings Formula . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 The Axiomatic Approach 5

2.1 The Axioms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.3 Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.4 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . 9

3 Random Variables 11

3.1 Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.2 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.3 Indicator Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.4 Inclusion - Exclusion Formula . . . . . . . . . . . . . . . . . . . . . 18

3.5 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4 Inequalities 23

4.1 Jensens Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.2 Cauchy-Schwarz Inequality . . . . . . . . . . . . . . . . . . . . . . . 26

4.3 Markovs Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.4 Chebyshevs Inequality . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.5 Law of Large Numbers . . . . . . . . . . . . . . . . . . . . . . . . . 28

5 Generating Functions 31

5.1 Combinatorial Applications . . . . . . . . . . . . . . . . . . . . . . . 34

5.2 Conditional Expectation . . . . . . . . . . . . . . . . . . . . . . . . 34

5.3 Properties of Conditional Expectation . . . . . . . . . . . . . . . . . 35

5.4 Branching Processes . . . . . . . . . . . . . . . . . . . . . . . . . . 37

5.5 Random Walks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

6 Continuous Random Variables 47

6.1 Jointly Distributed Random Variables . . . . . . . . . . . . . . . . . 50

6.2 Transformation of Random Variables . . . . . . . . . . . . . . . . . . 57

6.3 Moment Generating Functions . . . . . . . . . . . . . . . . . . . . . 64

iii


4/78

iv CONTENTS

6.4 Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . . . 67

6.5 Multivariate normal distribution . . . . . . . . . . . . . . . . . . . . 71


5/78

Introduction

These notes are based on the course Probability given by Prof. F.P. Kelly in Cam-

bridge in the Lent Term 1996. This typed version of the notes is totally unconnected

with Prof. Kelly.

Other sets of notes are available for different courses. At the time of typing thesecourses were:

Probability Discrete Mathematics

Analysis Further Analysis

Methods Quantum Mechanics

Fluid Dynamics 1 Quadratic Mathematics

Geometry Dynamics of D.E.s

Foundations of QM Electrodynamics

Methods of Math. Phys Fluid Dynamics 2

Waves (etc.) Statistical Physics

General Relativity Dynamical Systems

Combinatorics Bifurcations in Nonlinear Convection

They may be downloaded from

http://www.istari.ucam.org/maths/ .

v
http://www.istari.ucam.org/maths/http://www.istari.ucam.org/maths/http://www.istari.ucam.org/maths/


6/78

vi INTRODUCTION


7/78

Chapter 1

Basic Concepts

1.1 Sample Space

Suppose we have an experiment with a set of outcomes. Then is called the samplespace. A potential outcome is called a sample point.

For instance, if the experiment is tossing coins, then = {H, T}, or if the experi-ment was tossing two dice, then = {(i, j) : i, j {1, . . . , 6}}.

A subset A of is called an event. An event A occurs is when the experiment isperformed, the outcome satisfies A. For the coin-tossing experiment, thenthe event of a head appearing is A = {H} and for the two dice, the event rolling afour would be A = {(1, 3), (2, 2), (3, 1)}.

1.2 Classical Probability

If is finite, = {1, . . . , n}, and each of the n sample points is equally likelythen the probability of event A occurring is

P(A) =|A|||

Example. Choose r digits from a table of random numbers. Find the probability thatfor0 k 9,

1. no digit exceeds k,

2. k is the greatest digit drawn.

Solution. The event that no digit exceeds k is

Ak = {(a1, . . . , ar) : 0 ai k, i = 1 . . . r} .

Now |Ak| = (k + 1)r, so that P(Ak) =k+110

r.

Let Bk be the event that k is the greatest digit drawn. Then Bk = Ak \ Ak1. AlsoAk1 Ak, so that |Bk| = (k + 1)r kr . Thus P(Bk) = (k+1)

rkr10r

1


8/78

2 CHAPTER 1. BASIC CONCEPTS

The problem of the points

Players A and B play a series of games. The winner of a game wins a point. The twoplayers are equally skillful and the stake will be won by the first player to reach a target.

They are forced to stop when A is within 2 points and B within 3 points. How should

the stake be divided?

Pascal suggested that the following continuations were equally likely

AAAA AAAB AABB ABBB BBBB

AABA ABBA BABB

ABAA ABAB BBAB

BAAA BABA BBBA

BAAB

BBAA

This makes the ratio 11 : 5. It was previously thought that the ratio should be 6 : 4on considering termination, but these results are not equally likely.

1.3 Combinatorial Analysis

The fundamental rule is:

Suppose r experiments are such that the first may result in any of n1 possible out-comes and such that for each of the possible outcomes of the first i 1 experimentsthere are ni possible outcomes to experiment i. Let ai be the outcome of experiment i.Then there are a total of

ri=1 ni distinct r-tuples (a1, . . . , ar) describing the possible

outcomes of the r experiments.

Proof. Induction.

1.4 Stirlings Formula

For functions g(n) and h(n), we say that g is asymptotically equivalent to h and write

g(n) h(n) if g(n)h(n) 1 as n .Theorem 1.1 (Stirlings Formula). As n ,

logn!

2nnnen 0

and thus n! 2nnnen.

We first prove the weak form of Stirlings formula, that log(n!) n log n.Proof. log n! =

n1 log k. Nown

1

log xdx n1

log k n+11

log xdx,

andz1

logx dx = z log z z + 1, and son log n n + 1 log n! (n + 1) log(n + 1) n.

Divide by n log n and let n to sandwich logn!n logn between terms that tend to 1.Therefore log n! n log n.


9/78

1.4. STIRLINGS FORMULA 3

Now we prove the strong form.

Proof. For x > 0, we have

1 x + x2 x3 < 11 + x

< 1 x + x2.

Now integrate from 0 to y to obtain

y y2/2 + y3/3 y4/4 < log(1 + y) < y y2/2 + y3/3.

Let hn = logn!en

nn+1/2. Then1 we obtain

1

12n2 1

12n3 hn hn+1 1

12n2+

1

6n3.

For n 2, 0 hn hn+1 1n2 . Thus hn is a decreasing sequence, and 0 h2hn+1

nr=2(hrhr+1)

1

1r2 . Therefore hn is bounded below, decreasing

so is convergent. Let the limit be A. We have obtained

n! eAnn+1/2en.

We need a trick to find A. Let Ir =/20

sinr d. We obtain the recurrence Ir =r1r Ir2 by integrating by parts. Therefore I2n =

(2n)!(2nn!)2

/2 and I2n+1 =(2nn!)2

(2n+1)!.

Now In is decreasing, so

1 I2nI2n+1

I2n1I2n+1

= 1 +1

2n 1.

But by substituting our formula in, we get that

I2nI2n+1

2

2n + 1

n

2

e2A 2

e2A.

Therefore e2A = 2 as required.

1by playing silly buggers with log 1 + 1n


10/78

4 CHAPTER 1. BASIC CONCEPTS


11/78

Chapter 2

The Axiomatic Approach

2.1 The Axioms

Let be a sample space. Then probabilityP is a real valued function defined on subsetsof satisfying :-

1. 0 P(A) 1 for A ,

2. P() = 1,

3. for a finite or infinite sequence A1, A2, of disjoint events, P(Ai) =i P(Au).

The number P(A) is called the probability of event A.

We can look at some distributions here. Consider an arbitrary finite or countable

= {1, 2, . . . } and an arbitrary collection {p1, p2, . . . } of non-negative numberswith sum 1. If we define

P(A) =

i:iApi,

it is easy to see that this function satisfies the axioms. The numbers p1, p2, . . . arecalled a probability distribution. If is finite with n elements, and ifp1 = p2 = =pn =

1n we recover the classical definition of probability.

Another example would be to let = {0, 1, . . . } and attach to outcome r theprobability pr = e

r

r! for some > 0. This is a distribution (as may be easilyverified), and is called the Poisson distribution with parameter .

Theorem 2.1 (Properties ofP). A probability P satisfies

1. P(Ac) = 1 P(A),

2. P() = 0,

3. ifA B then P(A) P(B),

4. P(A B) = P(A) + P(B) P(A B).

5


12/78

6 CHAPTER 2. THE AXIOMATIC APPROACH

Proof. Note that = AAc, and AAc = . Thus 1 = P() = P(A)+P(Ac). Nowwe can use this to obtain

P

() = 1 P

(c

) = 0. IfA B, write B = A (B Ac

),so that P(B) = P(A) + P(B Ac) P(A). Finally, write A B = A (B Ac)and B = (B A) (B Ac). Then P(A B) = P(A) + P(B Ac) and P(B) =P(B A) + P(B Ac), which gives the result.Theorem 2.2 (Booles Inequality). For any A1, A2, ,

P

n1

Ai

ni

P(Ai)

P

1

Ai

i

P(Ai)

Proof. Let B1 = A1 and then inductively let Bi = Ai\

i11

Bk. Thus the Bis aredisjoint and

i Bi =

i Ai. Therefore

P

i

Ai

= P

i

Bi

=i

P(Bi)

i

P(Ai) as Bi Ai.

Theorem 2.3 (Inclusion-Exclusion Formula).

P

n1

Ai

=

S{1,...,n}

S=

(1)|S|1P

jSAj

.

Proof. We know that P(A1 A2) = P(A1) + P(A2) P(A1 A2). Thus the resultis true for n = 2. We also have that

P(A1 An) = P(A1 An1) + P(An) P((A1 An1) An) .But by distributivity, we have

P

n

i

Ai

= P

n1

1

Ai

+ P(An) P

n1

1

(Ai An)

.

Application of the inductive hypothesis yields the result.

Corollary (Bonferroni Inequalities).

S{1,...,r}

S=

(1)|S|1P

jSAj

or

P

n1

Ai

according as r is even or odd. Or in other words, if the inclusion-exclusion formula istruncated, the error has the sign of the omitted term and is smaller in absolute value.

Note that the case r = 1 is Booles inequality.


13/78

2.2. INDEPENDENCE 7

Proof. The result is true for n = 2. If true for n 1, then it is true for n and 1 r n 1 by the inductive step above, which expresses a n-union in terms of two n 1unions. It is true for r = n by the inclusion-exclusion formula.Example (Derangements). After a dinner, the n guests take coats at random from a

pile. Find the probability that at least one guest has the right coat.

Solution. Let Ak be the event that guest k has his1 own coat.

We want P(n

i=1 Ai). Now,

P(Ai1 Air) =(n r)!

n!,

by counting the number of ways of matching guests and coats after i1, . . . , ir havetaken theirs. Thus

i1


14/78

8 CHAPTER 2. THE AXIOMATIC APPROACH

Solution. We first calculate the probabilities of the events A1, A2, A3, A1A2, A1A3and A1 A2 A3.

Event Probability

A11836

= 12

A2 As above,12

A36336

= 12

A1 A2 3336 = 14A1

A3

3336

= 14

A1 A2 A3 0

Thus by a series of multiplications, we can see that A1 and A2 are independent, A1and A3 are independent (also A2 and A3), but that A1, A2 and A3 are notindependent.

Now we wish to state what we mean by 2 independent experiments 2. Consider

1 = {1, . . . } and 2 = {1, . . . } with associated probability distributions {p1, . . . }and {q1, . . . }. Then, by 2 independent experiments, we mean the sample space1 2 with probability distribution P((i, j)) = piqj .

Now, suppose A

1 and B

2. The event A can be interpreted as an event in

1 2, namely A 2, and similarly for B. ThenP(A B) =

iAjB

piqj =iA

pijB

qj = P(A)P(B) ,

which is why they are called independent experiments. The obvious generalisation

to n experiments can be made, but for an infinite sequence of experiments we mean asample space 1 2 . . . satisfying the appropriate formula n N.

You might like to find the probability that n independent tosses of a biased coinwith the probability of heads p results in a total ofr heads.

2.3 Distributions

The binomial distribution with parameters n andp, 0 p 1 has = {0, . . . , n} andprobabilities pi =

ni

pi(1 p)ni.

Theorem 2.4 (Poisson approximation to binomial). Ifn , p 0 with np = held fixed, then

n

r

pr(1 p)nr e

r

r!.

2or more generally, n.


15/78

2.4. CONDITIONAL PROBABILITY 9

Proof. nr

pr(1 p)nr = n(n 1) . . . (n r + 1)

r!pr(1 p)nr

=n

n

n 1n

. . .n r + 1

n

(np)r

r!(1 p)nr

=r

i=1

n i + 1

n

r

r!

1

n

n1

n

r

1 r

r! e 1

= er

r!.

Suppose an infinite sequence of independent trials is to be performed. Each trial

results in a success with probability p (0, 1) or a failure with probability 1 p. Sucha sequence is called a sequence of Bernoulli trials. The probability that the first success

occurs after exactly r failures ispr = p(1p)r. This is the geometric distribution withparameter p. Since

0 pr = 1, the probability that all trials result in failure is zero.

2.4 Conditional Probability

Definition 2.2. ProvidedP(B) > 0 , we define the conditional probability ofA|B3 tobe

P(A

|B) =

P(A B)P(B)

.

Whenever we write P(A|B), we assume thatP(B) > 0.Note that ifA and B are independent then P(A|B) = P(A).

Theorem 2.5. 1. P(A B) = P(A|B)P(B),2. P(A B C) = P(A|B C)P(B|C)P(C),3. P(A|B C) = P(AB|C)

P(B|C) ,

4. the function P(|B) restricted to subsets ofB is a probability function on B.Proof. Results 1 to 3 are immediate from the definition of conditional probability. For

result 4, note that AB B, so P(A B) P(B) and thus P(A|B) 1. P(B|B) =1 (obviously), so it just remains to show the last axiom. For disjoint Ais,

P

i

Ai

B

=P(

i(Ai B))P(B)

=

i P(Ai B)P(B)

=i

P(Ai|B) , as required.

3read A givenB.


16/78


17/78

Chapter 3

Random Variables

Let be finite or countable, and let p = P({}) for .

Definition 3.1. A random variable X is a function X : R.

Note that random variable is a somewhat inaccurate term, a random variable is

neither random nor a variable.

Example. If = {(i, j), 1 i, j t} , then we can define random variables X andY by X(i, j) = i +j andY(i, j) = max{i, j}

LetRX

be the image of

underX

. When the range is finite or countable then the

random variable is said to be discrete.

We write P(X = xi) for

:X()=xip , and for B R

P(X B) =xB

P(X = x) .

Then

(P(X = x) , x RX)

is the distribution of the random variable X. Note that it is a probability distributionover RX .

3.1 Expectation

Definition 3.2. The expectation of a random variable X is the number

E[X] =

pwX()

provided that this sum converges absolutely.

11


18/78

12 CHAPTER 3. RANDOM VARIABLES

Note that

E[X] =

pwX()

=

xRX

:X()=x

pX()

=

xRXx

:X()=x

p

=

xRXxP(X = x) .

Absolute convergence allows the sum to be taken in any order.

If X is a positive random variable and if

pX() = we write E[X] =+. If

xRXx0

xP(X = x) = and

xRXx


19/78

3.1. EXPECTATION 13

Solution.

E[X] =n

r=0

rpr(1 p)nr

nr

=n

r=0

rn!

r!(n r)!pr(1 p)nr

= nn

r=1

(n 1)!(r 1)!(n r)!p

r(1 p)nr

= np

nr=1

(n 1)!(r 1)!(n r)!p

r1(1 p)nr

= np

n1r=1

(n

1)!

(r)!(n r)!pr

(1 p)n

1

r

= npn1r=1

n 1

r

pr(1 p)n1r

= np

For any function f: R R the composition of f and X defines a new randomvariable f and X defines the new random variable f(X) given by

f(X)(w) = f(X(w)).

Example. Ifa, b andc are constants, then a + bXand(X c)2 are random variablesdefined by

(a + bX)(w) = a + bX(w) and

(X c)2(w) = (X(w) c)2.Note that E[X] is a constant.

Theorem 3.1.

1. IfX 0 then E[X] 0.2. IfX 0 andE[X] = 0 then P(X = 0) = 1.

3. Ifa andb are constants then E[a + bX] = a + bE[X].

4. For any random variables X, Y then E[X+ Y] = E[X] + E[Y].

5. E[X] is the constant which minimises E

(X c)2

.

Proof. 1. X 0 means Xw 0 w

So E[X] =

pX() 0

2. If with p > 0 and X() > 0 then E[X] > 0, therefore P(X = 0) = 1.


20/78


3.

E[a + bX] =

(a + bX())p

= a

p + b

pX()

= a + E[X] .

4. Trivial.

5. Now

E

(X c)2

= E

(X E[X] + E[X] c)2

= E

[(X E[X])

2+ 2(X E[X])(E[X] c) + [(E[X] c)]

2

]= E

(X E[X])2 + 2(E[X] c)E[(X E[X])] + (E[X] c)2

= E

(X E[X])2 + (E[X] c)2.This is clearly minimised when c = E[X].

Theorem 3.2. For any random variables X1, X2,....,Xn

E

n

i=1

Xi

=

ni=1

E[Xi]

Proof.

E

n

i=1

Xi

= E

n1i=1

Xi + Xn

= E

n1i=1

Xi

+ E[X]

Result follows by induction.

3.2 Variance

Var X = E

X2 E[X]2 for Random Variable X

= E[X E[X]]2 = 2Standard Deviation =

Var X

Theorem 3.3. Properties of Variance

(i) Var X 0 ifVar X = 0, then P(X = E[X]) = 1Proof - from property 1 of expectation

(ii) Ifa, b constants, Var (a + bX) = b2 Var X


21/78

3.2. VARIANCE 15

Proof.

Var a + bX = E[a + bX a bE[X]]= b2E[X E[X]]= b2 Var X

(iii) Var X = E

X2 E[X]2

Proof.

E[X E[X]]2 = EX2 2XE[X] + (E[X])2= E

X2

2E[X]E[X] + E[X]2

= EX2 (E[X])2

Example. Let X have the geometric distribution P(X = r) = pqr with r = 0, 1, 2...andp + q = 1. Then E[X] = qp andVar X =

qp2 .

Solution.

E[X] =r=0

rpqr = pqr=0

rqr1

=1

pq

r=0

d

dq(qr) = pq

d

dq

11 q

= pq(1 q)2

=

q

p

E

X2

=r=0

r2p2q2r

= pq

r=1

r(r + 1)qr1 r=1

rqr1

= pq(2

(1 q)3 1

(1 q)2 =2q

p2 q

p

Var X = E

X2 E[X]2

=2q

p2 q

p q

2

p

=q

p2

Definition 3.3. The co-variance of random variables X andY is:

Cov(X, Y) = E[(X E[X])(Y E[Y])]The correlation ofX andY is:

Corr(X, Y) =Cov(X, Y)Var XVar Y


22/78


Linear Regression

Theorem 3.4. Var (X+ Y) = Var X + Var Y + 2Cov(X, Y)

Proof.

Var (X + Y) = E

(X+ Y)2 E[X] E[Y]2= E

(X E[X])2 + (Y E[Y])2 + 2(X E[X])(Y E[Y])

= Var X+ Var Y + 2Cov(X, Y)

3.3 Indicator Function

Definition 3.4. The Indicator Function I[A] of an eventA is the function

I[A](w) =

1, if A;0, if / A. (3.1)

NB that I[A] is a random variable

1.

E[I[A]] = P(A)

E[I[A]] =

pI[A](w)

= P(A)

2. I[Ac] = 1 I[A]

3. I[A

B] = I[A]I[B]

4.

I[A B] = I[A] + I[B] I[A]I[B]I[A B]() = 1 if A or BI[A B]() = I[A]() + I[B]() I[A]I[B]() WORKS!

Example. n couples are arranged randomly around a table such that males and fe-males alternate. LetN = The number of husbands sitting next to their wives. Calculate


23/78

3.3. INDICATOR FUNCTION 17

the E[N] and the Var N.

N =n

i=1

I[Ai] Ai = event couple i are together

E[N] = E

n

i=1

I[Ai]

=n

i=1

E[I[Ai]]

=

n

i=1

2

n

Thus E[N] = n2

n= 2

E

N2

= E

n

i=1

I[Ai]

2

= E

n

i=1

I[Ai]

2

+ 2ij

I[Ai]I[Aj ]

= nE

I[Ai]2

+ n(n 1)E[(I[A1]I[A2])]EI[Ai]2 = E[I[Ai]] = 2

nE[(I[A1]I[A2])] = IE[[A1 B2]] = P(A1 A2)

= P(A1)P(A2|A1)

=2

n

1

n 11

n 1 n 2n 1

2

n 1

Var N = E

N2 E[N]2

=2

n 1 (1 + 2(n 2)) 2

=2(n 2)

n 1


24/78


25/78

3.5. INDEPENDENCE 19

Proof.

P(f1(X1) = y1, . . . , f n(Xn) = yn) =

x1:f1(X1)=y1,...xn:fn(Xn)=yn

P(X1 = x1, . . . , X n = xn)

=N1

xi:fi(Xi)=yi

P(Xi = xi)

=N1

P(fi(Xi) = yi)

Theorem 3.6. IfX1.....Xn are independent random variables then:

E

N1

Xi

=

N1

E[Xi]

NOTE that E[

Xi] =E[Xi] without requiring independence.

Proof. Write Ri for RXi the range ofXi

EN

1

Xi = x1R1

.... xnRn

x1..xnP(X1 = x1, X2 = x2......., Xn = xn)

=N1

xiRi

P(Xi = xi)

=N1

E[Xi]

Theorem 3.7. IfX1,...,Xn are independent random variables and f1....fn are func-tion R

R then:

E

N1

fi(Xi)

=

N1

E[fi(Xi)]

Proof. Obvious from last two theorems!

Theorem 3.8. IfX1,...,Xn are independent random variables then:

Var

n

i=1

Xi

=

ni=1

Var Xi


26/78


Proof.

Var

n

i=1

Xi

= E

n

i=1

Xi

2 E

ni=1

Xi

2

= E

i

X2i +i=j

XiXj

E

n

i=1

Xi

2

=i

E

X2i

+i=j

E[XiXj ] i

E[Xi]2

i=j

E[Xi]E[Xj ]

=i

E

X2i E[Xi]2

=i

Var Xi

Theorem 3.9. IfX1,...,Xn are independent identically distributed random variablesthen

Var

1

n

n

i=1Xi

=

1

nVar Xi

Proof.

Var

1

n

ni=1

Xi

=

1

n2Var Xi

=1

n2

ni=1

Var Xi

=1

nVar Xi

Example. Experimental Design. Two rods of unknown lengths a, b. A rule canmeasure the length but with but with error having 0 mean (unbiased) and variance 2.

Errors independent from measurement to measurement. To estimate a, b we could takeseparate measurements A, B of each rod.

E[A] = a Var A = 2

E[B] = b Var B = 2


27/78

3.5. INDEPENDENCE 21

Can we do better? YEP! Measure a + b as X anda b as Y

E[X] = a + b Var X = 2

E[Y] = a b Var Y = 2

E

X+ Y

2

= a

VarX+ Y

2=

1

22

E

X Y

2

= b

VarX Y

2=

1

22

So this is better.

Example. Non standard dice. You choose 1 then I choose one. Around this cycle

a B P(A B) = 23

. So the relation A better that B is not transitive.


28/78



29/78

Chapter 4

Inequalities

4.1 Jensens Inequality

A function f; (a, b) R is convex if

f(px + qy) pf(x) + (1 p)f(y) - x, y (a, b) - p (0, 1)

Strictly convex if strict inequality holds when x = y

f is concave iff is convex. f is strictly concave iff is strictly convex

23


30/78

24 CHAPTER 4. INEQUALITIES

Concave

neither concave or convex.

We know that if f is twice differentiable and f

(x) 0 for x (a, b) the if f isconvex and strictly convex if f

(x) 0 forx (a, b).

Example.

f(x) = log x

f

(x) = 1

x

f

(x) =1

x2 0

f(x) is strictly convex on (0, )Example.

f(x) = x log xf(x) = (1 + logx)

f

(x) =1x

0

Strictly concave.

Example. f(x = x3 is strictly convex on (0, ) but not on (, )Theorem 4.1. Letf : (a, b) R be a convex function. Then:

ni=1

pif(xi) f

ni=1

pixi

x1, . . . , X n (a, b), p1, . . . , pn (0, 1) and

pi = 1. Further more if f is strictlyconvex then equality holds if and only if all xs are equal.

E[f(X)] f(E[X])


31/78

4.1. JENSENS INEQUALITY 25

Proof. By induction on n n = 1 nothing to prove n = 2 definition of convexity.

Assume results holds up to n-1. Consider x1,...,xn (a, b), p1,...,pn (0, 1) andpi = 1For i = 2...n, set p

i =pi

1 pi , such thatn

i=2

p

i = 1

Then by the inductive hypothesis twice, first for n-1, then for 2

n1

pifi(xi) = p1f(x1) + (1 p1)n

i=2

p

if(xi)

p1f(x1) + (1 p1)f

ni=2

p

ixi

fp1x1 + (1 p1)n

i=2pixi

= f

n

i=1

pixi

f is strictly convex n 3 and not all the xis equal then we assume not all of x2...xnare equal. But then

(1 pj)n

i=2

p

if(xi) (1 pj)f

ni=2

p

ixi

So the inequality is strict.

Corollary (AM/GM Inequality). Positive real numbers x1, . . . , xnn

i=1

xi

1n

1n

ni=1

xi

Equality holds if and only ifx1 = x2 = = xnProof. Let

P(X = xi) =1

n

then f(x) = log x is a convex function on (0, ).So

E[f(x)] f(E[x]) (Jensens Inequality)E[log x] logE[x] [1]

Therefore 1n

n1

log xi log 1n

n1

x

n

i=1

xi

1n

1n

ni=1

xi [2]

For strictness since f strictly convex equation holds in [1] and hence [2] if and only if

x1 = x2 = = xn


32/78


If f : (a, b) R is a convex function then it can be shown that at each pointy (a, b) a linear function y + yx such that

f(x) y + yx x (a, b)f(y) = y + yy

If f is differentiable at y then the linear function is the tangent f(y) + (x y)f(y)

Let y = E[X], = y and = y

f(E[X]) = + E[X]

So for any random variable X taking values in (a, b)

E[f(X)] E[ + X]= + E[X]

= f(E[X])

4.2 Cauchy-Schwarz Inequality

Theorem 4.2. For any random variables X, Y,

E[XY ]2 EX2EY2Proof. For a, b R Let

LetZ = aX bYThen0 EZ2 = E(aX bY)2

= a2E

X2 2abE[XY ] + b2EY2

quadratic in a with at most one real root and therefore has discriminant 0.


33/78

4.3. MARKOVS INEQUALITY 27

Take b = 0E[XY ]2 EX2EY2

Corollary.

|Corr(X, Y)| 1

Proof. Apply Cauchy-Schwarz to the random variables X E[X] and Y E[Y]

4.3 Markovs Inequality

Theorem 4.3. If X is any random variable with finite mean then,

P(|X| a) E[|X|]a

for any a 0

Proof. Let

A =

|X

| a

Then |X| aI[A]

Take expectation

E[|X|] aP(A)E[|X|] aP(|X| a)

4.4 Chebyshevs Inequality

Theorem 4.4. Let X be a random variable with E

X2 . Then 0

P(|X| ) E

X2

2

Proof.

I[|X| ] x2

2x


34/78


Then

I[|X| ] x2

2

Take Expectation

P(|X| ) E

x2

2

=E

X2

2

Note

1. The result is distribution free - no assumption about the distribution of X (other

thanE

X2 ).

2. It is the best possible inequality, in the following sense

X = + with probabilityc

22

= with probability c22

= 0 with probability 1 c2

Then P(|X| ) = c2

EX2

= cP(|X| ) = c

2= E

X2

2

3. If = E[X] then applying the inequality to X gives

P(X ) Var X2

Often the most useful form.

4.5 Law of Large Numbers

Theorem 4.5 (Weak law of large numbers). LetX1, X2..... be a sequences of inde-pendent identically distributed random variables with Variance 2 Let

Sn =n

i=1

Xi

Then

0, P

Snn

0 as n


35/78

4.5. LAW OF LARGE NUMBERS 29

Proof. By Chebyshevs Inequality

P

Snn

E

(Snn )2

2

=E

(Sn n)2

n22properties of expectation

=Var Sn

n22Since E[Sn] = n

But Var Sn = n2

Thus P

Snn

n

2

n22=

2

n2 0

Example. A1, A2... are independent events, each with probability p. Let Xi = I[Ai].Then

Snn

=nA

n=

number of times A occurs

number of trials

= E[I[Ai]] = P(Ai) = p

Theorem states that

P

Snn p

0 as n

Which recovers the intuitive definition of probability.

Example. A Random Sample of size n is a sequence X1, X2, . . . , X n of independent

identically distributed random variables (n observations)

X =

ni=1 Xi

nis called the SAMPLE MEAN

Theorem states that provided the variance ofXi is finite, the probability that the samplemean differs from the mean of the distribution by more than approaches 0 as n .

We have shown the weak law of large numbers. Why weak? a strong form oflarger numbers.

P

Snn

as n

= 1

This is NOT the same as the weak form. What does this mean?

determines Snn

, n = 1, 2, . . .

as a sequence of real numbers. Hence it either tends to or it doesnt.

P

:

Sn()

n as n

= 1


36/78



37/78

Chapter 5

Generating Functions

In this chapter, assume that X is a random variable taking values in the range 0, 1, 2, . . ..Let pr = P(X = r) r = 0, 1, 2, . . .

Definition 5.1. The Probability Generating Function (p.g.f) of the random variable

X,or of the distribution pr = 0, 1, 2, . . ., is

p(z) = E

zX

=r=0

zrP(X = r) =r=0

przr

This p(z) is a polynomial or a power series. If a power series then it is convergent for|z| 1 by comparison with a geometric series.

|p(z)| r

pr |z|r

r

pr = 1

Example.

pr =1

6r = 1, . . . , 6

p(z) = E

zX

=1

6

1 + z + . . . z6

=

z

6

1 z61 z

Theorem 5.1. The distribution of X is uniquely determined by the p.g.f p(z).

Proof. We know that we can differential p(z) term by term for |z| 1p(z) = p1 + 2p2z + . . .

and so p(0) = p1 (p(0) = p0)

Repeated differentiation gives

p(i)(z) =r=i

r!

(r i)!przri

and has p(i) = 0 = i!pi Thus we can recover p0, p1, . . . from p(z)

31


38/78

32 CHAPTER 5. GENERATING FUNCTIONS

Theorem 5.2 (Abels Lemma).

E[X] = limz1p(z)

Proof.

p(z) =r=i

rprzr1 |z| 1

For z (0, 1), p(z) is a non decreasing function of z and is bounded above by

E[X] =r=i

rpr

Choose 0, N large enough thatNr=i

rpr E[X]

Then

limz1

r=i

rprzr1 lim

z1

Nr=i

rprzr1 =

Nr=i

rpr

True 0 and soE[X] = lim

z1p(z)

Usually p(z) is continuous at z=1, then E[X] = p(1).

Recall p(z) = z

61 z

6

1 z

Theorem 5.3.

E[X(X 1)] = limz1

p(z)

Proof.

p(z) =r=2

r(r 1)pzr2

Proof now the same as Abels Lemma

Theorem 5.4. Suppose that X1, X2, . . . , X n are independent random variables withp.g.fs p1(z), p2(z), . . . , pn(z). Then the p.g.f of

X1 + X2 + . . . X n

is

p1(z)p2(z) . . . pn(z)

Proof.

E

zX1+X2+...Xn

= E

zX1 .zX2 . . . . zXn

= E

zX1E

zX2

. . .E

zXn

= p1(z)p2(z) . . . pn(z)


39/78

33

Example. Suppose X has Poisson Distribution

P(X = r) = e rr!

r = 0, 1, . . .

Then

E

zX

=r=0

zrer

r!

= eez

= e(1z)

Lets calculate the variance of X

p

= e(1z) p

= 2e(1z)

Then

E[X] = limz1

p(z) = p

(1)( Since p

(z) continuous atz = 1 )E[X] =

E[X(X 1)] = p(1) = 2Var X = E

X2

E[X]2= E[X(X 1)] + E[X] E[X]2= 2 + 2=

Example. Suppose that Y has a Poisson Distribution with parameter. If X and Y areindependent then:

E

zX+Y

= E

zXE

zY

= e(1z)e(1z)

= e(+)(1z)

But this is the p.g.f of a Poisson random variable with parameter + . By uniqueness(first theorem of the p.g.f) this must be the distribution for X + Y

Example. X has a binomial Distribution,

P(X = r) =

n

rpr(1 p)nr r = 0, 1, . . .

E

zX

=n

r=0

n

r

pr(1 p)nrzr

= (pz + 1 p)n

This shows thatX = Y1 + Y2 + + Yn. Where Y1 + Y2 + + Yn are independentrandom variables each with

P(Yi = 1) = p P(Yi = 0) = 1 pNote if the p.g.f factorizes look to see if the random variable can be written as a sum.


40/78


5.1 Combinatorial Applications

Tile a (2 n) bathroom with (2 1) tiles. How many ways can this be done? Say fnfn = fn1 + fn2 f0 = f1 = 1

Let

F(z) =n=0

fnzn

fnzn = fn1zn + fn2zn

n=2

fnzn =

n=2

fn1zn +n=0

fn2zn

F(z) f0 zf1 = z(F(z) f0) + z2F(z)F(z)(1 z z2) = f0(1 z) + zf1

= 1 z + z = 1.Since f0 = f1 = 1, then F(z) =

11zz2

Let

1 =1 +

5

22 =

1 52

F(z) =1

(1 1z)(1 2z)=

1

(1 1z) 2

(1 2z)=

1

1 2

1

n=0

n1 zn 2

n=0

n2zn

The coefficient ofzn1 , that is fn, is

fn =1

1 2 (n+11 n+12 )

5.2 Conditional Expectation

Let X and Y be random variables with joint distribution

P(X = x, Y = y)

Then the distribution of X is

P(X = x) =yRy

P(X = x, Y = y)

This is often called the Marginal distribution for X. The conditional distribution for Xgiven by Y = y is

P(X = x|Y = y) = P(X = x, Y = y)P(Y = y)


41/78

5.3. PROPERTIES OF CONDITIONAL EXPECTATION 35

Definition 5.2. The conditional expectation of X given Y = y is,

E[X = x|Y = y] =xRx

xP(X = x|Y = y)

The conditional Expectation of X given Y is the random variable E[X|Y] defined by

E[X|Y] () = E[X|Y = Y()]

Thus E[X|Y] : R

Example. Let X1, X2, . . . , X n be independent identically distributed random vari-ables with P(X1 = 1) = p andP(X1 = 0) = 1 p. Let

Y = X1 + X2 + + Xn

Then

P(X1 = 1|Y = r) = P(X1 = 1, Y = r)P(Y = r)

=P(X1 = 1, X2 + + Xn = r 1)

P(Y = r)

=P(X1)P(X2 + + Xn = r 1)

P(Y = r)

=p

n1r1

pr1(1 p)nrnrpr

(1 p)nr

=

n1r1

nr

=

r

n

Then

E[X1|Y = r] = 0 P(X1 = 0|Y = r) + 1 P(X1 = 1|Y = r)=

r

n

E[X1|Y = Y()] =

1

n

Y()

Therefore E[X1|Y] = 1n

Y

Note a random variable - a function of Y.

5.3 Properties of Conditional Expectation

Theorem 5.5.

E[E[X|Y]] = E[X]


42/78


Proof.

E[E[X|Y]] =yRy

P(Y = y)E[X|Y = y]

=y

P(Y = y)xRx

P(X = x|Y = y)

=y

x

xP(X = x|Y = y)

= E[X]

Theorem 5.6. IfX andY are independent then

E[X|Y] = E[X]

Proof. IfX and Y are independent then for any y Ry

E[X|Y = y] =xRx

xP(X = x|Y = y) =x

xP(X = x) = E[X]

Example. Let X1, X2, . . . be i.i.d.r.vs with p.g.f p(z). Let N be a random variableindependent ofX1, X2, . . . with p.g.fh(z). What is the p.g.f of:

X1 + X2 + + XN

E

zX1+,...,Xn

= EE

zX1+,...,Xn |N=

n=0

P(N = n)E

zX1+,...,Xn |N = n

=n=0

P(N = n) (p(z))n

= h(p(z))

Then for example

E[X1+, . . . , X n] =d

dzh(p(z))

z=1

= h(1)p

(1) = E[N]E[X1]

Exercise Calculate d2

dz2 h(p(z)) and hence

Var X1+, . . . , X n

In terms ofVar N and Var X1


43/78

5.4. BRANCHING PROCESSES 37

5.4 Branching Processes

X0, X1 . . . sequence of random variables. Xn number of individuals in the nth gener-

ation of population. Assume.

1. X0 = 1

2. Each individual lives for unit time then on death produces k offspring, probabil-ity fk. fk = 1

3. All offspring behave independently.

Xn+1 = Yn1 + Y

n2 + + Ynn

Where Yni are i.i.d.r.vs. Yni number of offspring of individual i in generation n.

Assume

1. f0 02. f0 + f1 1

Let F(z) be the probability generating function ofYni .

F(z) =n=0

fkzk = E

zXi

= E

zY

ni

Let

Fn(z) = E

zXn

Then F1(z) = F(z) the probability generating function of the offspring distribution.

Theorem 5.7.

Fn+1(z) = Fn(F(z)) = F(F(. . . (F(z)) . . . ))

Fn(z) is an n-fold iterative formula.


44/78


Proof.

Fn+1(z) = E

zXn+1

= EE

zXn+1 |Xn

=

n=0

P(Xn = k)E

zXn+1 |Xn = k

=n=0

P(Xn = k)E

zYn1 +Y

n2 ++Ynn

=

n=0

P(Xn = k)E

zYn1

. . .E

zY

nn

=

n=0

P(Xn = k) (F(z))k

= Fn(F(z))

Theorem 5.8. Mean and Variance of population size

Ifm =k=0

kfk

and2 =

k=0

(k

m)2fk

Mean and Variance of offspring distribution.

Then E[Xn] = mn

Var Xn =

2mn1(mn1)

m1 , m = 1n2, m = 1

(5.1)

Proof. Prove by calculating F(z), F

(z) Alternatively

E[Xn] = E[E[Xn|Xn1]]= E[m

|Xn

1]

= mE[Xn1]= mn by induction

E

(Xn mXn1)2

= EE

(Xn mXn1)2|Xn

= E[Var (Xn|Xn1)]= E

2Xn1

= 2mn1

Thus

E

X2n 2mE[XnXn1] + m2EX2n12 = 2mn1


45/78


Now calculate

E[XnXn1] = E[E[XnXn1|Xn1]]= E[Xn1E[Xn|Xn1]]= E[Xn1mXn1]

= mE

X2n1

Then E

X2n

= 2mn1 + m2E[Xn1]2

Var Xn = E

X2n E[Xn]2

= m2E

X2n1

+ 2mn1 m2E[Xn1]2= m2 Var Xn1 + 2mn1

= m4 Var Xn2 +

2(mn1 + mn)

= m2(n1) Var X1 + 2(mn1 + mn + + m2n3)= 2mn1(1 + m + + mn)

To deal with extinction we need to be careful with limits as n . Let

An = Xn = 0

= Extinction occurs by generation n

and let A =

1

An

= the event that extinction ever occurs

Can we calculate P(A) from P(An)?

More generally let An be an increasing sequence

A1 A2 . . .

and define

A = limnAn =

1 An

Define Bn for n 1

B1 = A1

Bn = An

n1i=1

Ai

c

= An Acn1


46/78


Bn for n 1 are disjoint events andi=1

Ai =i=1

Bi

ni=1

Ai =n

i=1

Bi

P

i=1

Ai

= P

i=1

Bi

=1

P(Bi)

= limn

n1P(Bi)

= limn

ni=1

Bi

= limn

ni=1

Ai

= limnP(An)

Thus

P

limn

An

= lim

nP(An)

Probability is a continuous set function. Thus

P(extinction ever occurs) = limn

P(An)

= limnP(Xn = 0)

= q, Say

Note P(Xn = 0), n = 1, 2, 3, . . . is an increasing sequence so limit exists. But

P(Xn = 0) = Fn(0) Fn is the p.g.f ofXn

So

q = limnFn(0)Also

F(q) = F

limn

Fn(0)

= limnF(Fn(0))

Since F is continuous

= limnFn+1(0)

Thus F(q) = q

q is called the Extinction Probability.


47/78


Alternative Derivation

q =k

P(X1 = k)P(extinction|X1 = k)

=

P(X1 = k) qk

= F(q)

Theorem 5.9. The probability of extinction, q, is the smallest positive root of the equa-tion F(q) = q. m is the mean of the offspring distribution.

Ifm 1 then q = 1, while ifm 1thenq 1

Proof.

F(1) = 1 m =0

kf

k = limz1

F(z)

F

(z) =j=z

j(j 1)zj2 in 0 z 1 Since f0 + f1 1 Also F(0) = f0 0

Thus ifm 1, there does not exists a q (0, 1) with F(q) = q. If m 1 then let

be the smallest positive root ofF(z) = z then 1. Further,

F(0) F() = F(F(0)) F() =

Fn(0) n 1q = lim

nFn(0) 0q = Since q is a root ofF(z) = z


48/78


5.5 Random Walks

Let X1, X2, . . . be i.i.d.r.vs. Let

Sn = S0 + X1 + X2 + + Xn Where, usually S0 = 0

Then Sn (n = 0, 1, 2, . . . is a 1 dimensional Random Walk.

We shall assume

Xn =

1, with probability p

1, with probability q (5.2)

This is a simple random walk. If p = q = 12

then the random walk is called symmetric

Example (Gamblers Ruin). You have an initial fortune of A and I have an initial fortune of B. We toss coins repeatedly I win with probability p and you win withprobability q. What is the probability that I bankrupt you before you bankrupt me?


49/78

5.5. RANDOM WALKS 43

Seta = A + B andz = B Stop a random walk starting at z when it hits 0 ora.

Letpz be the probability that the random walk hits a before it hits 0, starting fromz. Letqz be the probability that the random walk hits 0 before it hits a , starting fromz. After the first step the gamblers fortune is eitherz 1 or z + 1 with prob p andqrespectively. From the law of total probability.

pz = qpz1 +ppz+1 0 z a

Also p0 = 0 andpa = 1. Must solve pt

2

t + q = 0.t =

1 1 4pq2p

=1 1 2p

2p= 1 or

q

p

General Solution forp = q is

pz = A + B

q

p

zA + B = 0A =

1

1 qp

aand so

pz =1

qp

z

1 qpaIfp = q, the general solution is A + Bz

pz =z

a

To calculate qz, observe that this is the same problem withp,q,z replaced byp,q,azrespectively. Thus

qz =

qp

aqp

zqp

a 1

ifp = q


50/78


or

qz =

a

z

z ifp = qThus qz +pz = 1 and so on, as we expected, the game ends with probability one.

P(hits 0 before a) = qz

qz =

qp

a ( qp )z

qp

a 1

ifp = q

Or =a z

zifp = q

What happens as a ?

P( paths hit0 ever) =

a=z+1

path hits 0 before it hits a

P(hits 0 ever) = limaP(hits 0 before a)

= lima qz

=

q

p

p q

= 1 p = q

LetG be the ultimate gain or loss.

G = a z, with probability pzz, with probability qz (5.3)

E[G] =

apz z, ifp = q0, ifp = q

(5.4)

Fair game remains fair if the coin is fair then then games based on it have expected

reward0.

Duration of a Game Let Dz be the expected time until the random walk hits 0or a, starting from z. Is Dz finite? Dz is bounded above by x the mean of geometricrandom variables (number of windows of size a before a window with all +1s or1s). Hence Dz is finite. Consider the first step. Then

Dz = 1 +pDz+1 + qDz1E[duration] = E[E[duration | first step]]

= p (E[duration | first step up]) + q (E[duration | first step down])= p(1 + Dz+1) + q(1 + Dz1)

Equation holds for 0 z a with D0 = Da = 0. Lets try for a particular solutionDz = Cz

Cz = Cp(z + 1) + Cq(z 1) + 1C =

1

q p for p = q


51/78

5.5. RANDOM WALKS 45

Consider the homogeneous relation

pt2 t + q = 0 t1 = 1 t2 = qp

General Solution for p = q is

Dz = A + B

q

p

z+

z

q = p

Substitute z = 0, a to get A and B

Dz =z

q p a

q p1

qp

z1

qp

a p = q

Ifp = q then a particular solution is z2. General solutionDz z2 + A + Bz

Substituting the boundary conditions given.,

Dz = z(a z) p = qExample. Initial Capital.

p q z a P(ruin) E[gain] E[duration]0.5 0.5 90 100 0.1 0 900

0.45 0.55 9 10 0.21 -1.1 11

0.45 0.55 90 100 0.87 -77 766

Stop the random walk when it hits 0 ora.We have absorption at0 ora. Let

Uz,n = P(r.w. hits 0 at time nstarts at z)

Uz,n+1 = pUz+1,n + qUz1,n 0 z a n 0U0,n = Ua,n = 0 n 0

Ua,0 = 1Uz,0 = 0 0 z a

LetUz =

n=0

Uz,nsn.

Now multiply by sn+1 and add forn = 0, 1, 2 . . .

Uz(s) = psUz+1(s) + qsUz1(s)Where U0(s) = 1 andUa(s) = 0

Look for a solution

Ux(s) = ((s))z (s)


52/78


Must satisfy

(s) = ps (((s))

2

+ qsTwo Roots,

1(s), 2(s) =1

1 4pqs22ps

Every Solution of the form

Uz(s) = A(s) (1(s))z + B(s) (2(s))

z

Substitute U0(s) = 1 andUa(s) = 0.A(s) + B(s) = 1 and

A(s) (1(s))a

+ B(s) (2(s))a

= 0

Uz(s) =(1(s))a (2(s))z (1(s))z (2(s))a

(1(s))a (2(s))a

But1(s)2(s) =q

precall quadratic

Uz(s) =

q

p

(1(s))

az (2(s))az(1(s))

a (2(s))a

Same method give generating function for absorption probabilities at the other barrier.

Generating function for the duration of the game is the sum of these two generating

functions.


53/78

Chapter 6

Continuous Random Variables

In this chapter we drop the assumption that id finite or countable. Assume we aregiven a probability p on some subset of.

For example, spin a pointer, and let give the position at which it stops, with = : 0 2. Let

P( [0, ]) = 2

(0 2)Definition 6.1. A continuous random variable X is a function X : Rfor which

P(a X() b) =ba

f(x)dx

Where f(x) is a function satisfying

1. f(x) 02.

+ f(x)dx = 1

The function f is called the Probability Density Function.

For example, if X() = given position of the pointer then x is a continuousrandom variable with p.d.f

f(x) =

12

, (0 x 2)0, otherwise

(6.1)

This is an example of a uniformly distributed random variable. On the interval [0, 2]

47


54/78

48 CHAPTER 6. CONTINUOUS RANDOM VARIABLES

in this case. Intuition about probability density functions is based on the approximate

relation.P(X [x, x + xx]) =

x+xx

x

f(z)dz

Proofs however more often use the distribution function

F(x) = P(X x)F(x) is increasing in x.

IfX is a continuous random variable then

F(x) =

x

f(z)dz

and so F is continuous and differentiable.

F(x) = f(x)

(At any point x where then fundamental theorem of calculus applies).

The distribution function is also defined for a discrete random variable,

F(x) =

:X()xp

and so F is a step function.

In either case

P(a X b) = P(X b) P(X a) = F(b) F(a)


55/78

49

Example. The exponential distribution. Let

F(x) =

1 ex, 0 x 0, x 0 (6.2)

The corresponding pdf is

f(x) = ex 0 x

this is known as the exponential distribution with parameter . IfX has this distribu-tion then

P(X x + z|X z) = P(X x + z)P(X z)

= e(x+z)ez

= ex

= P(X x)

This is known as the memoryless property of the exponential distribution.

Theorem 6.1. IfXis a continuous random variable with pdff(x) andh(x) is a contin-uous strictly increasing function with h1(x) differentiable then h(x) is a continuousrandom variable with pdf

fh(x) = fh1(x) ddx

h1(x)

Proof. The distribution function ofh(X) is

P(h(X) x) = PX h1(x) = Fh1(x)Since h is strictly increasing and F is the distribution function of X Then.

d

dxP(h(X) x)

is a continuous random variable with pdf as claimed fh. Note usually need to repeatproof than remember the result.


56/78


Example. Suppose X U[0, 1] that is it is uniformly distributed on [0, 1] ConsiderY = log x

P(Y y) = P( log X y)= P

X eY

=

1eY

1dx

= 1 eY

Thus Y is exponentially distributed.

More generally

Theorem 6.2. LetU

U[0, 1]. For any continuous distribution function F, the randomvariable X defined by X = F1(u) has distribution function F.

Proof.

P(X x) = PF1(u) x= P(U F(x))= F(x) U[0, 1]

Remark

1. a bit more messy for discrete random variables

P(X = Xi) = pi i = 0, 1, . . .

Let

X = xj if

j1i=0

pi U j

i=0

pi U U[0, 1]

2. useful for simulations

6.1 Jointly Distributed Random Variables

For two random variables X and Y the joint distribution function is

F(x, y) = P(X x, Y y) F : R2 [0, 1]

Let

FX(x) = P(Xz x)= P(X x, Y )= F(x, )= lim

yF(x, y)


57/78

6.1. JOINTLY DISTRIBUTED RANDOM VARIABLES 51

This is called the marginal distribution of X. Similarly

FY(x) = F(, y)

X1, X2, . . . , X n are jointly distributed continuous random variables if for a set c Rb

P((X1, X2, . . . , X n) c) =

. . .

(x1,...,xn)c

f(x1, . . . , xn)dx1 . . . d xn

For some function f called the joint probability density function satisfying the obvious

conditions.

1.

f(x1, . . . , xn)dx1 0

2. . . .

Rn

f(x1, . . . , xn)dx1 . . . d xn = 1

Example. (n = 2)

F(x, y) = P(X x, Y y)=

x

y

f(u, v)dudv

and so f(x, y) =2F(x, y)

xy

Theorem 6.3. provided defined at (x, y). If X and y are jointly continuous randomvariables then they are individually continuous.

Proof. Since X and Y are jointly continuous random variables

P(X A) = P(X A, Y (, +))=

A

f(x, y)dxdy

= fAfX(x)dx

where fX(x) =

f(x, y)dy

is the pdf ofX.

Jointly continuous random variables X and Y are Independentif

f(x, y) = fX(x)fY(y)

Then P(X A, Y B) = P(X A)P(Y B)

Similarly jointly continuous random variables X1, . . . , X n are independent if

f(x1, . . . , xn) =

ni=1

fXi(xi)


58/78


Where fXi(xi) are the pdfs of the individual random variables.

Example. Two points X and Y are tossed at random and independently onto a linesegment of length L. What is the probability that:

|X Y| l?

Suppose that at random means uniformly so that

f(x, y) =1

L2x, y [0, L]2


59/78


Desired probability

=

A

f(x, y)dxdy

=area of A

L2

=L2 212(L l)2

L2

=2Ll l2

L2

Example (Buffons Needle Problem). A needle of length l is tossed at random onto a

floor marked with parallel lines a distance L apart l L. What is the probability thatthe needle intersects one of the parallel lines.

Let [0, 2] be the angle between the needle and the parallel lines and let x bethe distance from the bottom of the needle to the line closest to it. It is reasonable to

suppose that X is distributed Uniformly.

X U[0, L] U[0, )

andX and are independent. Thus

f(x, ) =1

l0 x L and0


60/78


The needle intersects the line if and only if X sin The event A

=

A

f(x, )dxd

= l

0

sin

Ld

=2l

L

Definition 6.2. The expectation or mean of a continuous random variable X is

E[X] =

xf(x)dx

provided not both of xf(x)dx and0

xf(x)dx are infinite

Example (Normal Distribution). Let

f(x) =12

e(x)2

22 x

This is non-negative for it to be a pdf we also need to check that

f(x)dx = 1

Make the substitution z = x . Then

I =1

2

e(x)2

22 dx

=12

ez22 dz

Thus I2 =1

2

ex22 dx

ey22 dy

=1

2

e(y2+x2)

2 dxdy

=1

2

20

0

re22 drd

= 2

0

d = 1

Therefore I = 1. A random variable with the pdf f(x) given above has a Normaldistribution with parameters and2 we write this as

X N[, 2]The Expectation is

E[X] =12

xe(x)2

22 dx

=12

(x )e(x)2

22 dx +12

e(x)2

22 dx.


61/78


The first term is convergent and equals zero by symmetry, so that

E[X] = 0 +

=

Theorem 6.4. IfX is a continuous random variable then,

E[X] =

0

P(X x) dx 0

P(X x) dx

Proof. 0

P(X x) dx =0

x

f(y)dy

dx

= 0

0

I[y

x]f(y)dydx

=

0

y0

dxf(y)dy

=

0

yf(y)dy

Similarly

0

P(X x) dx =0

yf(y)dy

result follows.

Note This holds for discrete random variables and is useful as a general way of

finding the expectation whether the random variable is discrete or continuous.

IfX takes values in the set [0, 1, . . . , ] Theorem states

E[X] =n=0

P(X n)

and a direct proof follows

n=0

P(X n) =n=0

m=0

I[m n]P(X = m)

=

m=0

n=0

I[m n]P(X = m)

=

m=0

mP(X = m)

Theorem 6.5. Let X be a continuous random variable with pdf f(x) and let h(x) bea continuous real-valued function. Then provided

|h(x)| f(x)dx

E[h(x)] =

h(x)f(x)dx


62/78


Proof.

0

P(h(X) y) dy

=

0

x:h(x)0

f(x)dx

dy

=

0

x:h(x)0

I[h(x) y]f(x)dxdy

=

x:h(x)

h(x)00

dy

f(x)dx

= x:h(x)0

h(x)f(x)dy

Similarly

0

P(h(X) y) = x:h(x)0

h(x)f(x)dy

So the result follows from the last theorem.

Definition 6.3. The variance of a continuous random variable X is

Var X = E

(X E[X])2

Note The properties of expectation and variance are the same for discrete and contin-uous random variables just replace

with

in the proofs.

Example.

Var X = E

X2 E[X]2

=

x2f(x)dx

xf(x)dx

2

Example. Suppose X N[, 2] Letz = X then

P(Z z) = P

X

z

= P(X + z)

=

+z

12

e(x)2

22 dx

Let

u =

x

=

z

12

eu22 du

= (z) The distribution function of a N(0, 1) random variable

Z N(0, 1)


63/78

6.2. TRANSFORMATION OF RANDOM VARIABLES 57

What is the variance ofZ?

Var X = E

Z2 E[Z]2 Last term is zero

=12

z2ez22 dz

=

1

2ze

z22

+

ez22 dz

= 0 + 1 = 1

Var X = 1

Variance ofX?

X = + z

Thus E[X] = we know that already

Var X = 2 Var Z

Var X = 2

X (, 2)

6.2 Transformation of Random Variables

Suppose X1, X2, . . . , X n have joint pdff(x1, . . . , xn) let

Y1 = r1(X1, X2, . . . , X n)

Y2 = r2(X1, X2, . . . , X n)

...

Yn = rn(X1, X2, . . . , X n)

Let R Rn be such that

P((X1, X2, . . . , X n) R) = 1

Let S be the image of R under the above transformation suppose the transformationfrom R to S is 1-1 (bijective).


64/78


Then inverse functions

x1 = s1(y1, y2, . . . , yn)

x2 = s2(y1, y2, . . . , yn) . . .

xn = sn(y1, y2, . . . , yn)

Assume that siyj

exists and is continuous at every point (y1, y2, . . . , yn) in S

J =

s1y1

. . . s1yn...

. . ....

sn

y1 . . .

sn

yn

(6.3)

IfA R

P((X1, . . . , X n) A) [1] =

A

f(x1, . . . , xn)dx1 . . . d xn

=

B

f(s1, . . . , sn) |J| dy1 . . . d yn

Where B is the image of A

= P((Y1, . . . , Y n) B)[2]

Since transformation is 1-1 then [1],[2] are the same

Thus the density for Y1, . . . , Y n is

g((y1, y2, . . . , yn) = f(s1(y1, y2, . . . , yn), . . . , sn(y1, y2, . . . , yn)) |J|

y1, y2, . . . , yn S= 0 otherwise.

Example (density of products and quotients). Suppose that(X, Y) has density

f(x, y) =

4xy, for0 x 1, 0 y 10, Otherwise.

(6.4)

LetU = XY andV = XY


65/78

6.2. TRANSFORMATION OF RANDOM VARIABLES 59

X =

U V Y =

V

U

x =

uv y =

v

u

x

u=

1

2

v

u

x

v=

1

2

u

v

y

u=

12

v12

u32

y

v=

1

2

uv.

Therefore |J| = 12u and so

g(u, v) =1

2u(4xy)

=1

2u 4uv

v

u

= 2u

vif(u, v) D

= 0 Otherwise.

Note U and V are NOT independent

g(u, v) = 2u

vI[(u, v) D]

not product of the two identities.

When the transformations are linear things are simpler still. Let A be the n ninvertible matrix.

Y1

...

Yn

= A

X1...

Xn

.

|J| = det A1 = det A1

Then the pdf of(Y1, . . . , Y n) is

g(y1, . . . ,n ) =

1

det A f(A1

g)

Example. Suppose X1, X2 have the pdff(x1, x2). Calculate the pdf ofX1 + X2.LetY = X1 + X2 andZ = X2. Then X1 = Y Z andX2 = Z.

A1 =

1 10 1

(6.5)

det A1 = 11

det A

Then

g(y, z) = f(x1, x2) = f(y z, z)


66/78


joint distributions ofY andX.

Marginal density of Y is

g(y) =

f(y z, z)dz y

org(y) =

f(z, y z)dz By change of variable

IfX1 andX2 are independent, with pgfs f1 andf2 then

f(x1, x2) = f(x1)f(x2)

and then g(y) =

f(y z)f(z)dz

- the convolution of f1 andf2

For the pdf f(x) x is a mode iff(x) f(x)xx is a median if x

f(x)dx

x

f(x)dx =1

2

For a discrete random variable, x is a median if

P(X x) 12

orP(X x) 12

IfX1, . . . , X n is a sample from the distribution then recall that the sample mean is

1

n

n

1

Xi

LetY1, . . . , Y n (the statistics) be the values of X1, . . . , X n arranged in increasingorder. Then the sample median is Yn+1

2ifn is odd or any value in

Yn2

, Yn+12

if n is even

IfYn = maxX1, . . . , X n andX1, . . . , X n are iidrvs with distribution F and den-sity f then,

P(Yn y) = P(X1 y , . . . , X n y)= (F(y))n


67/78


68/78


Z = AY

Where

A =

1 0 0 . . . 0 01 1 0 . . . 0 00 1 1 . . . 0 0...

.... . .

......

0 0 . . . 1 1

(6.7)

det(A) = 1

h(z1, . . . , zn) = g(y1, . . . , yn)

= n!f(y1) . . . f (yn)

= n!ney1 . . . eyn

= n!ne(y1++yn)

= n!ne(z12z2++nzn)

=n

i=1

ieizn+1i

Thus h(z1, . . . , zn) is expressed as the product of n density functions and

Zn+1i exp(i)

exponentially distributed with parameteri, with z1, . . . , zn independent.

Example. LetX andY be independentN(0.1) random variables. Let

D = R2 = X2 + Y2

then tan = YX then

d = x2 + y2 and = arctany

x

|J| =

2x 2yyx2

1+( yx )2

1x

1+( yx )2

= 2 (6.8)


69/78


70/78


(2)The chord is perpendicular to a given diameter and the point of intersection is

uniformly distributed over the diameter.

a2 +

3

2

2=

32

answer= 12

(3) The foot of the perpendicular to the chord from the centre of the circle is uni-

formly distributed over the diameter of the interior circle.

interior circle has radius 12

.

answer =122

12

=1

4

6.3 Moment Generating Functions

If X is a continuous random variable then the analogue of the pgf is the moment gen-

erating function defined by

m() = E

ex

for those such that m() is finite

m() =

exf(x)dx

where f(x) is the pdf ofX.


71/78

6.3. MOMENT GENERATING FUNCTIONS 65

Theorem 6.6. The moment generating function determines the distribution ofX, pro-

videdm() is finite for some interval containing the origin.Proof. Not proved.

Theorem 6.7. IfX andY are independent random variables with moment generatingfunction mx() andmy() then X+ Y has the moment generating function

mx+y() = mx() my()Proof.

E

e(x+y)

= E

exey

= Ee

x

Eey

= mx()my()

Theorem 6.8. The rth moment ofX ie the expected value of Xr, E[Xr], is the coeffi-cient of

r

r!of the series expansion of n().

Proof. Sketch of....

eX = 1 + X +2

2!X2 + . . .

EeX = 1 + E[X] +2

2!

EX2 + . . .

Example. Recall X has an exponential distribution, parameter if it has a densityex for0 x .

E

eX

=

0

exexdx

=

0

e()xdx

=

= m() for

E[X] = m(0) =

( )2=0

= 1

E

X2

=

2

( )2=0

=2

2

Thus

Var X = E

X2 E[X]2

=2

2 1

2


72/78


73/78

6.4. CENTRAL LIMIT THEOREM 67

Proof. 1.

E

e(X+Y)

= E

eX

E

eY

= e(1+12

21

2)e(2+12

22

2)

= e(1+2)+12 (

21+

22)

2

which is the moment generating function for

N[1 + 2, 21 +

22]

2.

E

e(aX)

= E

e(a)X

= e1(a)+

12

21(a)

2

= e(a1)+12a

2212

which is the moment generating function of

N[a1, a221]

6.4 Central Limit Theorem

X1, . . . , X n iidrvs, mean 0 and variance 2. Xi has density

Var Xi = 2

X1 + + Xn has Variance

Var X1 + + Xn = n2


74/78


X1++Xnn has Variance

VarX1 + + Xn

n=

2

n

X1++Xnn

has Variance

VarX1 + + Xn

n= 2

Theorem 6.10. LetX1, . . . , X n be iidrvs with E[Xi] = andVar Xi = 2 .

Sn =n1

Xi

Then (a, b) such that a b

limnP

a Sn n

n b

=

ba

12

ez22 dz

Which is the pdf of a N[0, 1] random variable.


75/78

6.4. CENTRAL LIMIT THEOREM 69

Proof. Sketch of proof.....

WLOG take = 0 and

2

= 1. we can replace Xi by

Xi

. mgf ofXi ismXi() = E

eXi

= 1 + E[Xi] +

2

2E

X2i

+3

3!E

X3i

+ . . .

= 1 +2

2+

3

3!E

X3i

+ . . .

The mgf of Snn

Ee Sn

n = Een(X1++Xn)

= E

e

nX1

. . .E

enXn

= E

enX1n

=

mX1

n

n

=

1 +

2

2n+

3E

X3

3!n32

n= e

2

2 as n

Which is the mgf ofN[0, 1] random variable.

Note ifSn Bin[n, p] Xi = 1 with probability p and = 0 with probability (1p).Then Sn np

npq N[0, 1]

This is called the normal approximation the the binomial distribution. Applies as n with p constant. Earlier we discussed the Poisson approximation to the binomial.which applies when n and np is constant.Example. There are two competing airlines. n passengers each select 1 of the 2 plans

at random. Number of passengers in plane one

S Bin[n, 12

]

Suppose each plane has s seats and let

f(s) = P(S s)S np

npq n[0, 1]

f(s) = P

S 1

2n

12

n

s 12

n12

n

= 1

2s nn

therefore ifn = 1000 ands = 537 then f(s) = 0.01. Planes hold 1074 seats only 74in excess.


76/78


Example. An unknown fraction of the electorate, p, vote labour. It is desired to find p

within an error no exceeding 0.005. How large should the sample be.Let the fraction of labour votes in the sample be p

. We can never be certain (with-

out complete enumeration), that

p p 0.005. Instead choose n so that the eventp p 0.005 have probability 0.95.

P

p p 0.005 = P(|Sn np| 0.005n)= P

|Sn np|npq

0.005

nn

Choose n such that the probability is 0.95.

1.961.96

12

ex22 dx = 2(1.96) 1

We must choose n so that

0.005

nn

1.96

But we dont know p. Butpq 14

with the worst case p = q = 12

n 1.962

0.00521

4 40, 000

If we replace 0.005 by 0.01 the n 10, 000 will be sufficient. And is we replace 0.005by 0.045 then n 475 will suffice.

Note Answer does not depend upon the total population.


77/78

6.5. MULTIVARIATE NORMAL DISTRIBUTION 71

6.5 Multivariate normal distribution

Let x1, . . . , X n be iid N[0, 1] random variables with joint density g(x1, . . . xn)

g(x1, . . . xn) =

ni=1

12

ex2i2

=1

(2)n2

e12

Pni=1 x

2i

=1

(2)n2

e12 xx

Write

X =

X1X2

...

Xn

and let z = + A X where A is an invertible matrix

x = A1(x ). Density ofz

f(z1, . . . , zn) =1

(2)n2

1

det Ae12 (A

1(z))T(A1(z))

=1

(2)n2 || 12

e12 (z)T1(z)

where AAT. This is the multivariate normal density

z MV N[, ]

Cov(zi, zj) = E[(zi i)(zj j)]But this is the (i, j) entry of

E

(z )(z )T = E(A X)(A X)T= AE

XXT

AT

= AIAT = AAT = Covariance matrix

If the covariance matrix of the MVN distribution is diagonal, then the components of

the random vector z are independent since

f(z1, . . . , zn) =n

i=1

1

(2)12 i

e12

ziii

2

Where

=

21 0 . . . 00 22 . . . 0...

.... . .

...

0 0 . . . 2n

Not necessarily true if the distribution is no MVN recall sheet 2 question 9.


78/78


Example (bivariate normal).

f(x1, x2) =1

2(1 p2) 12 12

exp

1

2(1 p2)

x1 1

1

2

2p

x1 1

1

x2 2

2

+

x1 1

1

2

1, 2 0 and 1 p +1. Joint distribution of a bivariate normal randomvariable.

Example. An example with

1 =1

1 p2

21 p11

12

p11 12

22

=

21 p12

p12 22

E[Xi] = i andVar Xi = 2i . Cov(X1, X2) = 12p.

Correlation(X1, X2) =Cov(X1, X2)

12= p