JLarge Slides MT123 2011

Financial Econometrics Lecture Slides:

MFE, Michaelmas Term 2011

Weeks 1-3

Random Variables, Estimators

and Asymptotic Approximation

Jeremy Large

St Hugh’s College and Oxford-Man Institute of Quantitative Finance,University of Oxford

[email protected]

September 27, 2011

1

Contents

1 Basic probability 101.1 Reading: see lecture notes . . . . . . . . . . . . . . . . . . . . . . 10

1.2 Sample spaces, events and axioms . . . . . . . . . . . . . . . . . . 111.3 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161.4 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . 18

2 Random variables 242.1 Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.2 Example random variables . . . . . . . . . . . . . . . . . . . . . . 272.3 Random walk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

2.4 Distribution functions . . . . . . . . . . . . . . . . . . . . . . . . 322.5 Quantile functions . . . . . . . . . . . . . . . . . . . . . . . . . . 37

2.6 Some common random variables . . . . . . . . . . . . . . . . . . . 392.7 Multivariate random variables . . . . . . . . . . . . . . . . . . . . 472.8 Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

2.9 Covariance matrices . . . . . . . . . . . . . . . . . . . . . . . . . . 632.10 Back to distributions . . . . . . . . . . . . . . . . . . . . . . . . . 71

2.11 Conditional distributions . . . . . . . . . . . . . . . . . . . . . . . 75

3 Estimators 85

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 853.2 Bias and mean square error of estimators . . . . . . . . . . . . . . 87

4 Simulating random variables 894.1 Pseudo random numbers . . . . . . . . . . . . . . . . . . . . . . . 894.2 Inverting distribution functions . . . . . . . . . . . . . . . . . . . 90

5 Asymptotic approximation 925.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

5.2 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 945.3 Some payback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

5.4 Some more theory . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

2

Overview of the course

First of two examples

Lehman Brothers : share price (2001 - 2005)

-

20

40

60

80

100

120

140

2001 2002 2003 2004 2005

Figure 1: A time-series of Lehman Brothers end-of-day share prices (dollars).

3

Second of two examples

500

550

600

650

700

750

4000 5000 6000 7000 8000 9000

Copper price

Gold

price

Figure 2: A cross-plot of gold prices against copper prices, Q1 2006 (with trend line).

4

Time-series Regression(Lehman) (gold-copper)

One quantity changes over time Some quantities interact

Forecasting and explaining Explaining

Questions

• How can you make money from each?

• Is there randomness in the two examples? Or is everything ‘deterministic’?

5

Time-series Regression(Lehman) (gold-copper)

One quantity changes over time Some quantities interact

Forecasting and explaining Explaining

Comments

• In the second part of this term Prof Neil Shephard will talk about time-series, regression relationships, and mixtures of the two

• In Hilary Term, Prof Anders Rahbek will go more deeply into time-series:forecasting, volatility

• I provide theoretical underpinnings for both:

– notation

– framework

– proofs

– → a fairly abstract and theoretical start

6

Lecture plan, Thursdays, Weeks 1-4:

• 1pm: Lecture starts

• 1:40pm: 5-minute break, stretch legs

• 2:25pm: 20-minute break for coffee

• 3:25pm: 5-minute break, stretch legs

• 4:15pm: end

• 4:30pm to 5pm: Office hours in this lecture room

Classes take place in Weeks 3-9 this term. Thursday morning.

Kasper Lund-Jensen is the class teacher.

7

Weekly assignments:

Weekly assignments are distributed at each Thursday lecture:

→ Intended to take about three hours (I would recommend you never spendlonger than four hours on them)

→ Hand them in at SBS reception by 4pm the Monday 11 days later.

→ Kasper returns your answers, and provides solutions in the classes the follow-ing Thursday.

→ grade of either 1 or 0.

→ 1 point will be awarded if the assignment is mostly complete and correct. No

points will be awarded if the assignment is substantially incomplete.

→ Over this term and next, the best 10 out of 16 assignments will count towardsthe final grade.

8

What will be in the exams and quizzes next term?

All course contents are examinable, unless they have been flagged otherwise (notethe starring system in the lecture notes for this part of the course)

Best guide to exam question style : weekly assignments

Best guide to content : highly unlikely to stray beyond material appearing

in the slides covered in lectures, or the assignments.

9

1 Basic probability

Financial econometrics, and much of finance theory, takes the view thatasset prices are random.

So, probability theory is the basis of all modern econometrics and much of

economics and finance.

We will also need some linear algebra.

1.1 Reading: see lecture notes

10

1.2 Sample spaces, events and axioms

Example: Vodafone trades to the nearest 0.25p, so 0.25p is the price tick size.

Vodafone prices over one day:

8 9 10 11 12 13 14 15 16

138.75

139.00

139.25

139.50

139.75

140.00

140.25

Figure 3: Sample path of the best bid (best available marginal price to a seller) for Vodafone onthe LSE’s electronic limit order book SETS for the first working day in 2004.

Write Yi as the price of a very simple asset at time i (after i changes, say).

11

A very simple model: price starts at zero and it can move 1 “tick” up or downeach time period, or stay the same!

Possible pricestime Yi

i = 0 0i = 1 −1, 0, 1

i = 2 −2,−1, 0, 1, 2i = 3 −3,−2,−1, 0, 1, 2, 3

i = 4 −4,−3,−2,−1, 0, 1, 2, 3, 4

Thus, for example, Y4 can take on 9 different values.

This ‘toy model’ allows us to try out most deep ideas in probability theory

12

Sample space. The set Ω, is called the sample space, if it contains all possible(primitive) outcomes that we are considering, e.g. if we think about Y4 then its

sample space isΩ = −4,−3,−2, 1, 0, 1, 2, 3, 4 .

Event. An event is a subset of Ω (which could be Ω itself), e.g. Let

A = 1

i.e. Y4 = 1. Further let B be the event that Y4 is strictly positive, so

B = 1, 2, 3, 4.

Example 1.1 (value at risk) An important concept is downside risk — how

much you can lose, how quickly and how often. In this case the event of a largeloss might be defined as

−4,−3 .

A rapid fall of 3 ticks or more. In practice value at risk tends to be computedover a day or more, rather than over tiny time periods.

13

Probability axioms based on the triple (Ω,F , Pr)

F is the ‘power set’ of Ω, which just means it contains all the subsets of Ω:

A ∈ F ↔ A is a subset of Ω.

(technical note: F sometimes contains many – but not all – subsets of Ω)

And Pr is a real-valued function on F (not on Ω) that satisfies

1. Pr(A) ≥ 0, for all A ∈ F (for all A in the set F)

2. Pr(Ω) = 1.

3. If Ai ∈ F : i = 1, 2, ...,∞ (which is an infinitely large set of elements of

F) are disjoint then

Pr

( ∞⋃

j=1

Aj

)

=

∞∑

k=1

Pr(Ak).

In the Vodafone example:

Pr(Y4 > 0) =

4∑

i=1

Pr(Y4 = i).

14

Comments:

• Only events have probabilities.

• Events, E, are subsets of Ω, not elements. So E ⊆ Ω or, equivalently,E ∈ F .

• Probabilities are always ≥ zero.

• A realization is when a single ω ∈ Ω is picked (‘happens’).

• However, strictly speaking this realization has no probability (giving it a

probability makes no sense).

• ⋃ signifies ‘or’ ;⋂

signifies ‘and’

15

1.3 Independence

Consider two events A, B which are in F .

When does occurrence of one event not affect the probability of another event

also happening?

When the two events are independent.

Write that the events A, B are independent (in F) iff

Pr A ∩ B = Pr(A) × Pr(B).

Write

A ⊥⊥ B.

16

Example 1.2 Let S and T be any subsets of −1, 0, 1.

(e.g. suppose that S = −1, 1 and T = 1 )

Define A and B by:A is [ (Y4 − Y3) ∈ S]

andB is [ (Y3 − Y2) ∈ T ].

Many models assume that for any S and T ,

A ⊥⊥ B.

Informally, we mean that (Y4 − Y3) is independent of (Y3 − Y2), so we write thisquickly as:

(Y4 − Y3) ⊥⊥ (Y3 − Y2).

We will formalize this later, in terms of ‘random variables’.

17

1.4 Conditional Probability

Definition

Two events, A and B. We might be interested in Pr(A) or Pr(B) or Pr(A ∩B).

Want to know Pr(A|B), assuming Pr(B) > 0.

I constrain my world so that B happens and I ask if A then happens.

This can only be if both A and B happen, so we define

Pr(A|B) =Pr(A ∩ B)

Pr(B).

Think of this as a function of A, with B fixed in the background.

→ that way, it obeys the three standard probability axioms.

This is a vital concept in econometrics.

18

Conditional probabilities, and time

Suppose we are at time 3, then we know the value of Y3 = 2, say. Then Y4 mustbe in

1, 2, 3 ,

so lets think of 1, 2, 3 as a new sample space. It is not too hard to define new

events and new probablities

Pr(Y4 > 1|Y3 = 2) = 1 − Pr(Y4 = 1|Y3 = 2).

Here, as ever, conditional probabilities are simply standard probabilities,• but on another sample space.

Lets never forget the stuff to the right of |.

20

Example 1.3 May be interested in interested in the forecast distributions, acrossall x:

Pr(Y4 = x|Y3 = y),

Pr(Y4 = x|Y2 = y)Pr(Y4 = x|Y1 = y)Pr(Y4 = x|Y1 = a, Y2 = b, Y3 = c),

the last of which is the distribution of Y4 given we know that the price at time 1,2, 3 were a, b and c.

The last conditional probability is a one-step ahead forecast distribution given the

path of the process.

21

A flexible notation for the example on the page before

May be interested in interested in the forecast distributions:

Pr(Y4|Y3)Pr(Y4|Y2)

Pr(Y4|Y1)Pr(Y4|Y1, Y2, Y3),

the distribution of Y4 given we know the price at time 3, 2 or 1.

The last conditional probability is a one-step ahead forecast distribution giventhe path of the process.

22

Example 1.4 In many models in financial econometrics:

Pr(Yi|Yi−1, Yi−2, Yi−3, . . . ) = Pr(Yi| Yi−1 ).

That is, given the value of Yi−1, it is irrelevant to the value of Yi, what the valueof Y was two or more periods before.

This is the Markov Assumption.

A consequence of the Markov Assumption:

(Yi ⊥⊥ Yi−2) |Yi−1.

23

2 Random variables

2.1 Basics

A random variable is a function from Ω to R.

Typically, it is called X(ω).

Most of econometrics is about random variables.

We drop reference to ω, so we will write X as the random variable.

Properties of X are events, for example: ‘X > 0’ is the event

ω : X(ω) > 0, (1)

which is a subset of Ω, like every other event, and can have a probability.

24

Independence Two random variables, Y1 and Y2 are independent if

for any events A1 about Y1, and A2 about Y2:

A1 ⊥⊥ A2.

If they are independent, then we write

Y1 ⊥⊥ Y2.

Exercise: prove that if Y1 and Y2 are independent, then for any y1 and y2:

Pr[Y1 ≤ y1 and Y2 ≤ y2] = Pr[Y1 ≤ y1] Pr[Y2 ≤ y2].

25

i.i.d. A sequence of random variables Y1, Y2, ..., YN , ... is said to be i.i.d. (in-dependently and identically distributed) if

• any pair Yi and Yj are independent, and have the same distribution.

26

2.2 Example random variables

Bernoulli random variable:

Could be heads (ω = H) or tails (ω = T ).

Let X(H) = 1 and X(T ) = 0.

We say X is a Bernoulli random variable with two ‘points of support’, 0, 1.

Write Pr(X = 1) = p and Pr(X = 0) = 1 − p.

Now lets make a new random variable:

U1 = 2X − 1 ∈ −1, 1 .

27

A sequence of Bernoulli random variables:

Write Xi as above but for time i, where i can be 1,2,3,...

Assume that Xi are independent and identically distributed (i.i.d.).

Binomial tree process

Yi = Yi−1 + Ui, i = 1, 2, 3, ..., Y0 = 0, (2)

Ui = 2Xi − 1. (3)

What is a Random Process? Nothing other than a sequence of randomvariables, e.g. Y0, Y1, Y2, ...

→ for example, we record a price at a sequence of times of our choosing.

28

0 50 100

0

5

10 1st sample path of Y i

0 50 100

−10

0

0 50 100

0

5

10

0 50 100

0

10

20 4th sample path of Yi

0 50 100

−20

−10

0

0 50 100

0

5

10

0 50 100

0

10

7th sample path of Yi

0 50 100

0

5

−25 0 25

0.02

0.04

Histogram of Y100. Binomial density

29

Definition of a Binomial random variable:

Suppose we carry out n independent Bernoulli trials with Pr (Xi = 1) = p

→ then this is a Binomial RV, called Zn

Zn =n∑

i=1

Xi.

And we might want to define the random process, Z:

Z = Zn : n = 1, 2, 3, ... (4)

30

2.3 Random walk

The binomial tree (2) can be written as

Yi = 2i∑

j=1

Xj − i, i = 0, 1, 2, ..., Y0 = 0.

Special case of the random walk process

Yi = Yi−1 + ǫi,

where ǫi are i.i.d.

ǫi are called the ‘shocks’, or ‘residuals’, or ‘innovations’. Note that if we think of

Yi as log-prices thenǫi = Yi − Yi−1,

are returns.

Hence the log-price process can be transformed into an i.i.d. sample by ‘taking

first differences’.

31

2.4 Distribution functions

Distribution function of a random variable X is

FX(x) = Pr(X ≤ x).

Density function for continuous X,

fX(x) =∂FX(x)

∂x.

Clearly

FX(x) =

∫ x

−∞fX(y)dy.

Note that for continuous variables (in inverted commas):

Pr(X = x) = 0,

for every x.

For X with countable support we often write fX(x) for Pr(X = x).

32

MeanThis is defined (when it exists) as

E(X) =

∫ ∞

−∞xfX(x)dx.

It is often used as a measure of the average value of a random variable (alterna-

tives include mode and median).

Discrete r.v. : replace integration with summation.

Example 2.1 Suppose X is a Bernoulli trial with Pr(X = 1) = p and Pr(X =0) = 1 − p. Then

E(X) = 1 × Pr(X = 1) + 0 × Pr(X = 0)

= p. (5)

34

VarianceVariance is defined as

Var(X) = E X − E(X)2

=

∫x − E(X)2 fX(x)dx

= E(X2) − E(X)2 .

The standard deviation is defined as√

Var(X).

A further very important formula:

Var(a + bX) = b2Var(X).

(exercise: prove this)

35

Conditional MeanThe conditional expectation of a random variable X given a +ve probability

event A is

E(X|A) =

∫xfX |A(x)dx.

Conditional VarianceBy analogy:

E(X2|A) − E(X|A)2.

36

2.5 Quantile functions

Inverting the distribution function. i.e. we ask: for a given u ∈ [0, 1], find x such

thatu = FX(x).

We callx = F−1

X (u),

the quantile function of X.

The 0.1 quantile tells us the value of X such that only 10% of the populationfall below that value. The most well known quantile is

x = F−1

X (0.5),

which is called the median.

37

Example 2.2 Quantiles are central in simple value at risk (VaR) calculations,which measure the degree of risk taken by banks. In simple VaR calculations one

looks at the marginal distribution of the returns over a day, written Yi − Yi−1,and calculates

F−1

Yi−Yi−1(0.05),

the 5% percentile of the return distribution.

38

2.6 Some common random variables

Normal

The normal distribution is important. Does not look immediately attractive

fX (x) =1√

2πσ2exp

−(x − µ)2

2σ2

, x, µ ∈ R, σ2 ∈ R+.

Density peaks at µ and is symmetric around µ.

39

Model for returns on daily Sterling/$ 1985 to 2000.

−2.5 −2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 2.5

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Flexible estimator Fitted normal

40

fX (x) =1√

2πσ2exp

−(x − µ)2

2σ2

, x, µ ∈ R, σ2 ∈ R+.

Centred at µ, σ2 determines its scale (spread).

The notation for a normal r.v. is X ∼ N(µ, σ2

).

µ is the mean; σ2 is the variance• We will prove this later

• Notice that together, the mean and variance define the normal distribution

If an i.i.d. sequence has normal random variables, we write it is N.I.D.• And we will also see NID(µ, σ2).

Another word for normal is ‘Gaussian’.

41

If X ∼ N(µ, σ2

)and γ and λ are non-random then

γ + λX ∼ N(γ + λµ, λ2σ2).

One can writeX

law= µ + σu,

where u ∼ N(0, 1). Equality in law, means the left and right hand side quantitieshave the same law or distribution. Finally, if X and Y are independent normal

variables with means µx and µy and variances σ2x and σ2

y, then

X + Y ∼ N(µx + µy, σ2x + σ2

y).

That is: the means and variances add up, and normality is maintained.

This is a very convenient result for asset pricing, as we will see later.

42

Example: Suppose that Ui are i.i.d. N(µ, σ2) then the ‘drifting’ random walk

Yi = Yi−1 + Ui, Y0 = 0,

has the feature thatYi ∼ N(iµ, iσ2),

orYi+s|Yi ∼ N(Yi + sµ, sσ2).

43

Consider a change to the Binomial tree that we saw earlier:

Replace the scaled and recentred Bernoulli variable with a normal random vari-able, Ui N(µ, σ2).

Select µ = 0 and σ2 = 4× 0.5× 0.5 so that it matches the mean and variance of

the previous Binomial tree.

0 50 100

−2.5

0.0

2.5

5.0

7.5 1st sample path of Y i

0 50 100

−5

0

5

0 50 100

−10

−5

0

0 50 100

0

5

10

4th sample path of Yi

0 50 100

0

5

10

15

0 50 100

−20

−10

0

0 50 100

−10

−5

0

5 7th sample path of Yi

0 50 100

−5

0

5

10

−25 0 25

0.02

0.04Histogram of Y100. Gaussian density

44

Uniform

Sometimes variables are constrained to live on small intervals. The leading ex-ample of this is the standard uniform

fX(x) = 1, x ∈ [0, 1].

Used in economic theory as a stylised way of introducing uncertainty into a model

and in simulation.

Chi-squared

Suppose Xii.i.d.∼ N(0, 1) (often written NID(0, 1)) then

Y =

ν∑

i=1

X2i ∼ χ2

ν

is a Chi-Squared random variable with “degrees of freedom” ν.

45

Student t

Student t random variable is generated by a ratio of random variables.

Crude notation for it is:

tν =N(0, 1)√

χ2ν/ν

,

where N(0, 1) ⊥⊥ χ2ν. This is symmetrically distributed about 0.

46

2.7 Multivariate random variables

Consider a multivariate q × 1 vector where each element is a random variable

X = (X1, ..., Xq)′ .

This vector is itself a random variable (a ‘q-dimensional multivariate randomvariable’).

Note that we didn’t say that the elements of X had to be independent random

variables.

Important example: the elements of this vector could represent the returns on acollection of q assets, such as the FTSE100 equities, daily.

→ Because of this example, multivariate random variables play a central rolein portfolio allocation and risk assessments, as well as all aspects of econometrics.

47

Sustained example: returns from a portfolioConside the bivariate case where q = 2. We might think of

X =

(X1

X2

)=

(YZ

),

where X1 = Y is the return over the next day on IBM and X2 = Z is the return

over the next day on the S&P composite index.Consider the case of measuring the outperformance of the index by IBM. This

isY − Z.

We can write this as

(1,−1)

(Y

Z

)= b′X

where

b =

(1−1

), so b′ = (1,−1) .

Thus the outperformance can be measured using linear algebra.

This outperformance can be thought of as a simple portfolio, buying IBM and

selling the index.

48

Consider, slightly more abstractly, a portfolio made up of c shares in Y andd in Z. Then the portfolio returns

cY + dZ.

This can be written in terms of vectors as

(c, d)

(YZ

)= f ′X, f =

(cd

).

More generally, we might write p portfolios, each with different portfolio weights

as

B11Y + B12ZB21Y + B22Z

B31Y + B32Z...

Bp1Y + Bp2Z

= BX,

where

B =

B11 B12

B21 B22

B31 B32

......

Bp1 Bp2

.

This is a very powerful way of writing out portfolios compactly.

49

So far, the p portfolios each contained only two assets.

But you can extend this easily from 2 to q underlying assets

X =

X1

X2

X3

...Xq

, B =

B11 B12 B13 · · · B1q

B21 B22 B23 · · · B2q

B31 B32 B33 · · · B3q...

...... . . . ...

Bp1 Bp2 Bp3 Bpq

Now the p portfolios, depending upon q assets, have returns

BX =

∑qj=1

B1jXj∑qj=1

B2jXj∑qj=1

B3jXj...∑q

j=1BpjXj

.

Again this is quite a simple representation of quite a complicated situation.

50

Back on trackIn particular if X is a 2 × 1 vector

X =

(X1

X2

)and x =

(x1

x2

),

then

FX(x) = Pr(X1 ≤ x1, X2 ≤ x2),

which in the continuous case becomes

FX(x) =

∫ x2

−∞

∫ x1

−∞fX(y1, y2)dy1dy2. (6)

51

Likewise

fX(x1, x2) =∂2FX(x1, x2)

∂x1∂x2

.

When X1 ⊥⊥ X2 then this simplifies to

fX(x1, x2) =∂2FX1

(x1)FX2(x2)

∂x1∂x2

=∂FX1

(x1)

∂x1

∂FX2(x2)

∂x2

= fX1(x1)fX2

(x2).

52

(a) Standard normal density

−2.50.0

2.5−2.5

0.02.5

0.05

0.10

0.15

(b) NIG(1,0,0,1) density

−2.50.0

2.5−2.5

0.02.5

0.1

0.2

0.3

(c) Standard log−density

−2.50.0

2.5−2.5

0.02.5

−10

.0−

7.5

−5.

0−

2.5

(d) NIG(1,0,0,1) log−density

−2.50.0

2.5−2.5

0.02.5

−7.

5−

5.0

−2.

50.

0

53

An important point is that from Eq. (6),∫ ∞

−∞fX(y, x2)dy =

∂FX(∞, x2)

∂x2

=∂ Pr(X1 ≤ ∞, X2 ≤ x2)

∂x2

=∂FX2

(x2)

∂x2

= fX2(x2).

Hence if we integrate out a variable from a density function we produce the

‘marginal density’ of the other random variable.

54

Lets suppose that X2 is a discrete r.v.

Then the conditional distribution function of X1 takes on the form

FX1|X2=x2(x1) = Pr(X1 ≤ x1|X2 = x2),

while, if X1 is continuous we define

fX1|X2=x2(x1) =

∂ Pr(X1 ≤ x1|X2 = x2)

∂x1

,

which has the properties of a density.

Now, if both X2 and X1 are continuous r.v.s, we define

fX1|X2=x2(x1) =

fX(x1, x2)

fX2(x2)

.

Intuitive, but the theory behind this is beyond the scope of this course.

55

2.8 Moments

General case

An expectation of a function of a random variable.

Define, for a continuous X, if it exists

E g(X) =

∫g(x)fX(x)dx. (7)

The expectation obeys some important rules. For example if a, b are constantsthen

E a + bg(X) = a + bE g(X) .

This follows from the definition of expectations as solutions to integrals (7).

56

Special ’base-’ cases of moments

The most basic moment is known as the first moment:

E X =

∫xfX(x)dx. (8)

We’ve also seen the second moment:

EX2

=

∫x2fX(x)dx. (9)

Even though you’ll see these much more than others, try to see them as special

cases

57

Example 2.3 If X ∼ N(µ, σ2), then

E(X) =

∫ ∞

−∞x

1√2πσ2

exp

−(x − µ)2

2σ2

dx

= µ +

∫ ∞

−∞(x − µ)

1√2πσ2

exp

−(x − µ)2

2σ2

dx

= µ,

using the fact that a density integrates to one.

Exercise: fill in the working here (use properties of antisymmetric functions)

58

Multivariate meanRecall we write

X =

X1

X2

X3

...Xq

.

Now each Xj has a mean, E(Xj), so it would be nice to collect these together.The following notation does this. We define

E(X) =

E(X1)E(X2)

E(X3)...

E(Xq)

.

This is the mean of the vector.

59

We wrote the return on p portfolios as

BX,

where B is a p × q weight matrix.Then

E(BX) = BE(X).

60

Why? Recall a mean of a vector is the mean of all the elements of the vector

E(BX) =

E(∑q

j=1B1jXj

)

E(∑q

j=1B2jXj

)

E(∑q

j=1B3jXj

)

...

E(∑q

j=1BpjXj

)

.

But, for i = 1, 2, ..., p,

E

(q∑

j=1

BijXj

)

=

q∑

j=1

E(BijXj) =

q∑

j=1

BijE(Xj).

61

Hence

E(BX) =

∑qj=1

B1jE(Xj)∑qj=1

B2jE(Xj)∑qj=1

B3jE(Xj)...∑q

j=1BpjE(Xj)

= BE(X),

as stated. This is an important result for econometrics.

62

2.9 Covariance matrices

Univariate covariance

The covariance of X and Y is defined (when it exists) as

Cov(X, Y ) = E(X − E(X) Y − E(Y ))

=

∫(x − E(X) y − E(Y ))fX,Y (x, y)dxdy

= E(XY ) − E(X)E(Y ).

63

Cov(a + bX, c + dY ) = bdCov(X, Y ).

Hence covariances are location invariant.

Var(aX + bY ) = a2Var(X) + b2Var(Y ) + 2abCov(X, Y ).

64

Independence implies uncorrelatednessRecall, if moment exist

Cov(X, Y ) = E(XY ) − E(X)E(Y ).

So if X ⊥⊥ Y then

Cov(X, Y ) = E(X)E(Y ) − E(X)E(Y ) = 0.

If the covariance between X and Y is zero we say they are uncorrelated

X ⊥ Y.

So(X ⊥⊥ Y ) =⇒ (X ⊥Y ) .

The reverse is not true (in Gaussian case it is!).

65

Example 2.4 SupposeX ∼ N(0, 1), Y = X2.

ThenCov(X, Y ) = E(XY ) − E(X)E(Y ) = E(X3) = 0.

66

CorrelationThe correlation of X and Y is defined (when it exists) as

Cor(X, Y ) =Cov(X, Y )√

Var(X)Var(Y ).

Now

Cor(X, Y ) ∈ [−1, 1],

which follows from the Cauchy-Schwarz inequality.

67

Think of

X =

X1

X2

X3

...Xp

.

Then we define the covariance matrix of X as

Cov(X) =

Var(X1) Cov(X1, X2) · · · Cov(X1, Xp)

Cov(X2, X1) Var(X2) · · · Cov(X2, Xp)...

... . . . ...

Cov(Xp, X1) Cov(Xp, X2) · · · Var(Xp)

.

This is a symmetric p × p matrix.

* Covariance matrices are always ‘positive semi-definite’ (which means that the

e-values are all ≥ 0 [and real]).

68

The covariance matrix can be calculated as

Cov(X) = E(X − E(X) X − E(X)′

).

Example 2.5 In the IBM and S&P example then we have approximately that

E(X) =

(0.0206

−0.00721

)

Cov(X) =

(5.07 1.791.79 1.62

).

A very important result is that if B is a q × p matrix of constants, then

• E (a + BX) = a + BE(X)

• Cov(a + BX) = BCov(X)B′.

69

Correlation matricesCorresponding to the covariance matrix is the correlation matrix, which is

(when it exists)

Cor(X) =

1 Cor(X1, X2) · · · Cor(X1, Xp)

Cor(X2, X1) 1 · · · Cor(X2, Xp)...

... . . . ...

Cor(Xp, X1) Cor(Xp, X2) · · · 1

= Cor(X)′.

This matrix is invariant to location and scale changes, but obviously not general

linear transformations.

70

2.10 Back to distributions

Multivariate normal

The p-dimensional X ∼ N(µ, Σ). E(X) = µ, Cov(X) = Σ. Assume |Σ| > 0.Σ is always symmetric of course. Then

fX(x) = |2πΣ|−1/2 exp

−1

2(x − µ)′ Σ−1 (x − µ)

, x ∈ Rp.

Here Σ−1 is a matrix inverse, which exists due to the |Σ| > 0 assumption. Further

|2πΣ|−1/2 = (2π)−p/2 |Σ|−1/2 .

Exercises:

1. explain why the density has a single peak at µ.

2. how does this simplify if Σ = σ2I?

3. if I tell you Σ and µ, do you know everything about the normal distribution?

71

Example 2.6 SupposeΣ = σ2Ip,

which means that the elements of X are independent and homoskedastic.Then

fX(x) =(2πσ2

)−p/2exp

− 1

2σ2(x − µ)′ (x − µ)

=(2πσ2

)−p/2exp

− 1

2σ2

p∑

i=1

(xi − µi)2

72

Let X be a p-dimensional multivariate normal.

Then, if a is q × 1 and B is q × p and both are constants, then

Y = (a + BX) ∼ N(a + Bµ, BΣB′), (10)

a q-dimensional normal.

That is: all linear transformations of normals are normal.

73

In particular if p = 2, then for X = (X1, X2)′

Σ = Cov(X) =

Var (X1) Cov (X1, X2)

Cov (X1, X2) Var (X2)

=

(σ2

1 ρσ1σ2

ρσ1σ2 σ22

),

where ρ = Cor(X1, X2).

This is an important model.

In an ‘abuse of notation’, we can write (!) X1 as X and X2 as Y

In which case we get the formulation(

XY

)∼ N

(µx

µy

),

(σ2

x ρσxσy

ρσxσy σ2y

).

Keep this model in mind for the next few slides ...

74

2.11 Conditional distributions

Basic recap: Consider two (possibly multivariate) discrete random variables

X, Y , then

FX |Y =y(x) = Pr(X ≤ x|Y = y) =Pr(X ≤ x, Y = y)

Pr (Y = y).

Likewise in the continuous case, the conditional density is defined by :

fX |Y =y(x) =∂FX |Y =y(x)

∂x=

fX,Y (x, y)

fY (y).

So,fX,Y (x, y) = fX |Y =y(x)fY (y).

(known as the marginal-conditional decomposition)

Useful to consider this in the context of Normals...

75

Example of two standard normals

X and Y are Standard Normals. So, both have mean of 0, Variance of 1. Thecorrelation between them is ρ. In this case

Y |(X = x) ∼ Nρx,(1 − ρ2

).

Put another way,

fY |X=x(y) =1√

2π (1 − ρ2)e− 1

2(y−ρx)2

1−ρ2 . (11)

It is natural to write:

E(Y |X = x) = ρx;

V ar(Y |X = x) = 1 − ρ2,

and we often will. Called ‘Conditional Moments’ ...

76

A concise notation of great use

Recall X and Y as Standard Normals, correlation between them is ρ.

We saw that

E(Y |X = x) = ρx.

It will be really helpful to condense this further:

E(Y |X) = ρX.

Write the random variable itself in the place of the particular value that we knowit takes under the conditioning, i.e. capitalize X.

Likewise, we can write “Y ’s variance conditional on X” as:

V ar(Y |X) = 1 − ρ2. (12)

78

General conditional moments

More generally, the definition of a conditional moment is

EX |Y =y(g(X)) =

∫g(x)fX |Y =y(x)dx,

which is a function of y, say h(y).

This gives the random variable h(Y ).

... and we could consider its expectation: EY (h(Y )), or:

EY ( EX |Y (g(X)) ), (13)

i.e. (more concisely using the notation of the last slide):

E( E(g(X) | Y ) ). (14)

79

Law of Iterated Expectations

Now recallfX,Y (x, y) = fX |Y =y(x)fY (y).

Doing some algebra, we have that

EX(g(X)) = EY

EX |Y (g(X))

.

This is the Law of Iterated Expectations, and is very important.

• It allows you to break complex expectations up into manageable chunks.

You can also write the law as:

E( E(g(X) | Y ) ) = E(g(X)).

A related result:

VarX(X) = EY

(VarX |Y X|Y

)+ VarY (EX |Y X|Y ).

80

General bivariate normal. In this case(

X

Y

)∼ N

(µX

µY

),

(σ2

X ρσXσY

ρσXσY σ2Y

),

then

Y |(X = x) ∼ N

µY +

ρσY

σX(x − µX), σ2

Y

(1 − ρ2

).

81

Again,

Y |(X = x) ∼ N

µY +

ρσY

σX(x − µX), σ2

Y

(1 − ρ2

).

To be brief we often write:

Y |X ∼ N

µY +

ρσY

σX(X − µX), σ2

Y

(1 − ρ2

).

• Conditional variance does not depend upon x or X.

• Change in the conditional mean is

ρσY

σX(x − µX),

so is linear in x. Effect is compared to mean, i.e. x − µX . Dividing by σX

removes the scale of x, times by σY puts the variables onto the y scale.

82

Example 2.7 Y is the return on an asset, X is the return on market portfolio.Then

βY |X =ρσY

σX

is often called the beta of Y and is a measure of how Y moves with the market.

Notice that we can also write:

βY |X =cov(X, Y )

varX

83

MartingaleIn modelling dynamics martingales play a large role.

Consider a sequence asset prices recorded through time

Y1, Y2, Y3, ...

where the subscript reflects time. A natural object to study is

E(Yi|Y1, ..., Yi−1),

the conditional expectation (which we assume exists) of “the future given thepast”. Then if

E(Yi|Y1, ..., Yi−1) = Yi−1,

then the sequence is said to be a martingale with respect to its own past history.

Exercise: Use the Law of Iterated Expectations to prove that if Yi is any

Martingale, with fixed Y1, then

E(Y3) = Y1.

84

3 Estimators

3.1 Introduction

A statistic S(X) is a function of a (vector) random variable X.

When we learn about a feature of the probability model we say we are estimating

the model.

If S(X) is intended to describe a feature of the probability model, then we callit an estimator.

If x is the observed value of X, then we call S(x) the resulting estimate.

85

Example 3.1 Let

S(X) =1

n

n∑

i=1

Xi.

If Xi ∼ NID(µ, σ2) then using the fact that S(X) is a linear combination ofnormals we have that

S(X) ∼ N

(µ,

σ2

n

).

If n is very large the estimator is very close to µ, the average value of the normal

distribution.

86

3.2 Bias and mean square error of estimators

Estimate some quantity θ.

Wish for S(X) to be close to θ on average.

Bias: E S(X) − θ.

Example 3.2 If Xi ∼ NID(µ, σ2) then

S(X) = X =1

n

n∑

i=1

Xi,

the sample mean (sample average), has a zero bias as an estimator of µ.

When the bias is zero, the estimator is said to be unbiased.

87

Very large dispersion?

Imprecision of estimator can be measured with the Mean Square Error criterion:

MSE: = E[S(X) − θ2

]= Var S(X) + [E S(X) − θ]2 .

RMSE = Root MSE = Square-root of the MSE.

→ Which is better: an unbiased estimator, or a biased estimator which is more

precise?

88

4 Simulating random variables

Simulation is a key technique in advanced modern econometrics.

Produce random variables from known distribution functions.

4.1 Pseudo random numbers

All of the simulation methods built out of draws based on a sequence of indepen-dent and identically distributed (standard) uniform random numbers Ui ∈ [0, 1].

Let’s regard the problem of producing such uniform numbers as solved - matlab

does this for us.

An example is given below (!)Ui

.734

.452

.234

.123

.987

89

4.2 Inverting distribution functions

Key point: given a source of unlimited simulated i.i.d. uniforms we can produce

i.i.d. draws from any continuous distribution FX(x).

Proof : As Ui is uniform, so

Pr(Ui ≤ FX(x)) = FX(x).

Thus

Pr(Ui ≤ FX(x)) = Pr(F−1

X (Ui) ≤ x) = Pr(Xi ≤ x),

So if we takeXi = F−1

X (Ui), (15)

then we produce random numbers from any continuous distribution,→ plug the stream of simulated uniforms into the quantile function (15).

Discrete random variables are very similar but need some attention at the jump-

points in the CDF, F .

90

Example 4.1 The exponential distribution. Recall FX(x) = 1−exp(− 1

µx), and

so the quantile function is

F−1

X (p) = −µ log (1 − p) .

Hence−µ log (1 − Ui)

are i.i.d. exponential draws. e.g. µ = 1

Ui Xi

.734 1.324

.452 0.601

.234 0.266

.123 0.131

.987 4.343

91

5 Asymptotic approximation

5.1 Motivation

Classical convergence

Xn = 3 +1

n→ 3

as n → ∞.

A little more fuzzy when we think of

Xn = 3 +Y

n

?→ 3,

where Y is a random variable.

There are different measures of convergence. Some need moments, others don’t:“convergence in probability” and

“convergence in distribution” .

Formally we will think of a sequence of random variables X1, X2, . . . , Xn which,as n gets large, will be such that Xn will behave like some other random variable

or constant X.

92

Example 5.1 We are interested in

Xn =1

n

n∑

j=1

Yj.

Then it forms a sequence

X1 = Y1, X2 =1

2(Y1 + Y2) , X3 =

1

3(Y1 + Y2 + Y3).

What does 1

n

∑nj=1

Yj behave like for large n? What does Xn converge to for large

n?

93

5.2 Definitions

Sequence of random variables Xn. Ask if

Xn − X

is small as n goes to infinity.

You can measure smallness in many ways and so there are lots of different notionsof convergence.

We discuss three, the second of which will be the most important for us.

94

Definition. (Convergence in mean square) Let X and X1, X2, . . . be randomvariables. If

limn→∞

E[(Xn − X)2

]= 0,

then the sequence X1, X2, . . . is said to converge in mean square to the randomvariable X. A shorthand notation is

Xnm.s.→ X. (16)

Necessary and sufficient conditions for Xnm.s.→ X are that

limn→∞

E(Xn − X) = 0, [asymptotic unbiased] limn→∞

Var(Xn − X) = 0.

95

Suppose Y1, ..., Yn are i.i.d. with mean µ and variance σ2. Then define

Xn =1

n

n∑

i=1

Yi,

which has

E (Xn) =1

n

n∑

i=1

E(Yi) = µ,

and

Var (Xn) =1

n2Var

n∑

i=1

Yi =1

n2

n∑

i=1

Var(Yi)

=1

nσ2.

Hence Xn is unbiased and the variance goes to zero. Hence

Xnm.s.→ µ.

96

Definition. (Convergence in probability) If for all ε, η > 0 ∃ no s.t.

Pr(|Xn − X| < η) > 1 − ε, ∀ n > n0,

then the sequence X1, X2, . . . is said to converge in probability to the randomvariable X. A shorthand notation is

Xnp→ X. (17)

97

Definition. (Convergence almost surely) Let X and X1, X2, . . . be random vari-ables. If, for all ε, η > 0, there exists a n0 s.t.

Pr(|Xn − X| < η, ∀ n > n0) > 1 − ε,

then we say that Xn almost surely converges to X, which we write as Xna.s.→

X.

Thus almost sure convergence is about ensuring that the joint behaviour of allevents n > n0 is well behaved.

But convergence in probability just looks at the probabilities for each n.

98

Xna.s.→ X ⇒ Xn

p→ X.

Further note that Xna.s.→ X does not imply or is not implied by Xn

m.s.→ X.

99

Theorem. Weak Law Large Numbers (WLLN). Let Xi ∼ iid, E(Xi), Var (Xi)exist, then

1

n

n∑

i=1

Xip→ E(Xi),

as n → ∞.

Proof. See lecture notes (uses Chebyshev’s inequality or the generic result that

θm.s→ θ ⇒ θ

p→ θ).

100

Theorem. (Kolmogorov’s) Strong Law of Large Numbers (SLLN). Let Xi ∼iid, E(Xi) exist, then

1

n

n∑

i=1

Xia.s.→ E(Xi),

as n → ∞.Proof. Difficult. See, for example, Gallant (1997 p. 132).

101

5.3 Some payback

The most important rules are

• If Anp→ a , then g(An)

p→ g (a) where g(.) is a continuous function at a.

Example 5.2 Suppose Xi ∼ iid, E(Xi), Var (Xi) exist, and E(Xi is non-zero,

then1

n

n∑

i=1

Xip→ E(Xi),

which implies1

1

n

∑ni=1

Xi

p→ 1

E(Xi)

102

• If g and h are both continuous functions and

Anp→ a, Bn

p→ b,

as n → ∞, theng(An)h(Bn)

p→ g(a)h(b).

Suppose Yi ∼ iid, E(Yi), Var (Yi) exist. then(

1

n

n∑

i=1

Xi

)(1

n

n∑

i=1

Yi

)p→ E(Xi)E(Yi).

103

5.4 Some more theory

Refined measure of convergence

Convergence almost surely or in probability is quite a rough measure for it says

thatXn − X

implodes to zero with large values of n.

Does not indicate speed of convergence nor give any distributional shape to

Xn − X.

To improve our understanding we need to have a concept called convergencein distribution.

104

Definition. (Convergence in Distribution) The sequence X1, X2, . . . of randomvariables is said to converge in distribution to the random variable X if

FXn(x) → FX (x) . (18)

A shorthand notation is

Xnd→ X. (19)

105

Generic tools — Central Limit Theorems

Most famous of these is the Lindeberg-Levy ‘CLT’.

Theorem (Lindeberg-Levy) Let X1, X2, . . . be independent, identically distributedrandom variables, so that EXi = µ, Var (Xi) = σ2.

Set

Xn = (X1 + · · · + Xn)/n.

Then √n(Xn − µ

) d→ N(0, σ2

).

106

ExampleSuppose Xi are i.i.d. χ2

1 (that is, informally, N(0, 1)2). Xi has mean of 1 and

variance of 2. The Lindeberg-Levy CLT shows that

√n(X − 1

) d→ N (0, 2) .

−2.5 0.0 2.5 5.0 7.5 10.0

0.2

0.4Average of χ2

1 drawsN(s=1.4)

−15 −10 −5 0 5

0.1

0.2

0.3 Log of average of χ21 draws

N(s=1.68)

−2.5 0.0 2.5 5.0 7.5 10.0

0.1

0.2

0.3

Average of χ21 draws

N(s=1.42)

−10.0 −7.5 −5.0 −2.5 0.0 2.5 5.0

0.1

0.2


N(s=1.49)

−5.0 −2.5 0.0 2.5 5.0 7.5

0.1

0.2

0.3 Average of χ21 draws

N(s=1.41)

−7.5 −5.0 −2.5 0.0 2.5 5.0

0.1

0.2


N(s=1.42)

Figure 4: Left panel: estimated density, using 10,000 simulations, of√

n(X − 1

)from a sample

of iid χ2

1variables. Right panel looks at

√n(log(X)− log(1)

). From top to base, graphs have

n=3, 10 and 50.

107

Very important results in this context due to Slutsky’s Theorem:

• Suppose Xnd→ X and Yn

P→ µ. Then XnYnd→ Xµ and Xn/Yn

d→ X/µ ifµ 6= 0.

• More generally, suppose Xnd→ X and Yn

P→ µ. Let ϕ be a continuous

mapping. Then ϕ (Xn, Yn)d→ ϕ (X, µ) .

108

Suppose X1, ..., Xn are univariate i.i.d. with mean µ and variance σ2.

Lindeberg-Levy: √n(Xn − µ

)

σd→ N (0, 1) .

And we also can show that:

σ2 =1

n

n∑

i=1

(Xi − Xn

)2=

1

n

n∑

i=1

X2i − X

2

na.s.→ σ2.

Then by Slutsky’s Theorem

√n(Xn − µ

)

σd→ N (0, 1) .

109

Multivariate CLTs:

These will be very important for us.

(Multivariate Lindeberg-Levy)

Let X1, X2, . . . be i.i.d. r.v.s, so that EXi = µ, Var (Xi) = Σ.

Set

Xn = (X1 + · · · + Xn)/n.

Then √n(Xn − µ

) d→ N (0, Σ) .

110

JLarge Slides MT123 2011

Documents

Transcript of JLarge Slides MT123 2011