JLarge Slides MT123 2011
Transcript of JLarge Slides MT123 2011
Financial Econometrics Lecture Slides:
MFE, Michaelmas Term 2011
Weeks 1-3
Random Variables, Estimators
and Asymptotic Approximation
Jeremy Large
St Hugh’s College and Oxford-Man Institute of Quantitative Finance,University of Oxford
September 27, 2011
1
Contents
1 Basic probability 101.1 Reading: see lecture notes . . . . . . . . . . . . . . . . . . . . . . 10
1.2 Sample spaces, events and axioms . . . . . . . . . . . . . . . . . . 111.3 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161.4 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . 18
2 Random variables 242.1 Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.2 Example random variables . . . . . . . . . . . . . . . . . . . . . . 272.3 Random walk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.4 Distribution functions . . . . . . . . . . . . . . . . . . . . . . . . 322.5 Quantile functions . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.6 Some common random variables . . . . . . . . . . . . . . . . . . . 392.7 Multivariate random variables . . . . . . . . . . . . . . . . . . . . 472.8 Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
2.9 Covariance matrices . . . . . . . . . . . . . . . . . . . . . . . . . . 632.10 Back to distributions . . . . . . . . . . . . . . . . . . . . . . . . . 71
2.11 Conditional distributions . . . . . . . . . . . . . . . . . . . . . . . 75
3 Estimators 85
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 853.2 Bias and mean square error of estimators . . . . . . . . . . . . . . 87
4 Simulating random variables 894.1 Pseudo random numbers . . . . . . . . . . . . . . . . . . . . . . . 894.2 Inverting distribution functions . . . . . . . . . . . . . . . . . . . 90
5 Asymptotic approximation 925.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.2 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 945.3 Some payback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.4 Some more theory . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
2
Overview of the course
First of two examples
Lehman Brothers : share price (2001 - 2005)
-
20
40
60
80
100
120
140
2001 2002 2003 2004 2005
Figure 1: A time-series of Lehman Brothers end-of-day share prices (dollars).
3
Second of two examples
500
550
600
650
700
750
4000 5000 6000 7000 8000 9000
Copper price
Gold
price
Figure 2: A cross-plot of gold prices against copper prices, Q1 2006 (with trend line).
4
Time-series Regression(Lehman) (gold-copper)
One quantity changes over time Some quantities interact
Forecasting and explaining Explaining
Questions
• How can you make money from each?
• Is there randomness in the two examples? Or is everything ‘deterministic’?
5
Time-series Regression(Lehman) (gold-copper)
One quantity changes over time Some quantities interact
Forecasting and explaining Explaining
Comments
• In the second part of this term Prof Neil Shephard will talk about time-series, regression relationships, and mixtures of the two
• In Hilary Term, Prof Anders Rahbek will go more deeply into time-series:forecasting, volatility
• I provide theoretical underpinnings for both:
– notation
– framework
– proofs
– → a fairly abstract and theoretical start
6
Lecture plan, Thursdays, Weeks 1-4:
• 1pm: Lecture starts
• 1:40pm: 5-minute break, stretch legs
• 2:25pm: 20-minute break for coffee
• 3:25pm: 5-minute break, stretch legs
• 4:15pm: end
• 4:30pm to 5pm: Office hours in this lecture room
Classes take place in Weeks 3-9 this term. Thursday morning.
Kasper Lund-Jensen is the class teacher.
7
Weekly assignments:
Weekly assignments are distributed at each Thursday lecture:
→ Intended to take about three hours (I would recommend you never spendlonger than four hours on them)
→ Hand them in at SBS reception by 4pm the Monday 11 days later.
→ Kasper returns your answers, and provides solutions in the classes the follow-ing Thursday.
→ grade of either 1 or 0.
→ 1 point will be awarded if the assignment is mostly complete and correct. No
points will be awarded if the assignment is substantially incomplete.
→ Over this term and next, the best 10 out of 16 assignments will count towardsthe final grade.
8
What will be in the exams and quizzes next term?
All course contents are examinable, unless they have been flagged otherwise (notethe starring system in the lecture notes for this part of the course)
Best guide to exam question style : weekly assignments
Best guide to content : highly unlikely to stray beyond material appearing
in the slides covered in lectures, or the assignments.
9
1 Basic probability
Financial econometrics, and much of finance theory, takes the view thatasset prices are random.
So, probability theory is the basis of all modern econometrics and much of
economics and finance.
We will also need some linear algebra.
1.1 Reading: see lecture notes
10
1.2 Sample spaces, events and axioms
Example: Vodafone trades to the nearest 0.25p, so 0.25p is the price tick size.
Vodafone prices over one day:
8 9 10 11 12 13 14 15 16
138.75
139.00
139.25
139.50
139.75
140.00
140.25
Figure 3: Sample path of the best bid (best available marginal price to a seller) for Vodafone onthe LSE’s electronic limit order book SETS for the first working day in 2004.
Write Yi as the price of a very simple asset at time i (after i changes, say).
11
A very simple model: price starts at zero and it can move 1 “tick” up or downeach time period, or stay the same!
Possible pricestime Yi
i = 0 0i = 1 −1, 0, 1
i = 2 −2,−1, 0, 1, 2i = 3 −3,−2,−1, 0, 1, 2, 3
i = 4 −4,−3,−2,−1, 0, 1, 2, 3, 4
Thus, for example, Y4 can take on 9 different values.
This ‘toy model’ allows us to try out most deep ideas in probability theory
12
Sample space. The set Ω, is called the sample space, if it contains all possible(primitive) outcomes that we are considering, e.g. if we think about Y4 then its
sample space isΩ = −4,−3,−2, 1, 0, 1, 2, 3, 4 .
Event. An event is a subset of Ω (which could be Ω itself), e.g. Let
A = 1
i.e. Y4 = 1. Further let B be the event that Y4 is strictly positive, so
B = 1, 2, 3, 4.
Example 1.1 (value at risk) An important concept is downside risk — how
much you can lose, how quickly and how often. In this case the event of a largeloss might be defined as
−4,−3 .
A rapid fall of 3 ticks or more. In practice value at risk tends to be computedover a day or more, rather than over tiny time periods.
13
Probability axioms based on the triple (Ω,F , Pr)
F is the ‘power set’ of Ω, which just means it contains all the subsets of Ω:
A ∈ F ↔ A is a subset of Ω.
(technical note: F sometimes contains many – but not all – subsets of Ω)
And Pr is a real-valued function on F (not on Ω) that satisfies
1. Pr(A) ≥ 0, for all A ∈ F (for all A in the set F)
2. Pr(Ω) = 1.
3. If Ai ∈ F : i = 1, 2, ...,∞ (which is an infinitely large set of elements of
F) are disjoint then
Pr
( ∞⋃
j=1
Aj
)
=
∞∑
k=1
Pr(Ak).
In the Vodafone example:
Pr(Y4 > 0) =
4∑
i=1
Pr(Y4 = i).
14
Comments:
• Only events have probabilities.
• Events, E, are subsets of Ω, not elements. So E ⊆ Ω or, equivalently,E ∈ F .
• Probabilities are always ≥ zero.
• A realization is when a single ω ∈ Ω is picked (‘happens’).
• However, strictly speaking this realization has no probability (giving it a
probability makes no sense).
• ⋃ signifies ‘or’ ;⋂
signifies ‘and’
15
1.3 Independence
Consider two events A, B which are in F .
When does occurrence of one event not affect the probability of another event
also happening?
When the two events are independent.
Write that the events A, B are independent (in F) iff
Pr A ∩ B = Pr(A) × Pr(B).
Write
A ⊥⊥ B.
16
Example 1.2 Let S and T be any subsets of −1, 0, 1.
(e.g. suppose that S = −1, 1 and T = 1 )
Define A and B by:A is [ (Y4 − Y3) ∈ S]
andB is [ (Y3 − Y2) ∈ T ].
Many models assume that for any S and T ,
A ⊥⊥ B.
Informally, we mean that (Y4 − Y3) is independent of (Y3 − Y2), so we write thisquickly as:
(Y4 − Y3) ⊥⊥ (Y3 − Y2).
We will formalize this later, in terms of ‘random variables’.
17
1.4 Conditional Probability
Definition
Two events, A and B. We might be interested in Pr(A) or Pr(B) or Pr(A ∩B).
Want to know Pr(A|B), assuming Pr(B) > 0.
I constrain my world so that B happens and I ask if A then happens.
This can only be if both A and B happen, so we define
Pr(A|B) =Pr(A ∩ B)
Pr(B).
Think of this as a function of A, with B fixed in the background.
→ that way, it obeys the three standard probability axioms.
This is a vital concept in econometrics.
18
Joint conditional probabilities
Pr(A ∩ B|C).
If
Pr(A ∩ B|C) = Pr(A|C) × Pr(B|C),
we say that conditionally on C, A and B are independent. This is often written
as(A ⊥⊥ B) |C.
19
Conditional probabilities, and time
Suppose we are at time 3, then we know the value of Y3 = 2, say. Then Y4 mustbe in
1, 2, 3 ,
so lets think of 1, 2, 3 as a new sample space. It is not too hard to define new
events and new probablities
Pr(Y4 > 1|Y3 = 2) = 1 − Pr(Y4 = 1|Y3 = 2).
Here, as ever, conditional probabilities are simply standard probabilities,• but on another sample space.
Lets never forget the stuff to the right of |.
20
Example 1.3 May be interested in interested in the forecast distributions, acrossall x:
Pr(Y4 = x|Y3 = y),
Pr(Y4 = x|Y2 = y)Pr(Y4 = x|Y1 = y)Pr(Y4 = x|Y1 = a, Y2 = b, Y3 = c),
the last of which is the distribution of Y4 given we know that the price at time 1,2, 3 were a, b and c.
The last conditional probability is a one-step ahead forecast distribution given the
path of the process.
21
A flexible notation for the example on the page before
May be interested in interested in the forecast distributions:
Pr(Y4|Y3)Pr(Y4|Y2)
Pr(Y4|Y1)Pr(Y4|Y1, Y2, Y3),
the distribution of Y4 given we know the price at time 3, 2 or 1.
The last conditional probability is a one-step ahead forecast distribution giventhe path of the process.
22
Example 1.4 In many models in financial econometrics:
Pr(Yi|Yi−1, Yi−2, Yi−3, . . . ) = Pr(Yi| Yi−1 ).
That is, given the value of Yi−1, it is irrelevant to the value of Yi, what the valueof Y was two or more periods before.
This is the Markov Assumption.
A consequence of the Markov Assumption:
(Yi ⊥⊥ Yi−2) |Yi−1.
23
2 Random variables
2.1 Basics
A random variable is a function from Ω to R.
Typically, it is called X(ω).
Most of econometrics is about random variables.
We drop reference to ω, so we will write X as the random variable.
Properties of X are events, for example: ‘X > 0’ is the event
ω : X(ω) > 0, (1)
which is a subset of Ω, like every other event, and can have a probability.
24
Independence Two random variables, Y1 and Y2 are independent if
for any events A1 about Y1, and A2 about Y2:
A1 ⊥⊥ A2.
If they are independent, then we write
Y1 ⊥⊥ Y2.
Exercise: prove that if Y1 and Y2 are independent, then for any y1 and y2:
Pr[Y1 ≤ y1 and Y2 ≤ y2] = Pr[Y1 ≤ y1] Pr[Y2 ≤ y2].
25
i.i.d. A sequence of random variables Y1, Y2, ..., YN , ... is said to be i.i.d. (in-dependently and identically distributed) if
• any pair Yi and Yj are independent, and have the same distribution.
26
2.2 Example random variables
Bernoulli random variable:
Could be heads (ω = H) or tails (ω = T ).
Let X(H) = 1 and X(T ) = 0.
We say X is a Bernoulli random variable with two ‘points of support’, 0, 1.
Write Pr(X = 1) = p and Pr(X = 0) = 1 − p.
Now lets make a new random variable:
U1 = 2X − 1 ∈ −1, 1 .
27
A sequence of Bernoulli random variables:
Write Xi as above but for time i, where i can be 1,2,3,...
Assume that Xi are independent and identically distributed (i.i.d.).
Binomial tree process
Yi = Yi−1 + Ui, i = 1, 2, 3, ..., Y0 = 0, (2)
Ui = 2Xi − 1. (3)
What is a Random Process? Nothing other than a sequence of randomvariables, e.g. Y0, Y1, Y2, ...
→ for example, we record a price at a sequence of times of our choosing.
28
0 50 100
0
5
10 1st sample path of Y i
0 50 100
−10
0
0 50 100
0
5
10
0 50 100
0
10
20 4th sample path of Yi
0 50 100
−20
−10
0
0 50 100
0
5
10
0 50 100
0
10
7th sample path of Yi
0 50 100
0
5
−25 0 25
0.02
0.04
Histogram of Y100. Binomial density
29
Definition of a Binomial random variable:
Suppose we carry out n independent Bernoulli trials with Pr (Xi = 1) = p
→ then this is a Binomial RV, called Zn
Zn =n∑
i=1
Xi.
And we might want to define the random process, Z:
Z = Zn : n = 1, 2, 3, ... (4)
30
2.3 Random walk
The binomial tree (2) can be written as
Yi = 2i∑
j=1
Xj − i, i = 0, 1, 2, ..., Y0 = 0.
Special case of the random walk process
Yi = Yi−1 + ǫi,
where ǫi are i.i.d.
ǫi are called the ‘shocks’, or ‘residuals’, or ‘innovations’. Note that if we think of
Yi as log-prices thenǫi = Yi − Yi−1,
are returns.
Hence the log-price process can be transformed into an i.i.d. sample by ‘taking
first differences’.
31
2.4 Distribution functions
Distribution function of a random variable X is
FX(x) = Pr(X ≤ x).
Density function for continuous X,
fX(x) =∂FX(x)
∂x.
Clearly
FX(x) =
∫ x
−∞fX(y)dy.
Note that for continuous variables (in inverted commas):
Pr(X = x) = 0,
for every x.
For X with countable support we often write fX(x) for Pr(X = x).
32
Conditional distribution functions
Distribution function of a random variable X conditional on some positive-
probability event A isFX |A(x) = Pr(X ≤ x|A).
Conditional density function for continuous X,
fX |A(x) =∂FX |A(x)
∂x.
Clearly
FX |A(x) =
∫ x
−∞fX |A(y)dy.
33
MeanThis is defined (when it exists) as
E(X) =
∫ ∞
−∞xfX(x)dx.
It is often used as a measure of the average value of a random variable (alterna-
tives include mode and median).
Discrete r.v. : replace integration with summation.
Example 2.1 Suppose X is a Bernoulli trial with Pr(X = 1) = p and Pr(X =0) = 1 − p. Then
E(X) = 1 × Pr(X = 1) + 0 × Pr(X = 0)
= p. (5)
34
VarianceVariance is defined as
Var(X) = E X − E(X)2
=
∫x − E(X)2 fX(x)dx
= E(X2) − E(X)2 .
The standard deviation is defined as√
Var(X).
A further very important formula:
Var(a + bX) = b2Var(X).
(exercise: prove this)
35
Conditional MeanThe conditional expectation of a random variable X given a +ve probability
event A is
E(X|A) =
∫xfX |A(x)dx.
Conditional VarianceBy analogy:
E(X2|A) − E(X|A)2.
36
2.5 Quantile functions
Inverting the distribution function. i.e. we ask: for a given u ∈ [0, 1], find x such
thatu = FX(x).
We callx = F−1
X (u),
the quantile function of X.
The 0.1 quantile tells us the value of X such that only 10% of the populationfall below that value. The most well known quantile is
x = F−1
X (0.5),
which is called the median.
37
Example 2.2 Quantiles are central in simple value at risk (VaR) calculations,which measure the degree of risk taken by banks. In simple VaR calculations one
looks at the marginal distribution of the returns over a day, written Yi − Yi−1,and calculates
F−1
Yi−Yi−1(0.05),
the 5% percentile of the return distribution.
38
2.6 Some common random variables
Normal
The normal distribution is important. Does not look immediately attractive
fX (x) =1√
2πσ2exp
−(x − µ)2
2σ2
, x, µ ∈ R, σ2 ∈ R+.
Density peaks at µ and is symmetric around µ.
39
Model for returns on daily Sterling/$ 1985 to 2000.
−2.5 −2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 2.5
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Flexible estimator Fitted normal
40
fX (x) =1√
2πσ2exp
−(x − µ)2
2σ2
, x, µ ∈ R, σ2 ∈ R+.
Centred at µ, σ2 determines its scale (spread).
The notation for a normal r.v. is X ∼ N(µ, σ2
).
µ is the mean; σ2 is the variance• We will prove this later
• Notice that together, the mean and variance define the normal distribution
If an i.i.d. sequence has normal random variables, we write it is N.I.D.• And we will also see NID(µ, σ2).
Another word for normal is ‘Gaussian’.
41
If X ∼ N(µ, σ2
)and γ and λ are non-random then
γ + λX ∼ N(γ + λµ, λ2σ2).
One can writeX
law= µ + σu,
where u ∼ N(0, 1). Equality in law, means the left and right hand side quantitieshave the same law or distribution. Finally, if X and Y are independent normal
variables with means µx and µy and variances σ2x and σ2
y, then
X + Y ∼ N(µx + µy, σ2x + σ2
y).
That is: the means and variances add up, and normality is maintained.
This is a very convenient result for asset pricing, as we will see later.
42
Example: Suppose that Ui are i.i.d. N(µ, σ2) then the ‘drifting’ random walk
Yi = Yi−1 + Ui, Y0 = 0,
has the feature thatYi ∼ N(iµ, iσ2),
orYi+s|Yi ∼ N(Yi + sµ, sσ2).
43
Consider a change to the Binomial tree that we saw earlier:
Replace the scaled and recentred Bernoulli variable with a normal random vari-able, Ui N(µ, σ2).
Select µ = 0 and σ2 = 4× 0.5× 0.5 so that it matches the mean and variance of
the previous Binomial tree.
0 50 100
−2.5
0.0
2.5
5.0
7.5 1st sample path of Y i
0 50 100
−5
0
5
0 50 100
−10
−5
0
0 50 100
0
5
10
4th sample path of Yi
0 50 100
0
5
10
15
0 50 100
−20
−10
0
0 50 100
−10
−5
0
5 7th sample path of Yi
0 50 100
−5
0
5
10
−25 0 25
0.02
0.04Histogram of Y100. Gaussian density
44
Uniform
Sometimes variables are constrained to live on small intervals. The leading ex-ample of this is the standard uniform
fX(x) = 1, x ∈ [0, 1].
Used in economic theory as a stylised way of introducing uncertainty into a model
and in simulation.
Chi-squared
Suppose Xii.i.d.∼ N(0, 1) (often written NID(0, 1)) then
Y =
ν∑
i=1
X2i ∼ χ2
ν
is a Chi-Squared random variable with “degrees of freedom” ν.
45
Student t
Student t random variable is generated by a ratio of random variables.
Crude notation for it is:
tν =N(0, 1)√
χ2ν/ν
,
where N(0, 1) ⊥⊥ χ2ν. This is symmetrically distributed about 0.
46
2.7 Multivariate random variables
Consider a multivariate q × 1 vector where each element is a random variable
X = (X1, ..., Xq)′ .
This vector is itself a random variable (a ‘q-dimensional multivariate randomvariable’).
Note that we didn’t say that the elements of X had to be independent random
variables.
Important example: the elements of this vector could represent the returns on acollection of q assets, such as the FTSE100 equities, daily.
→ Because of this example, multivariate random variables play a central rolein portfolio allocation and risk assessments, as well as all aspects of econometrics.
47
Sustained example: returns from a portfolioConside the bivariate case where q = 2. We might think of
X =
(X1
X2
)=
(YZ
),
where X1 = Y is the return over the next day on IBM and X2 = Z is the return
over the next day on the S&P composite index.Consider the case of measuring the outperformance of the index by IBM. This
isY − Z.
We can write this as
(1,−1)
(Y
Z
)= b′X
where
b =
(1−1
), so b′ = (1,−1) .
Thus the outperformance can be measured using linear algebra.
This outperformance can be thought of as a simple portfolio, buying IBM and
selling the index.
48
Consider, slightly more abstractly, a portfolio made up of c shares in Y andd in Z. Then the portfolio returns
cY + dZ.
This can be written in terms of vectors as
(c, d)
(YZ
)= f ′X, f =
(cd
).
More generally, we might write p portfolios, each with different portfolio weights
as
B11Y + B12ZB21Y + B22Z
B31Y + B32Z...
Bp1Y + Bp2Z
= BX,
where
B =
B11 B12
B21 B22
B31 B32
......
Bp1 Bp2
.
This is a very powerful way of writing out portfolios compactly.
49
So far, the p portfolios each contained only two assets.
But you can extend this easily from 2 to q underlying assets
X =
X1
X2
X3
...Xq
, B =
B11 B12 B13 · · · B1q
B21 B22 B23 · · · B2q
B31 B32 B33 · · · B3q...
...... . . . ...
Bp1 Bp2 Bp3 Bpq
Now the p portfolios, depending upon q assets, have returns
BX =
∑qj=1
B1jXj∑qj=1
B2jXj∑qj=1
B3jXj...∑q
j=1BpjXj
.
Again this is quite a simple representation of quite a complicated situation.
50
Back on trackIn particular if X is a 2 × 1 vector
X =
(X1
X2
)and x =
(x1
x2
),
then
FX(x) = Pr(X1 ≤ x1, X2 ≤ x2),
which in the continuous case becomes
FX(x) =
∫ x2
−∞
∫ x1
−∞fX(y1, y2)dy1dy2. (6)
51
Likewise
fX(x1, x2) =∂2FX(x1, x2)
∂x1∂x2
.
When X1 ⊥⊥ X2 then this simplifies to
fX(x1, x2) =∂2FX1
(x1)FX2(x2)
∂x1∂x2
=∂FX1
(x1)
∂x1
∂FX2(x2)
∂x2
= fX1(x1)fX2
(x2).
52
(a) Standard normal density
−2.50.0
2.5−2.5
0.02.5
0.05
0.10
0.15
(b) NIG(1,0,0,1) density
−2.50.0
2.5−2.5
0.02.5
0.1
0.2
0.3
(c) Standard log−density
−2.50.0
2.5−2.5
0.02.5
−10
.0−
7.5
−5.
0−
2.5
(d) NIG(1,0,0,1) log−density
−2.50.0
2.5−2.5
0.02.5
−7.
5−
5.0
−2.
50.
0
53
An important point is that from Eq. (6),∫ ∞
−∞fX(y, x2)dy =
∂FX(∞, x2)
∂x2
=∂ Pr(X1 ≤ ∞, X2 ≤ x2)
∂x2
=∂FX2
(x2)
∂x2
= fX2(x2).
Hence if we integrate out a variable from a density function we produce the
‘marginal density’ of the other random variable.
54
Lets suppose that X2 is a discrete r.v.
Then the conditional distribution function of X1 takes on the form
FX1|X2=x2(x1) = Pr(X1 ≤ x1|X2 = x2),
while, if X1 is continuous we define
fX1|X2=x2(x1) =
∂ Pr(X1 ≤ x1|X2 = x2)
∂x1
,
which has the properties of a density.
Now, if both X2 and X1 are continuous r.v.s, we define
fX1|X2=x2(x1) =
fX(x1, x2)
fX2(x2)
.
Intuitive, but the theory behind this is beyond the scope of this course.
55
2.8 Moments
General case
An expectation of a function of a random variable.
Define, for a continuous X, if it exists
E g(X) =
∫g(x)fX(x)dx. (7)
The expectation obeys some important rules. For example if a, b are constantsthen
E a + bg(X) = a + bE g(X) .
This follows from the definition of expectations as solutions to integrals (7).
56
Special ’base-’ cases of moments
The most basic moment is known as the first moment:
E X =
∫xfX(x)dx. (8)
We’ve also seen the second moment:
EX2
=
∫x2fX(x)dx. (9)
Even though you’ll see these much more than others, try to see them as special
cases
57
Example 2.3 If X ∼ N(µ, σ2), then
E(X) =
∫ ∞
−∞x
1√2πσ2
exp
−(x − µ)2
2σ2
dx
= µ +
∫ ∞
−∞(x − µ)
1√2πσ2
exp
−(x − µ)2
2σ2
dx
= µ,
using the fact that a density integrates to one.
Exercise: fill in the working here (use properties of antisymmetric functions)
58
Multivariate meanRecall we write
X =
X1
X2
X3
...Xq
.
Now each Xj has a mean, E(Xj), so it would be nice to collect these together.The following notation does this. We define
E(X) =
E(X1)E(X2)
E(X3)...
E(Xq)
.
This is the mean of the vector.
59
We wrote the return on p portfolios as
BX,
where B is a p × q weight matrix.Then
E(BX) = BE(X).
60
Why? Recall a mean of a vector is the mean of all the elements of the vector
E(BX) =
E(∑q
j=1B1jXj
)
E(∑q
j=1B2jXj
)
E(∑q
j=1B3jXj
)
...
E(∑q
j=1BpjXj
)
.
But, for i = 1, 2, ..., p,
E
(q∑
j=1
BijXj
)
=
q∑
j=1
E(BijXj) =
q∑
j=1
BijE(Xj).
61
Hence
E(BX) =
∑qj=1
B1jE(Xj)∑qj=1
B2jE(Xj)∑qj=1
B3jE(Xj)...∑q
j=1BpjE(Xj)
= BE(X),
as stated. This is an important result for econometrics.
62
2.9 Covariance matrices
Univariate covariance
The covariance of X and Y is defined (when it exists) as
Cov(X, Y ) = E(X − E(X) Y − E(Y ))
=
∫(x − E(X) y − E(Y ))fX,Y (x, y)dxdy
= E(XY ) − E(X)E(Y ).
63
Cov(a + bX, c + dY ) = bdCov(X, Y ).
Hence covariances are location invariant.
Var(aX + bY ) = a2Var(X) + b2Var(Y ) + 2abCov(X, Y ).
64
Independence implies uncorrelatednessRecall, if moment exist
Cov(X, Y ) = E(XY ) − E(X)E(Y ).
So if X ⊥⊥ Y then
Cov(X, Y ) = E(X)E(Y ) − E(X)E(Y ) = 0.
If the covariance between X and Y is zero we say they are uncorrelated
X ⊥ Y.
So(X ⊥⊥ Y ) =⇒ (X ⊥Y ) .
The reverse is not true (in Gaussian case it is!).
65
Example 2.4 SupposeX ∼ N(0, 1), Y = X2.
ThenCov(X, Y ) = E(XY ) − E(X)E(Y ) = E(X3) = 0.
66
CorrelationThe correlation of X and Y is defined (when it exists) as
Cor(X, Y ) =Cov(X, Y )√
Var(X)Var(Y ).
Now
Cor(X, Y ) ∈ [−1, 1],
which follows from the Cauchy-Schwarz inequality.
67
Think of
X =
X1
X2
X3
...Xp
.
Then we define the covariance matrix of X as
Cov(X) =
Var(X1) Cov(X1, X2) · · · Cov(X1, Xp)
Cov(X2, X1) Var(X2) · · · Cov(X2, Xp)...
... . . . ...
Cov(Xp, X1) Cov(Xp, X2) · · · Var(Xp)
.
This is a symmetric p × p matrix.
* Covariance matrices are always ‘positive semi-definite’ (which means that the
e-values are all ≥ 0 [and real]).
68
The covariance matrix can be calculated as
Cov(X) = E(X − E(X) X − E(X)′
).
Example 2.5 In the IBM and S&P example then we have approximately that
E(X) =
(0.0206
−0.00721
)
Cov(X) =
(5.07 1.791.79 1.62
).
A very important result is that if B is a q × p matrix of constants, then
• E (a + BX) = a + BE(X)
• Cov(a + BX) = BCov(X)B′.
69
Correlation matricesCorresponding to the covariance matrix is the correlation matrix, which is
(when it exists)
Cor(X) =
1 Cor(X1, X2) · · · Cor(X1, Xp)
Cor(X2, X1) 1 · · · Cor(X2, Xp)...
... . . . ...
Cor(Xp, X1) Cor(Xp, X2) · · · 1
= Cor(X)′.
This matrix is invariant to location and scale changes, but obviously not general
linear transformations.
70
2.10 Back to distributions
Multivariate normal
The p-dimensional X ∼ N(µ, Σ). E(X) = µ, Cov(X) = Σ. Assume |Σ| > 0.Σ is always symmetric of course. Then
fX(x) = |2πΣ|−1/2 exp
−1
2(x − µ)′ Σ−1 (x − µ)
, x ∈ Rp.
Here Σ−1 is a matrix inverse, which exists due to the |Σ| > 0 assumption. Further
|2πΣ|−1/2 = (2π)−p/2 |Σ|−1/2 .
Exercises:
1. explain why the density has a single peak at µ.
2. how does this simplify if Σ = σ2I?
3. if I tell you Σ and µ, do you know everything about the normal distribution?
71
Example 2.6 SupposeΣ = σ2Ip,
which means that the elements of X are independent and homoskedastic.Then
fX(x) =(2πσ2
)−p/2exp
− 1
2σ2(x − µ)′ (x − µ)
=(2πσ2
)−p/2exp
− 1
2σ2
p∑
i=1
(xi − µi)2
72
Let X be a p-dimensional multivariate normal.
Then, if a is q × 1 and B is q × p and both are constants, then
Y = (a + BX) ∼ N(a + Bµ, BΣB′), (10)
a q-dimensional normal.
That is: all linear transformations of normals are normal.
73
In particular if p = 2, then for X = (X1, X2)′
Σ = Cov(X) =
Var (X1) Cov (X1, X2)
Cov (X1, X2) Var (X2)
=
(σ2
1 ρσ1σ2
ρσ1σ2 σ22
),
where ρ = Cor(X1, X2).
This is an important model.
In an ‘abuse of notation’, we can write (!) X1 as X and X2 as Y
In which case we get the formulation(
XY
)∼ N
(µx
µy
),
(σ2
x ρσxσy
ρσxσy σ2y
).
Keep this model in mind for the next few slides ...
74
2.11 Conditional distributions
Basic recap: Consider two (possibly multivariate) discrete random variables
X, Y , then
FX |Y =y(x) = Pr(X ≤ x|Y = y) =Pr(X ≤ x, Y = y)
Pr (Y = y).
Likewise in the continuous case, the conditional density is defined by :
fX |Y =y(x) =∂FX |Y =y(x)
∂x=
fX,Y (x, y)
fY (y).
So,fX,Y (x, y) = fX |Y =y(x)fY (y).
(known as the marginal-conditional decomposition)
Useful to consider this in the context of Normals...
75
Example of two standard normals
X and Y are Standard Normals. So, both have mean of 0, Variance of 1. Thecorrelation between them is ρ. In this case
Y |(X = x) ∼ Nρx,(1 − ρ2
).
Put another way,
fY |X=x(y) =1√
2π (1 − ρ2)e− 1
2(y−ρx)2
1−ρ2 . (11)
It is natural to write:
E(Y |X = x) = ρx;
V ar(Y |X = x) = 1 − ρ2,
and we often will. Called ‘Conditional Moments’ ...
76
Conditional first moment
Recall
fX |Y =y(x) =fX,Y (x, y)
fY (y).
The definition of the first conditional moment is
EX |Y =y(X) =
∫xfX |Y =y(x)dx.
We also write this (as previous slide) by
E(X|Y = y) =
∫xfX |Y =y(x)dx.
77
A concise notation of great use
Recall X and Y as Standard Normals, correlation between them is ρ.
We saw that
E(Y |X = x) = ρx.
It will be really helpful to condense this further:
E(Y |X) = ρX.
Write the random variable itself in the place of the particular value that we knowit takes under the conditioning, i.e. capitalize X.
Likewise, we can write “Y ’s variance conditional on X” as:
V ar(Y |X) = 1 − ρ2. (12)
78
General conditional moments
More generally, the definition of a conditional moment is
EX |Y =y(g(X)) =
∫g(x)fX |Y =y(x)dx,
which is a function of y, say h(y).
This gives the random variable h(Y ).
... and we could consider its expectation: EY (h(Y )), or:
EY ( EX |Y (g(X)) ), (13)
i.e. (more concisely using the notation of the last slide):
E( E(g(X) | Y ) ). (14)
79
Law of Iterated Expectations
Now recallfX,Y (x, y) = fX |Y =y(x)fY (y).
Doing some algebra, we have that
EX(g(X)) = EY
EX |Y (g(X))
.
This is the Law of Iterated Expectations, and is very important.
• It allows you to break complex expectations up into manageable chunks.
You can also write the law as:
E( E(g(X) | Y ) ) = E(g(X)).
A related result:
VarX(X) = EY
(VarX |Y X|Y
)+ VarY (EX |Y X|Y ).
80
General bivariate normal. In this case(
X
Y
)∼ N
(µX
µY
),
(σ2
X ρσXσY
ρσXσY σ2Y
),
then
Y |(X = x) ∼ N
µY +
ρσY
σX(x − µX), σ2
Y
(1 − ρ2
).
81
Again,
Y |(X = x) ∼ N
µY +
ρσY
σX(x − µX), σ2
Y
(1 − ρ2
).
To be brief we often write:
Y |X ∼ N
µY +
ρσY
σX(X − µX), σ2
Y
(1 − ρ2
).
• Conditional variance does not depend upon x or X.
• Change in the conditional mean is
ρσY
σX(x − µX),
so is linear in x. Effect is compared to mean, i.e. x − µX . Dividing by σX
removes the scale of x, times by σY puts the variables onto the y scale.
82
Example 2.7 Y is the return on an asset, X is the return on market portfolio.Then
βY |X =ρσY
σX
is often called the beta of Y and is a measure of how Y moves with the market.
Notice that we can also write:
βY |X =cov(X, Y )
varX
83
MartingaleIn modelling dynamics martingales play a large role.
Consider a sequence asset prices recorded through time
Y1, Y2, Y3, ...
where the subscript reflects time. A natural object to study is
E(Yi|Y1, ..., Yi−1),
the conditional expectation (which we assume exists) of “the future given thepast”. Then if
E(Yi|Y1, ..., Yi−1) = Yi−1,
then the sequence is said to be a martingale with respect to its own past history.
Exercise: Use the Law of Iterated Expectations to prove that if Yi is any
Martingale, with fixed Y1, then
E(Y3) = Y1.
84
3 Estimators
3.1 Introduction
A statistic S(X) is a function of a (vector) random variable X.
When we learn about a feature of the probability model we say we are estimating
the model.
If S(X) is intended to describe a feature of the probability model, then we callit an estimator.
If x is the observed value of X, then we call S(x) the resulting estimate.
85
Example 3.1 Let
S(X) =1
n
n∑
i=1
Xi.
If Xi ∼ NID(µ, σ2) then using the fact that S(X) is a linear combination ofnormals we have that
S(X) ∼ N
(µ,
σ2
n
).
If n is very large the estimator is very close to µ, the average value of the normal
distribution.
86
3.2 Bias and mean square error of estimators
Estimate some quantity θ.
Wish for S(X) to be close to θ on average.
Bias: E S(X) − θ.
Example 3.2 If Xi ∼ NID(µ, σ2) then
S(X) = X =1
n
n∑
i=1
Xi,
the sample mean (sample average), has a zero bias as an estimator of µ.
When the bias is zero, the estimator is said to be unbiased.
87
Very large dispersion?
Imprecision of estimator can be measured with the Mean Square Error criterion:
MSE: = E[S(X) − θ2
]= Var S(X) + [E S(X) − θ]2 .
RMSE = Root MSE = Square-root of the MSE.
→ Which is better: an unbiased estimator, or a biased estimator which is more
precise?
88
4 Simulating random variables
Simulation is a key technique in advanced modern econometrics.
Produce random variables from known distribution functions.
4.1 Pseudo random numbers
All of the simulation methods built out of draws based on a sequence of indepen-dent and identically distributed (standard) uniform random numbers Ui ∈ [0, 1].
Let’s regard the problem of producing such uniform numbers as solved - matlab
does this for us.
An example is given below (!)Ui
.734
.452
.234
.123
.987
89
4.2 Inverting distribution functions
Key point: given a source of unlimited simulated i.i.d. uniforms we can produce
i.i.d. draws from any continuous distribution FX(x).
Proof : As Ui is uniform, so
Pr(Ui ≤ FX(x)) = FX(x).
Thus
Pr(Ui ≤ FX(x)) = Pr(F−1
X (Ui) ≤ x) = Pr(Xi ≤ x),
So if we takeXi = F−1
X (Ui), (15)
then we produce random numbers from any continuous distribution,→ plug the stream of simulated uniforms into the quantile function (15).
Discrete random variables are very similar but need some attention at the jump-
points in the CDF, F .
90
Example 4.1 The exponential distribution. Recall FX(x) = 1−exp(− 1
µx), and
so the quantile function is
F−1
X (p) = −µ log (1 − p) .
Hence−µ log (1 − Ui)
are i.i.d. exponential draws. e.g. µ = 1
Ui Xi
.734 1.324
.452 0.601
.234 0.266
.123 0.131
.987 4.343
91
5 Asymptotic approximation
5.1 Motivation
Classical convergence
Xn = 3 +1
n→ 3
as n → ∞.
A little more fuzzy when we think of
Xn = 3 +Y
n
?→ 3,
where Y is a random variable.
There are different measures of convergence. Some need moments, others don’t:“convergence in probability” and
“convergence in distribution” .
Formally we will think of a sequence of random variables X1, X2, . . . , Xn which,as n gets large, will be such that Xn will behave like some other random variable
or constant X.
92
Example 5.1 We are interested in
Xn =1
n
n∑
j=1
Yj.
Then it forms a sequence
X1 = Y1, X2 =1
2(Y1 + Y2) , X3 =
1
3(Y1 + Y2 + Y3).
What does 1
n
∑nj=1
Yj behave like for large n? What does Xn converge to for large
n?
93
5.2 Definitions
Sequence of random variables Xn. Ask if
Xn − X
is small as n goes to infinity.
You can measure smallness in many ways and so there are lots of different notionsof convergence.
We discuss three, the second of which will be the most important for us.
94
Definition. (Convergence in mean square) Let X and X1, X2, . . . be randomvariables. If
limn→∞
E[(Xn − X)2
]= 0,
then the sequence X1, X2, . . . is said to converge in mean square to the randomvariable X. A shorthand notation is
Xnm.s.→ X. (16)
Necessary and sufficient conditions for Xnm.s.→ X are that
limn→∞
E(Xn − X) = 0, [asymptotic unbiased] limn→∞
Var(Xn − X) = 0.
95
Suppose Y1, ..., Yn are i.i.d. with mean µ and variance σ2. Then define
Xn =1
n
n∑
i=1
Yi,
which has
E (Xn) =1
n
n∑
i=1
E(Yi) = µ,
and
Var (Xn) =1
n2Var
n∑
i=1
Yi =1
n2
n∑
i=1
Var(Yi)
=1
nσ2.
Hence Xn is unbiased and the variance goes to zero. Hence
Xnm.s.→ µ.
96
Definition. (Convergence in probability) If for all ε, η > 0 ∃ no s.t.
Pr(|Xn − X| < η) > 1 − ε, ∀ n > n0,
then the sequence X1, X2, . . . is said to converge in probability to the randomvariable X. A shorthand notation is
Xnp→ X. (17)
97
Definition. (Convergence almost surely) Let X and X1, X2, . . . be random vari-ables. If, for all ε, η > 0, there exists a n0 s.t.
Pr(|Xn − X| < η, ∀ n > n0) > 1 − ε,
then we say that Xn almost surely converges to X, which we write as Xna.s.→
X.
Thus almost sure convergence is about ensuring that the joint behaviour of allevents n > n0 is well behaved.
But convergence in probability just looks at the probabilities for each n.
98
Xna.s.→ X ⇒ Xn
p→ X.
Further note that Xna.s.→ X does not imply or is not implied by Xn
m.s.→ X.
99
Theorem. Weak Law Large Numbers (WLLN). Let Xi ∼ iid, E(Xi), Var (Xi)exist, then
1
n
n∑
i=1
Xip→ E(Xi),
as n → ∞.
Proof. See lecture notes (uses Chebyshev’s inequality or the generic result that
θm.s→ θ ⇒ θ
p→ θ).
100
Theorem. (Kolmogorov’s) Strong Law of Large Numbers (SLLN). Let Xi ∼iid, E(Xi) exist, then
1
n
n∑
i=1
Xia.s.→ E(Xi),
as n → ∞.Proof. Difficult. See, for example, Gallant (1997 p. 132).
101
5.3 Some payback
The most important rules are
• If Anp→ a , then g(An)
p→ g (a) where g(.) is a continuous function at a.
Example 5.2 Suppose Xi ∼ iid, E(Xi), Var (Xi) exist, and E(Xi is non-zero,
then1
n
n∑
i=1
Xip→ E(Xi),
which implies1
1
n
∑ni=1
Xi
p→ 1
E(Xi)
102
• If g and h are both continuous functions and
Anp→ a, Bn
p→ b,
as n → ∞, theng(An)h(Bn)
p→ g(a)h(b).
Suppose Yi ∼ iid, E(Yi), Var (Yi) exist. then(
1
n
n∑
i=1
Xi
)(1
n
n∑
i=1
Yi
)p→ E(Xi)E(Yi).
103
5.4 Some more theory
Refined measure of convergence
Convergence almost surely or in probability is quite a rough measure for it says
thatXn − X
implodes to zero with large values of n.
Does not indicate speed of convergence nor give any distributional shape to
Xn − X.
To improve our understanding we need to have a concept called convergencein distribution.
104
Definition. (Convergence in Distribution) The sequence X1, X2, . . . of randomvariables is said to converge in distribution to the random variable X if
FXn(x) → FX (x) . (18)
A shorthand notation is
Xnd→ X. (19)
105
Generic tools — Central Limit Theorems
Most famous of these is the Lindeberg-Levy ‘CLT’.
Theorem (Lindeberg-Levy) Let X1, X2, . . . be independent, identically distributedrandom variables, so that EXi = µ, Var (Xi) = σ2.
Set
Xn = (X1 + · · · + Xn)/n.
Then √n(Xn − µ
) d→ N(0, σ2
).
106
ExampleSuppose Xi are i.i.d. χ2
1 (that is, informally, N(0, 1)2). Xi has mean of 1 and
variance of 2. The Lindeberg-Levy CLT shows that
√n(X − 1
) d→ N (0, 2) .
−2.5 0.0 2.5 5.0 7.5 10.0
0.2
0.4Average of χ2
1 drawsN(s=1.4)
−15 −10 −5 0 5
0.1
0.2
0.3 Log of average of χ21 draws
N(s=1.68)
−2.5 0.0 2.5 5.0 7.5 10.0
0.1
0.2
0.3
Average of χ21 draws
N(s=1.42)
−10.0 −7.5 −5.0 −2.5 0.0 2.5 5.0
0.1
0.2
0.3 Log of average of χ21 draws
N(s=1.49)
−5.0 −2.5 0.0 2.5 5.0 7.5
0.1
0.2
0.3 Average of χ21 draws
N(s=1.41)
−7.5 −5.0 −2.5 0.0 2.5 5.0
0.1
0.2
0.3 Log of average of χ21 draws
N(s=1.42)
Figure 4: Left panel: estimated density, using 10,000 simulations, of√
n(X − 1
)from a sample
of iid χ2
1variables. Right panel looks at
√n(log(X)− log(1)
). From top to base, graphs have
n=3, 10 and 50.
107
Very important results in this context due to Slutsky’s Theorem:
• Suppose Xnd→ X and Yn
P→ µ. Then XnYnd→ Xµ and Xn/Yn
d→ X/µ ifµ 6= 0.
• More generally, suppose Xnd→ X and Yn
P→ µ. Let ϕ be a continuous
mapping. Then ϕ (Xn, Yn)d→ ϕ (X, µ) .
108
Suppose X1, ..., Xn are univariate i.i.d. with mean µ and variance σ2.
Lindeberg-Levy: √n(Xn − µ
)
σd→ N (0, 1) .
And we also can show that:
σ2 =1
n
n∑
i=1
(Xi − Xn
)2=
1
n
n∑
i=1
X2i − X
2
na.s.→ σ2.
Then by Slutsky’s Theorem
√n(Xn − µ
)
σd→ N (0, 1) .
109
Multivariate CLTs:
These will be very important for us.
(Multivariate Lindeberg-Levy)
Let X1, X2, . . . be i.i.d. r.v.s, so that EXi = µ, Var (Xi) = Σ.
Set
Xn = (X1 + · · · + Xn)/n.
Then √n(Xn − µ
) d→ N (0, Σ) .
110