Advanced Time Series Econometrics Year 3 Notes

26
Advanced Time Series Econometrics (L12621) Patrick Marsh School of Economics Room B42 SCGB e-mail: [email protected] January 15, 2015 Course Content Part 1. Non-Stationary Time Series These lecture notes are designed to be self-contained. Additional reading, particularly of peer reviewed published research articles will be detailed in the notes with links to the articles in Moodle. Much of the material is covered by standard advanced Econometric textbooks, the most useful of which is: Hamilton J.D. Time Series Analysis (Princeton, 1994). Sporadic references to specific chapters will appear in the notes below. There are additional references to several key articles contained within the notes. These articles are available on the moodle website for this module. 1 Non-Stationary Time Series [Hamilton Chapters 15 and 16] Recall from Econometrics II last year the definition of a stationary time series: (S1): E (y t )= μ, -∞ <μ< for all t. (S2): Var (y t )= σ 2 < for all t. (S3): Cov (y t ,y t-k )= γ (k) (the autocovariance function) for all t. These imply that the mean, variance and autocovariances are all both finite and constant for all t. A non-stationary time series is one which violates one, or more, of these three conditions. 1

description

Advanced Time Series Econometrics Year 3

Transcript of Advanced Time Series Econometrics Year 3 Notes

  • Advanced Time Series Econometrics(L12621)

    Patrick MarshSchool of EconomicsRoom B42 SCGB

    e-mail: [email protected]

    January 15, 2015

    Course Content

    Part 1. Non-Stationary Time Series

    These lecture notes are designed to be self-contained. Additional reading, particularlyof peer reviewed published research articles will be detailed in the notes with linksto the articles in Moodle. Much of the material is covered by standard advancedEconometric textbooks, the most useful of which is:Hamilton J.D. Time Series Analysis (Princeton, 1994).

    Sporadic references to specific chapters will appear in the notes below.There are additional references to several key articles contained within the notes.

    These articles are available on the moodle website for this module.

    1 Non-Stationary Time Series [Hamilton Chapters

    15 and 16]

    Recall from Econometrics II last year the definition of a stationary time series:

    (S1): E (yt) = , <

  • Example 1: Linear Trending SeriesConsider the series

    yt = + t+ t, t = 1, .., T,

    where T denotes the sample size, t IID (0, 2) and = 0. Observe thatE (yt) = + t+ E [t] = + t.

    Consequently since E (yt) is not constant for all t, condition (S1) is not satisfied andthe process is non-stationary.

    In this course we will focus on two particular types of violation of stationaritywhich seem to routinely occur with both economic and financial data. The first isthe linear trend in mean series as in Example 1.

    The second is what is known as an integrated or difference stationary process.Linear trends and integration can be hard to differentiate between and, indeed oftenoccur together. First we will define exactly what we mean by an integrated ordifference stationary process.

    1.1 Preliminaries

    Definition: A time series {yt} is said to be integrated of order d, denoted I (d) ,if it must be differenced d times to make the resulting series both stationary andinvertible.

    Notation: The dth difference of the series is given by

    dyt = (1 L)d yt,where L is the lag-operator which is defined such that Lkyt = ytk, for k = 0, 1, 2, ...Most often (but not always) d = 1 (i.e. first differences) so we have yt = yt yt1.ARMA (p, q) ModelsRecall from Econometrics II the ARMA(p, q) model:

    yt = 1yt1 + ...+ pytp + t + 1t1 + ...+ qtq,

    or equally (L) yt = (L) t,

    where (L) = 1 1L .. pLp and (L) = 1 + 1L+ ...+ qLq. If the roots of (z) = 0 all lie strictly outside the unit circle |z| = 1 then {yt}is stationary.

    If the roots of (z) = 0 all lie strictly outside the unit circle |z| = 1 then {yt}is invertible.

    The ARIMA class of models are based on the observation that many economictime series appear stationary after differencing.

    2

  • 1.2 ARIMA (p, d, q) Models

    An ARIMA(p, d, q) process is one which takes the form

    dyt = 1dyt1 + ... + p

    dytp + t + 1t1 + ... + qtq,

    or, (L) (1 L)d yt = (L) t,

    where (z) and (z) satisfy the stationarity and invertibility conditions above, re-spectively. Thus the ARIMA(p, d, q) is just a standard ARMA (p, q) applied to thedifferenced series (1 L)d yt.

    Example 2Consider the process

    yt = yt1 + t, t = 1, .., T, (1)

    with t IID (0, 2)and starting value y0 a fixed (finite) constant. We need to checkthe stationarity conditions. By repeated back substitution we have

    yt = y0 +ti=1

    i,

    so that E (yt) = y0 + E(t

    i=1 i)= y0. This is both finite and constant for all t.

    Hence condition (S1) is satisfied. However,

    V ar (yt) = V

    (y0 +

    ti=1

    i

    )= 0 + V ar

    (ti=1

    i

    )= t2,

    which violates condition (S2) since the variance depends on t.Thus the process in (1) is not stationary. However, if we apply differences with

    d = 1 we obtainyt = yt yt1 = t,

    then E (yt) = E [t] = 0, V ar (yt) = V ar (t) = 2 and Cov [yt,ytk] =

    Cov [t, tk] = 0. So conditions (S1), (S2) and (S3) are all met and so yt is astationary process.

    The process in (1) is called a random walk. It is the simplest possible exampleof an I(1) series because yt is an IID process; that is yt ARIMA (0, 1, 0) .

    Example 3Consider the process

    (L) yt = t, t IID(0, 2

    ),

    with (L) = (1 L) (1 0.5L) .Note that immediately we can say yt is nonstationary- think of what the roots of (z) must be. In this case the relevant ARIMA modelis

    (1 0.5L)yt = t,

    3

  • or we could instead write (L)yt = (L) t,

    with (L) = (1 0.5L) and (L) = 1.Example 4: What is the order of integration of the series

    yt = t t1, t IID(0, 2

    )?

    In terms of the first 2 conditions;

    (S1) E (yt) = E (t) E (t1) = 0(S2) V (yt) = V (t) + V (t1) 2Cov [t, t1] = 22 ,

    while for (S3) with k = 1we have

    Cov [yt, yt1] = E [ytyt1] = E [(t t1) (t1 t2)]= E

    (2t1

    )= 2,

    while for k > 1 we have Cov [yt, ytk] = E [ytytk] = E [(t t1) (tk tk1)] =0. Consequently the series is stationary. However it is NOT I(0) because it is notinvertible. Writing the process in ARMA form we have

    yt = (L) t = (1 L) t,which is a moving average unit root process. Moreover no amount of differencing willyield a process which is BOTH stationary and invertible.

    In fact this type of process often occurs when the difference operator has beenapplied too many times - i.e. the process has been over differenced. We could saythat since yt = t t1 and t is stationary and invertible then yt is I(1). We cansay that all I (d), with d < 0, processes are non-invertible.

    Example 5Consider

    yt = + t+ t, t IID(0, 2

    ).

    Many (old) statistics textbooks advise dealing with the non-stationarity in this model(note E (yt) = + t) by taking first differences, yielding

    yt = +t.

    However, doing so yields a series which is stationary but non invertible - similar tothe previous example. Note also its difficult to do any inference on if we differencethis way.

    What should we do instead? One option is to de-trend by running an OLSregression of yt on the constant and linear trend (this is the simplest regression modelwith xt = t) and then take the residuals

    t = yt t,

    4

  • where and are the OLS estimators from that regression. This approach worksfine if the process is I (0) but there are problems - which we shall explore later - if itis I (1) .

    One solution to the problems involving non-stationary time series is to transformthem to stationary series - typically through differencing. Since the transformed seriesare then stationary then standard results for modelling such series (and performinginference on them) hold. As we shall see this is not the case for non-stationary series.Moreover, as we have seen there are problems associated with this approach.

    Example 6Suppose that log (GDPt) is non-stationary but log (GDPt) is stationary. It is

    relatively common to work with the latter, i.e. growth rates of GDP. However as wesaw in example 5 we do lose information on the level of GDP. This may be acuteif we want to measure the long-run relationships between GDP and other variables.As well see later this is where CO-INTEGRATION (i.e. what Sir Clive Granger wonhis Nobel Prize for) becomes an extremely important tool.

    Specifically differences of variables only measure short-run effects because theseare changes not the level or long-run position of the variable. By working indifferences we lose information on the long-run effects.

    1.2.1 Drifts and Trends

    So far we have only considered zero mean ARIMA processes. In practice series rarelyhave means which are identically zero. An example of an ARIMA process with anon-zero mean is given by the model

    yt = + t+ vt, t = 1, .., t

    vt = 1vt1 + t, t IID(0, 2). (2) If |1| < 1 (the usual condition for stationarity of an AR(1)) we refer to thismodel as a trend-stationary model. This is because the stochastic part of themodel, vt, satisfies the stationarity conditions so that simply subtracting theunconditional mean + t from yt would yield a stationary (and invertible)series.

    If 1 = 1, so that vt = vt1 + t is I(1) -and hence so is yt- then differencinggives

    yt = + t+ vt ( + (t 1) + vt1)= + t.

    This model is referred to as a random walk with drift, where the drift termis given by .

    Regardless of whether |1| < 1 or 1 = 1 the process has a linear trend in themean since, whatever value of 1, E (yt) = + t.

    5

  • DEFINITION: An ARMA(p, q) model, (L) yt = (L) t is said to contain dunit roots if d of the solutions to (z) = 0 lie exactly on the unit circle, |z| = 1. Notethat any roots with modulus 1 are also valid, e.g. 1, i,i etc.

    Equivalently we can say that an I(d) series has - or admits - d unit roots in itsautoregressive polynomial.

    Example 7: Consider the process

    yt = 1yt1 + t, t IID(0, 2

    ).

    If 1 = 1 then (z) has a root equal to 1 (a unit root) and the process is I(1) (in factit is the random walk process) while if |1| < 1 the root is stable and the process isI (0) .

    Example 8: Consider the ARMA(2, 0) process

    yt = 1yt1 + 2yt2 + t, t IID(0, 2

    ).

    If we rewrite this equation as

    yt yt1 = (1 1) yt1 + 2yt2 + t= (1 1) yt1 (1 1) yt2 + (1 + 2 1) yt2 + t,

    then we can write

    yt = (1 1)yt1 + (1 + 2 1) yt2 + t.

    If 1 + 2 = 1 then this model can be written as

    yt = (1 1)yt1 + t.

    So that if also 0 < 1 < 2 then this is an ARIMA(1, 1, 0) model and the processhas one (d = 1) unit root and one stable root. It is therefore I(1).

    If, however, 1 = 2 and 2 = 1 then we obtain yt = yt1 + t, or

    yt yt1 = t(yt) = t

    2yt = t,

    so that the process has two unit roots and is I(2). In terms of a lag polynomialwe have

    (L) = 1 1L 2L2 = 1 2L+ L2 = (1 L)2 = 2,

    in this case.

    6

  • 1.3 Properties of Unit Root Series

    Consider, once again, the AR(1) process:

    yt = 1yt1 + t, t IID(0, 2

    ),

    with initial condition y0. By repeated back substitution we obtain that

    yt = t + 1t1 + 21t2 + ...+

    t1y0 =

    t1i=0

    i1ti + t1y0.

    For the random walk case, where 1 = 1 :(i) V (yt) = t

    2 as t.(ii) The initial condition, y0, matters and does not vanish as t. Contrastthis with the stationary case |1| < 1 in which t1y0 0 as t. The formeris an example of the long memory property of integrated series (while I(0) seriesare termed short memory). Similarly at t = 50, say, the weight of the shockfrom t = 49 (i.e. 49) is the same as it will be when the time series reachest = 50000000.

    Note that if |1| > 1 (termed an explosive root and the series an explosiveAR(1)) then these problems are exacerbated.

    1.4 Comparisons with Trend-Stationary Models

    Consider the following two models, both observed for t = 1, 2, .., T ;

    (a) : yt = + t+ ut

    (b) : yt = + yt1 + ut,

    where in each case ut denotes an I(0) process. Both of these models have beenused extensively to model real macroeconomic time series data. Note that they bothcapture the general pattern of macroeconomic behavior - i.e. series combining trendsand random fluctuations. e.g. GDP trend upwards (we hope) but over time there arefluctuations about the trend. (See figures 1&2 on moodle)

    TREND STATIONARY MODEL (a): says that GDP is growing along the trendline. There are random fluctuations about this trend (the uts) but their effects areonly short term (short memory). So after, for example, an earthquake, war or othersignificant event GDPt has a tendency to return to the trend line. Hence shocks tothe economy have only a transient impact in this model.

    UNIT ROOT MODEL (b): says that GDP grows on average by each year butthe effects of unexpected shocks (the uts) are persistent (long memory). So after, forexample, an earthquake, war or other significant event GDPt remains below trendand starts growing again from this new lower level. Shocks to the economy have apermanent impact.

    Unit root models are typically associated with real business cycle models whiletrend-stationary models are associated with Keynesian theories.

    7

  • 1.5 The Spurious Regression Problem

    Suppose that {yt} and {xt} are both unit root I (1) series, both observed for t =1, 2, ..., T, i.e.

    yt = yt1 + vt & xt = xt1 + wt,

    where vt and wt are I (0) series.Consider running the OLS regression of y on x,

    yt = + xt + et,

    then

    The OLS estimator for has a non-degenerate limiting distribution. This is incontrast to the case where both x and y are stationary where

    p . Whenboth x and y are I(1) processes the fitted relationship is just an outcome ofsome random variable and not related to the actual relationship between thevariables.

    The t-statistic for testing on does not have a t-distribution - even asymptoti-cally as the sample size tends to infinity. Worse - it diverges as the sample sizeincreases.

    The usual R2 measure has a non-degenerate limiting distribution - it does notconverge to the true correlation between x and y.

    Now consider the case where vt and wt are independent (and hence so are xt andyt) so that in the regression = 0.

    If we run a regression of y on x then we will likely get a value of very differentfrom 0, even in very large sample sizes.

    R2 will not converge to zero - the true correlation between x and y. It willsuggest there is a good fit of the regression even though it shouldnt.

    Tests of H0 : = 0 will tend to reject far too often if we use critical values fromthe t-distribution and moreover in large samples we will reject H0 no matterwhat (finite) critical values we use.

    These findings tend towards the same conclusion: that even if yt and xt are inde-pendent it is very possible that standard estimators and tests will mislead you intothinking there is a (long-run) relationship between the variables y and x. This iscalled the spurious regression problem. Two papers exploring this problem inconsiderable detail are Granger and Newbold (1974) and Phillips (1986).

    The spurious regression problem is the principal reason why it is vital to pre-testdata for the presence of a unit root. This is what the focus of the remainder of thishalf of the module will focus upon.

    8

  • 2 Unit Root Testing

    As we have just seen, models with a unit root have very different properties thanstationary models. Therefore, it is very important to test whether a given time seriesmay have a unit root before proceeding to model its relationship with other variables.The first specific tests developed to this end are the Dickey-Fuller unit root tests.We will explore these tests and, in particular, detail their large sample propertiesand show that they have non-standard (i.e. not Normal, t, F or Chi-Square) limitingdistributions.

    2.1 Dickey-Fuller Unit Root Tests

    Assume that we have T+1 observations on a time series yt generated by the following,

    yt = yt1 + ut, t = 1, .., T, (3)

    where ut IID (0, 2) and we assume the initial condition y0 is a random variablewith finite variance. We can rewrite (3) as

    yt = yt1 + ut, t = 1, .., T, (4)

    where yt = yt yt1 and = 1.Dickey and Fuller (DF) (1979 and 1981) consider tests of the following;

    H0 : = 1 (i.e. = 0) yt I (1)vs H1 : || < 1 (i.e. 2 < < 0) yt I (0) .

    Under the null hypothesis yt is an integrated (unit root) process, yt = y0 +ti=1 ui.

    Under the alternative it is stationary, i.e. I (0) .DF suggest to use OLS to estimate in (4) and then propose two possible statistics

    for testing H0 against H1. These are the normalized bias statistic

    T = T

    Tt=1ytyt1tt=1 y

    2t1

    =T1

    Tt=1ytyt1

    T2Tt=1 y

    2t1

    ,

    and the one-sided t-statistic

    tDF =

    se (); se () =

    2Tt=1 y

    2t1,

    and the variance is estimated by

    2 =

    Tt=1 (yt yt1)2

    T 1 .

    9

  • As we will soon establish the limiting null distributions of neither of these tests isstandard normal (N (0, 1)) . This is a crucial property of both T and tDF . Comparingoutcomes of these statistics with critical values obtained from standard normal tableswill NOT deliver tests with the anticipated size (i.e. probability of incorrect rejectionof a true H0). In fact the large sample critical values for the tests are;

    1% 2.5% 5% 10%T 13.8 10.5 8.10 5.70tDF 2.58 2.23 1.95 1.62N (0, 1) 2.33 1.96 1.65 1.28

    Note that, for example, using a standard normal critical value at 5%, i.e. 1.65, forthe tDF statistic would imply a test which actually has a size of near 10%.

    We shall turn our attention to formally establishing the limiting distribution ofthe T and tDF statistics under the unit root null hypothesis. Unfortunately stan-dard methods of obtaining these do not apply in this problem and so we introducea new tool called the functional limit theorem [FCLT] which is the cornerstone ofdistribution theory in the nonstationary case.

    2.2 Unit Root Asymptotic Distribution Theory

    2.2.1 Introduction

    Consider the AR(1) process

    yt = yt1 + ut, ut N(0, 2

    ),

    and y0 = 0. The OLS estimator is

    =

    Tt=1 ytyt1Tt=1 y

    2t1

    =

    Tt=1 (yt1 + ut) yt1T

    t=1 y2t1

    = +

    Tt=1 yt1utTt=1 y

    2t1,

    and if || < 1 then (e.g. Hamilton, Ch. 8) the standard limit theorems apply sothat

    T 1/2(

    )d N

    (0,(1 2

    )).

    One can immediately see a problem if = 1! The scaling required to get a limitdistribution is different. To see this consider

    T( 1

    )=T1

    Tt=1 yt1ut

    T2Tt=1 y

    2t1. (5)

    Consider the numerator in (5), when = 1 then yt = ut+ut1+ ...+u1 (since y0 = 0)so that

    yt N(0, 2t

    ).

    10

  • Also, when = 1, then

    y2t = (yt1 + ut)2 =

    (y2t1 + 2yt1ut + u

    2t

    ),

    or equally

    yt1ut =1

    2

    (y2t y2t1 u2t

    ). (6)

    If we sum (6) from1 to T , we get

    Tt=1

    yt1ut =1

    2

    (y2T y20

    ) 12

    Tt=1

    u2t .

    Using the fact that y0 = 0 and dividing by both T and 2, we get

    1

    2T

    Tt=1

    yt1ut =1

    2

    [yTT 1/2

    ]2 122

    1

    T

    Tt=1

    u2t .

    But since yT N (0, 2T ) then yT/ (2T )1/2 N (0, 1), so that y2T/ (2T ) 21 andby the law of large numbers 1

    T

    Tt=1 u

    2t

    p E [u2t ] = 2. Putting these results togetherwe find

    1

    2T

    Tt=1

    yt1utd 12

    (21 1

    ).

    In the denominator of (3) note that yt1 N (0, 2 (t 1)) so that E[y2t1

    ]=

    2 (t 1) ,

    E

    [Tt=1

    y2t1

    ]=

    Tt=1

    E[y2t1

    ]= 2

    Tt=1

    (t 1) = 2T (T 1)2

    .

    Although we dont yet have the tools to derive the asymptotic distribution in thedenominator it should be pretty clear that such a distribution can only be obtainedif the scaling is T2. The required tools begin with the definition of:

    2.2.2 Brownian Motion

    Consider the random walk process yt = yt1+ut, where ut IIDN(0, 1) and y0 = 0.Thus (and as above) yt =

    tj=1 uj N (0, t) . Consider also, for s > t,

    ys yt = ut+1 + ut+2 + ... + us N (0, s t) ,

    and moreover yt ys is independent of yr yq if t > s > r > q.Consider now yt yt1 = ut N (0, 1) , but we consider ut to be the sum of two

    independent variables sayut = e1t + e2t,

    11

  • where both e1t and e2t are N (0, 1/2) variables. We could then associate e1t with thechange between yt1 and some mid-point yt1/2, say, so that

    yt1/2 yt1 = e1t ; yt yt1/2 = e2t,

    but stillyt yt1 = e1t + e2t.

    In fact we could go further and consider N 1 interim points so that

    yt yt1 = e1t + e2t + ... + eNt; eit IIDN (0, 1/N) , i = 1, .., N.

    Further we could consider what happens if we allow N . Doing so defines thecontinuous time process known as standard Brownian motion, which is defined as:

    DEFINITION: A standard Brownian motion W (.) is a continuous time sto-chastic process associating each date t [0, 1] with the scalar random variable W (t),such that;(a) W (0) = 0;(b) for dates 0 t1 < t2 < ... < tk 1 the changes [W (t2)W (t1)] , ..., [W (tk)W (tk1)]are independent normal with, [W (s)W (t)] N (0, s t) .(c) For any given realization, W (t) is continuous in t with probability 1.

    Note we have defined times as being between 0 and 1 rather than 0 and forconvenience for what follows.

    2.2.3 The Functional Central Limit Theorem (FCLT)

    The simplest version of the Central Limit Theorem (CLT) has that if ut IID (0, 2)then if u = T1

    Tt=1 ut then T

    1/2u N (0, 2) as T .Consider now just the first half of a sample (we discard the rest) and the (half)

    sample mean

    uT/2 =1

    T/2T/2t=1

    ut,

    where T/2 denotes the largest integer smaller than or equal to T/2 (this is calledthe integer part of T/2). Notice that also as T ,

    T/21/2 uT/2 d N(0, 2

    ),

    and notice that this (half) sample mean is independent of the (half) sample meanconstructed from the rest of the data.

    We can generalize to taking the rth fraction of a sample, where r [0, 1] , bydefining

    XT (r) =1

    T

    Trt=1

    ut.

    12

  • Note the denominator in XT (r) is T, not Tr . Now as r moves between 0 and 1XT (r) is a step function with

    XT (r) =

    0 when 0 r < 1/Tu1/T when 1/T r < 2/T(u1 + u2) /T when 2/T r < 2/T: : :(T

    t=1 ut)/T when r = 1

    .

    Then,

    T 1/2XT (r) =1T

    Trt=1

    ut =

    (TrT

    )1/2 (1

    Tr

    )1/2 Trt=1

    ut,

    but as T then(

    1Tr

    )1/2Trt=1 ut

    d N (0, 2) by the CLT while( Tr

    T

    )1/2 r1/2,so that

    T 1/2XT (r)d N

    (0, 2r

    ),

    orT 1/2

    XT (r)

    d N (0, r) .Similarly, and for r2 > r1,

    T 1/2 (XT (r2)XT (r1)) d N (0, r2 r1) ,

    and is independent of T 1/2XT (r) / provided r1 > r.

    The Functional Central Limit Theorem: The sequence of stochastic functions{T 1/2XT (.) /

    }has an asymptotic probability law described by standard Brownian

    motion W (.) that is,T 1/2

    XT (.)

    dW (.) . (7)The result in (7) is the FCLT. Although weve here assumed that the ut are IID

    in fact it holds under far weaker conditions.Notice that XT (1) is the sample mean, i.e. XT (1) = T

    1Tt=1 ut. Consequently

    the standard CLT is obtained as a special case of the FCLT, i.e.

    T 1/2

    XT (1) =

    1

    T 1/2

    Tt=1

    utdW (1) N (0, 1) .

    2.2.4 The Continuous Mapping Theorem (CMT)

    Let S (.) be a continuous time stochastic process with S (r) being the variable it takesat some date r [0, 1] . Note that S (r) is a continuous function of r (with probability

    13

  • 1). Consider the sequence of continuous functions {ST (r)} such that ST (.) d S (.) ,then if g (.) is a continuous functional then the CMT states that

    g (ST (.))d g (S (.)) . (8a)

    In this context the most commonly used functionals are (stochastic) integrals, e.g. 10 S (r) dr or simpler functions such as [S (r)]

    2 . The CMT also applies for continuousfunctionals mapping a continuous bounded function of [0, 1] to another, e.g. g (h (.)) =h (.) . We can use exactly this so that we have

    T 1/2XT (r) = T 1/2

    XT (r)

    d W (r) N(0, 2r

    ).

    Consider also the function ST (r) =[T 1/2XT (r)

    ]2. Since above we had T 1/2XT (r)

    dW (r) then it follows from the CMT that

    ST (r)d 2W (r)2 .

    2.2.5 Applications to Unit Root Processes

    Consider again the random walk process

    yt = yt1 + ut, ut IID(0, 2

    )& y0 = 0.

    Then yt =tj=1 uj and so we can define the following stochastic function XT (r) as

    follows:

    XT (r) =

    0 when 0 r < 1/Ty1/T when 1/T r < 2/Ty2/T when 2/T r < 2/T: : :yT/T when r = 1

    .

    We can plot this (see the final figure on moodle) as a function of r. Doing so yieldsrectangles of width 1/T and height yt1/T , and thus area yt1/T 2. The integral of(area under) XT (r) over r [0, 1] is therefore given by 1

    0XT (r) dr =

    y1T 2

    +y2T 2

    + ...+yT1T 2

    = T2Tt=1

    yt1,

    so that 10T 1/2XT (r) dr = T

    3/2Tt=1

    yt1.

    Thus since we know from the FCLT that T 1/2XT (r)d W (r) then using the CMT

    we have 10T 1/2XT (r) dr

    d 10W (r) dr,

    14

  • so that in fact we have shown that

    T3/2Tt=1

    yt1d

    10W (r) dr.

    It can be shown that 10 W (r) dr N (0, 1/3) .

    Notice that if {yt} is a random walk then the sample mean y = T1Tt=1 ytdiverges, whereas instead T3/2

    Tt=1 yt1 = T

    1/2y converges to a normal limitingvariable. Contrast thus with usual Central Limit Theorem type results for eitherstationary or independent data where it is T 1/2y that converges to a normal limitingvariable.

    Consider next the sum of squares of a random walk. Let

    ST (r) = T [XT (r)]2 ,

    so that

    ST (r) =

    0 when 0 r < 1/Ty21/T when 1/T r < 2/Ty22/T when 2/T r < 2/T: : :y2T/T when r = 1

    ,

    similar to above. Then 10ST (r) dr =

    y21T 2

    +y22T 2

    + ...+y2T1T 2

    = T2Tt=1

    y2t1,

    and then (using both the FCLT and CMT) ST (.)d 2 [W (.)]2 and so

    T2Tt=1

    y2td 2

    10W (r)2 dr.

    Ultimately the point here is to collect results useful in working out the limitingdistribution of the Dickey-Fuller tests. For that also recall that

    T1Tt=1

    yt1ut =1

    2

    1

    T

    [y2T

    Tt=1

    u2t

    ]=1

    2ST (1) 1

    2

    1

    T

    Tt=1

    u2t ,

    given the definition of ST (r). By the usual Law of Large numbers

    1

    T

    Tt=1

    u2tp 2,

    and since ST (1)d 2 [W (1)]2 then

    (2T

    )1 Tt=1

    yt1utd 12

    [W (1)2 1

    ],

    15

  • which is the same as we saw before, noting that W (1)2 21.At this stage it is worth collating all of the results so far obtained. If yt = yt1+ut

    with y0 = 0 and ut IID (0, 2), then:

    a) : T1/2Tt=1

    utd W (1) N

    (0, 2

    )(9)

    b) : T1Tt=1

    yt1utd

    2

    2

    [W (1)2 1

    ]

    2

    2

    [21 1

    ](10)

    c) : T3/2Tt=1

    yt1d

    10W (r) dr N

    (0,2

    3

    )(11)

    d) : T2Tt=1

    y2t1d 2

    10W (r)2 dr. (12)

    Note that it is the SAME standard Brownian motion process, W (r) , throughout.

    2.3 Asymptotic Distributions of Unit Root Test Statistics

    Recall that =Tt=1 ytyt1/

    Tt=1 y

    2t1, or equivalently

    T( 1

    )=T1

    Tt=1 yt1ut

    T2Tt=1 y

    2t1, (13)

    with ut IID (0, 2) . Then using the results above we have shown that

    T1Tt=1

    yt1utd

    2

    2

    [W (1)2 1

    ]& T2

    Tt=1

    y2t1d 2

    10W (r)2 dr, (14)

    Since the ratio in (13) is a continuous function of its numerator and denominator(which is positive with probability 1) then we can state that under H0 : = 1 theOLS estimator satisfies

    T( 1

    )d

    12

    [W (1)2 1

    ] 10 W (r)

    2 dr. (15)

    Sometimes we see the numerator in (15) written instead as 10 W (r) dW (r) - this is

    an example of a stochastic integral, and so

    T( 1

    )=T1

    Tt=1 yt1ut

    T2Tt=1 y

    2t1

    d 10 W (r) dW (r) 10 W (r)

    2 dr.

    16

  • Critical values have been tabulated for the distribution of the RHS of (15) (whichwe shall denote as )

    Table 1:

    =

    1

    0W (r)dW (r)1

    0W (r)2dr

    N (0, 1)

    Pr [ < 13.8] = 0.01 2.33Pr [ < 10.5] = 0.025 1.96Pr [ < 8.1] = 0.05 1.645Pr [ < 5.7] = 0.10 1.282Pr [ < 0.93] = 0.90 1.282Pr [ < 1.28] = 0.95 1.645Pr [ < 2.03] = 0.99 2.33

    .

    Clearly the distribution of is not standard normal, instead it is a non-standarddistribution called the Dickey-Fuller distribution.

    It also follows from (15) that is a super-consistent estimator of when = 1,in that it converges to the true value at a rate T, rather than the more usual T 1/2.

    The other popular unit root test statistic for H0 : = 1 is the OLS t-ratio:

    tDF = 1(

    2/Tt=1 y

    2t1)1/2 , (16)

    where 2 = T1Tt=1

    (yt yt1

    )2. Again this will have a non standard asymptotic

    distribution. To obtain it we rewrite (16) as

    tDF = T( 1

    ) [2T2

    Tt=1

    y2t1

    ]1/2=T1

    Tt=1 yt1ut

    T2Tt=1 y

    2t1

    [2T2

    Tt=1

    y2t1

    ]1/2

    =T1

    Tt=1 yt1ut

    (T2

    Tt=1 y

    2t1)1/2 d

    2

    2

    [W (1)2 1

    ](2 10 W (r)

    2 dr)1/2

    10 W (r) dW (r)( 10 W (r)

    2 dr)1/2 ,

    since 2p 2.

    Critical values have also been tabulated for this Dickey-Fuller distribution;

    Table 2:

    =

    1

    0W (r)dW (r)(

    1

    0W (r)2dr

    )1/2 N (0, 1)Pr [ < 2.58] = 0.01 2.33Pr [ < 2.23] = 0.025 1.96Pr [ < 1.95] = 0.05 1.645Pr [ < 1.62] = 0.10 1.282Pr [ < 0.89] = 0.90 1.282Pr [ < 1.28] = 0.95 1.645Pr [ < 2.00] = 0.99 2.33

    17

  • Example 9: The following AR(1) model was fitted by OLS for t = 1947Q2 to1989Q1 (T = 168) for data on the US nominal 3-month T-bill rate:

    t =0.99694(0.010592)

    it1,

    where the figure in parentheses is the estimated standard error.We then find

    T( 1

    )= 168 (0.99694 1) = 0.51

    tDF = (0.99694 1) /0.010592 = 0.29which are well above any of the (left tail) critical values in Tables 1 or 2, so we cannotreject H0 : = 1 in favour of H1 : < 1. We also cannot reject against H1 : > 1.

    2.3.1 Consistency under H1 : || < 1Under the alternative the model is a stationary AR (1) ;

    yt = yt1 + ut,

    where || < 1. For this model we know that is a consistent estimator of so that

    p . Under H1, 1 p 1 < 0, since || < 1. Consequently T( 1

    )will

    diverge to and so no matter which critical value we choose Pr[T( 1

    )< cv

    ]

    1 as T . I.e. the test will reject with probability 1 when H1 is true. Consistencyof the t-statistic tDF follows in the same way as we showed when we looked at thepower of the t-test in Introductory Econometrics (L11221).

    2.3.2 The Initial Value/Condition

    So far we have assumed y0 = 0. Here well weaken this slightly and instead considerthat either;(a) y0 = c a constant, or(b) y0 has a specified distribution with finite variance, e.g. y0 N (0, 2) .

    Notice that (b) includes (a) as a special case and in turn when y0 = 0.We assumethat y0 is independent of {ut}t1 . All other previous assumptions are maintained.

    Consider

    T( 1

    )=T1

    Tt=1 yt1ut

    T2Tt=1 y

    2t1

    =T1 (y0u1 + y1u2 + ..+ yT1uT )

    T2(y20 + y

    21 + ..+ y

    2T1) . (17)

    The denominator of (17) satisfies

    T2Tt=1

    y2t1 = T2

    Tt=1

    t1j=1

    uj + y0

    2 = T2 Tt=1

    (S2t1 + 2St1y0 + y

    20

    ),

    18

  • where St1 =t1j=1 uj. Consequently, using ST (r) and XT (r) as previously defined,

    T2Tt=1

    y2t1 = 10ST (r) dr + 2y0T

    1/2 10T 1/2XT (r)dr +

    y20T,

    and so since 10 T

    1/2XT (r)drd 10 W (r)dr then we have

    T2Tt=1

    y2t1d 2

    10W (r)2dr,

    as we did before.Similarly for the numerator of (17) is

    T1Tt=1

    yt1ut = T1Tt=1

    (St1 + y0)ut = T1Tt=1

    St1ut + y0T1/2Tt=1 utT,

    a standard CLT shows thatT

    t=1ut

    T

    d N (0, 2) and hence the second term abovevanishes as T , meaning that also as before

    T1Tt=1

    yt1utd 122(W (1)2 1

    ) 10W (r)dW (r).

    The initial value (under assumption (a) or (b)) has no effect on the asymptotic dis-

    tribution of T( 1

    ), nor therefore on that of tDF .

    2.4 Augmented Dickey-Fuller Tests

    So far we have assumed that ut IID (0, 2) , which of course is likely to be anunrealistic assumption in practice. Instead suppose that {yt} is generated by

    yt = yt1 + ut

    ut =pi=1

    iuti + et +qj=1

    jetj ,

    with et IID (0, 2e) and E[(e2t 2e)2

    ]= 4 0.

    19

  • The unit root hypothesis remains H0 : = 1, but when H0 is true then yt is anARIMA (p, 1, q) process and so ut = yt yt1 = yt. When p and q are unknownthen we approximate using

    yt = yt1 +ki=1

    diyti + et, (18)

    known as the Augmented Dickey Fuller regression. In (18) we allow k to grow withthe sample size, for example letting k as T but k/T 1/3 0.

    OLS applied to (18) yields consistent estimators (at rate T 1/2) for the {di}ki=1 andthe t-statistic for testing H0 has the same asymptotic distribution as in the simplercase above, i.e.

    tADFd 10 W (r) dW (r)( 10 W (r)

    2 dr)1/2 .

    The distribution of T( 1

    )is not the same, however. In fact it can be shown

    that, under H0;

    T( 1

    )=

    T( 1

    )(1ki=1 di)

    d 10 W (r) dW (r) 10 W (r)

    2 dr.

    Formal derivations of these results are found in, for example, Hamilton (Section 17.1).

    We call the tests tADF and T( 1

    )the Augmented Dickey Fuller (ADF) tests,

    in the sense that its the OLS regression of yt1 on yt augmented by lagged values of{yti}ki=1 as k , subject to k/T 1/3 0. In practice T is not infinite and so weneed to choose a value of k for our regression. Typically we use;(a) Information criteria, such as the Akaike or Bayesian Information Criteria (AIC,BIC) seen before in Econometrics modules.

    (b) Deterministic Rules, such as k =4 (T/100)1/4

    or k =

    12 (T/100)1/4

    - see

    Schwert (1989).(c) Data based Lag Selection, which involves a step-wise procedure in which weinitially choose a (large) value of k = kmax, for example one of those above, and thenuse regression t-tests to test H0 : dkmax = 0. If we dont reject we decrease the numberof lags in the ADF regression and then test H0 : dkmax1 = 0, if we continue to failto reject we keep on reducing the number of lags by one until we do reject. Thisprocedure is described fully in Ng and Perron (1995).

    Notice that running (18) implies losing k + 1 observations - since ytk is onlydefined ones T reaches k + 1 - or another way to think of it is that we have an extrak nuisance parameters ( is the interest parameter) to estimate. One consequenceof this is that often the asymptotic critical values obtained from the Dickey Fullerdistributions may not be accurate in finite samples.

    20

  • 3 Invariant Tests of a Unit Root

    When hypothesis testing in the presence of nuisance parameters (i.e. parameters thatare not specified by the null hypothesis - e.g. the dis in the ADF regression) we needto ensure that our test statistics have (at least asymptotic) distributions which donot depend on these nuisance parameters at all.

    Feasible tests whose distributions do not depend on nuisance parameters are saidto be similar or invariant. As far as we are concerned these two terms mean thesame thing.

    For example, since the Dicky Fuller distributions dont depend on y0 under theassumptions (a) and (b) above we can say the tests are asymptotically invariant withrespect to y0. They are not exactly invariant, though.

    If we want to make our tests exactly invariant with respect to y0 then this can beachieved by including an intercept term in the test regression, i.e.

    yt = + yt1 + ut, ut IID(0, 2

    ), (19)

    so that we regress yt on a constant and yt1.The limiting null distributions of the resulting test statistics are different, however,

    from what we derived previously. Specifically

    T( 1

    )d

    12

    [W (1)2 1

    ]W (1) 10 W (r) dr 1

    0 W (r)2 dr

    [ 10 W (r) dr

    ]2 10 W (r) dW (r) 10 W (r)

    2 dr,

    where W (r) = W (r) 10 W (s) ds is de-meaned Brownian motion. (Note that above isnt the same estimator as in the case with no intercept)

    We derive the test statistics in the following way. First regress yt on . The(OLS) estimator for is y = T1

    Tt=1 yt. Consequently we define the residuals of

    this regression by

    ut = yt y = yt T1Ts=1

    ys.

    Under H0: = 1 and yt = yt1 + ut, we have

    ut =tj=1

    uj + y0 T1Ts=1

    sj=1

    uj + y0

    =

    tj=1

    uj T1Ts=1

    sj=1

    uj ,

    which clearly does not involve y0 at all. If we also divide by T1/2, then

    T1/2ut = T1/2tj=1

    uj T3/2Ts=1

    sj=1

    ujd

    (W (r)

    10W (s) ds

    )= W (r) .

    21

  • We can pursue this line further. Now suppose that the data itself is generated by

    (1 L) (yt t) = t, (20)

    then it turns out that the asymptotic distributions of tests generated form the re-gression (19) will depend upon the value of .

    We can rewrite (19) in the following way,

    yt = yt1 + (1 )+ t (t 1) + t= yt1 + +

    t+ t,

    where = (1 )+ and = (1 ) . Consequently when = 1 then = and = 0 so the model is a random walk with drift. If we ignore the presence of (which is effectively what (19) does) then the resulting tests are useless. It can beshown that such tests have zero asymptotic power and are therefore not consistenttests.

    We can obtain consistent and also invariant tests (with respect to all of the nui-sance parameters) simply by including a constant and a linear trend in the testregression,

    yt = yt1 + + t+ ut,

    i.e. we regress yt on a constant, linear trend and yt1. As above this is achieved by ob-taining residuals from a regression of yt on the constant and trend. The resulting unitroot test statistics can be shown to have the following null asymptotic distributions;

    T( 1

    )d 10 W (r) dW (r) 10 W (r)

    2 dr; tDF

    d 10 W (r) dW (r)( 10 W (r)

    2 dr)1/2 ,

    where

    W (r) = W (r) 12(r 1

    2

    ) 10

    (s 1

    2

    )W (s) ds,

    is de-meaned and de-trended Brownian motion.We began with the simplest case of no constant and no trend then considered

    introducing a constant and finally had both a constant and trend. Notice that theasymptotic distributions of the resulting unit root tests all have essentially the sameform - the only difference being whether or not we are de-meaning (including a con-stant) and de-trending (also including a trend) the Brownian motion.

    22

  • The effect on the critical values of this can be seen in the following Table:

    Table 31% 2.5% 5% 10%

    No Constant, No Trend tDF -2.58 -2.23 -1.95 -1.62

    No Constant, No Trend T( 1

    )-13.8 -10.5 -8.10 -5.70

    Constant, No Trend tDF -3.43 -3.12 -2.86 -2.57

    Constant, No Trend T( 1

    )-20.7 -16.9 -14.1 -11.3

    Constant, Trend tDF -3.96 -3.66 -3.41 -3.12

    Constant, Trend T( 1

    )-29.5 -25.1 -21.8 -18.3

    .

    Notice that the effect of de-meaning then also de-trending is to shift the critical valueto the left.

    Many (numerical) studies have been made into the finite sample size and powerproperties of these unit root tests - see for example the papers by Schwert (1989) orNg and Perron (1995). One striking finding is how much less power there is whenwe include a trend. It is not immediately apparent from the limiting distributionsdescribed above why the effect on power should be so dramatic.

    Above, tests which are invariant to a constant and trend were constructed fromresiduals obtained from OLS estimation of the data on the constant and trend. Recallfrom Econometrics I and Advanced Econometric Theory the Normal linear regressionmodel

    y = XB + u,

    where X is an n k matrix of explanatory variables and B is a k 1 vector ofparameters. The OLS estimator is B = (X X)1X y and the residuals are

    u = y XB = y X (X X)1X y =My,

    where M = I X (X X)1X .Now suppose that the data are generated by (20) which we can rewrite as

    yt = + t+ ut ; ut = ut1 + t (21)

    and t N (0, 2) so that in terms of the linear regression model

    X =

    1 11 21 3: :1 T

    , B =(

    )

    then we would construct the DF unit root tests from the residuals, e.g. =Tt=2 utut1/

    Tt=2 u

    2t1.

    23

  • It actually turns out that every test which is invariant with respect to both (andso y0) and can be constructed from the elements of the n k dimensional vector

    v =C uuu,

    where we have decomposed M = CC and C is a T (T k) matrix also satisfyingC C = ITk.

    Then according to Marsh (2007) v will have a density function which depends onlyupon the parameter . Call this density fv () . Recall from Advanced EconometricTheory the Cramer-Rao Lower Bound which states (in the current notation)

    V ar( (v)

    ) Iv ()1 ,

    where (v) is any unbiased estimator of and Iv () = E[d2 ln fv()

    d2

    ]is Fisher

    Information. In fact the CRLB is a special case of a more fundamental bound whichstates that if z (v) is ANY statistic with mean E [z (v)] = () then

    V ar [z (v)] (d ()

    d

    )2Iv ()

    1 .

    That is Fisher Information represents a fundamental measure of precision for anystatistic at all, whether estimator or, as is of interest here, test. Note that the unitroot hypothesis is H0 : = 1. Marsh (2007) proves that for the model (21)

    Iv (1) = 0.

    That is there is NO information in any statistic which is invariant to a Linear Trendat the very parameter value (i.e. at H0) that we are interested in and the variance ofANY statistic will therefore be unbounded.

    There are many different trends that we could use rather than just the linear one,such as

    t, log (t) , t2 or et. As far as Im aware it is only the linear trend which does

    this. This also illustrates why it is so difficult to tell the difference between lineartrends and stochastic ones as described at the beginning of these notes.

    Obviously this highlights the need to only use a linear trend when one is strictlynecessary and procedures have been developed to try and ensure this is the case -see for instance Harvey, Leybourne and Taylor (2009) who propose methods to doexactly that.

    4 Spurious Regression

    It is vitally important that we do detect whether time series have unit roots (i.e.stochastic trends) because of the possibility of obtaining spurious results when we

    24

  • regress one on another. Here we focus on the case of a regression involving twoindependent I (1) variables. Later on in this module it will be shown that somelinear combination of I (1) variables may yield an I (0) variable - this is the case ofco-integration and was what Sir Clive Granger won his Nobel prize for.

    First though consider two I (1) variables {yt}t=0 and {xt}t=0 generated asyt = yt1 + ut ; ut IID

    (0, 2u

    )& y0 = 0

    xt = xt1 + t ; t IID(0, 2

    )& x0 = 0,

    with E [ut, s] = 0 for all s and t.Now consider the regression model

    yt = + xt + et. (22)

    Since E [ut, s] = 0 for all s and t implies {yt}t=0 and {xt}t=0 are independent then theslope of this regression should be zero, i.e. = 0. But is it true that OLS estimatorsand tests find this? E.g. does

    p 0?Define Wu (r) and W (r) as the independent Brownian motions obtained from

    cumulating and scaling the {ut} and {t}, exactly as we did previously. Also let xand y be the sample means of the {xt} and {yt} series. The OLS estimator is then

    =

    Tt=1 yt (xt x)Tt=1 (xt x)2

    =T2

    Tt=1 ytxt T2x

    Tt=1 yt

    T2Tt=1 (xt x)2

    =T2

    Tt=1 ytxt

    (T1/2y

    ) (T1/2x

    )T2

    Tt=1 x

    2t T1x2

    . (23)

    We can use all of our previous results to find;

    T1/2y d u 10Wu (r) dr (24)

    T1/2x d 10W (r) dr (25)

    T2Tt=1

    x2td 2

    10W (r)

    2 dr, (26)

    while for T2Tt=1 ytxt we can easily generalize to find

    T2Tt=1

    ytxtd u

    10Wu (r)W (r) dr (27)

    if we then apply the limits in (24) to (27) to (23) then via the CMT we have

    d

    u

    [ 10 Wu (r)W (r) dr

    ( 10 Wu (r) dr

    ) ( 10 W (r) dr

    )] 10 W (r)

    2 dr ( 10 W (r) dr

    )2: =

    u, (28)

    25

  • say. In addition,T1/2 = T1/2y T1/2x, (29)

    and so we immediately find

    T1/2 d u[ 10Wu (r) dr

    10W (r) dr

    ]. (30)

    To summarize these results (28) demonstrates that converges to a well definedrandom variable in the limit, i.e. it does not converge to 0. That is is not a consistentestimator of in this context. In addition a standard t test of H0 : = 0 can beshown to diverge as T . That is as the sample size becomes infinite we will rejectH0 : = 0, using such a test, with probability 1. These findings lead, inevitably, tospurious inference about the existence of a relationship between yt and xt.

    4.1 Possible cures for spurious regressions

    (1) Include lags of xt and yt in (22) - see for example Hamilton (1994, p. 561).(2) Difference the data before estimation, i.e. run the regression

    yt = + xt + ut.

    This yields (T1/2) consistent estimators for and and we can apply standardasymptotic theory, i.e. the limit distributions are Normal. However we still do needto check if xt and yt are I (1) - differencing is not a good idea if they are not. We alsolose one of the benefits of dealing with non-stationary data, which is that we havemuch faster rates of convergence than with stationary data.(3) We can estimate (22) by Generalized Least Squares by assuming the errors havefirst order autocorrelation. The resulting estimators GLS and GLS are asymptot-ically equal in distribution to the estimators obtained from suggestion (2) providedthat both series are I (1) and not co-integrated.

    26