The APARCH Model

The APARCH Model

Asymetric Power ARCH = APARCH

The APARCH model of Ding, Grange, and Engle (1993) is

At = σtεt, (1)

where εtiid∼ N(0,1), and

σδt = α0 +

m∑

i=1

αi(|At−i| − γiAt−i)δ +

s∑

j=1

βjσδt−j, (2)

where 0 < δ and −1 < γ < 1.

ARCH(m): δ = 2; γ = 0;β = 0

GARCH(m,s): δ = 2; γ = 0

GJR-GARCH(m, s) or TGARCH(m, s) : δ = 2

1

The APARCH Model

The APARCH model exhibits several stylized properties of finan-

cial time series.

Even though the conditional distribution is normal the uncondi-

tional distribution has excess kurtosis (fat tails).

Because a shock at time t− 1 also impacts the variance at time

t, the volatility is more likely to be high at time t if it was also

high at time t− 1. This yields clustering of volatility.

The APARCH model, as the GJR-GARCH model, additionally

captures asymmetry in return volatility. That is, volatility tends

to increase more when returns are negative, as compared to

positive returns of the same magnitude.

The APARCH model also yields the long-memory property of

returns.

2

Fitting Time Series Models in R

There are a number of R functions that perform the computa-

tions to fit various time series models.

ARMA / ARIMA arima(stats)

ARMA order determination autofit(itsmr)

ARMA + GARCH garchFit(fGarch)

APARCH garchFit(fGarch)

The APARCH model includes the TGARCH and GJR models,

among others; see equation (2).

Also see the help page for fGarch-package in fGarch.

3

Other R Functions for Time Series Models

There are R functions for forecasting using different time series

models that have been fitted.

ARMA/ARIMA predict.Arima(stats)

APARCH (including ARMA + GARCH) predict(fGarch)

There are also R functions for simulating data from different

time series models.

ARMA/ARIMA arima.sim(stats)

APARCH (including ARMA + GARCH) garchSim(fGarch)

4

Other Ways of Modeling Time Series

As we have emphasized, data from a general time series may

have special characteristics that lead us to use special models,

and to fit those models is special ways.

(What’s the main difference in our approach to fitting models

for time series?)

Despite the special nature of time series data, general statistical

models are often useful, especially if those models are fitted

locally.

5

Statistical Models

There are several reasons that we build models and fit them to

observational data.

In many applications, we are interested in some asymmetric re-

lationship between a random variable Y and other variable x.

(Both of these may be vectors.)

That is, we think of some subset of variables as being “depen-

dent” on some other subset of variables.

In time series applications, the dependent variable is often the

same variable as some of the others; it is just time of observation

that is different.

6

Statistical Models

We can think of the asymmetric relationship between a random

variable Y and a variable x as a process that accepts x as input

and outputs Y :

Y ← unknown process ← x,

or

Xt ← unknown process ← xt−1, xt−2, . . .

The relationship might also be described by a statement of the

form

Y ← f(x)

or

Y ≈ f(x).

7

Statistical Models

There are several reasons that we build models and fit them to

observational data.

One reason is to control a physical process.

Another reason is to understand a data-generating process.

A major reason also is to predict a future outcome of a data-

generating process.

When an objective is understanding, a mathematical model (equa-

tion or set of equations) usually works best.

When the emphasis is on prediction, we can form a “black box”

or a computer algorithm that accepts a set of input values x,combines them in various ways, and then produces an output y.

This kind of statistical model is often called an algorithmic model.

The algorithm is a sequence of rules, possibly randomized.

8

Statistical Models

“Statistics” means using data, and statistical models are fitted

by use of observations.

In the case of mathematical models, fitting usually means esti-

mation.

In some applications with mathematical models, the fitting is

“smoothing”, and the model itsel may be semi-parametric.

In the case of algorithmic models, the fitting is “training”; a

set of inputs with known outputs are run through the model,

and rules are chosen such that similar input would yield similar

output.

There is a very interesting article by Leo Breiman called “Statis-

tical modeling: The two cultures” in Statistical Science in 2001.

9

Use of Statistical Models in Time Series

ARMA and GARCH models can be used to understand rela-

tionships, both serial and, in their multivariate versions, among

different time series.

They also can be used in prediction or forecasting.

Other types of statistical models may be more or less useful in

forecasting.

One retrospective use of any smoothing model is to give a clearer

picture of the past, and, possibly, to give a short-term forecast.

Another use of smoothing models is to provide an estimate of

the volatility of a process.

10

Estimation of the Volatility Using Smoothing

Models

There are various ways the volatility can be estimated.

One interesting way, illustrated in Example 4.6 in Tsay, is based

on the simple diffusion model

Yt = µ(xt−1)dt + σ(xt−1)dWt,

where Wt is a standard Brownian motion (i.e., variance of 1),

and Yt = xt − xt−1.

(We’ll discuss this model more later.)

11

Estimation of the Volatility Using Smoothing

Models

The problem, of course, is that a standard deviation cannot be

estimated based on a single observation.

If, however, we assume that σ(xt−1) is smoothly varying in time,

then

|yt| ≈ σ(xt−1),

and a smoothed regression of |yt| on xt−1 is an estimate of

σ(xt−1), as in Figure 4.7(d).

What are the properties of such an estimator?

12

Statistical Models

Probably the most familiar statistical model after simple proba-

bility models is a parametric regression model.

The ARMA and GARCH models are also parametric models.

We’ll now discuss a couple of nonparametric statistical models

that are also nonlinear.

Kernel regression is primarily a smoothing model.

A neural net is primarily a classification model, but is also a

smoothing model.

The kernel regression model can be formulated in a simple math-

ematical form.

The neural net model is an algorithmic model.

Although these are nonparametric models, they both have com-

ponents that might be considered parametric.

13

Kernel Regression

Local regression is a type of nonlinear modeling.

The objective is prediction, rather than estimation of a parameter

or building a model that aids in understanding of relationships.

A simple form of local regression is to use a filter or kernel

function to provide local weighting of the observed data.

This approach ensures that at a given point the observations

close to that point influence the estimate at the point more

strongly than more distant observations.

14

Kernel Regression

Given pairs of data

(x1, y1), . . . , (xn, yn),

for prediction of y at an arbritrary point x, in this approach

we convolve the observations with a unimodal function that de-

creases rapidly away from the point:

y(x) =1

N

∑

i

K(x− xi)yi.

The function K is the filter or the kernel.

15

Kernels

A kernel function has two arguments representing the two points

in the convolution, but we typically use a single argument that

represents a scaled distance between the two points.

We often choose the kernel function to be a probability density

function; that is one with the properties

K(x) ≥ 0 and

∫

K(x) dx = 1.

We also often scale the kernel with a scale or “bandwidth” h and

define

Kh(x) =1

hK(x/h) and so

∫

Kh(x) dx = 1.

16

Choice of Kernels

Standard normal densities have these properties described above,

so the kernel is often chosen to be the standard normal density.

As it turns out, the kernel density estimator is not very sensitive

to the form of the kernel.

Although the kernel may be from a parametric family of distribu-

tions, in kernel density estimation, we do not estimate those pa-

rameters; hence, the kernel method is a nonparametric method.

17

Choice of Kernels

Sometimes, a kernel with finite support is easier to work with.

In the univariate case, a useful general form of a compact kernel

is

K(t) = κrs(1− |t|r)sI[−1,1](t),

where

κrs =r

2B(1/r, s + 1), for r > 0, s ≥ 0,

and B(a, b) is the complete beta function.

18

Choice of Kernels

This general form leads to several simple specific cases:

• for r = 1 and s = 0, it is the rectangular or uniform kernel;

• for r = 1 and s = 1, it is the triangular kernel;

• for r = 2 and s = 1 (κrs = 3/4), it is the “Epanechnikov”

kernel, which yields the optimal rate of convergence of the

MISE;

• for r = 2 and s = 2 (κrs = 15/16), it is the “biweight” kernel.

If r = 2 and s → ∞, we have the Gaussian kernel (with some

rescaling).

19

Kernel Methods

In kernel methods, the locality of influence is controlled by a

window around the point of interest.

The choice of the size of the window, or the “bandwidth”, is the

most important issue in the use of kernel methods.

In univariate applications, the window size is just a length, usually

denoted by “h” (except maybe in time series applications).

In practice, for a given choice of the size of the window, the

argument of the kernel function is transformed to reflect the

size.

The transformation is accomplished using a positive definite ma-

trix, V , whose determinant measures the volume (size) of the

window.

20

Choice of Bandwidth

There are two ways to choose a bandwidth.

One is based on the mean-integrated squared error (MISE).

In this method, the MISE for an assumed model is determined,and then the bandwidth that minimizes this is determined.

The other method is a data-based method.

We use cross-validation to determine the optimal bandwidth.

In cross-validation, for a given bandwidth, we fit a model using allof data except for a few points (“leave-out-d”), then determinethe SSE using all of the data.

We do this over a grid of bandwidths.

Then we do this multiple times (“k-fold cross-validation”).

The best bandwidth is the one that minimizes the SSE (from alldata).

21

Nonparametric Smoothing

There are various methods, such as running medians or running

(weighted) means. (Running means are moving averages.)

Use of the kernel function is simple and there are a number of

functions in R to do various kinds of kernel fitting.

The simplest (for simple regression) is ksmooth.

The R function lowess does locally weighted smoothing using

weighted running means.

These methods are widely used for smoothing time series.

The emphasis is on prediction, rather than model building.

22

General Additive Time Series Models

A model of the form

yi = β0 + β1x1i + · · ·+ βmxmi + εi

can be generalized by replacing the constant (but unknown) co-

efficients by unknown functions (with specified forms):

yi = f1(x)x1i + · · ·+ fm(x)xmi + εi

Hastie and Tibshirani have written extensively on such models.

23

Artificial Neural Networks

One of the most common “black box” algorithms is called an

“artificial neural network” because some of its early development

was inspired in part by the behavior of biological neurons and the

nervous system.

A neural network accepts a set of input values x, combines them

into intermediate values (in a “hidden layer”) and then combines

the values of the hidden layer into a single output y.

x2

x1

f3&%'$f2&%

'$f1&%

'$

XXXXXXXXXXz

��:

��

��

��

��

��3

QQ

QQ

QQ

QQ

QQs

XXXXXXXXXXz

��:

&%'$y

��:

-

XXXXXXXXXXXXXz

input

hidden layer

output

24

Neural Networks

There are many variations of neural networks. (People avoiding

perishing have published papers “introducing” and naming over

50 types.)

The simplest and most common class of neural networks are

feed-forward networks. This means that graph of the network is

directed, and the nodes can be numbered in such a way that the

edges from any node all go to a node with a larger number.

Feed-forward networks are also called ‘back-propagation’ net-

works.

For a given input, the neural net makes a decision about what

is to be the output.

This is done in various ways.

25

Neural Networks

A standard neural net with one hidden layer is described in terms

of the number of input values, the number of neurons in the

hidden layer and the number of output values.

Thus, the NN in the illustration is a “2-3-1” network.

If some input goes directly to the output without being processed

in a hidden layer, the network is called a “skip-layer” network.

For p inputs and m outputs, an NN can be thought of as a

function

f : IRp→ IRm.

26

Neural Networks

From a mathematical standpoint there is an interesting theorem

due to Kolmogorov (1957) that states

Given any continuous function f : IRp→ IRm, there is a network

with p input nodes, 2p + 1 intermediate nodes, and m output

nodes that implement the function y = f(x) exactly.

27

Neural Networks

Kolmogorov’s theorem is not useful for designing a neural net.

Most NNs have fewer than 2p + 1 neurons in the hidden layer.

How do we measure the complexity of an NN model?

Mostly by the number of neurons in the hidden layer, even though

the functions in those neurons can have varying degrees of com-

plexity.

Also, for discontinuous functions or functions over nonconvex

domains, multiple hidden layers may be necessary.

28

Activation Function

We can think of the functions in the hidden layer as being arti-

ficial neurons that pass information forward.

An activation function determines whether or how information is

passed forward.

A common type of activation function is a signum or Heaviside

function. In that case, the neuron is called a “perceptron”.

Another common type is linear, which because it is “adaptive”

when the neural net is used is called an “adaline”.

The activation function may also be nonlinear such as a sigmoid

function. In that case, the neuron is called “logistic”.

29

Activation Functions

Various ways of combining the inputs xij in the hidden layer

(that is, various choices for the functions fd) and various ways

of putting the results of the functions together to yield predicted

outputs yi are tried.

For the ith input to the system, the activation function in the dth

neuron in a hidden layer may be a particular weighted average of

the input elements xij

fd(xi) =∑

wdjxij.

In a perceptron, all that matters is sign(fd(xi)); in a linear neuron,

the value fd(xi) is used.

If the activation function is nonlinear, such as a logistic, then

the output value may be

fd

(

∑

wdjxij

)

.

30

Neural Networks

A neural network is trained by means of a set of inputs with

corresponding inputs.

In this process, the activation functions are trained.

For example, in the linear threshold function involving the weights

wdj, we may begin with a set of weights w(0)dj and then, based

on how well the output values match those in the training set,

update the weights as

w(k+1)dj = w

(k)dj + δ

(

w(k)dj

)

Under a least-squares criterion, the rules that yield a minimum

of∑

(yi − yi)2

are chosen, and we say the neural net is trained.

31

Neural Networks in R

The R package nnet has functions for neural nets.

The most important functions is nnet to fit (or “train”) a single-

hidden-layer net.

The function nnet produces an object of class nnet.

The generic functions summary and predict operation on objects

of class nnet in the expected way.

The arguments for nnet are pretty standard. The model can be

specified either by a formula for a dataframe, or by naming the

input and output variables.

• size tells the number of nodes in the hidden layer.

• skip tells whether skip layer connections are allowed.

• linout tells whether functions are linear or logistic.

32

Neural Networks in Time Series Applications

In a time series application, we have data r1, . . . , rn, and for

i = k, . . . , n, we choose a subsequence xi = (ri−1, . . . , ri−k+1) as

an input to produce an output ri as a predictor of ri.

For example, consider the time series shown on the next slide.

33

A Time Series

Time

r

0 20 40 60 80 100

−3−2

−10

12

34

Neural Networks in Time Series Applications

Let k = 3; that is, we will use xi = (ri−1, ri−2, ri−3) as an input

to produce an output ri as a predictor of ri. (We lose the first

3 observations.)

This means that there will be 3 input nodes and 1 output nodes.

Let’s use 2 nodes in the hidden layer; that is, the “size” is 2.

Let’s allow skipping.

We are using a 3-2-1 net with a possible skip layer.

Let’s train the neural net, then compute the fitted values.

k <- 3

y_r <- r[(k+1):n]

x_r <- cbind(r[3:(n-1)],r[2:(n-2)],r[1:(n-3)])

nn_r <- nnet(x_r,y_r,size=2,linout=TRUE,maxit=1000,skip=TRUE,decay=0.01)

haty_nn_r <- predict(nn_r,x_r)

lines(c(0,0,0,haty_nn_r), col="red")

35

A Time Series

Time

r

0 20 40 60 80 100

−3−2

−10

12

36

Neural Networks

The neural network generally matches the directions (up or down),

but the result is smoothed; that is, the neural network predictions

miss the larger swings.

How would we use the neural net for forecasting?

We can only get one step ahead at a time, so we would simply

iterate from rn+1 to rn+2, and so on.

37

Monte Carlo Forecasting

Monte Carlo can be used for forecasting in any time series model

(“parametric bootstrap”).

At forecast origin t we forecast at the horizon t + h by use of

the fitted (or assumed) model and simulated errors (or “innova-

tions”).

Doing this many times, we get a sample of r(j)t+h.

The mean of this sample is the estimator, rt+h, and the sample

quantiles provide confidence limits.

38

Time Series Models of Financial Data

Does it make sense that a financial time series depends only on

its past values?

(Why might we think this? Is it really preposterous?)

Most economic time series are interrelated.

If the earnings of a company are up, maybe the stock price will

go up (or down!).

If unemployment goes up, maybe housing stock prices will go

down.

We will now discuss some topics from Chapter 8.

39

Bivariate Time Series

Often a time series consists of bivariate data at each time point:

x1, x2, . . . , xn

y1, y2, . . . , yn

In this case we are generally interested in how the two series

change together.

40

Bivariate Time Series

For two series xt and yt,

• cross-covariance function:

γxy(s, t) = E(

(xs − µxs)(yt − µyt))

• cross-correlation function (CCF):

ρxy(s, t) =γxy(s, t)

√

γx(s, s)γy(t, t).

41

Stationarity in Bivariate Time Series

A bivariate process {xt, yt} is said to be jointly stationary if each

process is stationary and the cross-covariance function γxy(s, t)

is constant for fixed values of s− t; that is, for h = s− t

γxy(s, t) = γxy(h) = E(

(xt+h − µx)(yt − µy))

.

We define the cross-correlation function (CCF) of a jointly

stationary process to be

ρxy(h) =γxy(h)

√

γx(0)γy(0).

(Recall that some people use a slightly different notation.)

42

Sample Cross-Covariance and CCF in a

Bivariate Stationary Time Series

• sample cross-covariance function:

γxy(h) =1

n

n−h∑

t=1

(xt+h − x)(yt − y)

• sample cross-correlation function (CCF):

ρxy(h) =γxy(h)

√

γx(0)γy(0).

The R function ccf computes (and by default plots) the sample

cross-covariance or cross-correlation function.

43

Multivariate Time Series

Often a time series consists of multivariate data at each time

point:

x1 = (x11, x12, . . . , x1n)

x2 = (x21, x22, . . . , x2n)...

xp = (xp1, xp2, . . . , xpn)

(These are column vectors. Note the indexing; time comes sec-

ond.)

The autocovariance matirx function:

Γ(h) = E(

(xt+h − µ)(xt − µ)T)

.

The individual elements of Γ(h) are

γij(h) = E(

(xt+h,i − µi)(xtj − µj))

.

44

Note that γij(h) = γji(−h), and so

Γ(h) = ΓT(−h)

The APARCH Model

Documents

Transcript of The APARCH Model