The APARCH Model
Transcript of The APARCH Model
The APARCH Model
Asymetric Power ARCH = APARCH
The APARCH model of Ding, Grange, and Engle (1993) is
At = σtεt, (1)
where εtiid∼ N(0,1), and
σδt = α0 +
m∑
i=1
αi(|At−i| − γiAt−i)δ +
s∑
j=1
βjσδt−j, (2)
where 0 < δ and −1 < γ < 1.
ARCH(m): δ = 2; γ = 0;β = 0
GARCH(m,s): δ = 2; γ = 0
GJR-GARCH(m, s) or TGARCH(m, s) : δ = 2
1
The APARCH Model
The APARCH model exhibits several stylized properties of finan-
cial time series.
Even though the conditional distribution is normal the uncondi-
tional distribution has excess kurtosis (fat tails).
Because a shock at time t− 1 also impacts the variance at time
t, the volatility is more likely to be high at time t if it was also
high at time t− 1. This yields clustering of volatility.
The APARCH model, as the GJR-GARCH model, additionally
captures asymmetry in return volatility. That is, volatility tends
to increase more when returns are negative, as compared to
positive returns of the same magnitude.
The APARCH model also yields the long-memory property of
returns.
2
Fitting Time Series Models in R
There are a number of R functions that perform the computa-
tions to fit various time series models.
ARMA / ARIMA arima(stats)
ARMA order determination autofit(itsmr)
ARMA + GARCH garchFit(fGarch)
APARCH garchFit(fGarch)
The APARCH model includes the TGARCH and GJR models,
among others; see equation (2).
Also see the help page for fGarch-package in fGarch.
3
Other R Functions for Time Series Models
There are R functions for forecasting using different time series
models that have been fitted.
ARMA/ARIMA predict.Arima(stats)
APARCH (including ARMA + GARCH) predict(fGarch)
There are also R functions for simulating data from different
time series models.
ARMA/ARIMA arima.sim(stats)
APARCH (including ARMA + GARCH) garchSim(fGarch)
4
Other Ways of Modeling Time Series
As we have emphasized, data from a general time series may
have special characteristics that lead us to use special models,
and to fit those models is special ways.
(What’s the main difference in our approach to fitting models
for time series?)
Despite the special nature of time series data, general statistical
models are often useful, especially if those models are fitted
locally.
5
Statistical Models
There are several reasons that we build models and fit them to
observational data.
In many applications, we are interested in some asymmetric re-
lationship between a random variable Y and other variable x.
(Both of these may be vectors.)
That is, we think of some subset of variables as being “depen-
dent” on some other subset of variables.
In time series applications, the dependent variable is often the
same variable as some of the others; it is just time of observation
that is different.
6
Statistical Models
We can think of the asymmetric relationship between a random
variable Y and a variable x as a process that accepts x as input
and outputs Y :
Y ← unknown process ← x,
or
Xt ← unknown process ← xt−1, xt−2, . . .
The relationship might also be described by a statement of the
form
Y ← f(x)
or
Y ≈ f(x).
7
Statistical Models
There are several reasons that we build models and fit them to
observational data.
One reason is to control a physical process.
Another reason is to understand a data-generating process.
A major reason also is to predict a future outcome of a data-
generating process.
When an objective is understanding, a mathematical model (equa-
tion or set of equations) usually works best.
When the emphasis is on prediction, we can form a “black box”
or a computer algorithm that accepts a set of input values x,combines them in various ways, and then produces an output y.
This kind of statistical model is often called an algorithmic model.
The algorithm is a sequence of rules, possibly randomized.
8
Statistical Models
“Statistics” means using data, and statistical models are fitted
by use of observations.
In the case of mathematical models, fitting usually means esti-
mation.
In some applications with mathematical models, the fitting is
“smoothing”, and the model itsel may be semi-parametric.
In the case of algorithmic models, the fitting is “training”; a
set of inputs with known outputs are run through the model,
and rules are chosen such that similar input would yield similar
output.
There is a very interesting article by Leo Breiman called “Statis-
tical modeling: The two cultures” in Statistical Science in 2001.
9
Use of Statistical Models in Time Series
ARMA and GARCH models can be used to understand rela-
tionships, both serial and, in their multivariate versions, among
different time series.
They also can be used in prediction or forecasting.
Other types of statistical models may be more or less useful in
forecasting.
One retrospective use of any smoothing model is to give a clearer
picture of the past, and, possibly, to give a short-term forecast.
Another use of smoothing models is to provide an estimate of
the volatility of a process.
10
Estimation of the Volatility Using Smoothing
Models
There are various ways the volatility can be estimated.
One interesting way, illustrated in Example 4.6 in Tsay, is based
on the simple diffusion model
Yt = µ(xt−1)dt + σ(xt−1)dWt,
where Wt is a standard Brownian motion (i.e., variance of 1),
and Yt = xt − xt−1.
(We’ll discuss this model more later.)
11
Estimation of the Volatility Using Smoothing
Models
The problem, of course, is that a standard deviation cannot be
estimated based on a single observation.
If, however, we assume that σ(xt−1) is smoothly varying in time,
then
|yt| ≈ σ(xt−1),
and a smoothed regression of |yt| on xt−1 is an estimate of
σ(xt−1), as in Figure 4.7(d).
What are the properties of such an estimator?
12
Statistical Models
Probably the most familiar statistical model after simple proba-
bility models is a parametric regression model.
The ARMA and GARCH models are also parametric models.
We’ll now discuss a couple of nonparametric statistical models
that are also nonlinear.
Kernel regression is primarily a smoothing model.
A neural net is primarily a classification model, but is also a
smoothing model.
The kernel regression model can be formulated in a simple math-
ematical form.
The neural net model is an algorithmic model.
Although these are nonparametric models, they both have com-
ponents that might be considered parametric.
13
Kernel Regression
Local regression is a type of nonlinear modeling.
The objective is prediction, rather than estimation of a parameter
or building a model that aids in understanding of relationships.
A simple form of local regression is to use a filter or kernel
function to provide local weighting of the observed data.
This approach ensures that at a given point the observations
close to that point influence the estimate at the point more
strongly than more distant observations.
14
Kernel Regression
Given pairs of data
(x1, y1), . . . , (xn, yn),
for prediction of y at an arbritrary point x, in this approach
we convolve the observations with a unimodal function that de-
creases rapidly away from the point:
y(x) =1
N
∑
i
K(x− xi)yi.
The function K is the filter or the kernel.
15
Kernels
A kernel function has two arguments representing the two points
in the convolution, but we typically use a single argument that
represents a scaled distance between the two points.
We often choose the kernel function to be a probability density
function; that is one with the properties
K(x) ≥ 0 and
∫
K(x) dx = 1.
We also often scale the kernel with a scale or “bandwidth” h and
define
Kh(x) =1
hK(x/h) and so
∫
Kh(x) dx = 1.
16
Choice of Kernels
Standard normal densities have these properties described above,
so the kernel is often chosen to be the standard normal density.
As it turns out, the kernel density estimator is not very sensitive
to the form of the kernel.
Although the kernel may be from a parametric family of distribu-
tions, in kernel density estimation, we do not estimate those pa-
rameters; hence, the kernel method is a nonparametric method.
17
Choice of Kernels
Sometimes, a kernel with finite support is easier to work with.
In the univariate case, a useful general form of a compact kernel
is
K(t) = κrs(1− |t|r)sI[−1,1](t),
where
κrs =r
2B(1/r, s + 1), for r > 0, s ≥ 0,
and B(a, b) is the complete beta function.
18
Choice of Kernels
This general form leads to several simple specific cases:
• for r = 1 and s = 0, it is the rectangular or uniform kernel;
• for r = 1 and s = 1, it is the triangular kernel;
• for r = 2 and s = 1 (κrs = 3/4), it is the “Epanechnikov”
kernel, which yields the optimal rate of convergence of the
MISE;
• for r = 2 and s = 2 (κrs = 15/16), it is the “biweight” kernel.
If r = 2 and s → ∞, we have the Gaussian kernel (with some
rescaling).
19
Kernel Methods
In kernel methods, the locality of influence is controlled by a
window around the point of interest.
The choice of the size of the window, or the “bandwidth”, is the
most important issue in the use of kernel methods.
In univariate applications, the window size is just a length, usually
denoted by “h” (except maybe in time series applications).
In practice, for a given choice of the size of the window, the
argument of the kernel function is transformed to reflect the
size.
The transformation is accomplished using a positive definite ma-
trix, V , whose determinant measures the volume (size) of the
window.
20
Choice of Bandwidth
There are two ways to choose a bandwidth.
One is based on the mean-integrated squared error (MISE).
In this method, the MISE for an assumed model is determined,and then the bandwidth that minimizes this is determined.
The other method is a data-based method.
We use cross-validation to determine the optimal bandwidth.
In cross-validation, for a given bandwidth, we fit a model using allof data except for a few points (“leave-out-d”), then determinethe SSE using all of the data.
We do this over a grid of bandwidths.
Then we do this multiple times (“k-fold cross-validation”).
The best bandwidth is the one that minimizes the SSE (from alldata).
21
Nonparametric Smoothing
There are various methods, such as running medians or running
(weighted) means. (Running means are moving averages.)
Use of the kernel function is simple and there are a number of
functions in R to do various kinds of kernel fitting.
The simplest (for simple regression) is ksmooth.
The R function lowess does locally weighted smoothing using
weighted running means.
These methods are widely used for smoothing time series.
The emphasis is on prediction, rather than model building.
22
General Additive Time Series Models
A model of the form
yi = β0 + β1x1i + · · ·+ βmxmi + εi
can be generalized by replacing the constant (but unknown) co-
efficients by unknown functions (with specified forms):
yi = f1(x)x1i + · · ·+ fm(x)xmi + εi
Hastie and Tibshirani have written extensively on such models.
23
Artificial Neural Networks
One of the most common “black box” algorithms is called an
“artificial neural network” because some of its early development
was inspired in part by the behavior of biological neurons and the
nervous system.
A neural network accepts a set of input values x, combines them
into intermediate values (in a “hidden layer”) and then combines
the values of the hidden layer into a single output y.
x2
x1
f3&%'$f2&%
'$f1&%
'$
XXXXXXXXXXz
����������:
��
��
��
��
��3
QQs
XXXXXXXXXXz
����������:
&%'$y
�������������:
-
XXXXXXXXXXXXXz
input
hidden layer
output
24
Neural Networks
There are many variations of neural networks. (People avoiding
perishing have published papers “introducing” and naming over
50 types.)
The simplest and most common class of neural networks are
feed-forward networks. This means that graph of the network is
directed, and the nodes can be numbered in such a way that the
edges from any node all go to a node with a larger number.
Feed-forward networks are also called ‘back-propagation’ net-
works.
For a given input, the neural net makes a decision about what
is to be the output.
This is done in various ways.
25
Neural Networks
A standard neural net with one hidden layer is described in terms
of the number of input values, the number of neurons in the
hidden layer and the number of output values.
Thus, the NN in the illustration is a “2-3-1” network.
If some input goes directly to the output without being processed
in a hidden layer, the network is called a “skip-layer” network.
For p inputs and m outputs, an NN can be thought of as a
function
f : IRp→ IRm.
26
Neural Networks
From a mathematical standpoint there is an interesting theorem
due to Kolmogorov (1957) that states
Given any continuous function f : IRp→ IRm, there is a network
with p input nodes, 2p + 1 intermediate nodes, and m output
nodes that implement the function y = f(x) exactly.
27
Neural Networks
Kolmogorov’s theorem is not useful for designing a neural net.
Most NNs have fewer than 2p + 1 neurons in the hidden layer.
How do we measure the complexity of an NN model?
Mostly by the number of neurons in the hidden layer, even though
the functions in those neurons can have varying degrees of com-
plexity.
Also, for discontinuous functions or functions over nonconvex
domains, multiple hidden layers may be necessary.
28
Activation Function
We can think of the functions in the hidden layer as being arti-
ficial neurons that pass information forward.
An activation function determines whether or how information is
passed forward.
A common type of activation function is a signum or Heaviside
function. In that case, the neuron is called a “perceptron”.
Another common type is linear, which because it is “adaptive”
when the neural net is used is called an “adaline”.
The activation function may also be nonlinear such as a sigmoid
function. In that case, the neuron is called “logistic”.
29
Activation Functions
Various ways of combining the inputs xij in the hidden layer
(that is, various choices for the functions fd) and various ways
of putting the results of the functions together to yield predicted
outputs yi are tried.
For the ith input to the system, the activation function in the dth
neuron in a hidden layer may be a particular weighted average of
the input elements xij
fd(xi) =∑
wdjxij.
In a perceptron, all that matters is sign(fd(xi)); in a linear neuron,
the value fd(xi) is used.
If the activation function is nonlinear, such as a logistic, then
the output value may be
fd
(
∑
wdjxij
)
.
30
Neural Networks
A neural network is trained by means of a set of inputs with
corresponding inputs.
In this process, the activation functions are trained.
For example, in the linear threshold function involving the weights
wdj, we may begin with a set of weights w(0)dj and then, based
on how well the output values match those in the training set,
update the weights as
w(k+1)dj = w
(k)dj + δ
(
w(k)dj
)
Under a least-squares criterion, the rules that yield a minimum
of∑
(yi − yi)2
are chosen, and we say the neural net is trained.
31
Neural Networks in R
The R package nnet has functions for neural nets.
The most important functions is nnet to fit (or “train”) a single-
hidden-layer net.
The function nnet produces an object of class nnet.
The generic functions summary and predict operation on objects
of class nnet in the expected way.
The arguments for nnet are pretty standard. The model can be
specified either by a formula for a dataframe, or by naming the
input and output variables.
• size tells the number of nodes in the hidden layer.
• skip tells whether skip layer connections are allowed.
• linout tells whether functions are linear or logistic.
32
Neural Networks in Time Series Applications
In a time series application, we have data r1, . . . , rn, and for
i = k, . . . , n, we choose a subsequence xi = (ri−1, . . . , ri−k+1) as
an input to produce an output ri as a predictor of ri.
For example, consider the time series shown on the next slide.
33
A Time Series
Time
r
0 20 40 60 80 100
−3−2
−10
12
34
Neural Networks in Time Series Applications
Let k = 3; that is, we will use xi = (ri−1, ri−2, ri−3) as an input
to produce an output ri as a predictor of ri. (We lose the first
3 observations.)
This means that there will be 3 input nodes and 1 output nodes.
Let’s use 2 nodes in the hidden layer; that is, the “size” is 2.
Let’s allow skipping.
We are using a 3-2-1 net with a possible skip layer.
Let’s train the neural net, then compute the fitted values.
k <- 3
y_r <- r[(k+1):n]
x_r <- cbind(r[3:(n-1)],r[2:(n-2)],r[1:(n-3)])
nn_r <- nnet(x_r,y_r,size=2,linout=TRUE,maxit=1000,skip=TRUE,decay=0.01)
haty_nn_r <- predict(nn_r,x_r)
lines(c(0,0,0,haty_nn_r), col="red")
35
A Time Series
Time
r
0 20 40 60 80 100
−3−2
−10
12
36
Neural Networks
The neural network generally matches the directions (up or down),
but the result is smoothed; that is, the neural network predictions
miss the larger swings.
How would we use the neural net for forecasting?
We can only get one step ahead at a time, so we would simply
iterate from rn+1 to rn+2, and so on.
37
Monte Carlo Forecasting
Monte Carlo can be used for forecasting in any time series model
(“parametric bootstrap”).
At forecast origin t we forecast at the horizon t + h by use of
the fitted (or assumed) model and simulated errors (or “innova-
tions”).
Doing this many times, we get a sample of r(j)t+h.
The mean of this sample is the estimator, rt+h, and the sample
quantiles provide confidence limits.
38
Time Series Models of Financial Data
Does it make sense that a financial time series depends only on
its past values?
(Why might we think this? Is it really preposterous?)
Most economic time series are interrelated.
If the earnings of a company are up, maybe the stock price will
go up (or down!).
If unemployment goes up, maybe housing stock prices will go
down.
We will now discuss some topics from Chapter 8.
39
Bivariate Time Series
Often a time series consists of bivariate data at each time point:
x1, x2, . . . , xn
y1, y2, . . . , yn
In this case we are generally interested in how the two series
change together.
40
Bivariate Time Series
For two series xt and yt,
• cross-covariance function:
γxy(s, t) = E(
(xs − µxs)(yt − µyt))
• cross-correlation function (CCF):
ρxy(s, t) =γxy(s, t)
√
γx(s, s)γy(t, t).
41
Stationarity in Bivariate Time Series
A bivariate process {xt, yt} is said to be jointly stationary if each
process is stationary and the cross-covariance function γxy(s, t)
is constant for fixed values of s− t; that is, for h = s− t
γxy(s, t) = γxy(h) = E(
(xt+h − µx)(yt − µy))
.
We define the cross-correlation function (CCF) of a jointly
stationary process to be
ρxy(h) =γxy(h)
√
γx(0)γy(0).
(Recall that some people use a slightly different notation.)
42
Sample Cross-Covariance and CCF in a
Bivariate Stationary Time Series
• sample cross-covariance function:
γxy(h) =1
n
n−h∑
t=1
(xt+h − x)(yt − y)
• sample cross-correlation function (CCF):
ρxy(h) =γxy(h)
√
γx(0)γy(0).
The R function ccf computes (and by default plots) the sample
cross-covariance or cross-correlation function.
43
Multivariate Time Series
Often a time series consists of multivariate data at each time
point:
x1 = (x11, x12, . . . , x1n)
x2 = (x21, x22, . . . , x2n)...
xp = (xp1, xp2, . . . , xpn)
(These are column vectors. Note the indexing; time comes sec-
ond.)
The autocovariance matirx function:
Γ(h) = E(
(xt+h − µ)(xt − µ)T)
.
The individual elements of Γ(h) are
γij(h) = E(
(xt+h,i − µi)(xtj − µj))
.
44
Note that γij(h) = γji(−h), and so
Γ(h) = ΓT(−h)