Modeling and estimation for nonstationary time series with ... · implemented in volatility...

Modeling and estimation for nonstationary time serieswith applications to robust risk management

Ying ChenDepartment of Statistics & Applied Probability,Risk Management Institute,National University of Singapore,117546, Singapore.

Vladimir SpokoinyWeierstrass-Institute,Mohrenstr. 39,10117 Berlin, Germany

Summary. In the ideal Black-Scholes world, financial time series are assumed 1) stationary(time homogeneous) and 2) having conditionally normal distribution given the past. These twoassumptions have been widely-used in many methods such as RiskMetrics. However theseassumptions are unrealistic. The primary aim of the paper is to account for nonstationarity andheavy tails in time series by presenting a local exponential smoothing approach, by which thesmoothing parameter is adaptively selected at every time point and the heavy-tailedness ofthe process is considered. A complete theory addresses both issues. The proposed method isimplemented in volatility estimation and risk management based on simulated and real data in-cluding a financial crisis period (2007-2008). Numerical results show that the proposed methoddelivers accurate and sensitive estimates and forecasts.

Keywords: adaptive estimation; local exponential smoothing; non-stationary time series anal-ysis; risk management

1. Introduction

In the ideal Black-Scholes world, financial time series are assumed 1) stationary (time homo-geneous, i.e. all the variables obey the same dynamics over time) and 2) having conditionallynormal distribution given the past. These two assumptions have been widely-used in manymethods such as RiskMetrics which has been considered as industry standard in risk man-agement after introduced by J.P. Morgan in 1994. However, these assumptions are veryquestionable as far as the real life data are concerned. The time homogeneous assumptiondoes not allow to model structure shifts or breaks on the market and to account for e.g.macroeconomic or political changes that appear with very high probability over a long timespan. The assumption of conditionally Gaussian innovations leads to underestimation of themarket risk. Recent studies show that the Gaussian and sub-Gaussian distributions are toolight to model the market risk associated with sudden shocks and crashes and heavy-taileddistributions like Student-t or Hyperbolic are more appropriate. A realistic estimation aswell as prediction should account for the both stylized facts of financial data, which is arather complicated task. The reason is that these two issues are somehow contradictory.A robust estimation which is stable against extremes and large shocks in financial datais automatically less sensitive to structural changes. In addition, an estimation which is

2 Chen, Y. and Spokoiny, V.

appropriate for non-stationary time series may introduce heavy tails in residuals artificially.The objective of our study is to provide an approach for a flexible modeling of financialtime series which is sensitive to structural changes and robust against extremes and shockson the market. The approach is quite general and its scope of applications goes far beyondthe financial time series.

1.1. Accounting for Non-stationarityIn the financial econometrics literature, one well-known feature is the high degree of per-sistence of volatility, i.e. the sample autocorrelation function is hyperbolically decaying.Figure 1 displays the sample ACF plot of an “observed” volatility data, the volatility index(VIX) in the Chicago Board Options Exchange (CBOE) ranging from January 2, 2004 toSeptember 30, 2008. As a weighted average of implied volatilities on the S&P 500 indexfutures, VIX shows the market’s expectation of volatility, on an annualized basis. In theIGARCH and FIGARCH literature, the persistence is explained by a “long memory” pro-cess, see Bollerslev et al. (1992), Engle and Bollerslev (1986), Granger (1980), Granger andJoyeux (1980), and Hosking (1981). Empirical study shows that the long memory modelsseem to provide a better description and predictability than the short memory models e.g.GARCH(1,1), Baillie et al. (1996). On the other hand, there is a dual view on the truesource of the persistence. The theoretical results provided in Diebold and Inoue (2001),Granger and Hyung (2004) and Mikosch and Starica (2004) indicate that the persistencephenomenon may be induced by the non-stationarity of financial data and can be generatedby a short-memory model with structural breaks or regime-shifts.

Figure 1 also displays the historical values of VIX. If the persistence is driven by along memory rather than other reasons such as non-stationarity, the volatility should beconstant. It is however observed that the volatility process changes over time, especiallyafter the global financial crisis started from July 2007. This observation suggests that amodel accounting for non-stationarity, e.g. a dynamic model with changing parameters,might be closer to the real process than a long memory model.

0 20 40 60 80 100−0.2

0

0.2

0.4

0.6

0.8

Lag

Sam

ple

Aut

ocor

rela

tion

Sample Autocorrelation Function (ACF)

1/2/2004 12/13/2004 11/22/2005 11/3/2006 10/18/2007 9/30/20085

10

15

20

25

30

35

40

45

50

VIX

%

VIX 1/2/2004−9/30/2008

Fig. 1. The sample ACF plot for VIX (left panel) and the daily closed price of VIX (right panel) rangingfrom January 2, 2004 to September 30, 2008.

In other words, stationarity may only hold for a possibly short interval. In a data rich

Robust Risk Management 3

world, it is important to select the local interval over which the data are stationary and thebiases of estimation can be minimized. The standard way is to recalibrate (reestimate) themodel parameters at every time point using the latest information from a time varying win-dow. More appropriately, it is to adopt the exponential smoothing approach to account fornon-stationarity that assigns some weights to historical data which exponentially decreasewith the time. The choice of a small window or rapidly decreasing weights results in highvariability of the estimated volatility and, as a consequence, of the estimated value of theportfolio risk from day to day. In turns, a large window or a low pass volatility filteringmethod results in the loss of sensitivity to the significant changes of the market situation.Although the stationary interval may be long or short for each time point, the window sizeor the weighting parameter is often assumed to be constant over time in the standard ways.For example, the exponential smoothing method has been widely used in many financialanalysis with the success for forecasting. However, with a fixed value of smoothing parame-ter, the method may lead to low accuracy of estimation, even with an optimal choice on thesmoothing parameter. In our study, we adopt a poinwise adaptive approach, which aims toselect large windows or slowly decreasing weights in the time homogeneous situation andit switches to high pass filtering if some structural change is detected. In particular, weadaptively select for each time point a window over which the process is assumed to bedriven by a exponential smoothing scheme.

Recently a number of local parametric methods has been developed, which investigatesthe structure shifts, or equivalently to say, adjusts the smoothing parameter to avoid seri-ous estimation errors and achieve the best possible accuracy of estimation. For example,Fan and Gu (2003) introduces several semiparametric techniques of estimating volatilityand portfolio risk. Mercurio and Spokoiny (2004) presents an approach to specify local ho-mogeneous interval, by which volatility is approximated by a constant. Nevertheless, theseworks focus on only one issue, i.e. the non-stationarity of time series, and rely on unrealisticGaussian distributional assumption.

1.2. Accounting for Heavy Tails in InnovationsThe evidence of non-Gaussian heavy-tailed distribution for the standardized innovations ofthe financial time series is well documented. Among many other heavy-tailed distributionalfamilies, the normal inverse Gaussian (NIG) distribution has attracted much attention withthe universal goodness-of-fit, see e.g. Embrechts et al. (2002). The NIG distribution nestedmany distributions such as Cauchy and normal distributions and turned out to be useful butnot limited to identify the heavy tails of financial data. Some risk management models havebeen proposed based on the NIG distributional assumption, see e.g. Eberlein and Keller(1995). However, most models often rely on one or another kind of parametric assumptions,and hence, are not flexible for modeling structural changes in the financial data.

The primary aim of the paper is to present a realistic approach that accounts for the bothfeatures: non-stationarity and heavy tails in financial time series. The whole approach canbe decomposed in few steps. First we develop an adaptive procedure for estimation of thetime dependent volatility under the assumption of the conditionally Gaussian innovations.Then we show that the procedure continues to apply in the case of sub-Gaussian innovations(under some exponential moment conditions). To make this approach applicable to theheavy-tailed data, we make a power transformation of the underlying process. Box and Cox(1964) stimulated the application of the power transformation to non-Gaussian variables toobtain another distribution more close to the normal and homoscedastic assumption. Here


we follow this way and replace the squared returns by their p -power to provide that theresulting “observations” have exponential moments.

1.3. Volatility Estimation by Local Exponential SmoothingLet St be an observed asset process in discrete time, t = 1, 2, . . . , while Rt defines the cor-responding return process: Rt = log(St/St−1) . We model this process via the conditionalheteroskedasticity assumption:

Rt =√

θtεt , (1)

where εt , t ≥ 1 , is a sequence of standardized innovations satisfying IE(εt | Ft−1

)= 0

and IE(ε2

t | Ft−1

)= 1 where Ft−1 = σ(R1, . . . , Rt−1) is the ( σ -field generated by the first

t−1 observations), and θt is the volatility process which is assumed to be predictable withrespect to Ft−1 .

In this paper we focus on the problem of filtering the parameter θt from the pastobservations R1, . . . , Rt−1 . This problem naturally arises as an important building blockfor many tasks of financial engineering like Value-at-Risk or Portfolio Optimization. Amongothers, we refer to Christoffersen (2003) for a systematic introduction of risk analysis.

The exponential smoothing (ES) and its variation have been considered as good approx-imations of variance by assigning weights to the past squared returns:

θt =1

1− η

∞∑m=0

ηmR2t−m−1, η ∈ [0, 1).

It is worth noting that the widely-used ARCH and GARCH models can be represented bythe ES. For example, the popular GARCH(1,1) setup can be reformulated as:

θt = ω + αR2t−1 + βθt−1 =

ω

1− β+ α

∞∑m=0

βmR2t−m−1.

With a proper reparametrization, this is again an exponential smoothing estimate.ES is in fact a local maximum likelihood estimate (MLE) based on the Gaussian distri-

butional assumption of the innovations, see e.g. Section 2. One can expect that this methodalso does a good job if the innovations are not conditionally Gaussian but their distributionis not far away from normal. Our theoretical and numerical results confirm this hint for thecase of a sub-Gaussian distribution of the innovations εt , see Section 2 for more details.

To implement the ES approach, one first faces the problem to choose the smoothingparameter η (or β ) which can be naturally treated as a memory parameter. The valuesof η close to one correspond to a slow decay of the coefficients ηm and hence, to a largeaveraging window, while the small values of η result in a high-pass filtering. The classicalES methods choose one constant smoothing (memory) parameter. For instance, in theRiskMetrics design, η = 0.94 has been considered as an optimized value. This, however,raises the question whether the experience-based value is really better than others andfurther whether the process is stationary over the selected window. Other methods thatare more reliable but computationally demanding are e.g. to choose η by optimizing someobjective function e.g. forecasting errors Cheng et al. (2003).


In our study, the smoothing parameter is adaptively selected at every time point. Given afinite set η1,t, . . . , ηK,t of the possible values of the memory parameter, we calculate K localMLEs {θ(k)

t } for a particular time point t . Then these “weak” estimates are aggregated inone adaptive estimate by using the Spatial Stagewise Aggregation (SAGES ) procedure fromBelomestny and Spokoiny (2007). In addition, we also present a procedure to choose ηk,t

such that its corresponding MLE θ(k)t has the best performance in the estimation among

the considered set of K estimates, referred as LMS. Furthermore, we extend the localexponential smoothing in the heavy-tailed NIG distributional framework. It is practicallyinteresting to show that the quasi ML estimation is applicable under the NIG distributionalassumption. Finally, we demonstrate the implementation of the proposed local exponentialsmoothing method in volatility estimation and risk management.

The paper is organized as follows. The proposed adaptive exponential smoothing pro-cedure SAGES is described in Section 2. We especially focus on the choice of the tuningparameters of the procedures and suggest a general way of fixing those parameters usingthe so called “propogation” condition. The quasi ML estimation under the NIG distri-butional assumption is discussed in Section 3. Section 4 compares the predictability ofthe SAGES method and two dynamic models that use standard ways to account for non-stationarity of time series, i.e. the dynamic GARCH(1,1) and the dynamic ES, on the basisof simulated data. We also check the performance of the methods for the real financial dataincluding two German assets, one US equity and two exchange rates in Section 5. The back-testing results confirm the overall nice performance and robustness of the proposed adaptiveprocedures. Our theoretical study in Section 2.2 claims a kind of “oracle” optimality forthe proposed procedure and the numerical results of simulated and real data demonstratequite reasonable performance of the method in the situations we focus on. The proofs arecollected in Appendix in Section 6.

2. Adaptive estimation for sub-Gaussian innovations

This section presents the method of adaptive estimation of time inhomogeneous volatilityprocess θt via aggregating the ES estimates with different memory parameters η . For thissection the innovations εt in the model (1) are assumed to be Gaussian or sub-Gaussian.An extension to heavy-tailed innovations will be discussed in Section 3.

First we show that the ES estimate is a particular case of the local parametric volatilityestimate and study some of its properties. Then we introduce the SAGES procedure foraggregating a family of “weak” ES estimates into one adaptive volatility estimate and studyits properties in the case of sub-Gaussian innovations.

2.1. Local Parametric ModelingA time-homogeneous (time-homoskedastic) model means that θt is a constant. For thehomogeneous model θt ≡ θ for t from the given time interval I , the parameter θ can beestimated using the (quasi) maximum likelihood method. Suppose first that the innovationsεt are conditionally on Ft−1 standard normal. Then the joint distribution of Rt for t ∈ Iis described by the log-likelihood LI(θ) =

∑t∈I `(Yt, θ) , where `(y, θ) = −(1/2) log(2πθ)−

y/(2θ) is the log-density of the normal distribution N(0, θ) and Yt mean the squaredreturns, Yt = R2

t . The corresponding maximum likelihood estimate (MLE) maximizes the


likelihood:

θI = argmaxθ∈Θ

LI(θ) = argmaxθ∈Θ

∑

t∈I

`(Yt, θ),

where Θ is a given parametric subset in IR+ . If εt are not conditionally standard normal,the estimate θI is still meaningful and can be considered as a quasi MLE.

The assumption of time homogeneity is usually too restrictive if the time interval I issufficiently large. The standard approach is to apply the parametric modeling in a vicinityof the point of interest t . The localizing scheme is generally given by the collection ofweights Wt = {wst} which leads to the local log-likelihood

L(Wt, θ) =∑

s

`(Ys, θ)wst

and to the local MLE θt defined as the maximizer of L(Wt, θ) . In this paper we onlyconsider the localizing scheme with the exponentially decreasing weights wst = ηt−s fors ≤ t , where η is the given “memory” parameter. This is motivated by the general goodperformance of the exponential smoothing. We also cut the weights when they becomesmaller than some prescribed value c > 0 , e.g. c = 0.01 . However, the properties of thelocal estimate θt are general and apply to any localizing scheme.

We denote by θt the value maximizing the local log-likelihood L(Wt, θ) :

θt = argmaxθ∈Θ

L(Wt, θ).

The volatility model is a particular case of an exponential family, so that a closed form rep-resentation for the local MLE θt and for the corresponding fitted log-likelihood L(Wt, θt)are available.

Theorem 2.1. For every localizing scheme Wt , θt = N−1t

∑s Yswst where Nt denotes

the sum of the weights wst : Nt =∑

s wst . Moreover, for every θ > 0 the fitted likelihoodratio L(Wt, θ, θ) = maxθ′ L(Wt, θ

′, θ) with L(Wt, θ′, θ) = L(Wt, θ

′)− L(Wt, θ) satisfies

L(Wt, θt, θ) = NtK(θt, θ) (2)

where K(θ, θ′) = −0.5{log(θ/θ′)+1−θ/θ′

}is the Kullback-Leibler information for the two

normal distributions with variances θ and θ′ : K(θ, θ′) = Eθ log(Pθ/dPθ′

).

Proof. One can see that L(Wt, θ) = −Nt

2 log(2πθ)− 12θ

∑s Yswst . This representation

yields the both assertions of the theorem by simple algebra, see Polzehl and Spokoiny (2006)for more details.

Remark 2.1. The results of Theorem 2.1 only rely on the structure of the function`(y, θ) and do not utilize the assumption of conditional normality of the innovations εt .Therefore, they apply whatever the distribution of the innovations εt is.


2.2. Some Properties of θt in the Homogeneous SituationThis section collects some useful properties of the (quasi) MLE θt and of the fitted log-likelihood L(Wt, θt, θ

∗) in the homogeneous situation θs = θ∗ for all s . The presentedresults extend Polzehl and Spokoiny (2006) from the case of Gaussian innovations to a moregeneral situation with sub-Gaussian εi ’s.

In the latter work, it is stated for the time homogeneous model θs = θ∗ ∈ Θ for all swith ws > 0 and for i.i.d. standard normal innovations εs , that for any z > 0

IPθ∗(L(Wt, θt, θ

∗) > z) ≡ IPθ∗

(NtK(θt, θ

∗) > z) ≤ 2e−z. (3)

This result claims that the estimation loss measured by K(θt, θ∗) is with high probability

bounded by z/Nt provided that z is sufficiently large. This fact is very useful and it helpsto establish a risk bound for a power loss function and to construct the confidence sets forthe parameter θ∗ .

However, the assumption of normality for the innovations εt can be too restrictive andit is often criticized in the financial literature. Below we extend this result and its corollaryto the case of non-Gaussian innovations under some exponential moment conditions. Werefer to this situation as the sub-Gaussian case. Later these results in combination with thepower transformation of the data will be used for studying the heavily tailed innovations,see Section 5.

We assume the following condition on the set Θ of possible values of the volatilityparameter.

(Θ) The set Θ is a compact interval in IR+ and does not containing θ = 0 .

Theorem 2.2. Assume (Θ) . Let the innovations εs be i.i.d., IEε2s = 1 , and

log IE exp{λ(ε2

s − 1)} ≤ κ(λ) (4)

for some λ > 0 and some constant κ(λ) . Then there is a constant µ0 > 0 such that forall θ∗, θ ∈ Θ

IEθ∗ exp{µ0L(Wt, θ, θ

∗)} ≡ IEθ∗ exp

{µ0NtK(θt, θ

∗)} ≤ 1 (5)

IPθ∗(L(Wt, θt, θ

∗) > z) ≡ IPθ∗

(NtK(θt, θ

∗) > z) ≤ 2e−µ0z. (6)

Now we present some corollary of the exponential bound from Theorem 2.2.

Theorem 2.3. Assume (Θ) and (4). Then for any r > 0

IEθ∗∣∣L(Wt, θt, θ

∗)∣∣r ≡ IEθ∗

∣∣NtK(θt, θ∗)

∣∣r ≤ rr µ−r0 .

where rr = 2r∫

z≥0zr−1e−zdz = 2rΓ (r) . Moreover, if zα satisfies 2e−µ0zα ≤ α , then

Et,α ={θ : NtK

(θt, θ

) ≤ zα

}(7)

is an α -confidence set for the parameter θ∗ in the sense that

IPθ∗(Et,α 63 θ∗

) ≤ α.


2.3. Spatially Aggregated Exponential Smoothing (SAGES ) EstimateIn this section we focus on the problem of adaptive (data-driven) estimation of the parameterθt . We assume that a finite set {ηk, k = 1, . . . , K} of values of the smoothing parameteris given. Every value ηk leads to the localizing weighting scheme w

(k)st = ηt−s

k for s ≤ t

and to the local ML estimate θ(k)t :

Nk =∑

s

w(k)st =

Mk∑m=0

ηmk , (8)

θ(k)t = N−1

k

∑s

w(k)st Ys = N−1

k

Mk∑m=0

ηmk yt−m−1. (9)

where Mk = log c/ log ηk − 1 is the cutting point and guarantees that the weights afterMk are bounded by the prescribed value c , i.e. ηMk+1

k ≤ c . It is easy to see that thesum of weights Nk =

∑s w

(k)st does not depend on t , thus we suppress the index t in the

notation. The corresponding fitted log-likelihood L(W (k)t , θ

(k)t , θ) is:

L(W (k)t , θ

(k)t , θ) = NkK(θ(k)

t , θ).

The local MLEs θ(k)t will be referred to as “weak” estimates. Usually the parameter ηk

runs over a wide range from values close to one to rather small values, so that at leastone of them is “good” in the sense of estimation risk. However, the proper choice of theparameter η generally depends on the variability of the unknown random process θs . Weaim to construct a data-driven estimate θt which performs nearly as good as the best onefrom this family.

In what follow we consider the stagewise aggregation (SAGES ) method. The underlyingidea of the method is to aggregate all the weak estimates in form of a convex combinationinstead of choosing one of them. The procedure is sequential and starts with the estimateθ(1)t having the largest variability, that is, we set θ

(1)t = θ

(1)t . At every step k ≥ 2 the

new estimate θ(k)t is constructed by aggregating the next “weak” estimate θ

(k)t and the

previously constructed estimate θ(k−1)t . Following to Spokoiny (2010), the aggregation is

done in terms of the canonical parameter υ which relates to the natural parameter θ byυ = −1/(2θ) . With υ

(k)t = −1/(2θ

(k)t ) and υ

(k−1)t = −1/(2θ

(k−1)t )

υ(k)t = γkv

(k)t + (1− γk)υ(k−1)

t ,

θ(k)t = −1/(2υ

(k)t ).

Equivalently one can write θ(k)t =

(γk/θ

(k)t + (1− γk)/θ

(k−1)t

)−1 . The mixing weights{γk} are computed on the base of the fitted log-likelihood by checking that the previouslyaggregated estimate θ

(k−1)t is in agreement with the next “weak” estimate θ

(k)t . The

difference between these two estimates is measured by the quantity

γk = Kag

( 1zk−1

L(W (k)t , θ

(k)t , θ

(k−1)t )

)= Kag

( 1zk−1

NkK(θ(k)t , θ

(k−1)t )

)(10)

where z1, . . . , zK−1 are the parameters of the procedure, see Section 2.4 for more details,and Kag(·) is the aggregation kernel. This kernel monotonously decreases on IR+ , is equal


to one in a neighborhood of zero and vanishes outside the interval [0, 1] , so that the mixingcoefficient γk is one if there is no essential difference between θ

(k)t and θ

(k−1)t and zero, if

the difference is significant. The significance level is measured by the “critical value” zk−1 .In the intermediate case, the mixing coefficient γk is between zero and one. The procedureterminates after step k if γk = 0 and we define in this case θ

(m)t = θ

(k)t = θ

(k−1)t for all

m > k . The formal definition reads as

(a) Initialization: θ(1)t = θ

(1)t .

(b) Loop: for k ≥ 2

θ(k)t =

(γk

θ(k)t

+1− γk

θ(k−1)t

)−1

where the aggregating parameter γk is computed as by (10). If γk = 0 then terminateby letting θ

(k)t = . . . = θ

(K)t = θ

(k−1)t .

(c) Final estimate: θt = θ(K)t .

In a special case of the SAGES procedure with the binary γk equal to zero or one,every estimate θ

(k)t and hence, the resulting estimate θt coincide with one of the “weak”

estimates θ(k)t . This fact can easily be seen by induction arguments. Indeed, if γk = 1 ,

then θ(k)t = θ

(k)t and if γk = 0 , then θ

(k)t = θ

(k−1)t . Therefore, in this situation the

SAGES method reduces to a kind of local model selection procedure (LMS). One limitationof the SAGES compared to the alternative approach LMS is that it may magnify the biasthrough the summation, which will be illustrated in the later simulation study. On themeanwhile, the LMS may suffer from a high variability since it merely concerns discreteand finite values of the smoothing parameter.

The next section discusses in details the problem of the parameter choice and criticalvalues identification for the SAGES procedure.

2.4. Parameter Choice using Propagation ConditionTo run the procedure, one has to specify the setup and fix the involved parameters.

The considered setup mainly concerns the set of localizing schemes W(k)t = {w(k)

st } fork = 1, . . . , K yielding a set of “weak” estimates θ

(k)t . Due to Theorem 2.2, variability

of every θ(k)t is characterized by the local sample size Nk (the sum of the corresponding

weights w(k)st over s ) which increases with k . In this paper we focus on the exponentially

decreasing localizing schemes, so that every W(k)t is completely specified by the rate ηk

and the cutting level c .So, the aggregating procedure for a family of the “weak” ES estimates assumes that a

growing sequence of values η1 < η2 < . . . < ηK is given in advance. This set leads to thesequence of localizing schemes W

(k)t with w

(k)st = ηt−s

k for s ≤ t and ηt−sk > c otherwise

w(k)st = 0 . The set corresponding “weak” estimates θ

(k)t is defined by (8). The procedure

applies to any such sequence for which the following condition is satisfied:

(MD) for some u0, u with 0 < u0 ≤ u < 1 , the values N1, . . . , NK satisfyu0 ≤ Nk−1/Nk ≤ u.


Here we present one example of constructing such a set {ηk} which is used in oursimulation study and application examples.

Example 2.1. [Set {ηk} ] Given values η1 < 1 and a > 1 , define

Nk+1

Nk≈ 1− ηk

1− ηk+1= a > 1. (11)

The coefficient a controls the decreasing speed of the variations. The starting value η1

should be sufficiently small to provide a reasonable degree of localization. Our defaultvalues are a = 1.25 , η1 = 0.6 , and c = 0.01 . The total number K of the consideredlocalizing schemes is fixed by the condition that ηK does not exceed the prescribed valueη∗ < 1 . One can expect a very minor influence of the mentioned parameters a, c on theperformance of the procedure. This is confirmed by our simulation study presented in thepreprint.

The definition of the mixing coefficients γk involves the “aggregation” kernel Kag . Ourtheoretical study is done under the following assumptions on this kernel:

(Kag) The aggregation kernel Kag is monotonously decreasing for u ∈ IR+ ,Kag(0) = 1 , Kag(1) = 0 . Moreover, there exists some b ∈ (0, 1) such that Kag(u) =1 for u ≤ b .

Our default choice is Kag(u) = {1− (u− 1/6)+}+ so that Kag(u) = 1 for u ≤ 1/6 .Another choice is the uniform aggregation kernel Kag(u) = 1(u ≤ 1) . This choice leads

the binary mixing coefficients γk and hence, to the local model selection procedure.Next we discuss the most important question of choosing the critical values zk .The idea of selecting the critical values zk is to provide the prescribed performance of

the procedure in the simple parametric situation with θt ≡ θ∗ . In this situation, all thesquared returns Yt are i.i.d. and follow the equation Yt = θ∗ε2

t . The corresponding jointdistribution of all Yt is denoted by IPθ∗ . The approach assumes that the distribution ofthe innovations εs is known and it satisfies the condition (4). A natural candidate is theGaussian distribution. However, we consider below in Section 3 the case when the εs ’sare obtained from the normal inverse Gaussian distribution, a heavy-tailed distribution, bysome power transformation.

The way of selecting the critical values is based on the so called “propagation” conditionand it can be formulated in a quite general setup. Recall that the SAGES procedure issequential and delivers after the step k the estimate θ

(k)t which depends on the parameters

z1 ,. . . , zk−1 . We now consider the performance of this procedure in the simple “parametric”situation of constant volatility θt ≡ θ∗ . In this case the “ideal” or optimal choice among thefirst k estimates θ

(1)t , . . . , θ

(k)t is the one with the smallest variability, that is, the latest

estimate θ(k)t whose variability is measured by the quantity Nk , see (3). Our approach

is similar to the one which is widely used in the hypothesis testing problem: to select theparameters (critical values) by providing the prescribed error under the “null”, that is, inthe parametric situation. The only difference is that in the estimation problem the risk ismeasured by another loss function. This consideration leads to the following “propagation”condition: for all θ∗ ∈ Θ and all k = 2, . . . , K

IEθ∗∣∣L(W (k)

t , θ(k)t , θ

(k)t )

∣∣r ≡ IEθ∗∣∣NkK

(θ(k)t , θ

(k)t

)∣∣r ≤ (k − 1)ρrr

K − 1. (12)


Here rr is from Theorem 2.3, and r and ρ are the fixed global parameters. “Propagation”means that under the true homogeneous model, the estimate θ

(k)t selected at step k is

close to its non-adaptive counterpart θ(k)t , that is, the local “model” Wt is extended at

every step. As a particular case for k = K , the condition (12) implies for θt = θ(K)t

IEθ∗∣∣NKK

(θ(K)t , θt

)∣∣r ≤ ρrr .

This means that the final adaptive estimate θt is sufficiently close to its non-adaptivecounterpart θ

(K)t .

The relation (12) gives us K − 1 inequalities to fix K − 1 parameters z1, . . . , zK−1 .However, these parameters only implicitly enter in (12) and it is unclear, how they can beselected in a numerical algorithmic way. The next section describes a sequential procedurefor selecting the parameters z1, . . . , zK−1 one after another by Monte Carlo simulations.

The condition (12) is stated uniformly over θ∗ . However, the following technical resultallows to reduce the condition to any one particular θ∗ , e.g. for θ∗ = 1 .

Lemma 2.4. Let the squared returns Yt follow the parametric model with the constantvolatility parameter θ∗ , that is, Yt = θ∗ε2

t . Then the distribution of the “test statistics”L(W (k)

t , θ(k)t , θ

(k−1)t ) = NkK(θ(k)

t , θ(k−1)t ) under IPθ∗ is the same for all θ∗ > 0 .

The condition (12) involves two more “hyperparameters” r and α . However, thesevalues are global and their choice only depends on the particular application while the esti-mation procedure is local and it selects the memory parameter and constructs the estimateθt separately at each point. The parameters r and ρ can also be selected in a data drivenway by fixing some objective function, e.g., by minimizing the forecasting error. However,our experience based on applications and simulated results (available on demand) indicatea quite minor influence of these parameters within some region. In particular, our defaultchoice r = 1/2 and ρ = 1 demonstrates a very good performance in the most of situationand a further optimization does onlz a minor improvement in terms of e.g. forecasting error.The parameter r in (12) specifies the selected loss function. Our default choice r = 1/2corresponds to the `1 losses and it provides a stable and robust performance of the method.The parameter ρ is similar to the test level parameter, and, exactly as in the testing setup,its choice depends upon the subjective requirements on the procedure. Small values of ρmean that we put more attention to the performance of the methods in the time homo-geneous (parametric) situation and such a choice leads to a rather conservative procedurewith relatively large critical values. Increasing ρ would result in a decrease of the criticalvalues and an increase of the sensitivity of the method to the changes in the underlyingparameter θt at cost of some loss of stability in the time homogeneous situation. For themost of applications, a reasonable range of values ρ is between 0.2 and 1. Our default isρ = 1 .

Sequential choice of the critical values by Monte Carlo simulations

Below we present one way of selecting the critical values zk using Monte Carlo simulationsfrom the parametric model successively, starting from k = 1 . To specify the contributionof z1 in the final risk of the method, we set all the remaining values equal to infinity:z2 = . . . = zK−1 = ∞ . Now, for every particular z1 , the whole set of critical values zk is


fixed and can run the procedure leading to the estimates θ(k)t (z1) for k = 2, . . . ,K . The

value z1 is selected as the minimal one for which

IEθ∗∣∣NkK

(θ(k)t , θ

(k)t (z1)

)∣∣r ≤ ρrr

K − 1, k = 2, . . . , K.

Such a value exists because the choice z1 = ∞ leads to θ(k)t (z1) = θ

(k)t for all k . Notice

that the rule of “early stop” (the procedure terminates and sets θ(k)t = . . . , θ

(K)t = θ

(k−1)t

if γk = 0 ) is important here, otherwise zk = ∞ leads to γk = 1 and θ(k)t = θ

(k)t for all

k ≥ 2 .Next, with z1 fixed, we select z2 in a similar way. Suppose z1, . . . , zk−1 have been

already fixed. We set zk+1 = . . . = zK−1 = ∞ and play with zk . Every particular choice ofzk leads to the estimates θ(l)(z1, . . . , zk) for l = k + 1, . . . ,K coming out of the procedurewith the parameters z1, . . . , zk,∞, . . . ,∞ . We select zk as the minimal value which fulfills

IEθ∗∣∣NlK

(θ(l)t , θ

(l)t (z1, . . . , zk)

)∣∣r ≤ kρrr

K − 1, l = k + 1, . . . , K. (13)

By simple induction arguments one can see that such a value exists and that the finalprocedure with the such defined parameters fulfills (12).

Note that the proposed Monte Carlo procedure heavily relies on the joint distributionof the estimates θ

(1)t , . . . , θ

(K)t under the parametric measure IPθ∗ . In particular, it auto-

matically accounts for the correlation between the estimates θ(k)t .

It is also worth mentioning that the numerical complexity of the proposed procedureis not very high. It suffices to generate once M samples from IPθ∗ and compute andstore the estimates θ

(k,m)t for every realization, m = 1, . . . ,M and k = 1, . . . , K . The

SAGES procedure operates with the estimates θ(k)t and there is no need to keep the samples

themselves. Now, with the fixed set of parameters zk , computing the estimates θ(k)t requires

only the finite number of operations proportional to K . One can roughly bound the totalcomplexity of the Monte Carlo study by CMK2 for some fixed constant C .

Table 1 presents critical values computed by the proposed algorithm from Section 2.4for the default values of the “hyperparameters” r = 0.5 and α = 1 . The set-up is fixedas follows: {ηk = η1a

k−1} with a = 1.25 and η1 = 0.6 . The weights w(k)st are cut if

wst < c = 0.01 . We also restrict the largest ηK to be smaller than η∗ = 0.985 .To understand the impact of using a continuous aggregation kernel, we also consider

the LMS procedure which comes out of the algorithm for the uniform aggregation kernelKag(u) = 1(u ≤ 1) . The critical values are not very different for both procedures.

2.5. Some Theoretical Properties of the SAGES EstimateBelomestny and Spokoiny (2007) claimed some “oracle” property of the SAGES estimateθt in context of the classification problem. The set-up of this paper is different because thedata Ys are not independent and the conditional distribution of every Yt is not completelyspecified even for the given θt : it still depends on the distribution of the innovations εs .Therefore, the results from the latter work do not directly apply here and below we indicatehow we can extend them to the problem of volatility estimation.

The first result gives an upper bound for the critical values zk .


Table 1. Critical values of the SAGES and LMS methodsk ηk Mk Nk zk (SAGES ) zk (LMS)

1 0.600 9 2.485 0.192 0.1922 0.680 11 3.095 0.548 0.1413 0.744 15 3.872 0.587 0.0914 0.795 20 4.843 0.220 0.0655 0.836 25 6.045 0.134 0.0536 0.869 32 7.555 0.145 0.0437 0.895 41 9.446 0.117 0.0358 0.916 52 11.806 0.087 0.0309 0.933 66 14.759 0.076 0.025

10 0.946 83 18.446 0.065 0.02011 0.957 104 23.051 0.050 0.01612 0.966 131 28.816 0.037 0.01213 0.973 165 36.024 0.022 0.00714 0.978 207 45.029 0.015 0.00115 0.982 259 56.280

Theorem 2.5. Let the innovations εt satisfy the conditions of Theorem 2.2. Assume(MD) and (Kag) . There are three constants a0, a1 and a2 depending on u0 , u , b , andµ0 only such that the choice

zk = a0 + a1 log ρ−1 + a2r log Nk

ensures the propagation condition (12) for all k ≤ K .

The construction of the procedure ensures some risk bound for the adaptive estimate θt

in the time homogeneous situation, see (12). It is natural to expect that a similar behavioris valid in the situation when the time varying parameter θt does not significantly deviatesfrom some constant value θ . Here we quantify this property and show how the deviationfrom the parametric time homogeneous situation can be measured.

Denote by I(k)t the support of the k th weighting scheme corresponding to the memory

parameter ηk : I(k)t = [t−Mk, t] , k = 1, . . . ,K . Define for each k and θ

∆(k)t (θ) =

∑

s∈I(k)t

K(Pθs , Pθ

), (14)

where K(Pθs , Pθ

)means the Kullback-Leibler distance between two distributions of Ys

with the parameter values θs and θ . In the case of Gaussian innovations, K(Pθs , Pθ

)=

K(θs, θ) . The value ∆(k)t (θ) can be considered as a distance from the time varying model

at hand to the parametric model with the constant parameter θ on the interval I(k)t .

Note that the volatility θs is in general a random process. Thus, the value ∆(k)t (θ)

is random as well. Our small modeling bias condition means that there is a number k◦

such that the modeling bias ∆(k)t (θ) is small with a high probability for some θ and all

k ≤ k◦ . Consider the corresponding estimate θ(k◦)t obtained after the first k◦ steps of

the algorithm. The next “propagation” result claims that the behavior of the procedureunder the small modeling bias condition is essentially the same as in the pure parametricsituation.

Theorem 2.6. Assume (Θ) , (MD) , and (4). Let θ and k◦ be such that

maxk≤k◦

IE∆(k)t (θ) ≤ ∆ (15)


for some ∆ ≥ 0 . Then for any r > 0

IEf(·) log(

1 +Nr

k◦Kr(θ(k◦)t , θ

(k◦)t

)

ρRr

)≤ 1 + ∆,

IEf(·) log(

1 +Nr

k◦Kr(θ(k◦)t , θ

)

Rr

)≤ 1 + ∆

where Rr = rr in the case of Gaussian innovations and Rr = µ−r0 rr in the case of sub-

Gaussian innovations with the constant µ0 from Theorem 2.2.

Due to the “propagation” result, the procedure performs well as long as the “smallmodeling bias” condition ∆k(θ) ≤ ∆ is fulfilled. To establish the accurate result for thefinal estimate θ , we have to check that the aggregated estimate θk does not vary muchat the steps “after propagation” when the divergence ∆k(θ) from the parametric modelbecomes large.

Theorem 2.7. It holds for k ≤ K

NkK(θ(k)t , θ

(k−1)t

) ≤ zk. (16)

Moreover, under (MD) , it holds for every k′ with k < k′ ≤ K

NkK(θ(k′)t , θ

(k)t

) ≤ a2c2u zk (17)

where cu = (u−1/2 − 1)−1 , a is a constant depending on Θ only, and zk = maxl≥k zl .

Remark 2.2. An interesting feature of this result is that it is fulfilled with probabilityone, that is, the control of stability “works” not only with a high probability, it alwaysapplies. This property follows directly from the construction of the procedure.

Combination of the “propagation” and “stability” statements implies the main resultconcerning the properties of the adaptive estimate θt . The result claims again the “oracle”accuracy N

−1/2k◦ for θ up to the log factor zk◦ . We state the result for r = 1/2 only. An

extension to an arbitrary r > 0 is obvious.

Theorem 2.8 (“Oracle” property). Assume (Θ) , (MD) , (4), and let IE∆(k)t ≤

∆ for some k◦ , θ and ∆ . Then

IEf(·) log(

1 +N

1/2k◦ K1/2

(θt, θ

)

aR1/2

)≤ log

(1 + cuR−1

1/2

√zk◦

)+ ∆ + α + 1

where cu is the constant from Theorem 2.7 and R1/2 from Theorem 2.6.

Remark 2.3. Before proving the theorem, we briefly comment on the result claimed.By Theorem 2.6, the “oracle” estimate θ

(k◦)t ensures that the estimation loss K1/2

(θ(k◦)t , θ

)

is stochastically bounded by Const. /N1/2k◦ where Const. is a constant depending on ∆

from the condition (15). The “oracle” result claims the same property for the adaptive


estimate θt but the loss K1/2(θt, θ) is now bounded by Const.√

zk◦/Nk◦ . By Theorem 2.5,the parameter zk◦ is at most logarithmic in the sample size. Hence, the accuracy of adaptiveestimation is the same in order as for the “oracle” up to a logarithmic factor which canbe viewed as “payment for adaptation”. Belomestny and Spokoiny (2007) argued that the“oracle” result implies rate optimality of the adaptive estimate θ and that the log-factorzk◦ cannot be removed or improved.

3. Accounting for Heavy Tails

Until now, we have presented the local exponential smoothing estimation method and its im-plementation details. The theoretical results are derived in the Gaussian and sub-Gaussianframework. Since financial time series often display a heavily tailed distribution whichgoes far beyond the (sub-)Gaussian case, it is meaningful to discuss whether and how theadaptive method can be reasonably applied to real data. Rather than proposing a generalapproach w.r.t all the heavily tailed distributions, we focus on one special family, the nor-mal inverse Gaussian (NIG) distribution. This is above all motivated by the flexibility ofthe NIG distribution. With 4 parameters, NIG can well describe the important statisticalfeatures for practical purposes, such as the heavy-tailed behavior for risk management. Thespecial interest on NIG is also motivated by its outperformance for real data analysis com-pared to some other popular heavily tailed distributions, such as Student-t, see Chen et al.(2010).

The NIG density is of the form:

fNIG(ε; φ, β, δ, µ) =φδ

π

K1

(φ√

δ2 + (ε− µ)2)

√δ2 + (ε− µ)2

exp{δ√

φ2 − β2 + β(ε− µ)},

where the distributional parameters fulfill conditions: µ ∈ IR, δ > 0 and |β| ≤ φ , andKλ(·) is the modified Bessel function of the third kind with the form:

Kλ(y) =12

∫ ∞

0

yλ−1 exp{−y

2(y + y−1)} dy.

We refer to Barndorff-Nielsen (1997) for a detailed description of the NIG distribution.With the NIG distributional assumption, we find that the SAGES method is not di-

rectly applicable. The adaptive approach assumes that the distribution of the innovationsshould satisfy the condition (4). The exponential moment IE{exp(λε2

t )} of the squaredNIG innovations ε2

t however does not exist due to extremes (observations that result toheavy tails). The (quasi) ML estimates computed based on the NIG innovations have highvariability so that the non-asymptotic bound (4) cannot be constructed. In other words,the results of Section 2.2 do not apply.

To implement the SAGES method, we suggest to replace the squared returns Yt bytheir p -power:

yt,p = Y pt = θp

t |εt|2p (18)

Note that in the equation (19) the “new” standardized squared innovations ε2t,p = yt,p/ϑt,p =

Y pt /(Cpθ

pt ) automatically satisfy IE{ε2

t,p | Ft−1} = 1 . This power transformation makesthe distribution of the “new” data closer to the Gaussian case. For example, the kurtosis forthe S&P500 index futures data reduces from 12.6239 (original data) to 2.1785 ( p = 0.5 )


and 2.8659 ( p = 0.1 ) . The proposed method with power transformation of course dependson an appropriate choice of p . The choice of 0 ≤ p < 1/2 ensures that the resulting “ob-servations” yt,p follow (sub-)Gaussian distribution and the exponential moments require incondition (4) exist, see Chen et al. (2008). It theoretically justifies the SAGES procedureapplied to the transformed data yt,p .

There is still a question remains in the implementation, i.e. how to choose the criticalvalues zk . The exact knowledge of the NIG distributional parameters of innovations isdifficult to foresee in real life situation before estimating the volatility of the time series.Without this information, the critical values used to adaptively estimate the volatility cannot be calculated. Since the transformed data is closer to the Gaussian case, we suggest toapply the critical values zk computed for the Gaussian case as initial values.

From the power transformation, one easily gets

IE{yt,p | Ft−1} = IE{Y pt | Ft−1} = θp

t IE|εt|2p

= θpt Cp = ϑt,p (19)

The adaptive procedure delivers the estimate ϑt,p of the “new” data. The parameterCp = IE|εt|2p is a constant and relies on p and the distribution of the innovations εt whichis assumed to be NIG. It is clear that the variable ϑt,p is in one-to-one correspondence withthe parameter θt .

To get the estimate of θt from the relation (19), it requires the value of Cp . This canbe done using the condition of standardization: IE{ε2

t | Ft−1} = 1 . This suggests to chooseCp such that the estimated squared innovations ε2

t = Yt/θt have a unit sample mean:

n−1C1/pp

t1∑t=t0

Yt

ϑ1/pt,p

= 1, (20)

where n = t1 − t0 + 1 means the number of observations.The above implementation proceeds with the critical values zk computed for the Gaus-

sian case. One can redo the estimation by using the critical values recalculated based onthe estimated innovations. In Section 4 we will report the sensitivity analysis of SAGES tothe parameter p and the critical values zk . It shows that the use of p around 1/2 workswell in terms of predictability. More interesting, the Gaussian-based critical values deliverquite good forecasts and the use of the critical values calculated based on the true NIGdistributional parameters only slightly improves the accuracy.

The adaptive procedure for the NIG innovations is summarized as:

(a) Do power transformation to the squared returns Yt : Yt,p = Y pt .

(b) Compute the estimate ϑt,p of the parameter ϑt,p from Yt,p applying the criticalvalues zk obtained for the Gaussian case.

(c) Estimate the value Cp from the equation (20).(d) Compute the estimates θt = (ϑt,p/Cp)1/p and identify the NIG distributional param-

eters from εt = Rtθt

−1/2.

(e) (Optional) Calculate critical values zk with the identified NIG parameters usingMonte Carlo simulation. Repeat the above procedure to estimate θt .

All the theoretical results from Section 2.5 applies to a such constructed estimate ϑt,p ofthe parameter ϑt,p if p < 1/2 is taken. This automatically yields the “oracle” accuracy for


Table 2. Simulation scenarios: volatility processt ∈ [1, 300] t ∈ [301, 700] t ∈ [701, 1000]

VIX VIX from Jan. 2, 2004 to Sep. 30, 2008 ( t ∈ [1, 1195] )Bl 0.1612 0.0948 0.1612Bu 0.1612 0.2639 0.1612Sl 0.1612 0.1357 0.1612Su 0.1612 0.1868 0.1612

the back transformed estimate θt of the original volatility θt . For reference convenience,we present the “oracle” result. Below Pϑ means the distribution of Yt,p = ϑ|εt|2p with NIGεt . Note that neither the procedure nor the result does not assume that the parameters ofthe NIG distribution are known.

Theorem 3.1 (“Oracle” property for NIG innovations). Let the innovations εt

be NIG and p < 1/2 . Assume (Θ) , (MD) , and let, for some k◦ , ϑ and ∆ ,

IE∑

t∈I

K(Pϑt,p , Pϑ

) ≤ ∆.

Then

IEf(·) log(

1 +N

1/2k◦ K1/2

(ϑt,p, ϑ

)

aR1/2

)≤ log

(1 + cuR−1

1/2

√zk◦

)+ ∆ + α + 1

where cu is the constant from Theorem 2.7 and R1/2 from Theorem 2.6.

4. Simulation Study

This section demonstrates the performance of the proposed SAGES method to simulateddata. The initial focus will be on the predictability of the SAGES method compared totwo dynamic models that use standard ways to account for non-stationarity of time series.In particular, the dynamic GARCH(1,1) model recalibrates the model parameters at everytime point based on the past 250 observations, whereas the dynamic exponential smoothingassigns an exponentially decaying weight to the past observations. Here we follow the Risk-Metrics’ recommendation by letting the smoothing parameter η = 0.946 (corresponding toη10 in SAGES ). The second purpose of the simulation study is to investigate the reactionof SAGES to structure shifts, for which the length of the selected homogenous interval inthe SAGES method is displayed. Last but not least, we check the sensitivity of parametersselected for implementing the SAGES method in terms of predictability.

The scenarios of the simulations are designed close to reality, see Table 2. In the scenarioVIX, the volatility process is actually the historical values of VIX (ranging from January 2,2004 to September 30, 2008). The stochastic innovations are assumed to be NIG distributed.The NIG parameters φ = 1.6885 , β = −0.4994 , µ = 0.4591 and δ = 1.4719 are estimatedusing the standardized S&P 500 index futures (returns standardized by VIX). The scenariois to illustrate the predictability of different methods to real world. In addition, some simplescenarios are considered to show the reaction of SAGES when market structure shifts, forwhich B/S denotes a big/small change and l/u indicates a downward/upward movement.To be more specific, the volatility process is designed to jump up or down around 16.122%,


301 400 500 600 700 800 900 1000 1100 11950

0.1

0.2

0.3

0.4

0.5Volatility estimates

VIX

t

ES (η = 0.946)GARCH(1,1)SAGES

301 400 500 600 700 800 900 1000 1100 11959

11

13Length of the homogeneous intervals

k

Fig. 2. Simulation results for scenario VIX with NIG innovations. The red dash-dot line representsthe historical values of VIX, i.e. the true time-varying parameter. The bold solid line represents theestimates by using SAGES , the grey solid line is by using the dynamic ES and the blue dashed lineis by using the dynamic GARCH(1,1). The selected homogeneous intervals for each time point arepresented below.

the average value of the VIX data. The jump size is either big with two times of the samplestandard deviation (SD) from the average, or small with a half of SD. Since the VIX’s lowestrecorded value after introduction is 9.48% that occurred on Dec 24, 1993, we set it as thesmallest value in the process generation. Again, the innovations are assumed to be NIGdistributed with the same distributional parameters presented above.

For every scenario, we generate S = 1000 stochastic processes Rt =√

θtεt for eachdistributional assumptions. The sample size of every stochastic process is T = 1000 ( T =1195 for the scenario VIX). To carry on the adaptive estimation procedure, we use theparameters as same as in the computation of critical values, see Table 1. In particular, theparameters of the adaptive algorithm is r = 0.5 , Kag(u) = {1− (u− 1/6)+}+ and α = 1 .The set-up of the continuous aggregation intervals is fixed as follows: {ηk = η1a

k−1} witha = 1.25 and η1 = 0.6 . We also restrict the largest ηK to be smaller than η∗ = 0.985 .The cutting parameter in the exponential smoothing estimation is c = 0.01 . The procedurestarts from t0 = 301 to assure that the largest possible homogenous interval ( K = 15 )may be considered. When applied to the NIG distributed process, we set p = 0.5 in thepower transform. It is worth mentioning that the set-up of the aggregation intervals and thecutting parameter can be freely selected by users. The parameters of the adaptive procedurecan also be selected in a data driven way by minimizing some objective function, e.g., byminimizing the forecasting error. The small sensitivity analysis presented later reasons ourselection in terms of predictability.


Table 3. Forecast evaluationScenarios RMSFE MAFE

SAGES GARCH(1,1) ES SAGES GARCH(1,1) ES

VIX 0.0144 0.0269 0.0248 0.0097 0.0188 0.0167Bl 0.0064 0.0193 0.0110 0.0044 0.0107 0.0037Bu 0.0100 0.0317 0.0168 0.0073 0.0189 0.0059Sl 0.0049 0.0108 0.0043 0.0041 0.0078 0.0019Su 0.0057 0.0109 0.0040 0.0049 0.0079 0.0018

Figure 2 displays the simulation results for scenario VIX (NIG), i.e. by using the his-torical values of VIX and NIG innovations. Three methods, SAGES , dynamic ES anddynamic GARCH(1,1), are used and the average values of one-step ahead forecasts overthe 1000 generated process has been depicted. It shows that the SAGES forecasts (solidline) are very close to the historical values of VIX (dash-dot line), i.e. the actual valuesin the future. The two dynamic but non-adaptive methods (displayed as grey solid lineand dashed line), on the other hand, delivers forecasts that deviate from the actual values.The adaptively selected values of k are displayed in the bottom plot. Interestingly, themedian of k = 10 coincides to the RiskMetrics’ recommendation, which is the setup of thedynamic ES. However, with the flexibility of varying the smoothing parameter over time,SAGES obviously performs better in term of predictability.

Two criteria are used to measure the accuracy of prediction up to the end of this study,the root mean square forecast error (RMSFE) and the mean absolute forecast error (MAFE):

RMSFE ={ T∑

t=t0

(θ1/2t − θ

1/2t

)2/(T − t0 + 1)

}1/2

MAFE =T∑

t=t0

∣∣θ1/2t − θ

1/2t

∣∣/(T − t0 + 1)

The two criteria further support the observation that the proposed SAGES improves theforecasting accuracy, at least for short-term prediction, compared to the standard ways. TheRMSFEs, for example, are 0.0144 (SAGES ), 0.0269 (dynamic GARCH) and 0.0248 (ES),which indicates almost a double improvement of predictability by using SAGES . Table3 reports the RMSFE and MAE of 1-day-ahead forecasts of the simulations on SAGES ,the dynamic GARCH(1,1) and the dynamic ES for various scenarios. The results showthat SAGES outperforms the dynamic GARCH(1,1) for all scenarios. On the other hand,the non-adaptive exponential smoothing sometimes delivers more accurate predictions thanSAGES . This “imperfectness” of SAGES can be explained by the design of the scenarios.As an example, we discuss the scenario Bl where the ratio of MAFE is 1.1892, indicatingthat the exponential smoothing is almost 19% more accurate than SAGES . This scenario,as same as the other simple scenarios, involves 2 jumps while all the left data follow astationary process, see Figure 3. It is obvious that SAGES reacts very fast to both changes.In particular, it needs 26 steps to reach the new value at t = 301 and 7 steps at t = 700 ).The exponential smoothing, on the other hand, needs 76 steps and 64 steps respectively.Nevertheless, the adaptive method has to pay for flexibility, i.e. the method may over-fit andwrongly consider a variation induced by the innovations as structure shift. The exponentialsmoothing avoids such a problem by always using a large smoothing parameter over time.


301 400 500 600 700 800 900 10000.08

0.1

0.12

0.14

0.16

0.18Volatility estimates

θ

t1/2

ES (η = 0.946)

GARCH(1,1)

SAGES

301 400 500 600 700 800 900 10007

9

11

13Length of the homogeneous intervals

Fig. 3. Simulation results for scenario Bl with NIG innovations. The red dash-dot line represents thetrue time-varying parameter. The bold solid line represents the estimates by using SAGES , the greysolid line is by using the dynamic ES and the blue dashed line is by using the dynamic GARCH(1,1).The selected homogeneous intervals for each time point are presented below.

As a result, the latter delivers smooth forecasts. With a focus on the whole predictingperiod, the criteria unfortunately can not provide micro-information for each time point.Besides the fast reaction to structure changes, SAGES delivers quite reasonable selectionof the homogeneous intervals, i.e. short interval after a change and long interval thereafter.The results for the other scenarios are similar, we therefore eliminate the discussion andfigures.

In addition, we justify the default choice of parameters, i.e r = 0.5 , α = 1.0 , a = 1.25 ,c = 0.01 and p = 0.5 , in the adaptive procedure by using sensitivity analysis. It is necessaryto investigate the effect of the “hyperparameters” r and α (used in the adaptive procedure)and the parameters a and c (for constructing the set of homogeneous intervals). Table 4reports the forecast evaluation criteria – RMSFE and MAE – of 1-day-ahead forecasts usingSAGES based on the scenario VIX. For default choice with r = 0.5, ρ = 1, a = 1.25, c = 0.01we have RMSFE = 0.0144 and MAFE = 0.0097 . Here every parameter is designed to havea deviation from the default value, while the other parameters keep constant. It is worthmentioning that the critical values are different given the new values of parameters and needto be recalculated. In terms of predictability, SAGES seems to be robust to the variationof the parameters, with similar predictability to the default choice. The comparison of theforecast evaluation criteria reasons the default choice of r and ρ in the SAGES approach,which gives the most accurate predictions among the listed alternatives.

We also measure the sensitivity of the adaptive method to the parameter p over arange from 0.1 to 1 . The SAGES forecasts are calculated based on either the Gaussian-


Table 4. Sensitivity analysis of r, ρ, a, c

r, def. 0.5 RMSFE MAFE ρ, def. 1 RMSFE MAFE

0.3 0.0146 0.0101 0.5 0.0152 0.01070.7 0.0152 0.0108 0.7 0.0151 0.01061.0 0.0153 0.0109 1.5 0.0148 0.0103

a, def. 1.25 RMSFE MAFE c, def. 0.01 RMSFE MAFE

1.1 0.0145 0.0095 0.005 0.0147 0.00981.5 0.0151 0.0107 0.02 0.0151 0.0106

Table 5. Sensitivity analysis of p

CV Gaussian CV NIGp def. 0.5 RMSFE MAFE RMSFE MAFE

0.1 0.1980 0.1871 0.1835 0.17340.2 0.0464 0.0420 0.0412 0.03830.3 0.0200 0.0158 0.0175 0.01380.4 0.0149 0.0104 0.0137 0.00920.6 0.0144 0.0098 0.0138 0.00940.7 0.0148 0.0102 0.0146 0.01010.8 0.0157 0.0112 0.0154 0.01110.9 0.0172 0.0128 0.0172 0.01281.0 0.0193 0.0150 0.0193 0.0150

based critical values (after the power transformation) or the NIG-based critical values (i.e.calculating the critical values using Monte Carlo simulation where the NIG distributionalparameters are assumed to be known). Table 5 reports the forecast evaluation criteria of1-day-ahead forecasts using SAGES based on the scenario VIX. The default choice withp = 0.5 yields RMSFE = 0.0144 and MAFE = 0.0097 . As expected, the Gaussian-basedcritical values works well in terms of predictability, although using the NIG-based criticalvalues can slightly improve the accuracy of prediction. Moreover, the best accuracy ofprediction is reached when the default choice p = 0.5 is used, no matter which kind ofcritical values is considered. We therefore suggest to use the transformation technique andthe Gaussian-based critical values in estimating volatility in practice.

5. Application to Risk Analysis

We now turn to the empirical application of the proposed adaptive SAGES method in riskmanagement. A sound risk management model is of great importance in the financialmarkets. It enables the identification and analysis of risks and guides financial institu-tions for decision-making under uncertainty. The classic risk management methods such asRiskMetrics are based on unrealistic assumptions. It is possible that these methods mayresult in inaccurate risk measurement, when the assumptions are violated, i.e. when 1)financial time series are no longer stationary and 2) returns are definitely not normallydistributed. For example, as the global financial crisis unfolded in late July 2007, it isquestioned whether the market preferred risk models are really adequate to measure thereal risks. In this section, we propose and implement a new adaptive risk model based onthe proposed SAGES method with NIG distributional assumption. The performance of theadaptive model is investigated in a set of real data analysis. Compared to two alternativemodels based on dynamic exponential smoothing/GARCH(1,1) approach and normality


assumption, the adaptive model shows superior performance.

5.1. Data descriptionOur empirical analysis is based on a broad range of data, including 2 market summaryindices (S&P500 index ranging from 1950/01/03 to 2009/01/20, Hong Kong Hang Sengindex (HSI) from 1987/01/03 to 2009/01/20 from Yahoo! Finance), 10 year treasury note(T-note) from 1962/01/03 to 2009/01/20 from Yahoo! Finance and EUR exchange ratefrom 2000/01/03 to 2009/01/12 from Federal Reserve Statistical Release. We first removethe serial correlations from the log returns of the 4 financial instruments, if the Ljung-Boxtest statistic of no serial correlation is rejected at the 5% significance level. The linear filtersused are collected in Appendix in Section 6. The descriptive statistics of the filtered returnseries (denoted by Rt ) are reported in Table 6. In summary, the empirical characteristics ofthe series are in line with the well-known stylized facts reported in the financial econometricsliterature. For example, the series display heavy-tailed distributional behavior, by rejectingthe Kolmogorov-Smirnov (KS) test of data with a standard normal distribution at the 5%significance rate.

The volatility of the filtered returns is estimated by using the SAGES method and twodynamic approaches. In the adaptive risk model the SAGES estimator of volatility θt

is computed for each time point and the residuals are assumed to follow the NIG distri-bution. The default values of the parameters in SAGES are used r = 0.5 , α = 1 andKag(u) = {1− (u− 1/6)+}+ . Again, we use the same set of intervals ( a = 1.25 , η1 = 0.6and c = 0.01 ) as in the former section, see Table 1. For the two alternative risk mod-els, we compute the volatility based on either exponential smoothing with the “optimal”smoothing parameter η = 0.946 or fitting a GARCH(1,1) setup using rolling window witha fixed length (including 250 past observations). The residuals are assumed to be normallydistributed as by many financial institutions. The residuals ( εt ) are calculated based onthe conditional heteroscedasticity:

εt = Rt/

√θt.

The descriptive statistics of the resulting residual series are also presented in Table 6. Itshow that all the series are standardized with mean value close to 0 and variance close to1. Most residuals have leptokurtic distributions as indicated by their sample kurtoses. Theleptokurtic is also indicated by the rejection of the Kolmogorov-Smirnov test of normality.The fitted values of the NIG parameters in the SAGES method are reported in Table 7.

5.2. Risk measureTwo popular risk measures at probability pr , Value-at-Risk (VaR) and Conditional Valueat Risk (CVaR), are used for evaluating the three risk models. The two risk measuresare widely used in risk measurement, risk control and computing regulatory capital. VaRimplies the frequency of loss happening while CVaR, also called expected shortfall, measuresthe size of loss once the loss exceeds the VaR value. The definitions of the two risk measuresare as follows:

VaRt,pr = −quantilepr(Rt) = −√

θt ∗ quantilepr(εt)

CVaR = IE{−Rt| −Rt > VaRt,pr}.


Table 6. Descriptive statistics of the standardized returns ( ∗ indicates 5% significance)data mean s.d. skewness kurtosis Ljung-Box(20) KS

S&P500 -0.0000 0.0095 -1.1648 33.9089 31.3059 0.4830 ∗

εt (SAGES ) -0.0026 1.3979 -0.5672 12.4157 100.9510 ∗ 0.0398 ∗

εt (ES) -0.0032 1.0547 -0.5359 7.6162 129.1130 ∗ 0.0280 ∗

εt (GARCH) -0.0025 1.0412 -0.5080 7.6931 135.1386 ∗ 0.0321 ∗

T-note 0.0000 0.0096 0.0013 10.0607 28.7580 0.4823 ∗

εt (SAGES ) 0.0166 1.7041 -5.1646 201.9300 67.2728 ∗ 0.0423 ∗

εt (ES) 0.0204 1.0339 -0.4316 4.5777 23.7897 0.0486 ∗

εt (GARCH) 0.0175 1.0290 0.0272 5.7276 84.9526 ∗ 0.0362 ∗

EUR 0.0000 0.0064 0.0267 4.8639 28.5050 0.4902 ∗

εt (SAGES ) 0.0644 1.4382 0.9507 15.6825 20.3959 0.0502 ∗

εt (ES) 0.0262 1.0383 -0.1208 3.4982 21.4819 0.0328 ∗

εt (GARCH) 0.0221 1.0057 -0.1053 3.4522 24.8326 0.0397 ∗

HSI -0.0000 0.0181 -2.5743 61.9652 30.9487 0.4739 ∗

εt (SAGES ) 0.0023 1.3946 -0.7616 13.2821 60.4526 ∗ 0.0395 ∗

εt (ES) 0.0109 1.0696 -0.7983 10.0774 57.1703 ∗ 0.0421 ∗

εt (GARCH) 0.0109 1.0716 -1.1443 14.7564 62.5271 ∗ 0.0468 ∗

According to the modification of the Basel market risk paper in 1997, financial institutionsare allowed to use their internal risk models to calculate VaR by themselves. The modelsare then validated in the backtesting procedure, where the failure rate is defined as theproportion of times when VaR is exceeded in a given sample, see Kupiec (1995). Under theassumed accurate and effective risk model, the empirical failure rate should be close to theexpected risk level pr . The null hypothesis is formulated as:

H0 : IE[N ] = Tpr,

where N denotes the number of failures with respect to the model based VaR. Undernull hypothesis, it is a Binomial random variable with parameters T and pr , hence thelikelihood ratio test statistic is as follows:

LR = −2 log{(1− pr)T−NprN

}+ 2 log

{(1−N/T )T−N (N/T )N

},

which is χ2(1) distributed, see Jorion (2001).In our study, we evaluate the performance of the proposed methods in risk management

given several extreme probability levels 1%, 0.5% and 0.1%. It is worth mentioning thatthese probabilities are often considered for purpose of regulatory requirement and inter-nal supervisory. Table 8 reports the backtesting results based on the adaptive risk model(SAGES -NIG) and the dynamic but non-adaptive models (ES-N and GARCH-N). As ex-pected, the dynamic but non-adaptive risk models underestimate the risk level with e.g. alarge value of failure rate (compared to the empirical risk level), and hence are not verified.For S&P 500 data, as an example, the empirical failure rate based on the dynamic butnon-adaptive models is roughly 2 to 5 times of the expected values. One possible reasonof delivering such a low accuracy is that these models react relatively slowly to possiblestructural shifts in markets, especially during the financial crisis period. It is observed thatthese models give small values of VaR. However a small value of VaR does not necessarilyindicate a low capital reserved for regulatory requirement if the models are not verified.Instead using these models may receive punishment by regulatory. Moreover, the dynamic


Table 7. The estimated NIG parameters of theresiduals by SAGESdata α β µ δ

S&P500 0.4946 -0.0540 0.0720 0.6790T-note 0.1098 -0.0256 0.0579 0.1720EUR 0.4378 0.0709 -0.0349 0.6050HSI 0.4855 -0.0692 0.0969 0.6565

Table 8. Risk analysis. ( ∗ indicates 5% significance for LR backtesting)Data/model 100 ∗ (pr = 1%) 100 ∗ (pr = 0.5%) 100 ∗ (pr = 0.1%)

S&P500 pr VaR CVaR pr VaR CVaR pr VaR CVaRSAGES -NIG 1.13 2.95 2.14 0.56 3.71 2.48 0.16 ∗ 5.66 2.52ES-N 2.02 1.92 2.20 1.29 2.13 2.45 0.57 ∗ 2.56 3.08GARCH-N 1.90 1.94 2.33 1.30 2.14 2.54 0.58 ∗ 2.57 3.24

T-note pr VaR CVaR pr VaR CVaR pr VaR CVaRSAGES -NIG 1.33 3.02 1.62 0.37 ∗ 4.81 1.40 0.05 11.07 0.35ES-N 1.79 1.95 2.24 1.30 2.15 2.38 0.65 ∗ 2.58 2.59GARCH-N 1.62 2.00 2.33 1.16 2.22 2.42 0.59 ∗ 2.66 2.58

HSI pr VaR CVaR pr VaR CVaR pr VaR CVaRSAGES -NIG 1.10 5.39 3.96 0.58 6.81 4.88 0.12 10.49 9.05ES-N 1.84 ∗ 3.46 4.77 1.39 ∗ 3.83 5.29 0.72 ∗ 4.60 6.19GARCH-N 1.82 ∗ 3.48 4.90 1.36 ∗ 3.85 5.19 0.70 ∗ 4.62 6.01

USDEUR pr VaR CVaR pr VaR CVaR pr VaR CVaRSAGES -NIG 1.47 ∗ 1.77 1.32 0.61 2.22 1.34 0.15 3.40 1.32ES-N 1.63 ∗ 1.35 1.46 0.81 1.50 1.58 0.31 ∗ 1.80 1.71GARCH-N 1.73 ∗ 1.39 1.57 0.71 1.54 1.74 0.25 1.84 1.74

but non-adaptive models generate relatively large values of CVaR, which implies investorssuffer large loss at bankruptcy. On the contrary, the adaptive model gives not only accu-rate risk levels, but also a small value of CVaR. Numerical results show that the proposedmethod performs quite well in risk management.

6. Appendix

This section collects the proofs of the main results in the paper.

Proof of Theorem 2.2

For brevity of notation we omit the subscript t . It holds for L(W, θ, θ∗) = L(W, θ) −L(W, θ∗)

2L(W, θ, θ∗) = −N log(θ/θ∗)− (1/θ − 1/θ∗)∑

s

Ysws .


Under the measure IPθ∗ , the squared returns Ys can be represented as Ys = θ∗ε2s leading

to the formula

2L(W, θ, θ∗) = N log(θ∗/θ)− (θ∗/θ − 1)∑

s

ε2sws

= N log(1 + u)− u∑

s

ε2sws = N log(1 + u)−Nu− u

∑s

(ε2s − 1)ws

with u = θ∗/θ − 1 . For any µ such that maxs uµws ≤ λ this yields by independence ofthe εs ’s

log IEθ∗ exp{2µL(W, θ, θ∗)

}= µN log(1 + u)− µNu +

∑s

log IEθ∗ exp{−uµws(ε2

s − 1)}

= µN log(1 + u)− µNu +∑

s

κ(−uµws).

It is easy to see that the condition (Θ) implies κ(−uµws) ≤ κ0u2µ2w2

s ≤ κ0u2µ2ws for

some κ0 > 0 . This yields

log IEθ∗{2µL(W, θ, θ∗)

} ≤ µN log(1 + u)− µNu +∑

s

κ0u2µ2ws

= µN{log(1 + u)− u + κ0µu2

}.

The condition (Θ) ensures that u = u(θ) = θ∗/θ − 1 is bounded by some constant u∗

for all θ ∈ Θ . The expression log(1 + u) − u + κ0µu2 is negative for all |u| ≤ u∗ andsufficiently small µ yielding (5).

Lemma 6.1 from Polzehl and Spokoiny (2006) implies that

{NK(θ, θ∗) > z} ⊆ {NK(θ−, θ∗) > z} ∪ {NK(θ+, θ∗) > z}

for some fixed points θ+, θ− depending on z . This and (5) prove (6).

Proof of Theorem 2.3By Theorem 2.2

IEθ∗∣∣L(Wt, θt, θ

∗)∣∣r ≤ −

∫

z≥0

zrdIPθ∗(L(Wt, θt, θ∗) > z)

≤ r

∫

z≥0

zr−1IPθ∗(L(Wt, θt, θ∗) > z)dz

≤ 2r

∫

z≥0

zr−1e−µ0zdz = 2rµ−r0

∫

z≥0

zr−1e−zdz

and the first assertion is fulfilled. The last assertion is proved similarly.


Proof of Lemma 2.4Under IPθ∗ the squared returns Ys fulfill Yt = θ∗ε2

t and for every k , the estimate θ(k)t

can be represented as

θ(k)t = N−1

k

∑s

Ysw(k)st = θ∗N−1

k

∑s

ε2sw

(k)st ,

so that θ(k)t is θ∗ times the estimate computed for θ∗ = 1 . The same applies by sim-

ple induction argument to the aggregated estimate θ(k−1)t . It remains to note that the

Kullback-Leibler divergence K(θ(k)t , θ

(k−1)t ) is a function of the ratio θ

(k)t /θ

(k−1)t , in which

θ∗ cancels.

Proof of Theorem 2.5For notational simplicity we omit the subindex t . The proof utilizes the following simple“metric like” property of K1/2(·, ·) . shown in Polzehl and Spokoiny, 2005, Lemma 5.2: forany θ0, θ1, θ2

K1/2(θ1, θ2) ≤ a{K1/2(θ1, θ0) + K1/2(θ2, θ0)

}. (21)

With the given constants zk , define for k > 1 the random sets

Ak = {Nk K(θ(k), θ(k−1)) ≤ bzk}, A(k) = A2 ∩ . . . ∩Ak ,

where b enters in the construction of Kag : Kag(t) = 1 for t ≤ b .Note first that θ(k) = θ(k) on A(k) for all k ≤ K . This fact can be proved by induction

in k . For k = 1 , the assertion is trivial because θ(1) = θ(1) . Now suppose that θ(k−1) =θ(k−1) . Then it holds on Ak that m(k) = NkK(θ(k), θ(k−1)) = NkK(θ(k), θ(k−1)) ≤ bzk

and thus, γk = Kag(m(k)/zk) ≥ Kag(b) = 1 yielding θ(k) = θ(k) . Therefore, it remains to

bound the risk of θ(k) on the complement A(k)

of A(k) . Define Bk = A(k−1) \A(k) . Onthe event Bk , the index k is the first one for which the condition Nk K(θ(k), θ(k−1)) ≤ bzk

is violated. It is obvious that A(k)

=⋃

l<k Bl . First we bound the probability IPθ∗(Bl

).

Applying assumption A3 and (21) yields for every l

Nl K(θ(l), θ(l−1)) ≤ 2a2Nl

{K(θ(l), θ∗) + K(θ(l−1), θ∗)

}

≤ 2a2{Nl K(θ(l), θ∗) + u−1

0 Nl−1K(θ(l−1), θ∗)}.

Therefore, by Theorem 2.2,

IPθ∗(Bl

) ≤ IPθ∗(Nl K(θ(l), θ(l−1)) > bzl

) ≤ 2 exp(−u0b

4a2zl

).

On the set Bl , it holds θ(l−1) = θ(l−1) and thus, for every k > l the aggregated estimateθ(k) by construction is a convex combination of θ(l−1), . . . , θ(k) . Convexity of the Kullback-Leibler divergence w.r.t. the second argument, the definition of θ(k) and the lemma, see


(21), ensure that

K1/2(θ(k), θ(k)

)1(Bl

) ≤ maxl′=l−1,...,k−1

K1/2(θ(k), θ(l′))

≤ a maxl′=l−1,...,k−1

{K1/2(θ(k), θ∗) + K1/2(θ(l′), θ∗)

}

≤ 2a maxl′=l−1,...,k

K1/2(θ(l′), θ∗).

This and (3) imply for every r

IEθ∗Kr(θ(k), θ(k)

)1(Bl

) ≤ (2a)2rIEθ∗

k∑

l′=l−1

Kr(θ(l′), θ∗)1(Bl

)

≤ (2a)2rk∑

l′=l−1

IE1/2θ∗ K2r(θ(l′), θ∗)IP 1/2

θ∗(Bl

)

≤ (2a)2rr1/2r

k∑

l′=l−1

N−rl′ 2 exp

(−u0b

8a2zl

)

≤ C1N−rl r1/2

r exp(−c2zl

)

for some fixed constants C1 and c2 . Therefore,


) ≤k∑

l=2


)1(Bl

) ≤k∑

l=2

C1N−rl r1/2

r exp(−c2zl

).

It remains to check that the choice zk = a0 + a1 log ρ−1 + a2r log(NK/Nk) with properlyselected a0, a1 and a2 provide the required bound IEθ∗

∣∣NkK(θ(k), θ(k)

)∣∣r ≤ ρrr .

Proof of Theorem 2.6The proof is based on the following general result.

Lemma 6.1. Let IP and IP0 be two measures such that the Kullback-Leibler divergenceIE log(dIP/dIP0) , satisfies

IE log(dIP/dIP0) ≤ ∆ < ∞.

Then for any random variable ζ with IE0ζ < ∞

IE log(1 + ζ

) ≤ ∆ + IE0ζ.

Proof. By simple algebra one can check that for any fixed y the maximum of thefunction f(x) = xy − x log x + x is attained at x = ey leading to the inequality xy ≤


x log x−x+ ey . Using this inequality and the representation IE log(1+ ζ

)= IE0

{Z log

(1+

ζ)}

with Z = dIP/dIP0 we obtain

IE log(1 + ζ

)= IE0

{Z log

(1 + ζ

)}

≤ IE0

(Z log Z − Z

)+ IE0(1 + ζ)

= IE0

(Z log Z

)+ IE0ζ − IE0Z + 1.

It remains to note that IE0Z = 1 and IE0

(Z log Z

)= IE log Z .

The first assertion of the theorem is just a combination of this result and the condition (12).The second follows in a similar way from Theorem 2.2.

Proof of Theorem 2.7By convexity of the Kullback-Leibler divergence K(u, v) w.r.t. the first argument

K(θ(k)t , θ

(k−1)t

) ≤ γkK(θ(k)t , θ

(k−1)t

).

If K(θ(k)t , θ

(k−1)t

) ≥ zk/Nk , then γk = 0 and (16) follows. Now, (MD) and (21) yield

K1/2(θ(k′)t , θ

(k)t

) ≤ a

k′∑

l=k+1

K1/2(θ(l)t , θ

(l−1)t

) ≤ a

k′∑

l=k+1

(zl/Nl

)1/2.

The use of (21) leads to the bound

K1/2(θ(k′)t , θ

(k)t

) ≤ a(zk/Nk

)1/2k′∑

l=k+1

u(l−k)/2 ≤ a√

u(1−√u)−1(zk/Nk

)1/2

which proves (17).

Proof of Theorem 2.8Similarly to the proof of Theorem 2.7,

K1/2(θt, θ

) ≤ aK1/2(θ(k◦)t , θ

)+ aK

(θ(k◦)t , θ

(k◦)t

)+ a

k∑

l=k◦+1

K1/2(θ(l)t , θ

(l−1)t

)

≤ aK1/2(θ(k◦)t , θ

)+ aK1/2

(θ(k◦)t , θ

(k◦)t

)+ acu

√zk◦/Nk◦ .


This, the elementary inequality log(1 + a + b) ≤ log(1 + a) + log(1 + b) for a, b ≥ 0 impliessimilarly to Theorem 2.6 that

IEf(·) log(

1 +N

1/2k◦ K1/2

(θt, θ

)

aR1/2

)

≤ log(1 +

cu

√zk◦

R1/2

)+ IE log

(1 +

N1/2k◦ K1/2

(θ(k◦)t , θ

(k◦)t

)+ N

1/2k◦ K1/2

(θ(k◦)t , θ

)

R1/2

)

≤ log(1 +

cu

√zk◦

R1/2

)+ ∆ + α + 1

as required.

Linear filters for dataThe return series reject the Ljung-Box test at the 5% significance rate of no serial correlationup to 20. Linear filters are used to remove the serial correlation. The fitted filters are asfollows:

• S&P 500: Rt = 0.0002− 0.3982Rt−1 + 0.9649Rt−2 + 0.1384Rt−3 − 0.3202Rt−4 + εt +0.4493εt−1 − 1.0000εt−2 − 0.1982εt−3 + 0.3513εt−1 .

• DAX: Rt = −0.0002− 0.0384Rt−1 − 0.0127Rt−2 − 0.0352Rt−3 + εt .

• T-note: Rt = −0.1472Rt−1 + 0.6964Rt−2− 0.4109Rt−3− 0.0097Rt−4 + 0.5919Rt−5 +εt0.2318εt−1 − 0.6831εt−2 + 0.3597εt−3 + 0.0079εt−4 − 0.5799εt−5 .

• HSI: Rt = 0.0004 + 0.1592Rt−1 − 0.6664Rt−2 + 0.0716Rt−3 + εt − 0.1382εt−1 +0.6606εt−2 .

• USDEUR: Rt = 0.0001 + 0.0196Rt−1 − 0.0196Rt−2 + 0.0038Rt−3 + 0.0814Rt−4 .

References

Baillie, R. T., Bollerslev, T., and Mikkelsen, H. O. (1996). Fractionally integrated general-ized autoregressive conditional heteroskedasticity. Journal of Econometrics, 74:3–30.

Barndorff-Nielsen, O. (1997). Normal inverse gaussian distributions and stochastic volatilitymodelling. Scandinavian Journal of Statistics, 24:1–13.

Belomestny, D. and Spokoiny, V. (2007). Spatial aggregation of local likelihood estimateswith applications to classification. The Annals of Statistics, 35:2287–2311.

Bollerslev, T., Chou, R. Y., and Kroner, K. F. (1992). ARCH modeling in finance. Journalof Econometrics, pages 5–59.

Box, G. and Cox, D. (1964). An analysis of transformations. Journal of the Royal StatisticalSociety. Series B (Methodological), pages 211–252.


Chen, Y., Hardle, W., and Jeong, S. (2008). Nonparametric risk management with gener-alized hyperbolic distributions. Journal of The American Statistical Association, 14:910– 923.

Chen, Y., Hardle, W., and Spokoiny, V. (2010). GHICA – risk analysis with GH distribu-tions and independent components. Journal of Empirical Finance, 17:255–269.

Cheng, M., Fan, J., and Spokoiny, V. (2003). Dynamic nonparametric filtering with appli-cation to volatility estimation. In Akritas, M. and Politis, D., editors, Recent advancesand trends in nonparametric statistics, pages 315–334. Elsevier.

Christoffersen, P. (2003). Elements of Financial Risk Management. Adademic Press.

Diebold, F. X. and Inoue, A. (2001). Long memory and regime switching. Journal ofEconometrics, 105:131–159.

Eberlein, E. and Keller, U. (1995). Hyperbolic distributions in finance. Bernoulli, 1:281–299.

Embrechts, P., McNeil, A., and Straumann, D. (2002). Correlation and dependence in riskmanagement: properties and pitfalls. In Dempster, M., editor, Risk Management: Valueat Risk and Beyond. Cambridge University Press.

Engle, R. F. and Bollerslev, T. (1986). Modeling the persistence of conditional variances.Econometric Reviews, pages 1–50.

Fan, J. and Gu, J. (2003). Semiparametric estimation of value-at-risk. Econometrics Jour-nal, 6:261–290.

Granger, C. W. and Joyeux, R. (1980). An introduction to long memory time series modelsand fractional differencing. Journal of Time Series Analysis, 1:5–39.

Granger, C. W. J. (1980). Long memory relationships and the aggregation of dynamicmodels. Journal of Econometrics, 14:227–238.

Granger, C. W. J. and Hyung, N. (2004). Occasional structural breaks and long memorywith an application to the S&P 500 absolute stock returns. Journal of Empirical Finance,11:399–421.

Hosking, J. R. M. (1981). Fractional differencing. Biometrika, 68:165–176.

Jorion, P. (2001). Value at Risk. McGraw-Hill.

Kupiec, P. (1995). Techniques for verifying the accuracy of risk measurement models.journal of Derivatives, pages 73–84.

Mercurio, D. and Spokoiny, V. (2004). Statistical inference for time inhomogeneous volatilitymodels. The Annals of Statistics, 32:577–602.

Mikosch, T. and Starica, C. (2004). Non-stationarities in financial time series, the long rangedependence and the IGARCH effects. Review of Economics and Statistics, 86:378–390.

Polzehl, J. and Spokoiny, V. (2006). Propagation-separation approach for local likelihoodestimation. Probability Theory and Related Fields, pages 335–362.

Spokoiny, V. (2010). Local parametric methods in nonparametric estimation. Springer-Verlag Berlin Heidelberg New York.

Modeling and estimation for nonstationary time series with ... · implemented in volatility...

Documents

Transcript of Modeling and estimation for nonstationary time series with ... · implemented in volatility...