Nonparametric Regression as an Example of Model Choice

This article was downloaded by: [Florida Atlantic University]On: 20 November 2014, At: 08:10Publisher: Taylor & FrancisInforma Ltd Registered in England and Wales Registered Number: 1072954 Registeredoffice: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK

Communications in Statistics - Simulationand ComputationPublication details, including instructions for authors andsubscription information:http://www.tandfonline.com/loi/lssp20

Nonparametric Regression as an Exampleof Model ChoiceLaurie Davies a b , Ursula Gather c & Henrike Weinert ca Department of Mathematics , University Duisburg-Essen , Essen,Germanyb Department of Mathematics , Tech. Univ. Eindhoven, 5600 MBEindhoven , The Netherlandsc Department of Statistics , University of Dortmund , Dortmund,GermanyPublished online: 05 Feb 2008.

To cite this article: Laurie Davies , Ursula Gather & Henrike Weinert (2008) Nonparametric Regressionas an Example of Model Choice, Communications in Statistics - Simulation and Computation, 37:2,274-289, DOI: 10.1080/03610910701790285

To link to this article: http://dx.doi.org/10.1080/03610910701790285

PLEASE SCROLL DOWN FOR ARTICLE

Taylor & Francis makes every effort to ensure the accuracy of all the information (the“Content”) contained in the publications on our platform. However, Taylor & Francis,our agents, and our licensors make no representations or warranties whatsoever as tothe accuracy, completeness, or suitability for any purpose of the Content. Any opinionsand views expressed in this publication are the opinions and views of the authors,and are not the views of or endorsed by Taylor & Francis. The accuracy of the Contentshould not be relied upon and should be independently verified with primary sourcesof information. Taylor and Francis shall not be liable for any losses, actions, claims,proceedings, demands, costs, expenses, damages, and other liabilities whatsoever orhowsoever caused arising directly or indirectly in connection with, in relation to or arisingout of the use of the Content.

This article may be used for research, teaching, and private study purposes. Anysubstantial or systematic reproduction, redistribution, reselling, loan, sub-licensing,systematic supply, or distribution in any form to anyone is expressly forbidden. Terms &

http://www.tandfonline.com/loi/lssp20

http://www.tandfonline.com/action/showCitFormats?doi=10.1080/03610910701790285

http://dx.doi.org/10.1080/03610910701790285

Conditions of access and use can be found at http://www.tandfonline.com/page/terms-and-conditions

Dow

nloa

ded

by [

Flor

ida

Atla

ntic

Uni

vers

ity]

at 0

8:10

20

Nov

embe

r 20

14

http://www.tandfonline.com/page/terms-and-conditions

http://www.tandfonline.com/page/terms-and-conditions

Communications in Statistics—Simulation and Computation®, 37: 274–289, 2008Copyright © Taylor & Francis Group, LLCISSN: 0361-0918 print/1532-4141 onlineDOI: 10.1080/03610910701790285

Regression Analysis

Nonparametric Regression as an ExampleofModel Choice

LAURIE DAVIES1�2, URSULA GATHER3,AND HENRIKE WEINERT3

1Department of Mathematics, University Duisburg-Essen, Essen, Germany2Department of Mathematics, Tech. Univ. Eindhoven,5600 MB Eindhoven, The Netherlands3Department of Statistics, University of Dortmund, Dortmund, Germany

Nonparametric regression can be considered as a problem of model choice. In thisarticle, we present the results of a simulation study in which several nonparametricregression techniques including wavelets and kernel methods are compared withrespect to their behavior on different test beds. We also include the taut-stringmethod whose aim is not to minimize the distance of an estimator to some “true”generating function f but to provide a simple adequate approximation to the data.Test beds are situations where a “true” generating f exists and in this situation itis possible to compare the estimates of f with f itself. The measures of performancewe use are the L2- and the L�-norms and the ability to identify peaks.

Keywords Nonparametric regression; Model choice; Approximation.

Mathematics Subject Classification 62G08; 62G05.

1. Introduction

Consider paired data �n = ��ti� y�ti��ni=1 where the design points are ordered

0 ≤ t1 < · · · < tn ≤ 1 but not necessarily equidistant. The problem is to use thedata to derive a function fn which can be regarded as an adequate denoisedrepresentation of the data. The model we assume for the data is

Y�ti� = f�ti�+ ��ti�� i = 1� � � � � n� (1)

which represents a signal f corrupted by noise � which we take to be standardGaussian white noise. Given the data �n and the model (1), the problem ofspecifying a function fn based on the data �n becomes a problem of model choice

Received August 14, 2006; Accepted May 30, 2007Address correspondence to Ursula Gather, Department of Statistics, University of

Dortmund, 44221 Dortmund, Germany; E-mail: [email protected]

274

Dow

nloa

ded

by [

Flor

ida

Atla

ntic

Uni

vers

ity]

at 0

8:10

20

Nov

embe

r 20

14

Nonparametric Regression and Model Choice 275

(where � is treated as a nuisance parameter). The assumption that ��ti�� i = 1� � � � � n,is Gaussian is common throughout the literature on nonparametric regression,although, e.g., heavy-tailed noise such as Cauchy noise could be present in practice,in data from finance. As we will here compare methods which have all beensuggested under the assumption of Gaussian errors, we adopt this assumption inthis article as well but will comment briefly on non Gaussian errors in Secs. 4 and5, for which we have also produced some results. However, if one expects the errorsto come from some non Gaussian distribution, simple transformations can help,depending on the type of the distribution assumed. All methods can then be appliedto the transformed data to improve the results in this situation.

Traditional model selection operates within model (1) by assuming that the data�n were generated by this model. In the context of nonparametric regression theproblem of model choice becomes: estimate f by a function f ∗

n ∈ � that minimizesan expected distance or risk:

Ed�f� f ∗n � = inf

fn∈�Ed�f� fn�� (2)

where � is some specified class of functions and d�·� ·� is an appropriate lossfunction. In addition, some model selection rules require the optimization in (2)to be conducted under constraints, whereby some measure of the complexityof a model is included in the term to be minimized. In Chap. 2, we shallbriefly review some well-known methods for signal approximation includingwavelet regression (Donoho and Johnstone, 1994), kernel estimation methods(Herrmann, 1997; Polzehl and Spokoiny, 2000, 2003), and minimum descriptionlength (MDL)-denoising (Rissanen, 2000).

Another approach to nonparametric regression is based on the concept of dataapproximation described by Davies (1995, 2003). Although the concept makes use ofproperties of the model it does not operate solely within it but poses the questionas to whether the model can be regarded as an adequate approximation to thedata. Risk minimization such as (2) is not involved nor does it make assumptionsabout the existence of a true underlying function f� The model with parameters�fn� �n� is regarded as an adequate approximation if typical data generated underthe model look like the observed data �n. Within the set of parameter values �fn� �n�which give an adequate approximation, we then select an fn which minimizes one ormore measures of complexity. The “taut-string” nonparametric regression method ofDavies and Kovac (2001) and the corresponding nonparametric density procedure(Davies and Kovac, 2004) are examples of this idea. In both cases, the measureof complexity is the number of peaks. In the regression problem, the definitionof approximation is based on the residuals (see below), while in nonparametricdensity estimation, Kuiper metrics of high order are used, see (Davies and Kovac,2004).

2. Methods

2.1. Wavelet Methods (WH and WS)

Wavelets are defined by:

�j�k�t� = 2j2��2jt − k�� j� k ≥ 0�

Dow

nloa

ded

by [

Flor

ida

Atla

ntic

Uni

vers

ity]

at 0

8:10

20

Nov

embe

r 20

14

276 Davies et al.

where � is the so-called mother wavelet which is often chosen to have compactsupport. For a suitable � the �jk form an orthonormal basis of L2�� so that everyfunction f ∈ L2�� can be expressed as:

f�t� = ∑j�k

wj�k�j�k�t�� wj�k =∫ �

−�f�t��jk�t�dt�

with wavelet coefficients wj�k (Daubechies, 1992).If the design points ti = i/n are equidistant and n = 2J+1 is a power of 2, then

wavelet representations can be used to construct a signal estimate fn as follows.First, the finite wavelet transform matrix � is used to produce a vector w ofempirical wavelet coefficients via w = � yn with yn = �y�1�� y�n�� ∈ �n (seeDaubechies, 1992; Donoho and Johnstone, 1994; Nason and Silverman, 1994). Then = 2J+1 elements of w can be labeled wj�k� j = 0� � � � � J� k = 1� � � � � 2j − 1 and w−1�0

to express a wavelet approximation

fn� �ti� =∑

�j�k�∈ wj�k�j�k�ti�� i = 1� � � � � n� (3)

where �−1�0�ti� ≡ 1 and is a subset of pairs �j� k�. The optimal subset ∗ can betaken as the one which minimizes the empirical L2 risk

Ed2�f� fn� ∗� = inf

Ed2�f� fn� ��

where d2�f� fn� � =∑n

i=1 �f�ti�− fn� �ti��2/n. Minimal risk considerations indicatethat ∗ can be estimated by including only those wavelets whose coefficientswj�k in (3) exceed a certain threshold. Donoho and Johnstone (1994) suggest ahard thresholding rule where the subset in (3) corresponds to those coefficientssatisfying

�wj�k� > � ·√2 log�n�� (4)

using the median absolute deviation � of the wavelet coefficients. The resultingsignal construction is the Wavelet Hard Thresholding Estimator (WH). Donoho andJohnstone (1994) also proposed a wavelet signal approximation based on shrinkingthe wavelet coefficients to zero using so-called soft thresholding. The VisuShrink(WS) method gives a signal construction fn�VS at the points �ti�

ni=1 by

�fn�VS�t1�� fn�VS�tn�� = � � · �VS ·�

with a transformed version �VS of the wavelet coefficient vector w given by

�VS ={wj�k j < j0sgn�wj�k��wj�k� − � ·√2 log�n��+ j0 ≤ j ≤ J

�

for j0 as in Donoho and Johnstone (1994) chosen to prevent extreme cases in thewavelet transform.

Dow

nloa

ded

by [

Flor

ida

Atla

ntic

Uni

vers

ity]

at 0

8:10

20

Nov

embe

r 20

14


2.2. Minimum Description Length Denoising

The idea of minimum description length (MDL) is that a prefix coding schemecorresponds to a probability model and that probability models can therefore becompared by the lengths of the code required to encode the data. A good modeldescribes regularities in the data which can be exploited to reduce the code length.The naive MDL principle involves finding that model, from a collection of candidatemodels, that provides the shortest encoding of the data (Rissanen, 1989). This naiveformulation is too simple and the complexity of the model must also be taken intoaccount which involves encoding the model in some efficient manner. There is noobjective manner of doing this and MDL remains necessarily vague and arbitrary.

MDL for signal extraction by wavelets can be seen as a special case of thevariable selection problem in multiple linear regression. Let � be a given n× ndesign matrix for linear regression. Any subset � = �h1� � � � � hk� of columns of �defines an n× k matrix Z� and the corresponding linear regression model may bewritten as:

Yn = Z��k + �� k ∈ �k�

where the � = ��t1�� tn�� ∈ �n are taken to be standard Gaussian white

noise. The corresponding probability model for Yn has parameters �� k� �� anddensity

M� ={p��yn��k� �� = �2��2�−n/2 exp

[− 1

2�2�yn − Z��k�

��yn − Z��k�

]}�

The length of the code required for encoding yn is − log p��yn��k� �� which is nothingmore than the encoding of the residuals. For a given �, the shortest encoding isattained when the maximum likelihood estimates �k ≡ �k�yn� and �n ≡ ��yn� for �and � are used. If the Z��k and � come for free, so to speak, then nothing moreis to be said but the complexity of the model has not been taken into account.This can be done by encoding the �k and �n but this requires a model for themodel (see Rissanen, 1987). One way of overcoming the problem of encoding theparameters is to use a universal code or probability model p∗

� for the class of models.One such universal code is based on the Normalized Maximum Likelihood (NML)distribution

p∗� �yn� = pNML��yn� =

p��yn��k�yn�� n�

C�

� (5)

where

C� =∫�n

p��xn��k�xn�� xn��dxn�

Even in some of the simplest cases (including the present one of a normaldistribution), this requires some adjustment to the range of integration to makethe latter integral finite. If it is well defined, the NML distribution is a minimaxencoding in that it minimizes the maximum regret

− log p∗� �yn�−

(− log p��yn��k�yn�� n�)

Dow

nloa

ded

by [

Flor

ida

Atla

ntic

Uni

vers

ity]

at 0

8:10

20

Nov

embe

r 20

14

278 Davies et al.

over all samples yn. Given data yn, the MDL linear model selection chooses aprobability model p��·��k� �n� by minimizing the negative logarithm of (5) over �:

− log p��yn��k�yn�� n�+ logC� ≡ Fit + Complexity� (6)

The above criterion can be interpreted as a penalized likelihood approach tomodel selection with a model fit component and a model complexity penalty term(Grünwald, 2000; Rissanen, 1996). We are again left with the problem of whetheror not we have to encode the optimal � or not, and if so, how.

This may be applied to nonparametric regression by taking the design matrix�� to be � . Simplifications are available as in this particular situation it can beshown that the optimal subset � of size n− k corresponds either to eliminating thewavelets with the k largest or the k smallest coefficients. After some manipulationand approximations, Rissanen (2000) ends up with the following procedure. Orderthe absolute components in the wavelet coefficient vector �w��1� ≤ · · · ≤ �w��n�, letS�k� =

∑ki=1 �w�2�i� and then choose k by minimizing

�n− k� logS�n� − S�k�

n− k+ k log

S�k�

k+ log k�n− k�� (7)

The MDL principle for wavelet model selection hence becomes a form of hardthresholding where the subset of wavelet coefficients used in (3) corresponds to thelargest k coefficients of w in absolute value or those wavelet coefficients satisfying

�wj�k� ≥ � = �w��k��

We refer to Rissanen (2000) for more details. Rissanen (2000) showed that the MDLwavelet threshold � is asymptotically smaller than the hard threshold (4) of Donohoand Johnstone (1994).

2.3. Kernel Estimators (Plug-in and AWS)

2.3.1. Local Plug-in Approach (PL). We consider the locally adaptive kernelregression estimator

f �tj� htj� =

n∑i=1

y�ti�∫ si

si−1

1htj

K

(t − u

htj

)du� j = 1� � � � � n� (8)

with local bandwidths ht and si = �ti + ti+1�/2� i = 1� � � � � n− 1� s0 = 0� sn = 1. Thekernel K has order k ≥ 2 which means that the first k− 1 moments of K vanish butthe kth moment is non zero. Expressions for the ht can be obtained by minimizingthe pointwise mean squared error

MSE�fn�t� ht�� = E�fn�t� ht�− f�t��2�

These depend on the unknown function f and in a first step an initial estimate fn off based on a global bandwidth h is obtained which is plugged into the expressionsfor the local bandwidths. These in turn are plugged into (8). The complete algorithmis as follows (see Brockmann et al., 1993; Herrmann, 1997).

Dow

nloa

ded

by [

Flor

ida

Atla

ntic

Uni

vers

ity]

at 0

8:10

20

Nov

embe

r 20

14


1. Set h0 = �k− 1�/n,2. Iterate for i = 1� � � � � �k+ 1��2k+ 1� � gi = chi−1n

2/��2k+1��2k+3�� and

hi =(

�2nC�K�

nIk�f� gi�

) 12k+1

�

3. Set h = h�k+1��2k+1�, the global plug-in bandwidth,4. Set the pilot bandwidths gi = cihn

2/��2k+1��2k+3�� for i = 1� 2 and the local plug-inestimator in (8) to

htj= ��tj� g1�

(S�tj�C�K�

nr2k �tj� g1� g2�

) 12k+1

+ �1− ��tj� g1��h� j = 1� � � � � n�

for which the following must be specified or computed: a kernel constant C�K�,a certain variance estimator �2

n for �2, a weight function w�t�, an estimatorIk�f� g� depending on a further kernel density M�k� and bandwidth g of an integralfunctional ∫w�t��f �k��t��2dt involving the kth derivative f �k� of the signal, a localvariance estimator S�t�, a local estimator r2k �t� g1� g2� of �f �k��t��2 that depends ona kernel K and bandwidths g1 and g2, a weight function ��t� g1�, and iterationconstants c� c1 and c2.

See Herrmann (1997) for details.

2.3.2. Adaptive Weights Smoothing (AWS). The second kernel estimator we consideris Adaptive Weights Smoothing (AWS) (see Polzehl and Spokoiny, 2000, 2003).Suppose a parametric family{

f��x� =p∑

i=1

�i�i�x�� ∈ �p

}

is specified with basis functions ��i�x��pi=1 (e.g., polynomials �i�x� = xi−1). The idea

is to find the largest neighborhood of every design point ti in which the underlyingsignal function f�ti� can be well approximated by a function f�.

Fix i ∈ �1� � � � � n�. The estimate fn�ti� of f at ti is defined as a weighted meanof the observations y�tj� by non negative weights wij� j = 1� � � � � n:

fn�ti� =n∑

j=1

wij · y�tj�/n∑

j=1

wij� (9)

The final estimator in (9) is found by iteratively computing estimators

f �k�n �ti� =

n∑j=1

w�k�ij · y�tj�/

n∑j=1

w�k�ij � (10)

based on updated weights w�k�ij determined by three kernel density functions Kl�Ks

and Ke on the positive half-axis with Kl�0� = Ks�0� = Ke�0� = 1. Initial weightsare set as w�0�

ij = Kl��ti − tj�/h�0��2� using an initial kernel bandwidth estimate h�0�.

Dow

nloa

ded

by [

Flor

ida

Atla

ntic

Uni

vers

ity]

at 0

8:10

20

Nov

embe

r 20

14

280 Davies et al.

Then, with a memory parameter � ∈ �0� 1� and a scaling factor a > 1, weights areiteratively defined by

w�k�ij = � · w�k−1�

ij + �1− �� · Kl

(�ti − tj�

2

�ak−1h�0��2

)Ks�s

�k�ij �Ke�e

�k�ij ��

using computed penalties s�k�ij and e

�k�ij � j = 1� � � � � n. Comparing estimates f �k−1�

n �ti�

and f �k−1�n �tj� from a previous iteration in (10), the statistical penalty s

�k�ij is

evaluated as a localized maximum likelihood contrast divided by a numerical tuningparameter. The penalty e

�k�ij aims to limit the influence of an observation when tj is

deemed to be a high level point to the local fit fn�ti�; this involves another tuningparameter. For a specified maximal bandwidth hmax, (10) is computed over k = 1+log�hmax/a�/ log�h

�0�� iterations.See Polzehl and Spokoiny (2003) for details.

2.4. Data Approximation Methods (TS and TV)

The philosophy behind the taut-string method of Davies and Kovac (2001) is thefollowing. Firstly, a precise definition of adequate approximation for the model(1) is given and then, within the class of adequate approximations, a simplestapproximation is required. This is approximation followed by regularization. Anarbitrary function fn is regarded as an adequate approximation for the data �n ifthe residuals rn�ti� = yn�ti�− fn�ti� “look like” white noise with variance �2� Theproperty of white noise which is used is that normalized sums of white noise areagain Gaussian random variables with the same variance. This leads to the followingcondition:

maxI∈�

1√�I�

∣∣∣∣∑tj∈I

�y�tj�− fn�tj��

∣∣∣∣ ≤ �√� log n� (11)

where �I� is the number of points tj ∈ I , � is a collection of subintervals of [0, 1],and � > 2 is some constant. The justification of (11) is the following. The particularform is chosen because it forces the function fn to be close to the data whilsttaking the noise level into account. It is a multiresolution scheme as it looks at thedeviations over all scales of intervals, from single points to the whole interval. If� is the set of all intervals and the y�ti� are generated by (1), then (11) is satisfiedasymptotically with fn = f for � > 2� This follows from a result on the uniformmodulus of continuity of the Brownian motion on [0, 1] due to Dümbgen andSpokoiny (2001). To make (11) operational we must estimate the noise level � andspecify a constant �� The default value we use is � = 2�5 and the estimate

�n =1�48√

2median��y�t2�− y�t1�� y�tn�− y�tn−1�� (12)

Another possibility of quantifying the noise for equally spaced design points isthe following (see Davies and Kovac, 2006). Using the FFT, we first compute theperiodogram Iy�� of the data

Iy�� =1

2�n

∣∣∣∣ n∑i=1

y�ti�e−�i

√−1

∣∣∣∣2�

Dow

nloa

ded

by [

Flor

ida

Atla

ntic

Uni

vers

ity]

at 0

8:10

20

Nov

embe

r 20

14


If the signal does not contain very high frequencies then Iy�� is almost constant�2

2� at such frequencies and the noise parameter �2 can then be estimated fromIy�� What ever version of �n is used we are led to the following definition ofapproximation:

maxI∈�

1√�I�

∣∣∣∣∑tj∈I

�y�tj�− fn�tj��

∣∣∣∣ ≤ �n

√� log n� (13)

In practice, there is little to be gained by taking all intervals and dyadicmultiresolution schemes as for wavelets can be used which are much faster (seeDavies and Kovac, 2001). There is no restriction either to equally spaced designpoints or to powers of two.

There are many functions fn which are adequate approximations in the senseof (13). An extreme example is any function which interpolates the y�ti�� Thesecond step consists of a regularization in that we wish to calculate the simplestapproximating function. The two definitions of simple we use are the number oflocal extremes (TS) and the total variation (TV) of fn� In other words, in one casewe wish to minimize the number of local extremes of fn subject to (13), whereas inthe second case, we wish to minimize

n∑i=2

�f�ti�− f�ti−1�� (14)

To minimize the number of local extremes subject to (13), Davies and Kovac(2001) developed the taut-string algorithm. Although this does not guarantee theexact solution it is a fast O�n� algorithm and very often does give the correctsolution. This can be checked post facto. Minimizing (14) subject to (13) is a linearprogramming problem.

Kovac (2007) provides a smooth version of the taut-string method andcompares it with several other methods by a small simulation study. Some of thesemethods are also included in our study (TS, WH with another wavelet basis, PL).Where comparable, the results coincide.

3. Test Beds and Loss Functions

3.1. Test Beds

The test beds we use are those introduced in Donoho and Johnstone (1994) andwhich are known as Blocks, Bumps, Heavisine, and Doppler. We also include aconstant signal as well as a heavily oscillating sine-function which terminates with aconstant. The signals are defined below and depicted in Figure 1. We normalized thesignals such that we get a signal-to-noise ratio of 7 when adding noise with standarddeviation 1.

• Doppler:

f�t� = �t�1− t��12 sin

(2��1+ ��

�t + ��

)�

with � = 0�05.

Dow

nloa

ded

by [

Flor

ida

Atla

ntic

Uni

vers

ity]

at 0

8:10

20

Nov

embe

r 20

14

282 Davies et al.

Figure 1. The regression functions used as test beds.

• Bumps:

f�t� = ∑hjK

(t − tj

wj

)� K�t� = �1+ �t��−4�

with �tj� = �10� 13� 15� 23� 25� 40� 44� 65� 76� 78� 81t�/100, �hj� = �40� 50� 30,40� 50� 42� 21� 43� 31� 51� 42�/10, and �wj� = �5� 5� 6� 10� 10� 30� 10� 10� 5�8� 5�/1000�

• HeaviSine:

f�t� = 4 sin 4�t − sgn�t − 0�3�− sgn�0�72− t��

• Blocks:

f�t� = ∑hjK�t − tj�� K�t� = 1+ sgn�t�

2�

with �tj� = �10� 13� 15� 23� 25� 40� 44� 65� 76� 78� 81�/100, and �hj� = �40�−50� 30�−40, 50�−42� 21� 43�−31� 21�−42��/10.

• Sine:

f�t� ={sin�100�t� t ≤ 0�8

0 t > 0�8�

Dow

nloa

ded

by [

Flor

ida

Atla

ntic

Uni

vers

ity]

at 0

8:10

20

Nov

embe

r 20

14


3.2. Lp-Norms

The loss functions we consider are the empirical versions of the L2- and L�-normsdefined by:

d2�f� g� =(1n

n∑i=1

�f�ti�− g�ti��2)1/2

� (15)

d��f� g� = max1≤i≤n

�f�ti�− g�ti�� (16)

(see Donoho and Johnstone, 1994; Rissanen, 2000). For any given test bed withfunction f and for any given procedure resulting in some fn, the measures ofperformance are the average values of d2�f� fn� and d��f� fn� over the simulations.

3.3. Identification of Extremes

We introduce a new loss which measures how well the extremes (e.g., peaks andtroughs) of an estimate fn match those of the test signal f� There are two possibleerrors. The reconstruction fn can fail to have a local extreme of the correct typeat a point where a target signal f exhibits one. The second type of error is that fnexhibits a local extreme at a point where the test bed function does not have one.Both of these types of error must allow for a certain degree of error in the positionof the local extreme. With this in mind, we propose a peak identification loss (PID)as follows.

Let nextr denote the number of local extremes of the signal function f at thedesign points and let nest denote the number of local extremes of a reconstructionfn of f . The number of local extremes of f that are correctly identified by fn willbe denoted by nid� The peak identification loss is now defined by:

did�f� fn� = sgn�nest − nextr��nextr − nid�+ �nest − nid��

= sgn�nest − nextr��nextr + nest − 2nid�� (17)

where the counts �nextr − nid� and �nest − nid� measure the extent of the two errorsdescribed above. We use sgn�nest − netr� so that it is possible to see if too many(positive sign) or too few (negative sign) local extremes are identified.

We still require a definition of the number nid of correctly identified localextremes of f� We use tolerances as follows. Let tpk1 < · · · < tpknpk < 1 and ttr1 < · · · <ttrntr < 1 denote, respectively, the positions of the peaks and troughs of the targetsignal f at the design points so that npk + ntr = nextr. For identification purposes,each local extreme of f is assigned a corresponding tolerance given by

tol�t�i � =min�t�i − t�i−1� t

�i+1 − t�i �

2� i = 1� � � � � n�� ≡ pk� tr� (18)

where t�0 = 0� t�n�+1 = 1 and the label � above represents peaks �pk� or troughs �tr�.A peak of f at a position t

pki is said to be correctly identified by an reconstruction

fn if the position of a peak of fn is within a distance tol�tpki � from t

pki � 1� � � � � npk.

An analogous definition holds for troughs. The count nid is the number of correctlyidentified local extremes �t

pki �

npki=1 and �ttri �

ntri=1 by a reconstruction fn. The tolerances

specified in (18) are not severe.

Dow

nloa

ded

by [

Flor

ida

Atla

ntic

Uni

vers

ity]

at 0

8:10

20

Nov

embe

r 20

14

284 Davies et al.

4. Simulations

We conducted two simulation studies to compare the performances of the sevenmethods for nonparametric regression described in Sec. 2: the wavelet methodsWH and WS (with Daubechies’ Extremal Phase Family wavelets, N = 6), MDLmethod, kernel methods Plug-in (PL) and AWS, and data approximation methods:taut-string (TS) and total variation (TV). The sample sizes used are n = 27 =128� 28 = 256� 210 = 1024� 211 = 2048, and 212 = 4096. In the first study, the designpoints are taken equidistant: ti = i/n� i = 1� � � � � n. In the second study, we use nonequidistant design points, so the ti� i = 1� � � � � n are generated randomly from astandard uniform distribution. The standard deviations � in (1) were taken to be� = 0�4� 0�8� 1�0, and 1�4. We also simulated a situation with non Gaussian noise:we used the same test beds but for standard Cauchy noise with location parameter� = 0, scale parameter � = 1, and sample size n = 1024.

To measure the performance of the signal approximations we compute risksbased on the L2, L�, and PID losses described in Secs. 3.2 and 3.3. For each method,test signal function f� �-value, and sample size n, we generated 1,000 independentsamples �j�f�n�� of size n to approximate the risk:

Rn�f� fn� d� =1

1000

1000∑j=1

d�f� fnj��

using the reconstruction fnj for f based on each sample �j�f�n�� j = 1� � � � � 1� 000,and losses d = d2

2� d�� did given in (15) and (16). For the PID given in (17), it is notmeaningful to take a mean of the 1,000 single values because of the sign. We usethe mean of the absolute value of the PID.

To calculate the reconstructions for the non equidistant data points we use thesame methods as for equidistant. Although there exist versions for non equidistantdata for WH, WS, AWS, and PL we used the standard versions because theyperformed better on our test beds. For the MDL method there are no versions fornon equidistant design points. The methods TS and TV can be applied withoutchange to non equidistant data.

5. Results and Summary

Figures 2 and 3 show the average ranks of the seven methods for every situationof the simulation study. Figures 4–7 show signal reconstructions based on sampleswith several test bed signals. In Figures 8–10, the results of the simulation studiesare depicted. Figure 8 shows the results for the (squared) L2-norm with � = 0�8 andequidistant design points for the six test beds and the five sample sizes. For everysituation the seven results are normalised such that the smallest result (best) is 0and the largest result (worst) is 1. The results are ordered from right to left, so themethod with the best result is always on the right and the worst on the left. InFigures 9 and 10, the results for the L�-norm with � = 1�0 and non equidistantdesign points and the results for the PID with � = 1�0 and equidistant design pointsare shown, respectively. As the simulation results show no large differences in theranking of methods over different values for � we report only the results for onechoice of �.

As could be expected, the results for Cauchy noise are much worse than thosefor Gaussian noise. The L�- and L2-norms give average values at least 103 times

Dow

nloa

ded

by [

Flor

ida

Atla

ntic

Uni

vers

ity]

at 0

8:10

20

Nov

embe

r 20

14


Figure 2. Average rank of the seven methods for equidistant design points.

Figure 3. Average rank of the seven methods for nonequidistant design points.

Figure 4. Reconstructions of the Doppler function with n = 1�024 and � = 1�0.

Figure 5. Reconstructions of the Blocks function with n = 1�024 and � = 1�0.

Dow

nloa

ded

by [

Flor

ida

Atla

ntic

Uni

vers

ity]

at 0

8:10

20

Nov

embe

r 20

14

286 Davies et al.

Figure 6. Reconstructions of the Heavisine function with n = 1�024 and � = 1�0.

larger than those for Gaussian noise. Since Cauchy noise leads to very small aswell as very large values being added to the signal functions, all methods except thePL-method tend to exhibit too many extremes, such that the PID in general giveslarger results than for Gaussian noise.

Although all results are poor in this situation, we can identify a “best method”under Cauchy noise: the PL-method. This holds for all performance criteria.

We can summarize the results of the simulation studies with Gaussian noise asfollows:

• With the exception of TS and TV, the performance with respect to the peakidentification deteriorates as the sample size increases.

• The results for equidistant and non equidistant design points do not differgreatly.

• The MDL method often produces too many local extremes (see Sec. 2.2 foran explanation).

Figure 7. Reconstructions of the Bumps function with n = 1� 024 (non equidistant) and� = 1�0.

Dow

nloa

ded

by [

Flor

ida

Atla

ntic

Uni

vers

ity]

at 0

8:10

20

Nov

embe

r 20

14


Figure 8. Normalized simulation results (� = 0�8, equidistant) measured by squaredL2-norm. For each test bed function and sample size, the results are ordered such that thebest result is on the right and the worst on the left.

Figure 9. Normalized simulation results (� = 1�0, non equidistant) measured by L�-norm.For each test bed function and sample size, the results are ordered such that the best resultis on the right and the worst on the left.

Dow

nloa

ded

by [

Flor

ida

Atla

ntic

Uni

vers

ity]

at 0

8:10

20

Nov

embe

r 20

14

288 Davies et al.

Figure 10. Normalized simulation results (� = 0�8, equidistant) measured by PID. For eachtest bed function and sample size, the results are ordered such that the best result is on theright and the worst on the left.

• The reconstructions produced by kernel and wavelet methods (WH, WS,AWS, and PL) often failed to reproduce the magnitudes of the peaks for theBumps-function.

• As the Blocks function is piecewise constant, all the methods apart fromTS and TV, performed poorly as they are designed to give smoothreconstructions.

• TS is better than TV.• There are two types of behavior for the sine function. Either the signal isnot recognized at all or the reconstruction is reasonable. The reconstructionsimproved for large sample size n and smaller �.

• The MDL method performs extremely badly on the white noise test bed.Calculations show that in this situation about 60% of the wavelet coefficientswill be retained. The other wavelet methods and kernel methods show awave-line while TS and TV often give a constant.

• Overall, TS and TV perform the best. Of the 720 scenarios of the simulationstudy, TS is best or second best in 490 cases. Of the remaining 230 cases, it isthird best 82 times. It was ranked 5 or 6 83 times, 53 of which on the sine testbed. This can also be seen in Figs. 2–3: TS and TV have the smallest averageranks.

Acknowledgment

This work has been supported by the Collaborative Research Center ‘Reduction ofComplexity in Multivariate Data Structures’ (SFB 475) of the German ResearchFoundation (DFG).

Dow

nloa

ded

by [

Flor

ida

Atla

ntic

Uni

vers

ity]

at 0

8:10

20

Nov

embe

r 20

14


References

Brockmann, M., Gasser, T., Herrmann, E. (1993). Locally adaptive bandwidth choice forkernel regression estimators. J. Amer. Statist. Assoc. 88:1302–1309.

Daubechies, I. (1992). Ten Lectures on Wavelets. Philadelphia: SIAM.Davies, P. L. (1995). Data features. Statistica Neerlandica 49:185–245.Davies, P. L. (2003). Approximating data and Statistical procedures – I. Approximating

data. Technical Report 7/2003, SFB 475, Departement of Statistics, University ofDortmund, Germany.

Davies, P. L., Kovac, A. (2001). Local extremes, runs, strings and multiresolution (withdiscussion and rejoinder). Ann. Statist. 29:1–65.

Davies, P. L., Kovac, A. (2004). Densities, spectral densities and modality. Ann. Statist.32:1093–1136.

Davies, P. L., Kovac, A. (2006). Quantifiying the cost of simultaneous non-parametricapproximation of several samples. Technical Report.

Donoho, D. L., Johnstone, I. M. (1994). Ideal spatial adaptation by wavelet shrinkage.Biometrika 81:425–455.

Dümbgen, L., Spokoiny, V.G. (2001). Multiscale testing of qualitative hypotheses. Ann.Statist. 29:124–152.

Grünwald, P. (2000). Model selection based on minimum description length. J. Mathemat.Psychol. 44:133–152.

Herrmann, E. (1997). Local bandwidth choice in kernel regression estimation. J. Computat.Graph. Statist. 6:35–54.

Kovac, A. (2007). Smooth functions and local extreme values. Computat. Statist. Data Anal.51:5155–5171.

Nason, G. P., Silverman, B. W. (1994). The discrete wavelet transform in S. J. Computat.Graph. Statist. 3:163–191.

Polzehl, J., Spokoiny, V.G. (2000). Adaptive weights smoothing with applications to imagerestoration. J. Roy. Statist. Soc. B 62:335–354.

Polzehl, J., Spokoiny, G. (2003). Varying coefficient regression modeling. Preprint.Rissanen, J. (1987). Stochastic complexity. J. Roy. Statis. Soc. B 49:223–239.Rissanen, J. (1989). Stochastic Complexity in Statistical Inquiry. Singapore: World Scientific.Rissanen, J. (1996). Fisher information and stochastic complexity. IEEE Trans. Inform. Theor.

42:40–47.Rissanen, J. (2000). MDL denoising. IEEE Trans. Inform. Theor. 46:2537–2760.

Dow

nloa

ded

by [

Flor

ida

Atla

ntic

Uni

vers

ity]

at 0

8:10

20

Nov

embe

r 20

14

Nonparametric Regression as an Example of Model Choice

Documents

Transcript of Nonparametric Regression as an Example of Model Choice