The Asymptotic Distributions of The Kernel Estimations of ...For the conditional mode, we study the...

The Asymptotic Distributions of The Kernel

Estimations of The Conditional Mode and Quantiles

December 23, 2008

THE ISLAMIC UNIVERSITY of GAZA

DEANERY of HIGHER STUDIES

FACULTY of SCIENCE

DEPARTMENT of MATHEMATICS

The Asymptotic Distributions of The Kernel Estimations of The Conditional

Mode and Quantiles

PRESENTED BY

Hossam Othman M. El-sayed

SUPERVISED BY

Dr. Raid Salha

A THESIS SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENT

FOR THE DEGREE OF MASTER OF MATHEMATICS

1429-2008

1

To my family...

i

Contents

Table of Contents iii

Acknowledgment iv

Abstract v

List of Figures vi

List of Tables vi

Preface 1

1 Introduction 3

1.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2 Kernel Density Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 12

1.2.1 Kernel Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

1.3 Properties and Examples of the Kernels . . . . . . . . . . . . . . . . . . . 14

1.4 The MSE and MISE Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . 16

1.5 Asymptotic MSE and MISE Approximations . . . . . . . . . . . . . . . . . 18

1.6 Optimal Bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

1.7 Optimal Kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2 On the Estimation of the Mode 29

2.1 Mode Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

ii

2.2 A Simple Estimation of the Mode . . . . . . . . . . . . . . . . . . . . . . . 38

2.3 Nonparametric Regression Estimation . . . . . . . . . . . . . . . . . . . . . 39

2.4 Joint Asymptotic Distribution of the Estimated Conditional Mode . . . . . 43

3 Quantiles Regression 53

3.1 Nonparametric estimation of conditional quantiles . . . . . . . . . . . . . . 54

3.2 Joint Asymptotic Distribution of the Conditional Quantiles . . . . . . . . . 68

3.3 Mode and Median as a Comparison . . . . . . . . . . . . . . . . . . . . . . 83

Bibliography 84

iii

Acknowledgment

First of all, I am awarding my great thanks for the Almighty Allah who all the time helps

me and grants me the power and courage to finish this study and give me the success in

my live.

My gratitude and respect are paid to my supervisor Dr. Raid Salha for all the interesting

discussions I had with him.

I am grateful to the Islamic University in Gaza for offering me the opportunity to get the

Master Degree of Mathematics, and my thanks to all Professors who teaching me in the

mathematics department. I would like to my express my deep tanks and appreciation to

my family, especially parents for their encouragement.

I wish also to thank my colleagues and my friends who provided suggestions in this study.

Finally, I pray to Allah to accept this work.

iv

Abstract

In this thesis, we study the kernel estimation of the conditional probability density func-

tion and two of its aspects, the conditional mode and the conditional quantiles.

For the conditional mode, we study the asymptotic normality of its kernel estimation from

[18] and we study the conditions under which the conditional mode estimated at finite

distinct points is asymptotically normally distributed.

Also, we study the kernel estimation for the conditional quantile from [1] and we study

the conditions under which the joint distribution of several conditional quantile is asymp-

totically normally distributed.

v

List of Figures

1.1 Kernel density estimation based on 7 points 14

1.2 Kernel density estimates based on different bandwidths 23

1.3 The Epanechinkov kernel K∗ 27

1.4 Kernel density estimates of the Ethanol data 28

List of Tables

1.1 Common kernel functions 15

1.2 Efficiency of several kernels completed to the optimal kernel K∗ 28

vi

Preface

The probability density function is a fundamental concept in statistics. Suppose we have

a set of observed data points assumed to be a sample from an unknown probability density

function f . The construction of an estimate of the density function from observed data

is known as density estimation.

The classical approach for estimating the density function is called parametric density

estimation. Here one assumes that the data are drawn from a known parametric distrib-

ution which depends only on finitely many parameters, and one uses the data to estimate

the unknown values of these parameters. For example, the normal distribution depends

on two parameters , the mean µ and the variance σ2. The density function f could be

estimated by finding estimates of µ and σ2 from the data, and substituting these estimates

into the formula for the normal density.

Parametric estimates usually depend only on a few parameters, therefore they are suitable

on even for small sizes n. Another approach of density estimation is the nonparametric

estimation. For example Histograms, the naive estimator and the kernel estimator, etc.

We will concentrate on the Kernel estimator. In this case we do not assume that the data

are drwan from a known parametric distribution. This data are allowed to decide which

function fits them best, without the restrictions imposed by the parametric estimation.

For more details see [22].

There are several reasons for using the nonparametric smoothing approaches.

1) They can be employed as a convenient and succinct means of displaying the features

of the data set and hence to aid practical parametric model building.

1

2) They can be used for diagnostic checking of an estimated parametric model.

3) One may want to conduct inference under only the minimal restrictions imposed in

fully nonparametric structures. For more details see [20]

The main subject of this thesis, is the kernel estimation of the probability density

function, and the conditional distribution function.

Now suppose that (Xi, Yi) are R×R random variable with a common probability den-

sity function f . We want to study the relationship between a response variable Y and

a predictor variable X. To explore this relationship, we use the regression analysis to

quantify it.

The conditional distribution function F (Y |X = x) is very important for solving this prob-

lem. In parametric and nonparametric estimation of the conditional distribution function

most investigation of the underlying structure is concerned with the mean regression func-

tion m(x) = E(Y |X = x), the conditional mean of Y given the value x of X. New insight

about the underlying structure can be gained by considering other aspects of the condi-

tional distribution function F (Y |X).

In this thesis, we will study two other aspects of the conditional distribution function,

its mode and quantiles.

This thesis consist of three chapters, in the first chapter we present some basic definitions

and theorems which will be used in the next chapters. Also, we present the idea of the

kernel estimation of the probability density function and some related topics.

In Chapter two, we introduce the kernel estimation for the mode and the conditional mode

function in the case of independent and identically distributed (i.i.d.) random variables.

We will study the asymptotic behavior of the estimators of the mode and the conditional

mode functions.

Finally, in Chapter three we will study the kernel estimation of the conditional quantile

and asymptotic behavior of this estimation.

2

Chapter 1

Introduction

This Chapter contains some basic definitions and facts that we need in the remanning

of this thesis. In Section 1.1, we present some preliminaries in probability and statistics.

And in the remaining sections of this chapter, we present the idea of the kernel estimation

and some important subjects related to it.

1.1 Preliminaries

In this Section, we will introduce some basic definitions and theorems, that will help in

the remanning of this thesis.

Definition 1.1.1. [8](σ − Field). Let B be a collection of subsets of C. We say that B

is a σ − Field if

(1) φ ∈ B, (B is not empty).

(2) If A ∈ B, then Ac ∈ B, (B is closed under complements).

(3) If the sequence of sets C1, C2, . . . is in B, then∞⋃i=1

Ci ∈ B, (B is closed under

countable unions).

Definition 1.1.2. [8](Probability). Let C be a sample space and let B be a σ − Field

on C. Let P be a real valued function defined on B. Then P is a probability set function

3

if P satisfies the following three conditions:

1. P(C)≥ 0, for all C ∈ B.

2. P(C)=1.

3. If Cn is a sequence of sets in B and Cm ∩ Cn = φ for all m 6= n, then

P (∞⋃i=1

Cn ) =∞∑

n=1

P (Cn).

Definition 1.1.3. [8] Consider a random experiment with a sample space C. A function

X, which assigns to each element c ∈ C one and only one number X(c) = x, is called a

random variable. The space or range of X is the set of real numbers

D = x : x = X(c), c ∈ C. D will generally be countable set or an interval of real

numbers.

Definition 1.1.4. [6] If X is a discrete random variable, the function given by f(x) =

P (X = x) for each x within the range of X is called the probability distribution of

X.

Definition 1.1.5. [6] If X is a discrete random variable, the function given by

F (x) = P (X ≤ x) =∑t≤x

f(t) for −∞ < x < ∞

where f(t) is the value of the probability distribution of X at t, is called the dis-

tribution function, or the cumulative distribution function, of X and denoted by

(cdf).

Definition 1.1.6. [6] A function with values f(x), defined over the set of all real numbers,

is called a probability density function of the continuous random variable X if and

only if

P (a ≤ X ≤ b) =

∫ b

a

f(x)dx

for any real constants a and b with a ≤ b.

4

Definition 1.1.7. [6] If X is a continuous random variable and the value of its probability

density at t is f(t), then the function given by

F (x) = P (X ≤ x) =

∫ x

−∞f(t)dt for −∞ < x < ∞

is called the distribution function, or the cumulative distribution, of X.

Definition 1.1.8. [8] The support of a continuous random variable X consists of all

points x such that fX(x) > 0.

Definition 1.1.9. [8] (Independence). Let the random variables X1 and X2 have the

joint pdf f(x1, x2) and the marginal pdfs f1(x1) and f2(x2) respectively. The random

variables X1 and X2 are said to be independent if, and only if, f(x1, x2) ≡ f1(x1)f2(x2)

Random variables that are not independent are said to be dependent.

Definition 1.1.10. [8] Let X be a random variable with pdf with parameter θ. Let

X1, . . . , Xn be a random sample from the distribution of X and let T denotes an estimator

of θ. We say T is an unbiased estimator of θ if

E(T ) = θ.

If T is not unbiased, we say that T is a biased estimator of θ.

Theorem 1.1.1. [6]

If θ is an unbiased estimator of θ and

V ar(θ) =1

n E[(∂lnf(X)∂θ

)]2

then θ is a minimum variance unbiased estimator of θ.

Definition 1.1.11. [6] The statistic θ is a Consistent estimator of the parameter θ if

and only if for each c > 0

limn→∞

P ( |θ − θ| < c ) = 1.

5

Theorem 1.1.2. [6]

If θ is an unbiased estimator of θ and V ar(θ) −→ 0 as n −→∞, then θ is a consistent

estimator of θ.

Definition 1.1.12. [6] The statistic θ is a sufficient estimator of the parameter θ if

and only if for each value of θ the conditional probability distribution or density of the

random sample X1, X1, . . . , Xn given θ = θ is independent of θ.

Definition 1.1.13. [8](Characteristic Function) The characteristic function of a ran-

dom variable X with distribution function F, denoted by k(u), is defined be

k(u) =

∫ ∞

−∞e−iuyK(y)dy

Theorem 1.1.3. [8]

The characteristic function of any random variable is a uniformly continuous function.

Theorem 1.1.4. [8](Minkowski’s Inequality)

Let X, Y be two random variables. Then it holds for 1 ≤ p < ∞ that

E(|X + Y |p) 1p ≤ (E(|X|p)) 1

p + (E(|Y |p)) 1p .

Definition 1.1.14. [12] Let r be a positive number such that

kr = limu→0

1− k(u)

|u|r

is finite. If there exists a value of r such that kr is non-zero, it is called the characteristic

exponent of the transform k(u), and kr is called the characteristic coefficient.

Definition 1.1.15. [16] If A is any set, we define the Indicator function IA of the set

A to be the function given by

IA =

1 if x ∈ A,

0 if x 6∈ A.

6

Definition 1.1.16. [8](Converge in Probability). Let Xn be a sequence of random

variables and let X be a random variable defined on a sample space. We say Xn converges

in probability to X if for all ε > 0, we have

limn→∞

P [|Xn −X| ≥ ε] = 0,

or equivalently,

limn→∞

P [|Xn −X| < ε] = 1.

If so, we write

Xnp−→ X.

Definition 1.1.17. [8] (Converge in Distribution). Let Xn be a sequence of random

variables and let X be a random variable. Let FXn and FX be, respectively, the cdfs of

Xn and X. Let C(FX) denote the set of all points where FX is continuous. We say that

Xn converge in distribution to X if

limn→∞

FXn(x) = FX(x), for all x ∈ C(FX).

We denote this convergence by

XnD−→ X.

Definition 1.1.18. [8](Converge with probability 1) Let Xn∞n=1 be a sequence of

random variables on ( Ω , L , P ). We say that Xn converge almost surly to a ran-

dom variable X (Xna.s.−→ X) or Converge with probability 1 to X or Xn converge

strongly to X if and only if

P (w : Xn(w) −→ X(w) as n −→∞) = 1,

or equivalent, for all ε > 0, there exists N ∈ N

P ( |Xn −X| < ε, n ≥ N) = 1.

7

Theorem 1.1.5. [8]

1. If Xn converge to X with probability 1, then Xn converge to X in probability.

2. If Xn converge to X in probability, then Xn converge to X in distribution.

3. Let Xn converge to X in probability and let g be a continuous function on R,

then g(Xn) converge to g(X) in probability.

Example 1.1.1. ( Converge in probability 6=⇒ Converge with probability 1. )

Let Ω = (0, 1] and P a uniform distribution on Ω.

Define An by

A1 = (0, 12], A2 = (1

2, 1]

A3 = (0, 14], A4 = (1

4, 1

2], A5 = (1

2, 3

4], A6 = (3

4, 1]

A7 = (0, 18], A8 = (1

8, 1

4], . . .

Let Xn(w) = IAn(w).

Then P (|Xn − 0| ≥ ε) −→ 0 ∀ε > 0, since Xn is 0 except on An and P (An) ↓ 0. Thus Xn

converge to 0 in probability.

But P (w : Xn(w) −→ 0) = 0 (and not 1) because any w keeps being in some An

beyond any n0, i.e, Xn(w) look like 0 . . . 010 . . . 010 . . . 010 . . . , so Xn not converge with

probability 1 to 0.

Definition 1.1.19. [4] Let A ⊆ R, let f : A −→ R, and let c ∈ A. We say that f is

continuous at c if, given any neighborhood Vεf(c) of f(c) there exists a neighborhood

Vδ(c) of c such that if x is any point of A ∩ Vδ(c), then f(x) belongs to Vεf(c).

Definition 1.1.20. [4] A function f : A −→ R is said to be bounded on A if there

exists a constant M > 0 such that |f(x)| ≤ M for all x ∈ A.

Definition 1.1.21. [4] Let A ⊆ R, let f : A −→ R. We say that f is uniformly

continuous on A if for each ε > 0 there is a δ(ε) > 0 such that if x, u ∈ A are any

numbers satisfying |x− u| < δ(ε), then |f(x)− f(u)| < ε.

8

Definition 1.1.22. [4] Let A ⊆ R, let f : A −→ R. If there exists a constant K > 0

such that

|f(x)− f(u)| ≤ K|x− u|

for all x, u ∈ A, then f is said to be a Lipschitz function (or satisfy a Lipschitz

condition) on A.

Definition 1.1.23. [4] Let f : [a, b] −→ R and let a = x0 < x1 < . . . < xk = b

be any subdivision of [a, b], define p =k∑

i=1

[f(xi)− f(xi−1)]+, n =

k∑i=1

[f(xi)− f(xi−1)]−

and t = n + p. Define P = sup p, N = sup n, and T = sup t.

If T < ∞, we say that f is of bounded variation over [a, b] and we write f ∈ BV.

Definition 1.1.24. [4] A function f : [a, b] −→ R ia said to be absolutely continuous

if given ε > 0 there is δ > 0 such that if (xi, yi)ni=1 is finite pairwise disjoint family of

subintervals of [a, b] withn∑

i=1

|xi − yi| < δ, thenn∑

i=1

|f(xi)− f(yi)| < ε.

Theorem 1.1.6. [16]

Every absolutely continuous function is uniformly continuous function.

Theorem 1.1.7. [16]

If f is absolutely continuous function on [a, b], then f is of bounded variation.

Definition 1.1.25. [4] A set E is said to be measurable if for each set A we have

M?(A) = M?(A ∩ E) + M?(A ∩ Ec),

where M? is the outer measure which defined by

M?(A) = infA⊂S In

∑L(In)

Theorem 1.1.8. [16]

If f : A −→ R is a Lipschitz function, then f is uniformly continuous on A.

9

Theorem 1.1.9. [13]( Classical Central Limit Theorem ):

Let Xk, k ≥ 1 be i.i.d random variable with mean µ and finite variance σ2. Also let

Zn = (Tn − nµ)/σ√

n

where Tn =n∑

i=1

Xi. Then ZnD−→ N( 0 , 1 )

Theorem 1.1.10. [13]( Liapounov Theorem )

Let Xk, k ≥ 1, be independent random variables such that EXk = µk and V arXk =

σ2k, and for some 0 < δ ≤ 1,

v(k)2+δ = E|Xk − µk|2+δ < ∞, k ≥ 1.

Also let Tn =n∑

k=1

Xk, ξn = ETn =n∑

k=1

µk, s2n = V arTn =

n∑

k=1

σ2k, Zn = (Tn − ξn)/sn

and ρn = s−(2+δ)n

n∑

k=1

v(k)2+δ. Then, if limn→∞ ρn = 0, we have Zn

D−→ N( 0 , 1 ).

Theorem 1.1.11. [13]

Let Xk, k ≥ 1, be independent random variables such that Pa ≤ Xk ≤ b = 1

for some finite scalers a < b. Also let EXk = µk, V arXk = σ2k, Tn =

n∑

k=1

Xk, ξn =n∑

k=1

µk

and s2n =

n∑

k=1

σ2k. Then

Zn = (Tn − ξn)/snD−→ N( 0 , 1 ) if and only if sn −→∞ as n −→∞.

Theorem 1.1.12. [13] ( Borel - Cantelli Lemma )

Let An be a sequence of events and denote by P (An) the probability that An occurs,

n ≥ 1. Also, let A denote the event that the An occurs infinitely often (i.o). Then

∑n≥1

P (An) < ∞ =⇒ P (A) = 0,

no matter whether the An are independent or not. If the An are independent, then

∑n≥1

P (An) = +∞ =⇒ P (A) = 0.

10

Lemma 1.1.13. [19]

There exists a universal constant C > 0 such that for each n > 0, εn > 0 and

distribution function F,

P supx∈R

|Fn(x)− F (x)| > εn ≤ C exp (−2nε2n).

Theorem 1.1.14. [13] ( Cramer-Wold )

Let X, X1, X2, . . . be random vectors in Rp; then XnD−→ X if and only if, for a fixed

λ ∈ Rp, we have λtXnD−→ λtX.

Theorem 1.1.15. ( Taylor’s Theorem )

Suppose that f is a real-valued function defined on R and let x ∈ R. Assume that f

has p continuous derivative in an interval (x− δ, x + δ) for some δ > 0 and the (p + 1)th

derivative of f exists. Then for any sequence (αn) converging to zero, we have

f(x + αn) =

p∑j=0

(αjn/j!)f (j)(x) + o(αp

n).

11

1.2 Kernel Density Estimation

Suppose X1, X2, . . . , Xn is a sequence of independently and identically distributed (i.i.d.)

random variables with common probability density function f(x). The problem of esti-

mating the function f(x) is of interest for many reasons. For instance, it can be used to

calculate probabilities. In parallel, if we know f(x), we are be able, through its graph to

determine its shape as well as other features of the distribution, like if it has one peak ore

more, if it smooth, symmetric, etc.

1.2.1 Kernel Estimator

Let X1, X2, . . . , Xn be i.i.d. random variables with distribution function F (x) = P (X ≤x) which is absolutely continuous,

F (x) =

∫ x

−∞f(y)dy,

with probability density function f(x).

The sample distribution function Fn(x) at a point x is defined as

Fn(x) =1

n number of observations x1, x2, . . . , xn falling in (−∞, x].

It is natural to take Fn(x) as an estimate of F (x) at a given point x. An estimate of f(x)

may be

fn(x) =1

2hn

Fn(x + hn)− Fn(x− hn), (1.2.1)

where hn is chosen as a positive number.

Equation 1.2.1 can be written as

fn(x) =1

2nhn

number of observations falling in the interval [x− h, x + h]

=1

2nhn

n∑i=1

I(|Xi − x| ≤ h)

=1

nhn

n∑i=1

1

2I(|Xi − x

h| ≤ 1)

=1

nhn

n∑i=1

w(Xi − x

h) (1.2.2)

12

where

I =

1 x− h ≤ Xi ≤ x + h,

0 otherwise.

and

w(Xi − x

h) =

1

2I(|Xi − x

h| ≤ 1) =

12

− 1 ≤ Xi−xh

≤ 1,

0 otherwise.

Definition 1.2.1. We consider the function that centered at the estimation point used

to weight nearby data points as a weight function and will call it the kernel function and

denoted by K(·) which defined as

fn(x) =1

nhn

n∑i=1

K(x−Xi

hn

). (1.2.3)

Note that Equation 1.2.3 can be written as

fn(x) =1

n

n∑i=1

Kh(x−Xi),

where Kh(x) = K(x/h)/h.

The kernel estimator can be viewed as a sum of bumps placed at the observation. The

kernel function K determines the shape of bumps, where the bandwidth hn determines

their width, see the illustration in Figure 1.1

13

-3 -2 -1 0 1 2 3

0.0

0.1

0.2

0.3

0.4

0.5

Figure 1.1: Kernel density estimation based on 7 points (see[20])

From Figure 1.1, we have:

(1) The shape of the bump is efined the kernel function.

(2) The spread of the bump is determined by a bandwidth hn, that is analogous to the

bandwidth of a histogram.

That is the value of the kernel estimate at the point x is the average of the n kernel

ordinates at this point.

1.3 Properties and Examples of the Kernels

In this section, we will consider some properties of the kernels. A kernel is a piecewise

continuous function, symmetric around zero, even function and integrating to one, i.e.

K(x) = K(−x),

∫ ∞

−∞K(x)dx = 1. (1.3.1)

14

The kernel function need not have bounded support, and in most applications K is a

positive probability density function.

A kernel function K is said to be of order p, if its first nonzero moment is µp, i.e. if

µj(K) = 0, j = 1, 2, . . . , p− 1; µp(K) 6= 0,

where

µi(K) =

∫ ∞

−∞yiK(y)dy. (1.3.2)

Some examples of kernel functions is given in Table 1.1, where I is the indicator function.

Table 1.1: Common kernel functions

kernel K(x)

Epanechnikov 34(1− x2)I(|x|≤1)

Biweight 1516

(1− x2)2I(|x|≤1)

Triweight 3532

(1− x2)3I(|x|≤1)

Triangular (1− |x|)I(|x| ≤ 1)

Gaussian (2π)−12 exp(−x2

2)

Uniform 12I(|x| ≤ 1)

Now, we shall introduce some important properties of the kernel estimator, we consider the

following conditions that we will use in proving facts, lemmas and theorems in remanning

of this chapter.

i) The unknown density function f(x) has continuous second derivative f (2)(x).

ii) The bandwidth h = hn = h(n) is a sequence of positive numbers and satisfies

limn→∞

hn = 0, and limn→∞

nhn = ∞.

iii) The kernel K is a bounded probability density function of order 2 and symmetric

about the zero.

15

Definition 1.3.1. The Bias of an estimator fn(x) of a density f(x) is the difference

between the expected value of fn(x) and f(x). That is

Bias(fn(x)) = E(fn(x))− f(x)

In [12], he studied the statistical properties of kernel estimator. In addition to the above,

he proved several other properties. He showed that fn(x) is a consistent of f(x) and the

sequence of estimates fn(x) is asymptotically normally distributed. Also he proved that

if the probability density function f(x) is uniformly continuous, and if limn→∞

nh2n = ∞ ,

then fn(x) tends uniformly continuously ( in probability ) to f(x), in the sense that (1.3.3)

holds

limn→∞

P ( sup−∞<x<∞

|fn(x)− f(x)| < ε) = 1, ∀ε > 0. (1.3.3)

1.4 The MSE and MISE Criteria

The important role played by kernel density estimator makes us concerned with its per-

formance, its efficiency and accuracy in estimating the true density. we will study two

types of the error criteria, the mean square error (MSE) and the mean integrated square

error (MISE).

Definition 1.4.1. The mean square error ( MSE) is used to measure the error when

estimating the density function at a single point. It is defined by

MSEfn(x) = Efn(x)− f(x)2. (1.4.1)

From its definition, the MSE measures the average squared difference between the

density estimator and the true density. In general, any function of the absolute distance

|fn(x) − f(x)| (often called metric) would serve as a measurement of the goodness of

an estimator. But MSE metric has at least two advantages over other metrics. First it

is tractable analytically. Second it has an interesting decomposition into variance and

16

squared bias provided f(x) is not random, as follows

MSE(fn(x)) = E(f(x)− fn(x))2

= E(f 2(x)− 2f(x)fn(x) + fn(x)2)

= Ef 2(x)− 2f(x)Efn(x) + Efn(x)2

= f 2(x)− 2f(x)Efn(x) + Varfn(x) + (Efn(x))2

= Varfn(x) + (Efn(x)− f(x))2 (1.4.2)

Theorem 1.4.1.

Let X be a random variable having a density f, then

MSE(fn(x)) = n−1∫ ∞

−∞K2

hn(x− y)f(y)dy − (

∫ ∞

−∞Khn(x− y)f(y)dy)2

+ (

∫ ∞

−∞Khn(x− y)f(y)dy − f(x))2 (1.4.3)

Proof: See [11].

Now, we are interested in considering an error criterion that globally measures the distance

between the estimation of f over the entire real line and f itself.

Definition 1.4.2. An error criterion that measures the distance between fn(x) and f(x)

is the integrated squared error (ISE) given by

ISEfn(x) =

∫ ∞

−∞(fn(x)− f(x))2dx

Note that the ISE is not appropriate if we deal with all data sets, so we prefer to analyze

the expected value of this random quantity, the integrated squared error.

Definition 1.4.3. The expected value of ISE is called the mean integrated squared error

(MISE) is given by

MISEfn(x) = E(ISEfn(x)) = E

∫ ∞

−∞(fn(x)− f(x))2dx

17

By changing the order of integration we have,

MISE(fn(x)) =

∫ ∞

−∞MSEfn(x)dx

=

∫ ∞

−∞Efn(x)− f(x)2dx +

∫ ∞

−∞V ar(fn(x))dx. (1.4.4)

Theorem 1.4.2.

The MISE of an estimator fn(x) of a density f(x) is given by

MISE(fn(x)) = n−1

∫ ∞

−∞

∫ ∞

−∞K2

hn(x− y)f(y)dydx

+ (1− n−1)

∫ ∞

−∞(

∫ ∞

−∞Khn(x− y)f(y)dy)2dx

− 2

∫ ∞

−∞∫ ∞

−∞Khn(x− y)f(y)dyf(x)dx +

∫ ∞

−∞f 2(x)dx. (1.4.5)

Proof: See [11].

1.5 Asymptotic MSE and MISE Approximations

Here, we will derive an asymptotic approximation for MISE which depend on hn in a

simple way. The simple expression of these approximations will exhibit the influence of

the bandwidth hn as a smoothing parameter.

The rate of convergence of the kernel density estimation and the optimal bandwidth can

be also obtained from the asymptotic approximation of MISE.

Before we start in our investigation we have to introduce some definitions, theorems, and

some assumptions that are needed throughout our work.

Definition 1.5.1.

i) A function f is of order less than g as x −→∞ if

limx→∞

f(x)

g(x)= 0.

we indicate this by writing f = (g) (”f is little oh g”)

18

ii) Let f(x) and g(x) be positive for x sufficiently large. Then f is of at most the order of

g as x −→∞ if there is a positive integer M for which

f(x)

g(x)≤ M,

for x sufficiently large. We indicate this by writing f = O(g) (” f is big oh of g”).

Definition 1.5.2. Given two sequences an and bn such that bn ≥ 0 for all n.

We write

an = O(bn) (read : ”an is big oh of bn”, )

if there exists a constant M > 0 such that |an| ≤ Mbn for all n.

We write an = (bn) as n −→∞ ( read :” an is little oh of bn”), if

limx→∞

an

bn

= 0.

Definition 1.5.3. We say that an is asymptotically equivalent to bn, or simply an is

asymptotic to bn, and we write

an ∼ bn iff limn→∞

(an

bn

) = 1.

Lemma 1.5.1.

Let X be a random variable having a density f, then the bias of fn(x) can be expressed

as

E(fn(x))− f(x) =1

2h2

nµ2(K)f ′′(x) + (h2n) (1.5.1)

Proof :

Firstly, we assume that

∫ ∞

−∞K(z)dz = 1,

∫ ∞

−∞zK(z)dz = 0,

∫ ∞

−∞z2K(z)dz < ∞, and µ2(K) =

∫ ∞

−∞z2K(z)dz.

Note that

E(fn(x)) =

∫ ∞

−∞

1

hn

K(x− y

hn

)f(y)dy.

19

Let z =x− y

hn

to get

E(fn(x)) =

∫ ∞

−∞K(z)f(x− zhn)dz, since f has a continuous derivatives of order 2.

Then we can expand f(x− zhn) in a Taylor series as follows

f(x− zhn) =2∑

j=0

(−zhn)j

j!f (j)(x) + o(−zhn)2

= f(x) + (−zhn)f ′(x) +z2h2

n

2f ′′(x) + o(z2h2

n)

= f(x)− zhnf′(x) +

1

2z2h2

nf′′(x) + o(h2

n)

Therefore,

E[fn(x)] =

∫ ∞

−∞K(z)f(x)− zhnf ′(x) +

1

2z2h2

nf′′(x) + o(h2

n)dz

=

∫ ∞

−∞K(z)f(x)dz −

∫ ∞

−∞zK(z)hnf ′(x)dz +

∫ ∞

−∞K(z)

1

2h2

nz2f ′′(x)dz +

∫ ∞

−∞K(z)o(h2

n)dz

= f(x)− hnf′(x)

∫ ∞

−∞zK(z)dz +

1

2h2

nf ′′(x)

∫ ∞

−∞z2K(z)dz + o(h2

n)

= f(x) +1

2h2

nf′′(x)

∫ ∞

−∞z2K(z)dz + o(h2

n).

By the assumption, we have the result.

Lemma 1.5.2.

Let X be a random variable having a density f, then

V arfn(x) = (nhn)−1R(K)f(x) + (nhn)−1, (1.5.2)

where R(K) =

∫ ∞

−∞K2(x)dx.

Proof :

First, note that

V arfn(x) =1

n∫ ∞

−∞K2

hn(x− y)f(y)dy − [

∫ ∞

−∞Khn(x− y)f(y)dy]2

20

Using the Taylor series expansion of f(x− zhn) about x to get

V arfn(x) =1

nhn

∫ ∞

−∞K2(z)f(x− zhn)dz − n−1Efn(x)2

=1

nhn

∫ ∞

−∞K2(z)f(x) + o(1)dz − n−1f(x) + o(1)2

=1

nhn

∫ ∞

−∞K2(z)f(x) + o(nhn)−1.

From the assumption, the result holds.

Now from above we have some properties about the bias and the variance,

1) The bias is of order (h2n), which implies that fn(x) is asymptotically unbiased estima-

tor. (From page 15 (ii))

2) The bias is large, whenever the absolute value of the second derivative |f (2)(x)| is large.

this occurs for several densities at peaks, where the bias is negative, and valleys,

where the bias is positive.

3) The variance is of order (nhn)−1, which means that the variance converges to zero by

condition (ii) page 15.

Theorem 1.5.3.

The MISE of an estimator fn of the unknown density f is given by

MISEfn(x) = AMISEfn(x)+ (nhn)−1 + h4

where

AMISEfn(x) = (nhn)−1R(K) +1

4h4

nµ22(K)R(f ′′) (1.5.3)

is called the asymptotic MISE of fn(x), and R(K) =

∫ ∞

−∞K2(x)dx.

Proof :

From Equation (1.5.1) and (1.5.2) and applying Equation (1.4.2) to get

MSEfn(x) = (nhn)−1R(K)f(x) + o(nhn)−1+1

4h4

nµ22(K)f ′′2(x) + o(h4

n)

+ h2nµ2

2(K)f ′′(x)o(h2n)

= (nhn)−1R(K)f(x) +1

4h4

nµ22(K)f ′′2(x) + o(nhn)−1 + h4

n

21

From this, we have

MSEfn(x) = (nhn)−1R(K)f(x) +1

4h4

nµ22(K)f ′′2(x) + o(nhn)−1 + h4

n

therefor,

AMISEfn(x) = (nhn)−1R(K)f(x) +1

4h4

nµ22(K)f ′′2(x).

1.6 Optimal Bandwidth

The problem of bandwidth selection is very important in density estimation. The next

figure (1.2) shows how the density estimates change with the bandwidth size. Choice of

the appropriate bandwidth is critical to the performance of most nonparametric density

estimators. When the bandwidth is very small, the estimate will be very close to the

original data. Thus it will be very wiggly due to the over fitting. The estimate will be

almost unbiased, but it will have large variation under repeated sampling. If the band-

width is very large, the estimate will be very smooth, lying close to the mean of all the

data. Such an estimate will have small variance, but it will be highly biased. A brief sur-

vey of bandwidth selection for kernel density estimation has been taken on by [22] and [23].

One way to select the smoothing parameter is simply to look at the plots of the

smoothed data for several bandwidth. If the overall trend is the feature of the most in-

terest to the investigator, a very smooth estimate may be desirable. If the investigator is

interested in local extremes, a less smooth estimate may be preferred.

Subjective choice of the smoothing parameter offers a great deal of flexibility, as well as

a comprehensive look at the data, see [22].

The AMISE (asymptotic MISE) has some useful advantages. Its simplicity as a math-

ematical expression to deal with, makes it useful for large sample approximation.

Also, we can see an important alternative relationship between bias and variance, it is

known as the variance-bias trade-off. It gives us an understanding about the role of band-

22

width hn.

-3 -2 -1 0 1 2 3

0.0

0.1

0.2

0.3

0.4

0.5

-3 -2 -1 0 1 2 3

0.0

0.1

0.2

0.3

0.4

0.5

-3 -2 -1 0 1 2 3

0.0

0.1

0.2

0.3

0.4

0.5

Figure 1.2: Kernel density estimates based on different bandwidths (see [20])

Figure 1.2 depends on three bandwidths:

If we choose hn = 0.25, then we have a solid curve. If we choose hn = 0.5, then we have

a dashed curve. And if we choose hn = 0.75, then we have adotted curve.

There are many rules for bandwidth selection, for example Normal Scale Rules, Over-

smoothed bandwidth selection rules, Least Squares Cross-Validation, Biased Cross-Validation,

Estimation of density functionals and Plug-In Bandwidth Selection. For more details see

[11], [22] and [23].

Corollary 1.6.1.

The AMISE-optimal bandwidth, hAMISE , has a closed form

hopt = [R(K)

µ2(K)2R(f (2))n]15 (1.6.1)

23

Proof :

By differentiating (1.5.3) with respect to hn and setting the derivative equal to zero we

can find the optimal bandwidth

d

dhAMISEfn(x) = −(nh2

n)−1R(K) + h3nµ

22(K)R(f ′′) = 0

h5nµ2

2(K)R(f ′′) = n−1R(K)

hopt = R(K)

nµ22(K)R(f ′′)

15 .

When trying to understand what this hn guides to, we will find that it depends on the

known kernel function K and n, and it is inversely proportional to R(f ′′)15 . This R(f ′′)

measures that the total curvature of f . So if R(f ′′) is small, that is f has a little curvature

and the bandwidth h will be large. On the other hand hn will be small if R(f ′′) is large.

The previous corollary gives agood optimal hn can work to choose a good bandwidth, if

R(f ′′) is known. But f is unknown.

Therefore if we substitute (1.7.1.) to (1.6.3), we obtain the smallest value of AMSE

(since the seconed derivative is grater than zero) for estimating f using the kernel K

AMISEfn(x) = (nhn)−1R(K) +n

4h5

nµ22(K)R(f ′′)

=5

4R(K)µ2

2(K)R(f ′′) 15 /n

45 (R(K))

15

=5

4n−45 (R(K))

45 (µ2(K)2R(f ′′))

15

take the infimum over hn > 0, we get

infhn>0

AMISEfn =5

4µ2(K)2R(K)4R(f (2)) 1

5 n−45

Notice that in (1.6.1) the optimal bandwidth depends on the unknown density being

estimated, so we can not use (1.6.1) directly to find the optimal bandwidth hopt. Also

from (1.6.1) we can draw the following useful conclusions:

1. The optimal bandwidth will converge to zero as the sample size increases, but at very

slow rate.

24

2. The optimal bandwidth is very inversely proportional to R(f ′′)15 . Since R(f ′′) measures

the curvature of f , this means that for a density function with little curvature, the

optimal bandwidth will be large. conversely, if the density function has a large

curvature, the optimal bandwidth will be small.

1.7 Optimal Kernel

In this section, we investigate what effect the shape of kernel function K has on density

estimation. Usually K is taken to be symmetric, unimodal density function, but there

are many kernel functions that do satisfy these characteristics and still their performance

varies. The best kernel will be known as the optimal kernel.

Epanechnikov (1969) was the first to consider this problem in the density estimation

context and to give a comparison of common kernels in asymptotic performance terms.

Consider the formula for AMISEfn(x) in (1.5.3). In the formula that scaling of K is

incorporated with the bandwidth hn. This causes difficulty in optimization with respect

to K. If we Choose a re-scaling of K of the form

Kδ(x) =1

δK(

x

δ),

the dependance of K and hn can be separated. To know how this can be made, we will

give this lemma.

Lemma 1.7.1.

R(Kδ) = µ22(Kδ) is satisfied iff δ = δ0 = R(K)/µ2

2(K) 15

Proof : See [11].

Theorem 1.7.2.

Let R(Kδ) = µ22(Kδ), where δ = δ0 = R(K)/µ2(K)2 1

5 , then

AMISE(fn(x)) = C(Kδ0)(nhn)−1 +1

4h4R(f ′′). (1.7.1)

25

Proof :

First, since R(Kδ) = µ22(Kδ), then

µ22(K) = δ−5R(K) = δ.δ−5R(Kδ) = δ−4R(Kδ).

Note that

AMISE(fn(x)) =1

nhn

R(K) +1

4h4

nµ22(K)R(f ′′)

=δ

nhn

R(Kδ) +1

4h4

nδ−4R(Kδ)R(f ′′)

= R(Kδ0)(nhn)−1 +1

4h4

nR(f ′′)

= δ−10 R(K)(nhn)−1 +

1

4h4

nR(f ′′)

= δ−4R4(K)δ4µ22(K) 1

5(nhn)−1 +1

4h4

nR(f ′′)

= C(Kδ0)(nhn)−1 +1

4h4R(f ′′)

Thus the result holds.

Definition 1.7.1. We say that C(K) is invariant to re-scalings of K if C(Kδ1) = C(Kδ2),

for any δ1, δ2 > 0. We call Kc = Kδ0 the canonical kernel for the class Kδ : δ > 0 of

resealed K.

Corollary 1.7.3.

C(K) is invariant to re-scaling of K.

Proof : See [11].

Canonical kernels also can simplify the optimization procedure of the kernel shape.

That is; from Equation (1.7.1), it is enouhg to choose K that minimizes C(Kδ0), with:

∫ ∞

∞K(x)dx = 1,

∫ ∞

∞xK(x)dx = 0,

∫ ∞

∞x2K(xdx) = a2 ≤ ∞ and K(x) ≥ 0 for all x.

The solution to this problem was given by as

Ka(x) =3

4(1− x2/(5a2))/(5

12 a)I|x|<5

12 a, (1.7.2)

26

where a is an arbitrary scale parameter.

Now, if we choose a2 =1

5, we get the simp;est of Ka(x)

K?(x) =3

4(1− x2)I|x|<1. (1.7.3)

The kernel in (1.7.3) is known as Epanechinkov kernel, since its optimality properties in

density estimation were first described by Epanechinkov (1969).

Now, we will introduce the useful ratio C(K?)/C(K) 54 .

Definition 1.7.2. The ratio C(K?)/C(K) 54 represents ratio of sample sizes necessary

to obtain the same minimum AMISE (for a given f) when using K?(x) as when using K,

and is called the efficiency of K relative to K?.

x

y

-3 -2 -1 0 1 2 3

0.0

0.2

0.4

0.6

0.8

1.0

x

y

-3 -2 -1 0 1 2 3

0.0

0.2

0.4

0.6

0.8

1.0

x

y

-3 -2 -1 0 1 2 3

0.0

0.2

0.4

0.6

0.8

1.0

Figure 1.3: The Epanechinkov kernel K∗ (see [20])

27

kernel C(K?)/C(K) 45

Epanechnikov 1.000

Biweight .994

Triweight .987

Triangular .986

Gaussian .951

Uniform .930

Table 1.2 : Efficiency of several kernels compered to the optimal kernel K?.

From Table 1.2, if the efficiency of K is 0.98, this means that, we have to use 98%

of data as that using K, if we want the density estimate optimal kernel K? to reach the

same minimum AMISE.

Figure 1.4 shows the kernel density estimates of the Ethanol data based on the same

bandwidth hn = 0.2, but using different kernels. The solid curve stands for the triangular

kernel, the dashed curve for the uniform kernel, and the dotted curve for the normal

kernel.

0.6 0.8 1.0 1.2

12

34

Figure 1.4: Kernel density estimates of the Ethanol data (see [20])

28

Chapter 2

On the Estimation of the Mode

The mode is considered as one of the central tendency measures, the tendency of data

towards the central around a particular value. The mode is defined as the common value

or the most repetitive among those observation, and the data might have more than one

modes or nothing at all which can’t be calculated.

The mode can be found by calculation or drawing and it is not affected by the irregu-

lar values. It’s important to refer to the relation between the mean, the median, and

the mode. Those measures are the most used measures of locations by the investiga-

tors, because those measures are easy to understand. They are equal when the curve is

symmetric, while when the curve is in a state of a positive bending, the mean is larger

than the median and the both are larger than the mode. When the curve is in a state of

negative bending, the mode is larger than the median and the both are larger than the

mean.

In this chapter, firstly we present the kernel estimation of the mode and conditional mode

function, and we study them in the case of i.i.d. random variables. Also we study the

asymptotic behavior of the mode and the conditional mode estimation.

This chapter consists of four sections. In Section 2.1, we introduce the problem of es-

timating the mode of a probability density function, and giving historical notes. We

study under what conditions the asymptotic normality of the unconditional mode. In

the next section, we present the simple mode estimation of [3]. Section 2.3 comprises an

29

introduction to the study of the relationship between two variables, X and Y, the first is

called the predictor variable, and the second is the response variable, and we present the

Nadaraya-Watson estimator as on approach of the kernel regression estimators. Finally,

we study the joint estimation of the conditional mode function taken at k finite distinct

points.

2.1 Mode Estimation

The problem of estimating the mode of a probability density function has received con-

siderable attention in the literature. The study of nonparametric mode estimation is now

four decades old, having roots in many papers. In the last few years, an increasing interest

in this topic can be observed. Among the most recent evidence of this growing interest

are the papers by [3].

There are many fields where the knowledge of the mode is of great interest. For example,

the estimation of contours is a natural extension of the estimation of mode points. For

more details see [18] and [20].

Let X1, X2, . . . , Xn be a sequence of i.i.d. random variable with pdf f. Assume that

the probability density function f(x) is uniformly continuous in x. It follows that f(x)

possesses a mode θ which defined by

f(θ) = maxx

f(x).

Assume that θ is unique.

The classical procedure to estimate the mode is as follows:

If f(x) is the unknown function and θ is the mode of f, then θ is estimated from the

location θn which maximize function fn for f.

Suppose that fn(x) is a continuous function and tends to 0 as s tends to ∞. There is a

random variable θn such that

θn = arg maxx

fn(x) (2.1.1)

We call θn the sample mode.

30

Lemma 2.1.1. (Bochner Lemma)

Suppose K(y) is a mesurable function satisfying the following:

1. sup−∞<y<∞

|K(y)|dy < ∞,

2.

∫ ∞

−∞|K(y)|dy < ∞,

3. limy→∞

|yK(y)| = 0.

Let g(y) satisfy

∫ ∞

−∞|g(y)|dy < ∞. Let hn be a sequence of positive constants satisfying

the condition

4. limn→∞

hn = ∞.

Define gn(x) =1

hn

∫ ∞

−∞K(

y

hn

)g(x− y)dy.

Then at every point x of continuity of g(·),

limn→∞

gn(x) = g(x)

∫ ∞

−∞|K(y)|dy. (2.1.2)

Proof :

Note first that

gn(x)− g(x)

∫ ∞

−∞|K(y)|dy =

1

hn

∫ ∞

−∞K(

y

hn

)g(x− y)dy − g(x)

∫ ∞

−∞K(y)dy

=

∫ ∞

−∞g(x− y)− g(x) 1

hn

K(y

hn

)dy.

Let δ > 0, and split the region of integration into two regions, |y| ≤ δ and |y| > δ.

Now let z =y

hn

. Then y = zhn and so dy = hndz. Then we have dz =1

hn

dy.

31

Now

|gn(x)− g(x)

∫ ∞

−∞K(y)dy| = |g(x− y)− g(x)|

∫

|y|≤δ

1

hn

K(y

hn

)dy

+ |g(x− y)− g(x)|∫

|y|>δ

1

hn

K(y

hn

)dy

≤ |g(x− y)− g(x)|∫

|y|≤δ

1

hn

K(y

hn

)dy

+

∫

|y|≥δ

|g(x− y)|y

y

hn

K(y

hn

)dy + g(x)

∫

|y|≥δ

1

hn

K(y

hn

)dy

≤ max|y|≤δ

|g(x− y)− g(x)|∫

|z|≤δ/hn

|K(z)|dz

+

∫

|y|≥δ

|g(x− y)|y

y

hn

K(y

hn

)dy + |g(x)|∫

|y|≥δ

1

hn

K(y

hn

)dy

≤ max|y|≤δ

|g(x−y)−g(x)|∫ ∞

−∞|K(z)|dz+

1

δsup

|z|≥δ/hn

|zK(z)|∫ ∞

−∞|g(y)|dy

+ |g(x)|∫

|z|≥δ/hn

|K(z)|dz, ( Since1

y<

1

δ)

which tends to 0 as n tends to ∞, and δ tend to 0.

Lemma 2.1.2.

Consider the foemula of fn(x) as in Equarion (1.2.3). Then fn(x) can be written as

fn(x) = (2π)−1

∫ ∞

−∞e−iuxK(uhn)ϕn(u)du. (2.1.3)

where

ϕn(u) =

∫ ∞

−∞eiuxdFn(x) = n−1

n∑

k=1

eiuXk

Proof : See [12]

Theorem 2.1.3.

Under the conditions of the last lemma (2.1.1), if hn is a function of n satisfying

limn→∞

nh2n = ∞, lim

n→∞E[fn(x)] = f(x), and if the probability density function f(x) is uni-

formly continuous. Then for every ε > 0

P [supx|fn(x)− f(x)| < ε] −→ 1 as n −→∞.

32

Proof :

To prove this theorem we want to show that

limn→∞

E12 [ sup−∞<x<∞

|fn(x)− f(x)|2] = 0. (2.1.4)

Since limn→∞ E[fn(x)] = f(x), it suffices to show that

E12 [ sup−∞<x<∞

|fn(x)− E[fn(x)]|2] −→ 0, (2.1.5)

as n −→∞, since by Lemma 2.1.1, it follows that

limn→∞

sup−∞<x<∞

|E[fn(x)]− f(x)| = 0.

Since

fn(x) = (2π)−1

∫ ∞

−∞e−iuxK(uhn)ϕn(u)du.

Then

sup−∞<x<∞

|fn(x)− E[fn(x)]| ≤ (2π)−1|∫ ∞

−∞eiuxK(uhn)ϕn(u)du−

∫ ∞

−∞eiuxK(uhn)E[ϕn(u)]du|

= (2π)−1

∫ ∞

−∞|eiuxK(uhn)ϕn(u)− E[ϕn(u)]du|

= (2π)−1

∫ ∞

−∞|K(uhn)||ϕn(u)− E[ϕn(u)]|du. ( since | eiux| = 1.)

Therefore, by Minkowski’s inequality, the quantity in (2.1.5) is no greater than

(2π)−1

∫ ∞

−∞|K(uhn)|σ[ϕn(u)]du ≤ (n

12 hn)−1

∫ ∞

−∞|k(u)|du

which tends to 0. The proof of this Theorem is complete.

Theorem 2.1.4.

Under the conditions of the last Theorem, If θn are the sample modes, and if the

population mode θ is unique, then for every ε > 0

limn→∞

P (|θn − θ| < ε) = 1, ∀ε > 0. (2.1.6)

33

Proof :

since f(x) is a uniformly continuous probability density function with a unique mode θ,

it has the following property,

for every ε > 0 there exists an η > 0 such that, for every point x, |θ − x| ≥ ε implies

|f(θ)− f(x)| ≥ η.

If the assertion were false, then there would exist an ε > 0 and a sequence of xn such

that

|f(θ)− f(xn)| < 1

nand |θ − xn| ≥ ε (2.1.7)

Now (2.1.7), and the fact f(x) −→ 0 as x −→ ±∞, implies that there exists point θ′ 6= θ

such that f(θ′) = f(θ), which contradicts the assumption that f(x) has a unique mode θ.

From this assertion since f is uniformly continuous, and it follows that to prove θn −→ θ

in probability, it sufficient to prove that

f(θn) −→ f(θ) in probability as n −→∞. (2.1.8)

Now,

|f(θn)− f(θ)| = |f(θn)− fn(θn) + fn(θn)− f(θ)|≤ |f(θn)− fn(θn)|+ |fn(θn)− f(θ)|≤ sup

x|f(x)− fn(x)|+ sup

x|fn(x)− f(x)|

= 2 supx|fn(x)− f(x)| (2.1.9)

Since

|fn(θn)− f(θ)| = | supx

fn(x)− supx

f(x)| ≤ supx|fn(x)− f(x)|. (2.1.10)

From (2.1.9) and Theorem (2.1.2), we obtain (2.1.8).

Nadaraya (1965) has proved the strongest result in this direction. He proved that

under certain conditions, the sample mode θn converges to the population mode θ with

probability 1.

To achieve the asymptotic normality of θn, and therefore to be able to construct as-

ymptotic confidence interval for θ, it is generally believed that rather heavy smoothing

34

conditions are needed. The next Theorem state conditions on the constants hn and the

kernel K(u) such that the estimated mode θn is asymptotically normally distributed.

Consider a probability density function f(x) with a unique mode at θ. If f(x) has a

continuous second derivative, then by definition of the mode we have

f ′(θ) = 0, f ′′(θ) < 0. (2.1.11)

Similarly, if the estimated probability density function fn(x) is chosen to be twice dif-

ferentiable (that is, the weighting function K(y) is chosen to be twice differentiable),

then

f ′n(θn) = 0, f ′′n(θn) < 0, (2.1.12)

if θn is the mode of fn(x). Then by Taylor’s theorem, we have

0 = f ′n(θn) = f ′n(θ) + (θn − θ)f ′′n(θ?n) (2.1.13)

for some random variable θ?n between θn and θ. From (2.1.13) we can write

θn − θ = −f ′n(θ)/f ′′n(θ?n) (2.1.14)

if the denominator does not vanish. Using (2.1.14) as a basis, we now state conditions

under which the estimated mode θn is asymptotically normally distributed.

Theorem 2.1.5.

Suppose that there exists δ, 0 < δ < 1, such that the transform k(u) has a charac-

teristic exponent r ≥ 2, and satisfies

1.

∫ ∞

−∞u2+δ|k(u)|du < ∞, and hn is a function of n satisfying

2. limn→∞

nh5+2δn = 0,

3. limn→∞

nh6n = ∞, and that the characteristic function ϕn(u) satisfying

4.

∫ ∞

−∞u2+δ|φ(u)|du < ∞.

35

Then as n −→∞,

E[ sup−∞<x<∞

|f ′′n(x)− f ′′(x)|2] −→ 0 (2.1.15)

f ′′n(θ?n) −→ f ′′(θ) in probability (2.1.16)

(nh3)12 f ′n(θ) −→ N(0, f(θ)J) in distribution (2.1.17)

(nh3n)

12 (θn − θ) −→ N(0, f(θ)/[f ′′(θ)]2J) in distribution (2.1.18)

where we define

J =

∫ ∞

−∞K ′2(y)dy = (2π)−1

∫ ∞

−∞u2k2(u)du. (2.1.19)

Proof :

From (2.1.3), since fn(x) = (2π)−1

∫ ∞

−∞e−iuxk(uhn)ϕn(u)du.

Then f ′n(x) =−i

2π

∫ ∞

−∞ue−iuxk(uhn)ϕn(u)du,

and so f ′′n(x) =i2

2π

∫ ∞

−∞u2e−iuxk(uhn)ϕn(u)du =

−1

2π

∫ ∞

−∞u2e−iuxk(uhn)ϕn(u)du.

First, we will prove (2.1.15)

|f ′′n(x)− E[f ′′n(x)]| = |−1

2π

∫ ∞

−∞u2e−iuxk(uhn)ϕn(u)du− −1

2π

∫ ∞

−∞u2e−iuxk(uhn)E[ϕn(u)]du|

= |−1

2π

∫ ∞

−∞u2e−iuxk(uhn)ϕn(u)du +

1

2π

∫ ∞

−∞u2e−iuxk(uhn)E[ϕn(u)]du|

=1

2π|∫ ∞

−∞u2e−iuxk(uhn)[ϕn(u)− E[ϕn(u)]]du|

≤ 1

2π

∫ ∞

−∞|e−iuxk(uhn)u2[ϕn(u)− E[ϕn(u)]]du|

= (2π)−1

∫ ∞

−∞|k(hnu)|u2|ϕn(u)− E[ϕn(u)]|du. ( Since |e−iux| = 1.)

Let uhn = v, then dv = hndu and so du =1

hn

dv, u2 =v2

h2n

to get

E12 [ sup−∞<x<∞

|f ′′n(x)− E[f ′′n(x)]|2] ≤∫ ∞

−∞|k(uhn)|u2σ[ϕn(u)]du

≤ (n12 h3

n)−1

∫ ∞

−∞|k(v)|v2dv,

|E[f ′′n(x)]− f ′′(x)| ≤ (2π)−1

∫ ∞

−∞|1− k(uhn)|u2|ϕ(u)|du.

36

Equation (2.1.16) follows from (2.1.15) and the fact that θ?n tends to θ, since it is between

θn and θ, and θn tends to θ.

To prove (2.1.17), let

f ′n(θ) = n−1

n∑

k=1

Vnk , Vnk =1

h2n

K ′(θ −Xk)

hn

,

Vnk are independent and identically distributed as Vn = (h2n)−1K ′(θ −Xk)

hn

.Now,

E|Vn|m =

∫ ∞

−∞| 1

hn

K(x− y

hn

)|mf(y)dy

=1

hmn

∫ ∞

−∞|K(

x− y

hn

)|mf(y)dy

Let u =x− y

hn

, to get y = x− uhn, and so dy = −hdu.

That is;

E|Vn|m =hn

hmn

∫ ∞

−∞|K ′(y)|mf(y)dy

=1

h2m−1n

∫ ∞

−∞|K ′(y)|mf(y)dy.

Then

E|Vn|m =1

h2m−1n

f(θ)

∫ ∞

−∞|K ′(y)|mdy

hence

h2m−1E|Vn|m −→ f(θ)

∫ ∞

−∞|K ′(y)|mdy,

Using Liapunov’s condition it is sufficient to show that,

for some δ > 0,E|Vn − E[Vn]|2+δ

n(δ/2)σ2+δ[Vn]−→ 0 as n −→∞.

Now,

(nh3n)−1E[f ′n(θ)] = (nh3

n)−1(−i

2π)

∫ ∞

−∞e−iuθk(hu)− 1uϕudu −→ 0,

37

nh3nVar[f ′n(θ)] = h−1

n

∫ ∞

−∞K ′2(θ − y)/hnf(y)dy − nh3

nE2[f ′n(θ)]

−→ f(θ)

∫ ∞

−∞K ′2(y)dy.

Therefore

(nh3n)

12 f ′n(θ) −→ N(0, f(θ)J).

This implies that

f ′n(θ)− E[f ′n(θ)]

σ[f ′n(θ)]−→ N(0 , 1) in distribution.

To proof of (2.1.18), from Equation (2.1.14) and (2.1.17), we have

θn − θ =−f ′n(θ)

f ′′n(θ?n)

.

That is;

(nh3n)

12 f ′n(θ) −→ N(0, f(θ)J).

Then,

(nh3n)

12 (θn − θ) =

−(nh3n)

12 f ′n(θ)

[f ′′(θ?n)]2

−→ N( 0 ,f(θ)J

[f ′′(θ)]2).

2.2 A Simple Estimation of the Mode

The estimator (2.1.1) is increasingly used, although it is difficult to calculate. Indeed,

in addition to the calculation of fn, it involves a numerical step for the computation of

arg max .

As noticed by [5], classical search methods of the arg max perform satisfactorily only when

fn is sufficiently regular. Thus in practice, the arg max is usually computed over a finite

grid, although it may affect the asymptotic properties of the estimator. Moreover, when

the dimension of the sample space is large, or when accurate estimation is needed, the grid

size increasing exponentially with the dimension, leads to timeconsuming computations.

38

Finally, the search grid should be located around high density areas. In high dimension,

this is a difficult task and the grid search grid usually includes low density areas. To solve

this problem, [3] proposed a concurrent estimator of the mode θ?n which is defined by

θ?n = arg max

x∈Sn

fn(X), (2.2.1)

where Sn = X1, . . . , Xn, ia a finite sample of d dimension data.

The main advantage of using θ?n instead of θn, is that the former is easily computed

in a finite number of operations. Moreover, since the sample points are naturally concen-

trated in high density areas, the set Sn can be regarded as the most natural random grid

for approximating the mode.

[3] established the strong consistency of θ?n towards θn, and provided almost sure

rate of convergence without any differentiability condition on f around the mode.

[2] examine whether maximization over a finite sample alters the rate of convergence of

the estimate θ?n compared to that of the estimate θn. They proved that the two estimates

have the same asymptotic behavior. Also, another use of computing θ?n is that it may be

an appropriate choice for a stating value of an optimization algorithm to approximate θn.

2.3 Nonparametric Regression Estimation

Kernel smoothing provides a simple way of finding structure in the data sets without

the imposition of a parametric model. One of the most fundamental setting where ker-

nel smoothing ideas can applied is the simple regression problem. In this case paired

observations for each of two variable are available and one is interested in determining

an appropriate functional relationship between the two variables. One of the variables,

usually denoted by X, which called the predictor variable and the other variable usually

denoted by Y, which called the response variable.

39

A well known result from elementary statistics is the function m that minimizes EY −m(X)2, and it is known as the conditional expectation ( mean ) function of Y given X,

that is

m(X) = E(Y |X)

This function is always called the regression function of Y on X. There are now several

approaches to the nonparametric regression problem. Some of the more popular are those

based on kernel functions, spline functions and wavelets. For more details see [22] and [23].

Each of these approaches has its particular strengths and weaknesses, although kernel

regression estimators have the advantage of mathematical and intuitive simplicity. One

of the most known kernel estimators is the Nadaraya-Watson estimator.

The Nadaraya-Watson Estimator

Let (Xi, Yi) be R×R valued independent random variables with a common proba-

bility density function f. Also assume that X admits a marginal density g(X).

Suppose that we are given n observations of (X, Y ), denoted by (X1, Y1), . . . , (Xn, Yn).

First, we consider the following estimator of the joint density f(x, y) of (X,Y ) :

fn(x, y) =1

n

n∑i=1

Khn(x−Xi)Khn(y − Yi),

and define the marginal pdf of X as

gn(x) =1

n

n∑i=1

Khn(x−Xi)

where Khn(x) = K(x/hn)/hn.

The Nadaraya-Watson estimator of the conditional density function f(y|x) is given

40

by

fn(y|x) =fn(x, y)

gn(x)=

n−1

n∑i=1

Khn(x−Xi)Khn(y − Yi)

n−1

n∑i=1

Khn(x−Xi)

=

n∑i=1

Khn(x−Xi)Khn(y − Yi)

n∑i=1

Khn(x−Xi)

.

Now to estimate m(·), first we compute an estimator of the joint density f(x, y) of (X, Y ),

and then to integrate it according to the formula

m(x) =

∫ ∞

−∞yf(x, y)dy

∫ ∞

−∞f(x, y)dy

. (2.3.1)

Lemma 2.3.1.

Under the formulas of fn(x, y) and f(x, y), we have

(1)

∫ ∞

−∞fn(x, y)dy =

1

n

n∑i=1

Khn(x−Xi).

(2)

∫ ∞

−∞yfn(x, y)dy =

1

n

n∑i=1

Khn(x−Xi)Yi.

Proof :

Since

∫ ∞

−∞K(u)du = 1, we have that

(1)

∫ ∞

−∞fn(x, y)dy =

∫ ∞

−∞

1

n

∞∑n=1

Khn(x−Xi)Khn(y − Yi)dy

=1

n

∞∑n=1

Khn(x−Xi)

∫ ∞

−∞Khn(y − Yi)dy

=1

n

∞∑n=1

Khn(x−Xi).

41

(2)

∫ ∞

−∞yfn(x, y)dy =

∫ ∞

−∞

y

n

∞∑n=1

Khn(x−Xi)Khn(y − Yi)dy

=1

n

∞∑n=1

Khn(x−Xi)

∫ ∞

−∞yKhn(y − Yi)dy

=1

n

∞∑n=1

Khn(x−Xi)Yi.

If we substitute these into the numerator and denominator of 2.3.1 we obtain the Nadaraya-

Watson kernel estimator for m(·),

mn(x) =

n∑i=1

Khn(x−Xi)Yi

n∑i=1

Khn(x−Xi)

=n∑

i=1

Wni(x)Yi,

where

Wni(x) =Khn(x−Xi)

n∑i=1

Khn(x−Xi)

, i = 1, . . . , n,

are the weight functions.

The bandwidth hn determines the degree of smoothness of mn(·). This can be imme-

diately seen by considering the limits for hn tending to zero or to infinity respectively.

Corollary 2.3.2.

(a) If hn −→ 0, then at an observation Xi,

mn(Xi) −→ Khn(0)Yi

Khn(0)= Yi,

indicating that small bandwidths reproduce the data.

(b) If hn −→∞, then

mn(Xi) −→

n∑i=1

Khn(0)Yi

n∑i=1

Khn(0)

=1

n

n∑i=1

Yi = Y .

42

Proof: See [20]

That is, in (a) if hn −→ 0, then mn(x) tends to one piont. But if hn −→∞, then mn(x)

tends to the mean.

Suggesting that large bandwidth leads to an over smoothed curve, the sample mean. In

general, the bandwidth function hn acts as follows.

If hn is very small, then the weights focus on a few observations that are in the neigh-

borhood around each Xi. If hn is very large, then the weights will spread over larger

neighborhood around each Xi.

Consequently, the choice of hn plays an important role in kernel regression. These two

limit considerations make it clear that the smoothing parameter hn, in relation to the

sample size n, should not converge to zero too rapidly nor too slowly.

2.4 Joint Asymptotic Distribution of the Estimated

Conditional Mode

In nonparametric estimation of regression function, most investigations are concerned

with the regression function m(x), the conditional mean of Y given value x of a predictor

X. However, new insights about the underlying structures can be gained by considering

other aspects of the conditional distribution f(y|x) of Y given X = x. One of this aspects

is the conditional mode function, which will be the topic of this section.

Assume that (X1, Y1), (X2, Y2), . . . , (Xn, Yn) are i.i.d. random variables with joint proba-

bility density function f(x, y). The marginal probability density function of X1 is g(x) =∫ ∞

−∞f(x, y)dy, and the conditional probability density function of Y1 given X1 = x is

given by f(y|x) =f(x, y)g(x)

. We assume that for each x, f(x, y) is uniformly continuous in

y and it follows that f(x, y) possesses a mode θ(x) defined by

θ(x) = arg max−∞<y<∞

f(y|x).

We call θ(x) the population conditional mode or the mode function, and we assume that

θ(x) is unique.

43

Let K be a measurable function and hn be a sequence of positive numbers converging

to zero. We consider the Nadaraya-Watson estimator fn(y|x) of the conditional density

f(y|x).

If K is chosen such that K(u) tends to zero as u tends to ±∞, then for every sam-

ple sequence and for each x, fn(y|x) is a continuous function of y and tends to ±∞.

Consequently, there is a random variable θn(x) such that

θn(x) = arg max−∞<y<∞

fn(y|x).

We call θn(x) the sample conditional mode. [18] considered θn(x) as an estimator of θ(x)

and established conditions under which the estimator is strongly consistent and asymp-

totically normally distributed. They proved that (nh4n)

12 (θn(x) − θ(x)) is asymptotically

normally distributed with mean zero and variance

f(x, θ(x))

f (0,2)(x, θ(x))2.

∫ ∞

−∞

∫ ∞

−∞K(u)K(1)(v)2dudv

nh4n

where K(1)(v) means the first derivative of K(v), and f (0,2)(x, θ(x)) is defined in the fol-

lowing assumptions.

In this section, we will discuss this result for multivariate case. For distinct points

x1, x2, . . . , xk we will establish conditions under which (nh4n)

12 (θn(x1)−θ(x1), . . . , θn(xk)−

θ(xk))T where T denotes the transpose, is asymptotically multivariate normal with mean

zero vector and diagonal covariance matrix B = [bij] with

bij =f(xi, θ(xi))

f (0,2)(xi, θ(xi))2

∫ ∞

−∞

∫ ∞

−∞K(u)K(1)(v)2dudv.

We consider the following assumptions from [18],

(A1) (X1, Y1), . . . , (Xn, Yn), is a sample of i.i.d. random variables with joint probability

density function f(x, y), where the following hold,

44

(i) g(x) the marginal probability density function of X, is uniformly continuous.

(ii) f (i,j)(x, y) =∂i+jf(x, y)

∂xi∂yj exist and bounded for 1 ≤ i + j ≤ 3.

(A2) The kernel K is a Borel function and satisfies the following:

(i) K(u) tends to zero as u tends to ±∞.

(ii) K(u) and it’s first two derivative are functions of bounded variation.

(iii) lim|u|→∞

|u2K(i)(u)| = 0, (i = 0, 1)

(iv)

∫ ∞

−∞uiK(u)du = 1, , i = 0 (= 0, if i = 1, 2)

(v)

∫ ∞

−∞|u|3K(u)du < ∞.

(A3) hn is a sequence of positive numbers tending to zero, and satisfies the following:

hn = n−δ ,1

10< δ <

1

8; i.e. lim

n→∞nh8

n = ∞ and limn→∞

nh10n = 0

To prove our result we will use the following preliminary lemmas from [18] and [20].

Lemma 2.4.1. (Bochner Lemma)

Suppose K1(u) and K2(u) are real valued Borel measurable function satisfying the

following conditions

1. supu∈R

|Ki(u)|du < ∞, (i = 1, 2).

2.

∫ ∞

−∞|Ki(u)|du < ∞, (i = 1, 2).

3. lim|u|→∞

|u2Ki(u)| = 0, (i = 1, 2).

If (x, y) ∈ C(f), the set of continuity points of f, then for any η ≥ 0,

limn→∞

[h−2n

∫ ∞

−∞

∫ ∞

−∞|K1(

u

hn

)K2(v

hn

)|1+ηf(x− u, y − v)dudv] =

f(x, y)

∫ ∞

−∞

∫ ∞

−∞|K1(u)K2(v)|1+ηdudv.

45

Define

∂jfn(x, y)

∂yj= f (0,j)(x, y)

= (nhj+2n )−1

n∑i=1

K(x−Xi

hn

)K(j)(y − Yi

hn

),

where K(j) denotes the jth derivative of K, (j = 1, 2),

and

Wni = h−3n K(

x−Xi

hn

)K(1)(y − Yi

hn

), (i = 1, 2, . . . , n)

Lemma 2.4.2.

Under the assumptions (A1)(ii), (A2) and (A3), if (x, y) ∈ C(f), then the following

are true

(i) limn→∞

nh4n[varf (0,1)

n (x, y)] = f(x, y)

∫ ∞

−∞

∫ ∞

−∞(K(u)K(1)(v))2dudv.

(ii) (nh4n)

12Ef (0,1)

n (x, y)− f (0,1)(x, y) = o(1).

Lemma 2.4.3.

Under the assumption of the above Lemma, the following is true,

limn→∞

(n−1h4n)1+ δ

2 [n∑

i=1

E|Wni − EWni|2+δ] = 0.

For fixed x expanding f(0,1)n (x, θn(x)) around θ(x), we obtain

0 = f (0,1)n (x, θn(x)) = f (0,1)

n (x, θ(x)) + (θn(x)− θ(x))f (0,2)n (x, θ?

n(x)),

where

|θ?n(x)− θ(x)| < |θn(x)− θ(x)|.

Hence,

θn(x)− θ(x) = − f(0,1)n (x, θ(x))

f(0,2)n (x, θ?

n(x))(2.4.1)

46

Lemma 2.4.4.

Under the assumption (A1), (A2)(ii), (iii) and (A3), if g(x) > 0, then f(0,2)n (x, θ?

n(x))

converges in probability to f(0,2)n (x, θn(x)) as n tends to infinity.

Now we prove an intermediate result in the next Theorem

Theorem 2.4.5.

Suppose that x1, x2, . . . , xk are distinct points, where f(x, y) > 0, and (xi, y) ∈C(f), (i = 1, 2, . . . , k). Then under the assumption (A1),(A2)(ii),(iii), and (A3), the

distribution of the vector

(nh4n)

12f (0,1)

n (x1, y)− f (0,1)(x1, y), . . . , f (0,1)n (xk, y)− f (0,1)(xk, y)T ,

where T denotes the transpose, is asymptotically multivariate normal, with mean zero

vector and diagonal covariance matrix Γ = [γij], with

γij = f(xi, y)

∫ ∞

−∞

∫ ∞

−∞K(u)K(1)(v)2dudv, (i = 1, 2, . . . , k).

Proof:

Without loss of generality, we consider the special case k = 2. The same arguments are

used for the more general case.

Before we start the proof of the theorem, we introduce some notation.

For i = 1, 2, . . . , n and s = 1, 2, we define the following:

Vni = h−3n K(

xs −Xi

hn

)K(1)(y − Yi

hn

),

Wni(xs) = h2n(Vni(xs)− EVni(xs)),

Wn(xs) =n∑

i=1

Wni(xs),

Zni = (Wni(x1) , Wni(x1))T ,

Zn = n−12 (Wn(x1) , Wn(x2))

T

47

Zn = (nh4n)

12f (0,1)

n (x1, y)− Ef (0,1)n (x1, y) , f (0,1)

n (x2, y)− Ef (0,1)n (x2, y)T (2.4.2)

Let A = [ars] be a 2× 2 diagonal matrix with

ars = f(xs, y)

∫ ∞

−∞

∫ ∞


Let Z be the bivariate normal with mean vector zero and covariance matrix A.

First, we will show that Zn converges in distribution to Z. To do that, we will use the

Cramer-Wold theorem.

It will be sufficient to prove that CZTn converge in distribution to CZT for any constant

C = (c1, c2) ∈ R2 ,C 6= 0. Note that,

CZTn =

n∑i=1

n−12CZni , E(n−

12CZni) = 0.

Let ρ2+δni = E|n− 1

2CZni|2+δ , ρ2+δn =

n∑i=1

ρ2+δni , and σ2

n = Var(CZTn ).

Using Liapounov’s theorem, it will be sufficient to show that,

limn→∞

ρ2+δn

σ2+δn

= 0. (2.4.3)

Now, the proof of the Theorem 2.4.5 will given via the following lemmas.

Lemma 2.4.6.

Under conditions (A2)(ii),(iii),(iv), if (xs, y) ∈ C(f), then for (s = 1, 2), (r = 1, 2), the

following are true:

a. limn→∞

EW 2ni(xs) = f(x, y)

∫ ∞

−∞

∫ ∞


b. limn→∞

EWni(xs)Wni(xr) = 0, (r 6= s).

Proof :

(a) By definition of Wni(xs),

EW 2ni(xs) = h4

n(EV 2ni(xs) − (EVni(xs))

2), (2.4.4)

48

where

h4n(EV 2

ni(xs)) = h4n(h−6

n

∫ ∞

−∞

∫ ∞

−∞K(

xs − u

hn

)K(1)(y − v

hn

)2f(u, v)dudv)

= h−2n

∫ ∞

−∞

∫ ∞

−∞K(

u

hn

)K(1)(v

hn

)2f(xs − u, y − v)dudv.

Now, by an application of Bochner Lemma, we obtain that

limn→∞

h4nEV 2

ni(xs) = f(xs, y)

∫ ∞

−∞

∫ ∞

−∞K(u)K(1)(v)2dudv. (2.4.5)

Next,

h4n(EVni(xs))

2 = h2n(hn(EVni(xs)))

2

= h2n(h−2

n

∫ ∞

−∞

∫ ∞

−∞K(

u

hn

)K(1)(v

hn

)f(xs − u, y − v)dudv)2.

By another application of Bochner Lemma, we obtain that

limn→∞

h4n(EVni(xs))

2 = 0. (2.4.6)

By a combination of (2.4.4), (2.4.5), and (2.4.6), (a) holds.

(b) From the definition of Wni(x), we have

E(Wni(x1)Wni(x2)) = h4n(EVni(x1)Vni(x2) − EVni(x1)EVni(x2)). (2.4.7)

Suppose that x2 > x1, let δ = x2 − x1, and δn =δ

hn

.

h4nEVni(x1)Vni(x2) = h−2

n

∫ ∞

−∞

∫ ∞

−∞K(

x1 − u

hn

)K(x2 − u

hn

)(K(1)(y − v

hn

))2f(u, v)dudv

=

∫ ∞

−∞

∫ ∞

−∞K(u)K(δn + u)(K(1)(v))2f(x1 − hnu , y − hnv)dudv

=

∫ ∞

−∞[

∫ ∞

−∞K(u)K(δn + u)(K(1)(v))2g(x1 − hnu)du]

× (K(1)(v))2f(y − hnv|x1 − hnu)dv. (2.4.8)

49

Next,

∫ ∞

−∞K(u)K(δn + u)g(x1 − hnu)du =

∫

|u|< δn2

K(u)K(δn + u)g(x1 − hnu)du

+

∫

|u|≥ δn2

K(u)K(δn + u)g(x1 − hnu)du

≤ sup|u|< δn

2

K(δn + u)

∫ ∞

−∞K(z)g(x1 − hnz)dz

+ sup|u|≥ δn

2

K(u)

∫ ∞

−∞K1(δn + z)g(x1 − hnz)dz

≤ sup|u|≥ δn

2

K(u) . O(1) + sup|u|≥ δn

2

K(u) . O(1)

= 2 sup|u|≥ δn

2

K(u) . O(1)

≤ 4

δn

sup|u|≥ δn

2

|uK(u)| . O(1)

=4hn

δsup|u|≥ δn

2

|uK(u)| . O(1) = O(hn). (2.4.9)

Finally, from (2.4.8), and (2.4.9), we have that

limn→∞

h4nEVni(x1)Vni(x2) = 0. (2.4.10)

Now, we get

h4nEVni(x1)Vni(x2) = h2

n[(h−2n

∫ ∞

−∞

∫ ∞

−∞K(

u

hn

)K(1)(v

hn

)f(x1 − u, y − v)dudv)]

× [(h−2n

∫ ∞

−∞

∫ ∞

−∞K(

w

hn

)K(1)(v

hn

)f(x2 − w, y − v)dwdv)]

−→ 0 (2.4.11)

by an application of Bochner Lemma. The proof of the Lemma is completed by a combi-

nation of (2.4.7), (2.4.10), and (2.4.11).

Lemma 2.4.7.

Under the conditions of the last Lemma, we have that

limn→∞

σ2n = CACT

50

Proof:

Since σ2n = Var (CZT

n ), and by the definition of Z, we have

σ2n = Var(n−

12 c1Wn(x1) + n−

12 c2Wn(x2))

= n−1c21Var(Wn(x1)) + n−1c2

2Var(Wn(x2)) + 2n−1c1c2Cov(Wn(x1) , Wn(x2))

= n−1c21

n∑i=1

Var(Wni(x1)) + n−1c22

n∑i=1

Var(Wni(x2))

+ 2n−1c1c2Cov(n∑

i=1

(Wni(x1)),n∑

i=1

(Wni(x2)))

= c21Var(Wni(x1)) + c2

2Var(Wni(x2)) + 2n−1c1c2E(n∑

i=1

n∑i=1

Wni(x1)Wnj(x2)).

Since CACT is a quadratic from associated with the positive definite matrix A, that is

(CACT > 0), an application of Lemma (2.4.6) implies that

limn→∞

σ2n =

∫ ∞

−∞

∫ ∞

−∞K(u)K(1)(v)2dudv[c2

1f(x1, y) + c22f(x2, y)]

= CACT > 0.

Now,

ρ2+δni ≤ n−(1+ δ

2)|C|2+δE|Z|2+δ

= n−(1+ δ2)|C|2+δE|(Wni(x1) , Wni(x2))|2+δ

≤ n−(1+ δ2)|C|2+δ22+δ maxE|Wni(x1)|2+δ , E|Wni(x2)|2+δ.

Assume that E|Wni(x1)|2+δ > E|Wni(x2)|2+δ. Then we have

ρ2+δni ≤ n−(1+ δ

2)|C|2+δ22+δE|Wni(x1)|2+δ

= n−(1+ δ2)|C|2+δ22+δE|h2

n(Vni(x1)− EVni(x1))|2+δ

= |C|2+δ22+δ(n1−)1+ δ2 (h2

n)2+δE|Vni(x1)− EVni(x1))|2+δ

= |C|2+δ22+δ(n−1h4n)1+ δ

2 E|Vni(x1)− EVni(x1))|2+δ.

This implies that

ρ2+δn =

n∑i=1

ρ2+δni ≤ |C|2+δ22+δ(n−1h4

n)1+ δ2

n∑i=1

E|Vni(x1)− EVni(x1))|2+δ,

51

which converge to zero as n tends to infinity by an application of Lemma (2.4.3).

Hence the Liapounov’s condition, limn→∞

ρ2+δn

σ2+δn

= 0, is satisfied. So we have that, CZTn is

asymptotically normally distributed with mean zero and variance CACT .

By Cramer-Wold Theorem we have that Zn converges in distribution to Z. Now an appli-

cation of Lemma (2.4.2)(ii) to equation(2.4.2) completes the proof of the Theorem (2.4.2).

We are now in position to prove our maim theorem.

Theorem 2.4.8.

Suppose that x1, x2, . . . , xn are distinct points, where f(xi, y) > 0, and (xi, y) ∈C(f), (i = 1, 2, . . . , k). Then under the assumption (A1)-(A3), the distribution of the

vector

(nh4n)

12 (θn(x1)− θ(x1), . . . , θn(xk)− θ(xk))

T ,

where T denotes the transpose, is asymptotically multivariate normal with mean vector

zero and diagonal covariance matrix B = [bij], with

bij =f(xi, θ(xi))

f (0,2)(xi, θ(xi))2

∫ ∞

−∞

∫ ∞


Proof:

(nh4n)

12 (θn(x1)−θ(x1), . . . , θn(xk)−θ(xk))

T =−(nh4n)

12 (

f(0,1)n (x1, θ(x1))

f(0,2)n (x1, θ?

n(x1)), . . . ,

f(0,1)n (xk, θ(xk))

f(0,2)n (xk, θ?

n(xk))),

where

|θ?n(xi)− θ(xi)| < |θn(xi)− θ(xi)|, (i = 1, 2, . . . , k).

An application of Theorem (2.4.5) and Lemma (2.4.4) completes the proof.

52

Chapter 3

Quantiles Regression

The term quantile is synonymous with percentile; the median is the best example of

a quantile. We know that the sample median can be defined as the middle value (or

the value half-way between the two middle values) of a set of ranked data, i.e. the

sample median splits the data into two parts with an equal number of data points in

each. Usually, the sample median is taken as an estimator of the population median m,

a quantity which splits the distribution into two halves in the sense that, if the random

variable Y can be measured on the population, then P (Y ≤ m) = P (Y ≥ m) =1

2. In

particular, for a continuous random variable, m is a solution to the equation F (m) =1

2,

where F (y) = P (Y ≤ y) is the cumulative distribution function.

As an example of the use of the median, consider the distribution of salaries. This

is typically skewed to the right since relatively few people earn large salaries. As a

consequence, the sample median provides a better summary of typical salaries than mean.

More generally, the 25% and 75% sample quantile can be defined as values that split the

data into proportions of one-and three-quarters, and vice versa. Corresponding, in the

continuous case, the population lower quartile and the upper quartile are the solutions

to the equations F (y) =1

4and F (y) =

3

4respectively. Generally, for a proportion α

(0 < α < 1), and in the continuous case, the 100α% quantile (equivalently, the 100pth

percentile) of F is the value y which solves F (y) = α. Note that we assume that this value

53

is unique.

A further generalization of the concept to the conditional quantile emerges when we want

to study the relationship between a response variable Y and some covariates X. To explore

this relationship, we use analysis to quantify the relationship.

The conditional distribution function F (y|X = x) is the important role of inference in

solving this problem.

In parametric and nonparametric estimation of the conditional distribution function most

investigation of the underlying structures is concerned with the conditional mean function

m(x) = E(Y |X = x), the conditional mean of Y given the value x of X. New insight about

the underlying structure can be gained by considering other aspects of the conditional

distribution function F (y|X).

Estimation of the conditional quantiles has gained particular attention during the recent

three decades because of their useful application in various fields such as econometrics,

finance, environmental sciences and medicine. For more details see [9].

This chapter consists of three sections, in the first section we will discuss the asymptotic

normality of the conditional quantiles, while in the next section we will discuss the joint

asymptotic normality of the conditional quantiles, Finally in Section 3.3 we will give a

comparison between the mode and the median.

3.1 Nonparametric estimation of conditional quan-

tiles

In this Section, we will introduce the definition of conditional α−quantiles and we will

discuss the joint asymptotic normality of the conditional quantiles.

Let X,Y be a bivariate random variable and F (y|X = x) = P (Y ≤ y |X = x) the

conditional distribution of Y, given X = x.

Definition 3.1.1. The conditional α−quantile qα(x) is defined as follows

qα(x) = inf y ∈ R | F (y|x) ≥ α, 0 < α < 1, x ∈ R.

54

The quantiles give more complete information about the distribution of Y as a function

of the predictor variable X than the conditional mean alone.

[1] discussed the following two kernel estimators of the cdf F (y|x) and the α−quantile

qα(x) respectively

Fn(y|x) =

n∑i=1

IYi≤yK(x−Xi

hn

)

n∑i=1

K(x−Xi

hn

)

(3.1.1)

qn,α(x) = infy ∈ R | Fn(y|x) ≥ α, 0 < α < 1. (3.1.2)

Now, we will consider some properties of the E[Fn(y|x)] and V ar[Fn(y|x)] to give more

information about the Mean Square Error.

Lemma 3.1.1.

Let Yi be an independent random variables. The expectation of the estimator

Fn(y|x) is given by

E[Fn(y|x)] =n∑

i=1

K(x−Xi

hn

)

[n∑

i=1

K(x−Xi

hn

)]

. F (y|Xi)

Proof :

From the definition of the expectation and Equation (3.1.1), we have

E[Fn(y|x)] = E [

n∑i=1

IYi≤yK(x−Xi

hn

)

n∑i=1

K(x−Xi

hn

)

]

=

E [n∑

i=1

IYi≤yK(x−Xi

hn

)]

n∑i=1

K(x−Xi

hn

)

55

=

∫ ∞

−∞

n∑i=1

IYi≤yK(x−Xi

hn

)f(y|Xi)dy

n∑i=1

K(x−Xi

hn

)

=

n∑i=1

K(x−Xi

hn

)

∫ ∞

−∞IYi≤yf(y|Xi)dy

n∑i=1

K(x−Xi

hn

)

=

n∑i=1

K(x−Xi

hn

)

∫ y

−∞f(t|Xi)dt

n∑i=1

K(x−Xi

hn

)

=

n∑i=1

K(x−Xi

hn

) F (y|Xi)

[n∑

i=1

K(x−Xi

hn

)]

=n∑

i=1

K(x−Xi

hn

)

[n∑

i=1

K(x−Xi

hn

)]

. F (y|Xi). (3.1.3)

Thus the proof of this lemma is completed.

Now, the next lemma gives the varaince of Fn(y|x).

Lemma 3.1.2.

Let Yi be an independent random variables. The varaince of the estimator Fn(y|x)

is given by

V ar[Fn(y|x)] =n∑

i=1

K2(x−Xi

hn

)

[n∑

i=1

K(x−Xi

hn

)]2. [F (y|Xi) − F 2(y|Xi)].

56

Proof:

From the definition of the varaince and Equation (3.1.1), we have

V ar[Fn(y|x)] = E(F 2n(y|x))− [E(Fn(y|x))]2

= E [

n∑

i=1

I2Yi≤yK(

x−Xi

hn

) +∑

1≤i<j<n

IYi < yIYj < yK2(x−Xi

hn

)

[n∑

i=1

K(x−Xi

hn

)]2]

− n∑

i=1

K(x−Xi

hn

)

[n∑

i=1

K(x−Xi

hn

)]

. F (y|Xi)2

=

E [n∑

i=1

I2Yi≤yK

2(x−Xi

hn

)]

[n∑

i=1

K(x−Xi

hn

)]2−

n∑i=1

K2(x−Xi

hn

)

[∑

i

K(x−Xi

hn

)]2. F 2(y|Xi)

=

∫ ∞

−∞

n∑i=1

I2Yi≤yK

2(x−Xi

hn

) f(y|Xi)dy

[n∑

i=1

K(x−Xi

hn

)]2−

n∑i=1

K2(x−Xi

hn

)

[n∑

i=1

K(x−Xi

hn

)]2. F 2(y|Xi)

=

n∑i=1

K2(x−Xi

hn

)

∫ ∞

−∞I2Yi≤y f(y|Xi)dy

[n∑

i=1

K(x−Xi

hn

)]2−

n∑i=1

K2(x−Xi

hn

)

[n∑

i=1

K(x−Xi

hn

)]2. F 2(y|Xi)

=

n∑i=1

K2(x−Xi

hn

) F (y|Xi)

[n∑

i=1

K(x−Xi

hn

)]2−

n∑i=1

K2(x−Xi

hn

)

[n∑

i=1

K(x−Xi

hn

)]2. F 2(y|Xi)

57

=n∑

i=1

K2(x−Xi

hn

)

[n∑

i=1

K(x−Xi

hn

)]2. F (y|Xi)−

n∑i=1

K2(x−Xi

hn

)

[n∑

i=1

K(x−Xi

hn

)]2. F 2(y|Xi)

=n∑

i=1

K2(x−Xi

hn

)

[n∑

i=1

K(x−Xi

hn

)]2. [F (y|Xi) − F 2(y|Xi)]. (3.1.4)

For further results, we need assumptions of the kernel function, the bandwidth and the

conditional distribution function. These assumptions will be used in this chapter.

(A1) hn is sequence of positive number satisfies the following:

(i) hn −→ 0, for n −→∞;

(ii) nhn −→∞, for n −→∞;


(i) K has a compact support;

(ii) K is symmetric;

(iii) K is Lipschitz-continuous;

(iv)

∫K(u)du = 1;

(v) K is bounded;

(A3) For a fixed y ∈ R there exists F ′′(y|x) =∂2F (y|x)

∂x2in a neighborhood of x.

We assume that (A2)( i , ii ), and (A3) are satisfied. Let Ui =x−Xi

hn

and x ∈ (hn , 1−hn), then from Lemma (1.5.1) it follows that

E[Fn(y|x)]− F (y|x) =1

2h2

nµ2(K)F ′′(y|x) + o(h2n),

58

where

µ2(K) =

∫u2K(u)du =

∑i

U2i K(u)

∑i

K(Ui).

Then

E[Fn(y|x)] = F (y|x) +h2

n

2

∑i

U2i K(Ui)

∑i

K(Ui)F ′′(y|x) + o(h2

n). (3.1.5)

Lemma 3.1.3. ( Integral approximation of the sum over the kernel function )

With (A2)(i), Lipschitz-continuity (A2)(iii) and the mean value theorem of integration,

it follows

(i) limn→∞

n∑i=1

1

nhn

K(Ui) =

∫ ∞

−∞K(u)du

(ii) limn→∞

n∑i=1

1

nhn

K2(Ui) =

∫ ∞

−∞K2(u)du

(iii) limn→∞

n∑i=1

1

nhn

UiK(Ui) =

∫ ∞

−∞uK(u)du

Proof :

Assume J be the index set of observations, |J | = O(nhn), |J | denotes the cardinality.

We will prove (i) and the proof of (ii) and (iii) are similar.

|n∑

i=1

1

nhn

K(Ui)−∫ ∞

−∞K(u)du| ≤

∑i∈J

| 1

nhn

K(Ui)−∫ Ui−1

Ui

K(u)du|

=∑i∈J

| 1

nhn

K(Ui)− (Ui−1 − Ui)K(ζi)|

=∑i∈J

| 1

nhn

K(Ui)− (x−Xi−1

hn

− x−Xi

hn

)K(ζi)|

=∑i∈J

| 1

nhn

K(Ui)− (Xi −Xi−1

hn

)K(ζi)|

=∑i∈J

| 1

nhn

K(Ui)− 1

nhn

K(ζi)|

59

variance of the estimator is highest in the middle of the distributions. ( Since the maxi-

mum of (F (y|x)− F 2(y|x)) is1

4and happens when F (y|x) =

1

2. )

From the last Theorem it follows that the kernel estimator (3.1.1) is consistent. Next, the

asymptotic normality of (nhn)12 (Fn(y|x) − E[Fn(y|x)]) and (nhn)

12 (Fn(y|x) − [F (y|x)])

is shown.

Theorem 3.1.5.

Let the conditions of the last theorem be satisfied. Then it holds for n −→∞,

(nhn)12 (Fn(y|x)− E[Fn(y|x)])

d−→ N (0, [F (y|x) − F 2(y|x)]

∫K2(u)du). (3.1.7)

Proof :

To prove this theorem, we use Liapunov’s condition.

Let

Qn,i(x) =

K(x−Xi

hn

)

∑i

K(x−Xi

hn

)[IYi≤y − F (y|Xi)]

√V ar[Fn(y|x)]

Therefor,

n∑i=1

Qn,i(x) =n∑

i=1

K(x−Xi

hn

)

∑i

K(x−Xi

hn

)[IYi≤y − F (y|Xi)]

√V ar[Fn(y|x)]

That is;

n∑i=1

Qn,i(x) =

n∑i=1

K(x−Xi

hn

)

∑i

K(x−Xi

hn

)IYi≤y −

n∑i=1

K(x−Xi

hn

)

∑i

K(x−Xi

hn

)F (y|Xi)

√V ar[Fn(y|x)]

This means thatFn(y|x) − E[Fn(y|x)]√

V ar[Fn(y|x)]=

n∑i=1

Qn,i(x)

62

With nh5n −→ 0 for n −→∞, it follows the asymptotic normality of (nhn)

12 ([Fn(y|x)]− F (y|x))

Above theorems deal with the estimator of the conditional distribution. Now the be-

havior of the estimator of the conditional quantile is analyzed. So assume that Fn,x(qn,α(x)) =

Fx(qα(x)) = α is unique and Yi independent.

Now, let

Hn,α(θ(x)) =∑

i

1∑

i

K(x−Xi

hn

). K(

x−Xi

hn

) [α− IYi ≤ θ(x)]

=∑

i

Hi,α(θ(x)). (3.1.9)

Using the central limit theorem,

Hn,α(θ(x)) − E[Hn,α(θ(x))]√V ar[Hn,α(θ(x))]

d−→ N ( 0 , 1) , n −→∞. (3.1.10)

With Hn,α(θ(x)) the mean squared error of qn,α(x) can be calculated.

Theorem 3.1.7.

Let the conditions of theorem (3.1.1) be satisfied and let Fn,x(qn,α(x)) = Fx(qα(x)) = α

be unique. Then it holds

MSE[qn,α(x)] = [1

2h2

n

F (2,0)(qα(x)|x)

f(qα(x)|x)

∫u2K(u)du]2

+1

nhn

α(1− α)

f 2(qα(x))

∫K2(u)du (3.1.11)

Proof :

By the Taylor expansion of the conditional distribution functions of theorem (3.1.1) and

θ(x) = qn,α(x) follows

E[Hn,α(qn,α(x))] ≈ f(qα(x)|x) [qn,α(x) − qα(x)]

+1

2h2

nF (2,0)(qα(x))

∑i

U2i K(Ui)

∑i

K(Ui)

65

and with integral approximation holds

E[Hn,α(qn,α(x))] ≈ f(qα(x)|x) [qn,α(x) − qα(x)]

+1

2h2

nF (2,0)(qα(x))

∫u2K(u)du.

Now,

V ar[Hn,α(qn,α(x))] =1

[∑

i

K(Ui) ]2

∑i

K2(Ui)[ F (qn,α(x) | x− hnUi)

− F 2((qn,α(x)) | x− hnUi)]

≈ 1

[∑

i

K(Ui) ]2[ α(1− α)]

∑i

K2(Ui)

≈ 1

nhn

α(1− α)

∫K2(u)du.

nhnHn,α(qn,α(x)) is a bounded random variable and∑

i

V ar(nhn Hi,α(qn,α(x))) −→ ∞for n −→∞.

From this the asymptotic normality ( Theorem 1.1.7 Ch.1), it follows.

nhnHn,α(qn,α(x)) − E[Hn,α(qn,α(x))]nhn

√V ar(Hn,x(qn,α(x)))

−→ N( 0 , 1 ) , n −→∞.

Since Fn(qn,α(x)|x) = α , Hn,α(qn,α(x)) = 0. This implies for n −→∞

f(qα(x)|x)[qn,α(x)− qα(x)] + 12h2

nF(2,0)(qα(x))

∫u2K(u)du

√1

nhnα(1− α)

∫K2(u)du

−→ N( 0 , 1 ) (3.1.12)

From that bias and variance of qn,α(x) can be calculated.

The bias depends through F (2,0)(qα(x)|x) on the smoothness of the quantile function.

But because of the division by the conditional density at qα(x) the steepness of the

conditional distribution also affects the bias and the variance. The steeper the conditional

distribution is the greater is the mean square error.

From the method of proof of theorem (3.1.3), asymptotic normality can be established.

66

Corollary 3.1.8.

Let the condition of theorem (3.1.3) be satisfied and let nh5n −→ 0, for n −→∞.

Then it holds

(nhn)12 (qn,α(x)− qα(x))

d−→ N (1

2(nh5

n)12

F (2,0)(qα(x)|x)

f(qα(x)|x)

∫u2K(u)du ,

α(1− α)

f 2(qα(x)|x)

∫K2(u)du )

) ;

d−→ N( 0 ,α(1− α)

f 2(qα(x)|x)

∫K2(u)du

) (3.1.13)

Proof :

From Equation (3.1.12) we have

f(qα(x)|x)[qn,α(x)− qα(x)] + 12h2

nF(2,0)(qα(x))

∫u2K(u)du

√1

nhnα(1− α)

∫K2(u)du

d−→ N( 0 , 1 ).

This implies;

E[

f(qα(x)|x)[qn,α(x)− qα(x)] + 12h2

nF(2,0)(qα(x))

∫u2K(u)du

√1

nhnα(1− α)

∫K2(u)du

] = 0,

and

V ar[f(qα(x)|x)[qn,α(x)− qα(x)] + 1

2h2

nF(2,0)(qα(x))

∫u2K(u)du

√1

nhnα(1− α)

∫K2(u)du

] = 1.

From the properties of the Expectation and the variance, we have

(nhn)12 E[[qn,α(x)− qα(x)]] −→ 1

2(nh2

n)12F (2,0)(qα(x))

fx(qα(x))

∫u2K(u)du (3.1.14)

And

V ar[f(qα(x)|x)[qn,α(x)− qα(x)] + 1

2h2

nF (2,0)(qα(x))

∫u2K(u)du

√1

nhnα(1− α)

∫K2(u)du

] −→ 1,

67

Then

(nhn) V ar[qn,α(x)− qα(x)] −→ α(1− α)

f 2(qα(x)|x)

∫K2(u)du (3.1.15)

From (3.1.14) and (3.1.15), we have

(nhn)12 (qn,α(x)− qα(x))

d−→ N (1

2(nh5

n)12

F (2,0)(qα(x)|x)

f(qα(x)|x)

∫u2K(u)du ,

α(1− α)

f 2(qα(x)|x)

∫K2(u)du )

)

Since nh5n −→ 0. Then we have (3.1.13).

3.2 Joint Asymptotic Distribution of the Conditional

Quantiles

Let (X1, Y1), (X2, Y2), . . . , (Xn, Yn) be independent and identically distributed two di-

mensional random variables with a joint density function f(x, y) and a joint distrib-

ution function F (x, y) =

∫ x

−∞

∫ y

−∞f(u, v)dvdu. The marginal density function of X is

g(x) =

∫ ∞

−∞f(x, y)dy. The conditional density function and the conditional distribution

function of Y given X = x are f(y|x) =f(x, y)

g(x), and

F (y|x) =

∫ y

−∞f(u|x)du =

∫ y

−∞f(x, u)du

g(x)

respectively. Now for i = 1,2 let qαi(x) denote the αi th quantile of the conditional

distribution F (y|x), i.e., a root of the equation F (q(x)|x) = αi, with 0 < α1 < α2 < 1.

Let fn(x, y), gn(x), fn(y|x) and Fn(y|x) be the estimators of f(x, y), g(x), f(y|x) and

68

F (y|x) respectively and are defined as follows

fn(x, y) =1

nh2n

n∑i=1

K(x−Xi

hn

)K(y − Yi

hn

),

gn(x) =

∫ ∞

−∞fn(x, y)dy =

1

nhn

n∑i=1

K(x−Xi

hn

),

fn(y|x) =fn(x, y)

gn(x)

Fn(y|x) =

∫ y

−∞fn(u|x)du =

Bn(x, y)

gn(x)

where K is a probability density function, hn is a sequence of positive numbers con-

verging to zero, and

Bn(x, y) =1

nhn

n∑i=1

G(y − Yi

hn

)K(x−Xi

hn

)

With G(y) =

∫ y

−∞K(u)du.

Now, we consider for i =1,2 two estimators qαi,n(x) of qαi(x) defined by the root of the

equation Fn(q(x)|x) = αi, i = 1, 2. We shall call qαi,n(x) the conditional sample quantiles.

We prove that under some regularity conditions these estimators are strongly consistent

and asymptotically normally distributed.

Now, we shall assume the following conditions:

(A1) The conditional distribution function satisfy:

(i) F (i,j)(x, y) = ∂i+jF (x, y)/∂xi∂yj exist and are bounded for (i, j) = (1, 2), (2, 0), (2, 1), (3, 0).

(ii) The conditional population quantiles qαi(x) defined by

F (qαi(x)|x) =

F (1,0)( x , qαi(x) )

g(x)= αi, i = 1, 2

are unique.

(iii) f(x, y) is uniformly continuous.

69

(A2) The marginal density function of X satisfy

(i) g(i)(x) =

∫ ∞

−∞∂if(x, y)/∂xidy exist for i = 1, 2.

(ii) Both h(x) =

∫ ∞

−∞|∂f(x, y)/∂x|dy and g(i)(x) are bounded for i = 1, 2.

(iii) g(x) is uniformly continuous.


(i) K(u) is a function of bounded variation.

(ii)

∫ ∞

−∞uK(u)du = 0.

(iii)

∫ ∞

−∞u2K(u)du < ∞.

(A4) hn is a sequence of positive numbers satisfying:

(i) hn = n−δ, 15

< δ < 14. i.e. lim

n→∞nh4

n = ∞, limn→∞

nh5n = ∞.

Lemma 3.2.1.

Under the conditions ( A2 ), ( i , ii ), ( A3 )( i ) and (A4 )( i ), we have

limn→∞

supx∈R

|gn(x)− g(x)| = 0

with probability one.

Proof : see [17].

Lemma 3.2.2.

Under the conditions (A1 )( i ) and ( A3 )( iii ), we have

supx∈R

|EBn(x, y) − F (1,0)(x, y)| = O(hn).

70

Proof :

By definition of Bn(x, y), we have

EBn(x, y) = E 1

nhn

n∑i=1

G(y − Yi

hn

)K(x−Xi

hn

)

=

∫ ∞

−∞

∫ ∞

−∞

1

hn

G(y − Yi

hn

)K(x−Xi

hn

)f(x, y)dxdy

= hn

∫ ∞

−∞

∫ ∞

−∞G(u)K(v)f(y − uhn, x− vhn)dudv

= hn

∫ ∞

−∞

∫ ∞

−∞K(u)K(v)F (1,0)(y − uhn, x− vhn)dudv

= hn

∫ ∞

−∞

∫ ∞

−∞K(u)K(v)F (1,0)(x, y)− uhnF

(1,1)(x, y) + u2h2nF (0,2)(x, y) + o(h2

n)

= F (1,0)(x, y) + O(hn).

Then, we have the result.

Lemma 3.2.3.

Under the conditions (A1 )( i ), ( A3 )( i , iii ) and (A4 )( i ), we have

limn→∞

supx∈R

|Bn(x, y)− F (1,0)(x, y)| = 0


Proof :

By above lemma, it suffices to show that

limn→∞

supx∈R

|Bn(x, y)− EBn(x, y)| = 0

with probability one. Let Sn(u, v) be the two dimensional empirical distribution function

defined by

Sn(u, v) =1

n

n∑i=1

I(u−Xi)I(v − Yi)

where

I(x− y) =

1 x− y ≥ 0,

0 x− y < 0.

71


Proof :

Since F (y|x) =

∫ y

−∞f(x, u)du

g(x)=

F (1,0)(x, y)

g(x).

Then by Lemma (3.2.1) and Lemma (3.2.3), we have

limn→∞

supy∈R

|Fn(y|x)− F (y|x)| = limn→∞

supy∈R

|Bn(x, y)

gn(x)− F (1,0)(x, y)

g(x)| = 0


Lemma 3.2.5.

Under the conditions of Lemma (3.2.4), if g(x) > 0, we have

limn→∞

|F (qαi,n(x)|x)− F (qαi(x)|x)| = 0

with probability one (i = 1, 2.)

Proof :

Since

|F (qαi,n(x)|x)− F (qαi(x)|x)| = |F (qαi,n(x)|x)− Fn(qαi,n(x)|x)− F (qαi

(x)|x) + Fn(qαi,n(x)|x)|≤ |F (qαi,n(x)|x)− Fn(qαi,n(x)|x)|+ |F (qαi

(x)|x)− Fn(qαi,n(x)|x)|≤ 2 sup

y∈R|Fn(y|x)− F (y|x)|

Then

supy∈R

|F (qαi,n(x)|x)− F (qαi(x)|x)| ≤ 2 sup

y∈R|Fn(y|x)− F (y|x)|

Applying the last Lemma to get

limn→∞

supy∈R

|F (qαi,n(x)|x)− F (qαi(x)|x)| ≤ 2 lim

n→∞supy∈R

|Fn(y|x)− F (y|x)| = 0

Thus

limn→∞

supy∈R

|F (qαi,n(x)|x)− F (qαi(x)|x)| = 0.

Now, the following theorem deals with the strong consistency of the estimators qαi,n(x), i =

1, 2.

73

Proof :

Since

|fn(qi|x)− f(qαi(x)|x)| = |fn(qi|x)− f(qi|x) + f(qi|x)− f(qαi

(x)|x)|≤ |fn(qi|x)− f(qi|x)|+ |f(qi|x)− f(qαi

(x)|x)|≤ sup

y∈R|fn(y|x)− f(y|x)|+ |f(qi|x)− f(qαi

(x)|x)|= o(1).

To complete the asymptotic joint distribution of qα1,n(x) and qα2,n(x) we define for i = 1, 2

and j = 1, 2, . . . , n the following:

U∗nj(x) =

1

hn

K(x−Xi

hn

),

V ∗nij(x) =

1

hn

G(qαi

(x)− Yj

hn

)K(x−Xj

hn

),

Unj(x) = (hn)12 [U∗

nj(x)− EU∗nj(x)],

Vnij(x) = (hn)12 [V ∗

nij(x)− EV ∗nij(x)],

Un(x) =n∑

j=1

Unj(x), Vni(x) =n∑

j=1

Vnij(x)

Wnj =

Uni(x)

Vn1j(x)

Vn2j(x)

, n

12Zn =

Un(x)

Vn1(x)

Vn2(x)

wi(x) = F (1,0)(x, qαi(x))

n12Z∗n = (hn)

12

n∑j=1

[U∗nj(x)− g(x)]

n∑j=1

[V ∗n1j(x)− w1(x)]

n∑j=1

[V ∗n2j(x)− w2(x)]

75

A =

∫ ∞

−∞K2(u)du .

g(x) w1(x) w2(x)

w1(x) w1(x) w1(x)

w2(x) w1(x) w2(x)

.

Lemma 3.2.8.

Under the conditions (A1 )( i ), ( A2 )( ii ), and (A3 )( i , iii ), the following results hold:

1. limn→∞

EU2nj(x) = g(x)

∫ ∞

−∞K2(u)du,

2. limn→∞

EV 2nij(x) = wi(x)

∫ ∞

−∞K2(u)du, i = 1, 2,

3. limn→∞

EUnj(x)Vnij(x) = wi(x)

∫ ∞

−∞K2(u)du, i = 1, 2,

4. limn→∞

EVn1j(x)Vn2j(x) = w1(x)

∫ ∞

−∞K2(u)du.

Proof:

(1) since

EU2nj(x) = hn[

1

h2n

∫ ∞

−∞K2(

x− u

hn

)g(u)du− (1

hn

∫ ∞

−∞K(

x− u

hn

)g(u)du)2],

then,

limn→∞

EU2nj(x) = lim

n→∞1

hn

∫ ∞

−∞K2(

x− u

hn

)g(u)du

− limn→∞

hn(1

hn

∫ ∞

−∞K(

x− u

hn

)g(u)du)2

= g(x)

∫ ∞

−∞K(u)du − 0 = g(x)

∫ ∞

−∞K(u)du

(2)

limn→∞

EV 2nj(x) = lim

n→∞1

h2n

∫ ∞

−∞

∫ ∞

−∞G2(

qα(x)− v

hn

)K2(x− u

hn

)f(u, v)dudv

= limn→∞

1

h2n

∫ ∞

−∞

∫ ∞

−∞G2(

qα(x)− v

hn

)f(v|x− uhn)dv K2(u) g(x− uhn)du

76

= g(x)

∫ ∞

−∞K2(u)du . lim

n→∞1

h2n

∫ ∞

−∞

∫ ∞

−∞G2(

qα(x)− v

hn

)f(v|x− uhn)dv

= g(x)

∫ ∞

−∞K2(u)du .

∫ qα(x)

−∞f(v|x)d

= g(x)F (qα(x)|x)

∫ ∞

−∞K2(u)du.

The proof of (3) and (4) omited.

Lemma 3.2.9.

Under the conditions (A1 )( i ), ( A2 )( ii ), ( A3 )( i , iii ) and (A4 )( ii ) Zn converges in

distribution to a trivariate normal random variable with mean vector 0 and covariance

matrix A.

Proof :

To prove this theorem , it sufficient to show that CZTn converge in distribution to CZT ,

for any real vector C =

C1

C2

C3

and C 6= 0.

Now, we define for j = 1, 2, . . . , n the following

σ2nj = var[CTWnj ]

ρ3njE|CTWnj|3

and let σ2n =

n∑i=1

σ2nj, ρ3

n =n∑

i=1

ρ3nj.

Next, for any C 6= 0, we have

limn→∞

σ2nj = lim

n→∞V ar[CTWnj ]

= limn→∞

C1Unj(x) + C2Vn1j(x) + C3Vn2j(x) = CTAC > 0, j = 1, 2, . . . , n.

Using computations similar to those in Lemma (3.2.8), we have

E|Un1(x)|3 = O(h− 1

2n ) and

77

E|Vni1(x)|3 = O(h− 1

2n ), i = 1, 2.

Therefor,

ρ3n = nE|CTWn1 |3

= nE|(

C1 C2 C3

)

Un1(x)

Vn11(x)

Vn21(x)

|3

= nEC21U

2n1(x) + C2

2V2n1j(x) + C2

3V2n2j(x) 3

2

≤ 332 n(C2

1 + C22 + C2

3)32 ×maxE|Un1(x)|3, E|Vni1(x)|3 i = 1, 2.

= O(nh− 1

2n )

Hence it follows that limn→∞

ρn

σn

= 0. By Liapounov’s version of the central limit theorem we

conclude that CTZn = n−12

n∑i=1

CTWnj converge in distribution to a univariate normal

random variable with mean 0 and variance CTAC.

We recall the Cramer-Wold Theorem to complete the proof of this lemma.

78

Lemma 3.2.10.

Under the conditions (A1 )( i ), ( A2 )( ii ) and ( A3 )( i , iii ) Z∗n converge in distribution

of a trivariate normal random variable with mean vector 0 and covariance matrix A.

Proof :

Since

n12 (Z∗n − Zn) = (hn)

12

n∑j=1

[U∗nj(x)− g(x)]

n∑j=1

[V ∗n1j(x)− w1(x)]

n∑j=1

[V ∗n2j(x)− w2(x)]

−

Un(x)

Vn1(x)

Vn2(x)

=

h12n

n∑j=1

[U∗nj(x)− g(x)]− Un(x)

h12n

n∑j=1

[V ∗n1j(x)− w1(x)]− Vn1(x)

h12n

n∑j=1

[V ∗n2j(x)− w2(x)]− Vn2(x)

=

h12n

n∑j=1

[U∗nj(x)− g(x)]−

n∑j=1

Unj(x)

h12n

n∑j=1

[V ∗n1j(x)− w1(x)]−

n∑j=1

Vn1j(x)

h12n

n∑j=1

[V ∗n2j(x)− w2(x)]−

n∑j=1

Vn2j(x)

=

h12n

n∑j=1

[U∗nj(x)]−

n∑j=1

Unj(x)− h12n

n∑j=1

[g(x)]

h12n

n∑j=1

[V ∗n1j(x)]−

n∑j=1

Vn1j(x)− h12n

n∑j=1

[w1(x)]

h12n

n∑j=1

[V ∗n2j(x)]−

n∑j=1

Vn2j(x)− h12n

n∑j=1

[w2(x)]

79

=

h12n

n∑j=1

E[U∗nj(x)]− h

12n

n∑j=1

[g(x)]

h12n

n∑j=1

E[V ∗n1j(x)]− h

12n

n∑j=1

[w1(x)]

h12n

n∑j=1

E[V ∗n2j(x)]− h

12n

n∑j=1

[w2(x)]

= nh12n

E[U∗nj(x)]− g(x)

E[V ∗n1j(x)]− w1(x)

[V ∗n2j(x)]− w2(x)

That is Z∗n − Zn = (nhn)12


E[V ∗n1j(x)]− w1(x)

[V ∗n2j(x)]− w2(x)

= (nhn)

12 Cn

where Cn =


E[V ∗n1j(x)]− w1(x)

[V ∗n2j(x)]− w2(x)

since

EU∗n1(x1) − g(x1) = E 1

hn

K(x−X1

hn

) − g(x1)

=1

hn

∫K(

x1 − u

hn

)g(u)du− g(x1)

=

∫K(u)g(xi − uhn)du− g(x1)

=

∫K(u)g(x1)− uhng

′(x1) + u2h2ng′′(x1)du− g(x1)

= h2ng′′(x1)

∫U2K(u)du ≤ Ch2

n = O(h2n)

We can prove similar the other elements of Cn to get Cn = O(h2n).

80

Now

Z∗n = Zn + (Z∗n − Zn)

= Zn + (nhn)12 Cn

= Zn + O(nh5n)

12

= Zn + O(1)

Then the proof of this Lemma is complete.

Now, we will consider the main theorem of this section.

Theorem 3.2.11.

Under the conditions (A1)(i), (A4)(ii), if g(x) > 0 and f(x, qαi(x)) > 0, i = 1, 2, we

have

qα1,n(x)

qα2,n(x)

is asymptotically normally distributed with mean vector

qα1(x)

qα2(x)

and covariance matrix Bn =

∫ ∞

−∞K2(u)du

nhng(x)

b11 b12

b12 b22

,

where

bij =αi(1− αi)

f(qαi(x)|x)f(qαj

(x)|x), 1 ≤ i ≤ j ≤ 2.

Proof :

Let the function H from R3 to R2 defined by

H(y) =

y2

y1y3

y1

with y =

y1

y2

y3

and let θ =

g(x)

w1(x)

w2(x)

81

We can write Z∗n = (nhn)12 (Tn − θ), where

Tn =

Tn1

Tn2

Tn3

with Tn1 =

1

n

n∑j

U∗nj(x)

and

Tni =1

n

n∑j

V ∗n(i−1)j(x), i = 2, 3.

From [17] with (n)12 replaced by (nhn)

12 we conclude that

(nhn)12H(Tn)−H(θ) = (nhn)

12

Fn(qα1(x)|x)− F (qα1(x)|x)

Fn(qα2(x)|x)− F (qα2(x)|x)

converge in distribution to a bivariate normal random variable with mean vector 0 and

covariance matrix DADT where D is the matrix of partial derivative of H, evaluated at

θ, and given by

D =

−w1(x)

g2(x)

1

g(x)0

−w2(x)

g2(x)0

1

g(x)

Then

DADT =

∫ ∞

−∞K2(u)du

g(x)×

α1(1− α1) α1(1− α2)

α1(1− α2) α2(1− α2)

By (3.2.1), we have the result.

82

3.3 Mode and Median as a Comparison

In chapter 2 and 3 we have studied two important aspects of the conditional density func-

tion, the conditional mode and the conditional quantiles.

As a measure of central density we compare between the mode and the median.

First, for the mode

i) The bias term vanishes since we use the condition (A2) (iv) page 45.

ii) The variance equalsf(x, θ(x))

f (0,2)(x, θ(x))2.

∫ ∫K(u)K(1)(v)2dudv

nh4n

That is the MSE depends only on the value of the variance.

Now, for the median

if we put α = 0.5 in Theorem 3.1.5, we get

i) The bias term is given by1

2h2

n

F (2,0)(q0.5(x)|x)

f(q0.5(x)|x)

∫u2K(u)du

ii) The variance term is1

nhn

1/4

f 2(q0.5(x))

∫K2(u)du

Notice that for the mean square error, MSE(x) = Bias2[f(x)] + V ar[f(x)].

For the median, we have the bias depends through F (2,0)(q0.5(x)|x) on the smoothness

of the quantile function.

Because of the division by the conditional density at q0.5(x) the steepness of the conditional

distribution also affects the bias and the variance. The steeper the conditional distribution

is the greater is the mean square error.

Also, we note the variance of the median is the largest since1

4≥ (F (y|x)− F 2(y|x)).

83

Bibliography

[1] Abberger, K. (1997). Quantile Smoothing in Financial Time Series. Statisical Papers

38, 125-148.

[2] Abraham, G., Bian, G. and Cadre, B. (2004). On the Asymptotic Properties od a

Simple Estimation of the Mode. ESAIM: Probability and Statistics, Vol. 8, 1-11.

[3] Abraham, G., Bian, G. and Cadre, B. (2003). Simple Estimation of the Mode of a

Multivariate Density. The Canadian Journal of Statis- tics, Vol. 31, 23-34.

[4] Bartle Robert G. Sherbert Donaled R. (1991). Introduction To Real Analysis. Eastern

Michigan University.

[5] Devorye, L. (1979). Recursive Estimation of the Mode of a Multivariate Density. The

Canadian Journal of Statistics, Vol 7, 159-167.

[6] Freund, J. (1992). Mathematical Statistics, Arizona State University.

[7] George Casell. Roger L. Berger. (1990). Statistical Inference. Cornell University,

North Carolina State University.

[8] Hogg . Mckean . Craig (2005). Introduction to Mathematicial Statistic. University of

Iowa, Wester Michigan University, University of Iowa.

[9] Keming Yu (2003). Quantile regression: applicaions and current reseach areas. Uni-

versity of Plymouth.UK

[10] Loeve, M.(1960). Probanility Theory, 2nd Ed.Van Nostrand, Princeton. Nostand.

84

[11] Nada, G. (2002). On The Kernel Density Estimatiom, Islamic University of Gaza,

Palestine.

[12] Parzen, E. (1962). On Estimation of a Probability Density Functin and Mode. The

Annals of Mathematical Statistics, Vol. 33, 1065-1076.

[13] Pranab K.sen. (1993). Large Sample Methods in Statsitic ”An Itroduction With

Applications”. New York.

[14] Rao, C.R. (1965). Linear Statistical Inference and Its Applications. Wiley, New York.

[15] Rosenblatt, M. (1956). Remarks on some non-parametric estimates of the density

function. Annals Math. Statist. 27, 832-837.

[16] Royden H.L. (1997). Real Analysis. Stanford University

[17] Samanta, M. Non-Parametric Estimation of Conditional Quamtiles. (1988). Depar-

ment of Statistic. University of Manitoba. Canada.

[18] Samanta, M. and Thavanesmaran, A. (1990). Nonparametric Estimation of the Con-

ditional Mode. Communications in Statistics. Theory and Methods, Vol. 19, 4515-

4524.

[19] Samanta, M.(1973). Nonparametric Estimation of the Mode of Multivariate Density.

South African Statiscal Jourrnal, Vol.7, 109-117

[20] Salha, R. (2006). Kernel Estimation of the Conditional Quantiles and Mode for Time

Series. Univ. Of Macedonia.

[21] Schustewr, E. (1972). Joint Asymptotic Distribution of Estimated Regression Func-

tion at a Finite Number of Distinct Points. The Annals of Mathematical Statistics,

Vol. 43, 84-88.

[22] Silverman, B.W. (1986). Density Estimation for Statistics and Datd Analysis, School

of Mathematics Univ. of Bath, UK.

85

[23] Wand, M.P , Jones, M.C. (1995). Kernel Smoothing. Univ of New South Wales

Asutralia.

[24] Watson, G.S. (1964). Smooth Regression Analysis. Sankhya, Series. A, Vol. 26, 359-

372.

[25] Whittle, P. (1958). On the Smoothing of Probabiliy Density Function. J. Roy. Statist.

Soc, Ser. B 20 334-343.

86

The Asymptotic Distributions of The Kernel Estimations of ...For the conditional mode, we study the...

Documents

Transcript of The Asymptotic Distributions of The Kernel Estimations of ...For the conditional mode, we study the...