The Asymptotic Distributions of The Kernel Estimations of ...For the conditional mode, we study the...
Transcript of The Asymptotic Distributions of The Kernel Estimations of ...For the conditional mode, we study the...
The Asymptotic Distributions of The Kernel
Estimations of The Conditional Mode and Quantiles
December 23, 2008
THE ISLAMIC UNIVERSITY of GAZA
DEANERY of HIGHER STUDIES
FACULTY of SCIENCE
DEPARTMENT of MATHEMATICS
The Asymptotic Distributions of The Kernel Estimations of The Conditional
Mode and Quantiles
PRESENTED BY
Hossam Othman M. El-sayed
SUPERVISED BY
Dr. Raid Salha
A THESIS SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENT
FOR THE DEGREE OF MASTER OF MATHEMATICS
1429-2008
1
To my family...
i
Contents
Table of Contents iii
Acknowledgment iv
Abstract v
List of Figures vi
List of Tables vi
Preface 1
1 Introduction 3
1.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Kernel Density Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.2.1 Kernel Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.3 Properties and Examples of the Kernels . . . . . . . . . . . . . . . . . . . 14
1.4 The MSE and MISE Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.5 Asymptotic MSE and MISE Approximations . . . . . . . . . . . . . . . . . 18
1.6 Optimal Bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.7 Optimal Kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2 On the Estimation of the Mode 29
2.1 Mode Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
ii
2.2 A Simple Estimation of the Mode . . . . . . . . . . . . . . . . . . . . . . . 38
2.3 Nonparametric Regression Estimation . . . . . . . . . . . . . . . . . . . . . 39
2.4 Joint Asymptotic Distribution of the Estimated Conditional Mode . . . . . 43
3 Quantiles Regression 53
3.1 Nonparametric estimation of conditional quantiles . . . . . . . . . . . . . . 54
3.2 Joint Asymptotic Distribution of the Conditional Quantiles . . . . . . . . . 68
3.3 Mode and Median as a Comparison . . . . . . . . . . . . . . . . . . . . . . 83
Bibliography 84
iii
Acknowledgment
First of all, I am awarding my great thanks for the Almighty Allah who all the time helps
me and grants me the power and courage to finish this study and give me the success in
my live.
My gratitude and respect are paid to my supervisor Dr. Raid Salha for all the interesting
discussions I had with him.
I am grateful to the Islamic University in Gaza for offering me the opportunity to get the
Master Degree of Mathematics, and my thanks to all Professors who teaching me in the
mathematics department. I would like to my express my deep tanks and appreciation to
my family, especially parents for their encouragement.
I wish also to thank my colleagues and my friends who provided suggestions in this study.
Finally, I pray to Allah to accept this work.
iv
Abstract
In this thesis, we study the kernel estimation of the conditional probability density func-
tion and two of its aspects, the conditional mode and the conditional quantiles.
For the conditional mode, we study the asymptotic normality of its kernel estimation from
[18] and we study the conditions under which the conditional mode estimated at finite
distinct points is asymptotically normally distributed.
Also, we study the kernel estimation for the conditional quantile from [1] and we study
the conditions under which the joint distribution of several conditional quantile is asymp-
totically normally distributed.
v
List of Figures
1.1 Kernel density estimation based on 7 points 14
1.2 Kernel density estimates based on different bandwidths 23
1.3 The Epanechinkov kernel K∗ 27
1.4 Kernel density estimates of the Ethanol data 28
List of Tables
1.1 Common kernel functions 15
1.2 Efficiency of several kernels completed to the optimal kernel K∗ 28
vi
Preface
The probability density function is a fundamental concept in statistics. Suppose we have
a set of observed data points assumed to be a sample from an unknown probability density
function f . The construction of an estimate of the density function from observed data
is known as density estimation.
The classical approach for estimating the density function is called parametric density
estimation. Here one assumes that the data are drawn from a known parametric distrib-
ution which depends only on finitely many parameters, and one uses the data to estimate
the unknown values of these parameters. For example, the normal distribution depends
on two parameters , the mean µ and the variance σ2. The density function f could be
estimated by finding estimates of µ and σ2 from the data, and substituting these estimates
into the formula for the normal density.
Parametric estimates usually depend only on a few parameters, therefore they are suitable
on even for small sizes n. Another approach of density estimation is the nonparametric
estimation. For example Histograms, the naive estimator and the kernel estimator, etc.
We will concentrate on the Kernel estimator. In this case we do not assume that the data
are drwan from a known parametric distribution. This data are allowed to decide which
function fits them best, without the restrictions imposed by the parametric estimation.
For more details see [22].
There are several reasons for using the nonparametric smoothing approaches.
1) They can be employed as a convenient and succinct means of displaying the features
of the data set and hence to aid practical parametric model building.
1
2) They can be used for diagnostic checking of an estimated parametric model.
3) One may want to conduct inference under only the minimal restrictions imposed in
fully nonparametric structures. For more details see [20]
The main subject of this thesis, is the kernel estimation of the probability density
function, and the conditional distribution function.
Now suppose that (Xi, Yi) are R×R random variable with a common probability den-
sity function f . We want to study the relationship between a response variable Y and
a predictor variable X. To explore this relationship, we use the regression analysis to
quantify it.
The conditional distribution function F (Y |X = x) is very important for solving this prob-
lem. In parametric and nonparametric estimation of the conditional distribution function
most investigation of the underlying structure is concerned with the mean regression func-
tion m(x) = E(Y |X = x), the conditional mean of Y given the value x of X. New insight
about the underlying structure can be gained by considering other aspects of the condi-
tional distribution function F (Y |X).
In this thesis, we will study two other aspects of the conditional distribution function,
its mode and quantiles.
This thesis consist of three chapters, in the first chapter we present some basic definitions
and theorems which will be used in the next chapters. Also, we present the idea of the
kernel estimation of the probability density function and some related topics.
In Chapter two, we introduce the kernel estimation for the mode and the conditional mode
function in the case of independent and identically distributed (i.i.d.) random variables.
We will study the asymptotic behavior of the estimators of the mode and the conditional
mode functions.
Finally, in Chapter three we will study the kernel estimation of the conditional quantile
and asymptotic behavior of this estimation.
2
Chapter 1
Introduction
This Chapter contains some basic definitions and facts that we need in the remanning
of this thesis. In Section 1.1, we present some preliminaries in probability and statistics.
And in the remaining sections of this chapter, we present the idea of the kernel estimation
and some important subjects related to it.
1.1 Preliminaries
In this Section, we will introduce some basic definitions and theorems, that will help in
the remanning of this thesis.
Definition 1.1.1. [8](σ − Field). Let B be a collection of subsets of C. We say that B
is a σ − Field if
(1) φ ∈ B, (B is not empty).
(2) If A ∈ B, then Ac ∈ B, (B is closed under complements).
(3) If the sequence of sets C1, C2, . . . is in B, then∞⋃i=1
Ci ∈ B, (B is closed under
countable unions).
Definition 1.1.2. [8](Probability). Let C be a sample space and let B be a σ − Field
on C. Let P be a real valued function defined on B. Then P is a probability set function
3
if P satisfies the following three conditions:
1. P(C)≥ 0, for all C ∈ B.
2. P(C)=1.
3. If Cn is a sequence of sets in B and Cm ∩ Cn = φ for all m 6= n, then
P (∞⋃i=1
Cn ) =∞∑
n=1
P (Cn).
Definition 1.1.3. [8] Consider a random experiment with a sample space C. A function
X, which assigns to each element c ∈ C one and only one number X(c) = x, is called a
random variable. The space or range of X is the set of real numbers
D = x : x = X(c), c ∈ C. D will generally be countable set or an interval of real
numbers.
Definition 1.1.4. [6] If X is a discrete random variable, the function given by f(x) =
P (X = x) for each x within the range of X is called the probability distribution of
X.
Definition 1.1.5. [6] If X is a discrete random variable, the function given by
F (x) = P (X ≤ x) =∑t≤x
f(t) for −∞ < x < ∞
where f(t) is the value of the probability distribution of X at t, is called the dis-
tribution function, or the cumulative distribution function, of X and denoted by
(cdf).
Definition 1.1.6. [6] A function with values f(x), defined over the set of all real numbers,
is called a probability density function of the continuous random variable X if and
only if
P (a ≤ X ≤ b) =
∫ b
a
f(x)dx
for any real constants a and b with a ≤ b.
4
Definition 1.1.7. [6] If X is a continuous random variable and the value of its probability
density at t is f(t), then the function given by
F (x) = P (X ≤ x) =
∫ x
−∞f(t)dt for −∞ < x < ∞
is called the distribution function, or the cumulative distribution, of X.
Definition 1.1.8. [8] The support of a continuous random variable X consists of all
points x such that fX(x) > 0.
Definition 1.1.9. [8] (Independence). Let the random variables X1 and X2 have the
joint pdf f(x1, x2) and the marginal pdfs f1(x1) and f2(x2) respectively. The random
variables X1 and X2 are said to be independent if, and only if, f(x1, x2) ≡ f1(x1)f2(x2)
Random variables that are not independent are said to be dependent.
Definition 1.1.10. [8] Let X be a random variable with pdf with parameter θ. Let
X1, . . . , Xn be a random sample from the distribution of X and let T denotes an estimator
of θ. We say T is an unbiased estimator of θ if
E(T ) = θ.
If T is not unbiased, we say that T is a biased estimator of θ.
Theorem 1.1.1. [6]
If θ is an unbiased estimator of θ and
V ar(θ) =1
n E[(∂lnf(X)∂θ
)]2
then θ is a minimum variance unbiased estimator of θ.
Definition 1.1.11. [6] The statistic θ is a Consistent estimator of the parameter θ if
and only if for each c > 0
limn→∞
P ( |θ − θ| < c ) = 1.
5
Theorem 1.1.2. [6]
If θ is an unbiased estimator of θ and V ar(θ) −→ 0 as n −→∞, then θ is a consistent
estimator of θ.
Definition 1.1.12. [6] The statistic θ is a sufficient estimator of the parameter θ if
and only if for each value of θ the conditional probability distribution or density of the
random sample X1, X1, . . . , Xn given θ = θ is independent of θ.
Definition 1.1.13. [8](Characteristic Function) The characteristic function of a ran-
dom variable X with distribution function F, denoted by k(u), is defined be
k(u) =
∫ ∞
−∞e−iuyK(y)dy
Theorem 1.1.3. [8]
The characteristic function of any random variable is a uniformly continuous function.
Theorem 1.1.4. [8](Minkowski’s Inequality)
Let X, Y be two random variables. Then it holds for 1 ≤ p < ∞ that
E(|X + Y |p) 1p ≤ (E(|X|p)) 1
p + (E(|Y |p)) 1p .
Definition 1.1.14. [12] Let r be a positive number such that
kr = limu→0
1− k(u)
|u|r
is finite. If there exists a value of r such that kr is non-zero, it is called the characteristic
exponent of the transform k(u), and kr is called the characteristic coefficient.
Definition 1.1.15. [16] If A is any set, we define the Indicator function IA of the set
A to be the function given by
IA =
1 if x ∈ A,
0 if x 6∈ A.
6
Definition 1.1.16. [8](Converge in Probability). Let Xn be a sequence of random
variables and let X be a random variable defined on a sample space. We say Xn converges
in probability to X if for all ε > 0, we have
limn→∞
P [|Xn −X| ≥ ε] = 0,
or equivalently,
limn→∞
P [|Xn −X| < ε] = 1.
If so, we write
Xnp−→ X.
Definition 1.1.17. [8] (Converge in Distribution). Let Xn be a sequence of random
variables and let X be a random variable. Let FXn and FX be, respectively, the cdfs of
Xn and X. Let C(FX) denote the set of all points where FX is continuous. We say that
Xn converge in distribution to X if
limn→∞
FXn(x) = FX(x), for all x ∈ C(FX).
We denote this convergence by
XnD−→ X.
Definition 1.1.18. [8](Converge with probability 1) Let Xn∞n=1 be a sequence of
random variables on ( Ω , L , P ). We say that Xn converge almost surly to a ran-
dom variable X (Xna.s.−→ X) or Converge with probability 1 to X or Xn converge
strongly to X if and only if
P (w : Xn(w) −→ X(w) as n −→∞) = 1,
or equivalent, for all ε > 0, there exists N ∈ N
P ( |Xn −X| < ε, n ≥ N) = 1.
7
Theorem 1.1.5. [8]
1. If Xn converge to X with probability 1, then Xn converge to X in probability.
2. If Xn converge to X in probability, then Xn converge to X in distribution.
3. Let Xn converge to X in probability and let g be a continuous function on R,
then g(Xn) converge to g(X) in probability.
Example 1.1.1. ( Converge in probability 6=⇒ Converge with probability 1. )
Let Ω = (0, 1] and P a uniform distribution on Ω.
Define An by
A1 = (0, 12], A2 = (1
2, 1]
A3 = (0, 14], A4 = (1
4, 1
2], A5 = (1
2, 3
4], A6 = (3
4, 1]
A7 = (0, 18], A8 = (1
8, 1
4], . . .
Let Xn(w) = IAn(w).
Then P (|Xn − 0| ≥ ε) −→ 0 ∀ε > 0, since Xn is 0 except on An and P (An) ↓ 0. Thus Xn
converge to 0 in probability.
But P (w : Xn(w) −→ 0) = 0 (and not 1) because any w keeps being in some An
beyond any n0, i.e, Xn(w) look like 0 . . . 010 . . . 010 . . . 010 . . . , so Xn not converge with
probability 1 to 0.
Definition 1.1.19. [4] Let A ⊆ R, let f : A −→ R, and let c ∈ A. We say that f is
continuous at c if, given any neighborhood Vεf(c) of f(c) there exists a neighborhood
Vδ(c) of c such that if x is any point of A ∩ Vδ(c), then f(x) belongs to Vεf(c).
Definition 1.1.20. [4] A function f : A −→ R is said to be bounded on A if there
exists a constant M > 0 such that |f(x)| ≤ M for all x ∈ A.
Definition 1.1.21. [4] Let A ⊆ R, let f : A −→ R. We say that f is uniformly
continuous on A if for each ε > 0 there is a δ(ε) > 0 such that if x, u ∈ A are any
numbers satisfying |x− u| < δ(ε), then |f(x)− f(u)| < ε.
8
Definition 1.1.22. [4] Let A ⊆ R, let f : A −→ R. If there exists a constant K > 0
such that
|f(x)− f(u)| ≤ K|x− u|
for all x, u ∈ A, then f is said to be a Lipschitz function (or satisfy a Lipschitz
condition) on A.
Definition 1.1.23. [4] Let f : [a, b] −→ R and let a = x0 < x1 < . . . < xk = b
be any subdivision of [a, b], define p =k∑
i=1
[f(xi)− f(xi−1)]+, n =
k∑i=1
[f(xi)− f(xi−1)]−
and t = n + p. Define P = sup p, N = sup n, and T = sup t.
If T < ∞, we say that f is of bounded variation over [a, b] and we write f ∈ BV.
Definition 1.1.24. [4] A function f : [a, b] −→ R ia said to be absolutely continuous
if given ε > 0 there is δ > 0 such that if (xi, yi)ni=1 is finite pairwise disjoint family of
subintervals of [a, b] withn∑
i=1
|xi − yi| < δ, thenn∑
i=1
|f(xi)− f(yi)| < ε.
Theorem 1.1.6. [16]
Every absolutely continuous function is uniformly continuous function.
Theorem 1.1.7. [16]
If f is absolutely continuous function on [a, b], then f is of bounded variation.
Definition 1.1.25. [4] A set E is said to be measurable if for each set A we have
M?(A) = M?(A ∩ E) + M?(A ∩ Ec),
where M? is the outer measure which defined by
M?(A) = infA⊂S In
∑L(In)
Theorem 1.1.8. [16]
If f : A −→ R is a Lipschitz function, then f is uniformly continuous on A.
9
Theorem 1.1.9. [13]( Classical Central Limit Theorem ):
Let Xk, k ≥ 1 be i.i.d random variable with mean µ and finite variance σ2. Also let
Zn = (Tn − nµ)/σ√
n
where Tn =n∑
i=1
Xi. Then ZnD−→ N( 0 , 1 )
Theorem 1.1.10. [13]( Liapounov Theorem )
Let Xk, k ≥ 1, be independent random variables such that EXk = µk and V arXk =
σ2k, and for some 0 < δ ≤ 1,
v(k)2+δ = E|Xk − µk|2+δ < ∞, k ≥ 1.
Also let Tn =n∑
k=1
Xk, ξn = ETn =n∑
k=1
µk, s2n = V arTn =
n∑
k=1
σ2k, Zn = (Tn − ξn)/sn
and ρn = s−(2+δ)n
n∑
k=1
v(k)2+δ. Then, if limn→∞ ρn = 0, we have Zn
D−→ N( 0 , 1 ).
Theorem 1.1.11. [13]
Let Xk, k ≥ 1, be independent random variables such that Pa ≤ Xk ≤ b = 1
for some finite scalers a < b. Also let EXk = µk, V arXk = σ2k, Tn =
n∑
k=1
Xk, ξn =n∑
k=1
µk
and s2n =
n∑
k=1
σ2k. Then
Zn = (Tn − ξn)/snD−→ N( 0 , 1 ) if and only if sn −→∞ as n −→∞.
Theorem 1.1.12. [13] ( Borel - Cantelli Lemma )
Let An be a sequence of events and denote by P (An) the probability that An occurs,
n ≥ 1. Also, let A denote the event that the An occurs infinitely often (i.o). Then
∑n≥1
P (An) < ∞ =⇒ P (A) = 0,
no matter whether the An are independent or not. If the An are independent, then
∑n≥1
P (An) = +∞ =⇒ P (A) = 0.
10
Lemma 1.1.13. [19]
There exists a universal constant C > 0 such that for each n > 0, εn > 0 and
distribution function F,
P supx∈R
|Fn(x)− F (x)| > εn ≤ C exp (−2nε2n).
Theorem 1.1.14. [13] ( Cramer-Wold )
Let X, X1, X2, . . . be random vectors in Rp; then XnD−→ X if and only if, for a fixed
λ ∈ Rp, we have λtXnD−→ λtX.
Theorem 1.1.15. ( Taylor’s Theorem )
Suppose that f is a real-valued function defined on R and let x ∈ R. Assume that f
has p continuous derivative in an interval (x− δ, x + δ) for some δ > 0 and the (p + 1)th
derivative of f exists. Then for any sequence (αn) converging to zero, we have
f(x + αn) =
p∑j=0
(αjn/j!)f (j)(x) + o(αp
n).
11
1.2 Kernel Density Estimation
Suppose X1, X2, . . . , Xn is a sequence of independently and identically distributed (i.i.d.)
random variables with common probability density function f(x). The problem of esti-
mating the function f(x) is of interest for many reasons. For instance, it can be used to
calculate probabilities. In parallel, if we know f(x), we are be able, through its graph to
determine its shape as well as other features of the distribution, like if it has one peak ore
more, if it smooth, symmetric, etc.
1.2.1 Kernel Estimator
Let X1, X2, . . . , Xn be i.i.d. random variables with distribution function F (x) = P (X ≤x) which is absolutely continuous,
F (x) =
∫ x
−∞f(y)dy,
with probability density function f(x).
The sample distribution function Fn(x) at a point x is defined as
Fn(x) =1
n number of observations x1, x2, . . . , xn falling in (−∞, x].
It is natural to take Fn(x) as an estimate of F (x) at a given point x. An estimate of f(x)
may be
fn(x) =1
2hn
Fn(x + hn)− Fn(x− hn), (1.2.1)
where hn is chosen as a positive number.
Equation 1.2.1 can be written as
fn(x) =1
2nhn
number of observations falling in the interval [x− h, x + h]
=1
2nhn
n∑i=1
I(|Xi − x| ≤ h)
=1
nhn
n∑i=1
1
2I(|Xi − x
h| ≤ 1)
=1
nhn
n∑i=1
w(Xi − x
h) (1.2.2)
12
where
I =
1 x− h ≤ Xi ≤ x + h,
0 otherwise.
and
w(Xi − x
h) =
1
2I(|Xi − x
h| ≤ 1) =
12
− 1 ≤ Xi−xh
≤ 1,
0 otherwise.
Definition 1.2.1. We consider the function that centered at the estimation point used
to weight nearby data points as a weight function and will call it the kernel function and
denoted by K(·) which defined as
fn(x) =1
nhn
n∑i=1
K(x−Xi
hn
). (1.2.3)
Note that Equation 1.2.3 can be written as
fn(x) =1
n
n∑i=1
Kh(x−Xi),
where Kh(x) = K(x/h)/h.
The kernel estimator can be viewed as a sum of bumps placed at the observation. The
kernel function K determines the shape of bumps, where the bandwidth hn determines
their width, see the illustration in Figure 1.1
13
-3 -2 -1 0 1 2 3
0.0
0.1
0.2
0.3
0.4
0.5
Figure 1.1: Kernel density estimation based on 7 points (see[20])
From Figure 1.1, we have:
(1) The shape of the bump is efined the kernel function.
(2) The spread of the bump is determined by a bandwidth hn, that is analogous to the
bandwidth of a histogram.
That is the value of the kernel estimate at the point x is the average of the n kernel
ordinates at this point.
1.3 Properties and Examples of the Kernels
In this section, we will consider some properties of the kernels. A kernel is a piecewise
continuous function, symmetric around zero, even function and integrating to one, i.e.
K(x) = K(−x),
∫ ∞
−∞K(x)dx = 1. (1.3.1)
14
The kernel function need not have bounded support, and in most applications K is a
positive probability density function.
A kernel function K is said to be of order p, if its first nonzero moment is µp, i.e. if
µj(K) = 0, j = 1, 2, . . . , p− 1; µp(K) 6= 0,
where
µi(K) =
∫ ∞
−∞yiK(y)dy. (1.3.2)
Some examples of kernel functions is given in Table 1.1, where I is the indicator function.
Table 1.1: Common kernel functions
kernel K(x)
Epanechnikov 34(1− x2)I(|x|≤1)
Biweight 1516
(1− x2)2I(|x|≤1)
Triweight 3532
(1− x2)3I(|x|≤1)
Triangular (1− |x|)I(|x| ≤ 1)
Gaussian (2π)−12 exp(−x2
2)
Uniform 12I(|x| ≤ 1)
Now, we shall introduce some important properties of the kernel estimator, we consider the
following conditions that we will use in proving facts, lemmas and theorems in remanning
of this chapter.
i) The unknown density function f(x) has continuous second derivative f (2)(x).
ii) The bandwidth h = hn = h(n) is a sequence of positive numbers and satisfies
limn→∞
hn = 0, and limn→∞
nhn = ∞.
iii) The kernel K is a bounded probability density function of order 2 and symmetric
about the zero.
15
Definition 1.3.1. The Bias of an estimator fn(x) of a density f(x) is the difference
between the expected value of fn(x) and f(x). That is
Bias(fn(x)) = E(fn(x))− f(x)
In [12], he studied the statistical properties of kernel estimator. In addition to the above,
he proved several other properties. He showed that fn(x) is a consistent of f(x) and the
sequence of estimates fn(x) is asymptotically normally distributed. Also he proved that
if the probability density function f(x) is uniformly continuous, and if limn→∞
nh2n = ∞ ,
then fn(x) tends uniformly continuously ( in probability ) to f(x), in the sense that (1.3.3)
holds
limn→∞
P ( sup−∞<x<∞
|fn(x)− f(x)| < ε) = 1, ∀ε > 0. (1.3.3)
1.4 The MSE and MISE Criteria
The important role played by kernel density estimator makes us concerned with its per-
formance, its efficiency and accuracy in estimating the true density. we will study two
types of the error criteria, the mean square error (MSE) and the mean integrated square
error (MISE).
Definition 1.4.1. The mean square error ( MSE) is used to measure the error when
estimating the density function at a single point. It is defined by
MSEfn(x) = Efn(x)− f(x)2. (1.4.1)
From its definition, the MSE measures the average squared difference between the
density estimator and the true density. In general, any function of the absolute distance
|fn(x) − f(x)| (often called metric) would serve as a measurement of the goodness of
an estimator. But MSE metric has at least two advantages over other metrics. First it
is tractable analytically. Second it has an interesting decomposition into variance and
16
squared bias provided f(x) is not random, as follows
MSE(fn(x)) = E(f(x)− fn(x))2
= E(f 2(x)− 2f(x)fn(x) + fn(x)2)
= Ef 2(x)− 2f(x)Efn(x) + Efn(x)2
= f 2(x)− 2f(x)Efn(x) + Varfn(x) + (Efn(x))2
= Varfn(x) + (Efn(x)− f(x))2 (1.4.2)
Theorem 1.4.1.
Let X be a random variable having a density f, then
MSE(fn(x)) = n−1∫ ∞
−∞K2
hn(x− y)f(y)dy − (
∫ ∞
−∞Khn(x− y)f(y)dy)2
+ (
∫ ∞
−∞Khn(x− y)f(y)dy − f(x))2 (1.4.3)
Proof: See [11].
Now, we are interested in considering an error criterion that globally measures the distance
between the estimation of f over the entire real line and f itself.
Definition 1.4.2. An error criterion that measures the distance between fn(x) and f(x)
is the integrated squared error (ISE) given by
ISEfn(x) =
∫ ∞
−∞(fn(x)− f(x))2dx
Note that the ISE is not appropriate if we deal with all data sets, so we prefer to analyze
the expected value of this random quantity, the integrated squared error.
Definition 1.4.3. The expected value of ISE is called the mean integrated squared error
(MISE) is given by
MISEfn(x) = E(ISEfn(x)) = E
∫ ∞
−∞(fn(x)− f(x))2dx
17
By changing the order of integration we have,
MISE(fn(x)) =
∫ ∞
−∞MSEfn(x)dx
=
∫ ∞
−∞Efn(x)− f(x)2dx +
∫ ∞
−∞V ar(fn(x))dx. (1.4.4)
Theorem 1.4.2.
The MISE of an estimator fn(x) of a density f(x) is given by
MISE(fn(x)) = n−1
∫ ∞
−∞
∫ ∞
−∞K2
hn(x− y)f(y)dydx
+ (1− n−1)
∫ ∞
−∞(
∫ ∞
−∞Khn(x− y)f(y)dy)2dx
− 2
∫ ∞
−∞∫ ∞
−∞Khn(x− y)f(y)dyf(x)dx +
∫ ∞
−∞f 2(x)dx. (1.4.5)
Proof: See [11].
1.5 Asymptotic MSE and MISE Approximations
Here, we will derive an asymptotic approximation for MISE which depend on hn in a
simple way. The simple expression of these approximations will exhibit the influence of
the bandwidth hn as a smoothing parameter.
The rate of convergence of the kernel density estimation and the optimal bandwidth can
be also obtained from the asymptotic approximation of MISE.
Before we start in our investigation we have to introduce some definitions, theorems, and
some assumptions that are needed throughout our work.
Definition 1.5.1.
i) A function f is of order less than g as x −→∞ if
limx→∞
f(x)
g(x)= 0.
we indicate this by writing f = (g) (”f is little oh g”)
18
ii) Let f(x) and g(x) be positive for x sufficiently large. Then f is of at most the order of
g as x −→∞ if there is a positive integer M for which
f(x)
g(x)≤ M,
for x sufficiently large. We indicate this by writing f = O(g) (” f is big oh of g”).
Definition 1.5.2. Given two sequences an and bn such that bn ≥ 0 for all n.
We write
an = O(bn) (read : ”an is big oh of bn”, )
if there exists a constant M > 0 such that |an| ≤ Mbn for all n.
We write an = (bn) as n −→∞ ( read :” an is little oh of bn”), if
limx→∞
an
bn
= 0.
Definition 1.5.3. We say that an is asymptotically equivalent to bn, or simply an is
asymptotic to bn, and we write
an ∼ bn iff limn→∞
(an
bn
) = 1.
Lemma 1.5.1.
Let X be a random variable having a density f, then the bias of fn(x) can be expressed
as
E(fn(x))− f(x) =1
2h2
nµ2(K)f ′′(x) + (h2n) (1.5.1)
Proof :
Firstly, we assume that
∫ ∞
−∞K(z)dz = 1,
∫ ∞
−∞zK(z)dz = 0,
∫ ∞
−∞z2K(z)dz < ∞, and µ2(K) =
∫ ∞
−∞z2K(z)dz.
Note that
E(fn(x)) =
∫ ∞
−∞
1
hn
K(x− y
hn
)f(y)dy.
19
Let z =x− y
hn
to get
E(fn(x)) =
∫ ∞
−∞K(z)f(x− zhn)dz, since f has a continuous derivatives of order 2.
Then we can expand f(x− zhn) in a Taylor series as follows
f(x− zhn) =2∑
j=0
(−zhn)j
j!f (j)(x) + o(−zhn)2
= f(x) + (−zhn)f ′(x) +z2h2
n
2f ′′(x) + o(z2h2
n)
= f(x)− zhnf′(x) +
1
2z2h2
nf′′(x) + o(h2
n)
Therefore,
E[fn(x)] =
∫ ∞
−∞K(z)f(x)− zhnf ′(x) +
1
2z2h2
nf′′(x) + o(h2
n)dz
=
∫ ∞
−∞K(z)f(x)dz −
∫ ∞
−∞zK(z)hnf ′(x)dz +
∫ ∞
−∞K(z)
1
2h2
nz2f ′′(x)dz +
∫ ∞
−∞K(z)o(h2
n)dz
= f(x)− hnf′(x)
∫ ∞
−∞zK(z)dz +
1
2h2
nf ′′(x)
∫ ∞
−∞z2K(z)dz + o(h2
n)
= f(x) +1
2h2
nf′′(x)
∫ ∞
−∞z2K(z)dz + o(h2
n).
By the assumption, we have the result.
Lemma 1.5.2.
Let X be a random variable having a density f, then
V arfn(x) = (nhn)−1R(K)f(x) + (nhn)−1, (1.5.2)
where R(K) =
∫ ∞
−∞K2(x)dx.
Proof :
First, note that
V arfn(x) =1
n∫ ∞
−∞K2
hn(x− y)f(y)dy − [
∫ ∞
−∞Khn(x− y)f(y)dy]2
20
Using the Taylor series expansion of f(x− zhn) about x to get
V arfn(x) =1
nhn
∫ ∞
−∞K2(z)f(x− zhn)dz − n−1Efn(x)2
=1
nhn
∫ ∞
−∞K2(z)f(x) + o(1)dz − n−1f(x) + o(1)2
=1
nhn
∫ ∞
−∞K2(z)f(x) + o(nhn)−1.
From the assumption, the result holds.
Now from above we have some properties about the bias and the variance,
1) The bias is of order (h2n), which implies that fn(x) is asymptotically unbiased estima-
tor. (From page 15 (ii))
2) The bias is large, whenever the absolute value of the second derivative |f (2)(x)| is large.
this occurs for several densities at peaks, where the bias is negative, and valleys,
where the bias is positive.
3) The variance is of order (nhn)−1, which means that the variance converges to zero by
condition (ii) page 15.
Theorem 1.5.3.
The MISE of an estimator fn of the unknown density f is given by
MISEfn(x) = AMISEfn(x)+ (nhn)−1 + h4
where
AMISEfn(x) = (nhn)−1R(K) +1
4h4
nµ22(K)R(f ′′) (1.5.3)
is called the asymptotic MISE of fn(x), and R(K) =
∫ ∞
−∞K2(x)dx.
Proof :
From Equation (1.5.1) and (1.5.2) and applying Equation (1.4.2) to get
MSEfn(x) = (nhn)−1R(K)f(x) + o(nhn)−1+1
4h4
nµ22(K)f ′′2(x) + o(h4
n)
+ h2nµ2
2(K)f ′′(x)o(h2n)
= (nhn)−1R(K)f(x) +1
4h4
nµ22(K)f ′′2(x) + o(nhn)−1 + h4
n
21
From this, we have
MSEfn(x) = (nhn)−1R(K)f(x) +1
4h4
nµ22(K)f ′′2(x) + o(nhn)−1 + h4
n
therefor,
AMISEfn(x) = (nhn)−1R(K)f(x) +1
4h4
nµ22(K)f ′′2(x).
1.6 Optimal Bandwidth
The problem of bandwidth selection is very important in density estimation. The next
figure (1.2) shows how the density estimates change with the bandwidth size. Choice of
the appropriate bandwidth is critical to the performance of most nonparametric density
estimators. When the bandwidth is very small, the estimate will be very close to the
original data. Thus it will be very wiggly due to the over fitting. The estimate will be
almost unbiased, but it will have large variation under repeated sampling. If the band-
width is very large, the estimate will be very smooth, lying close to the mean of all the
data. Such an estimate will have small variance, but it will be highly biased. A brief sur-
vey of bandwidth selection for kernel density estimation has been taken on by [22] and [23].
One way to select the smoothing parameter is simply to look at the plots of the
smoothed data for several bandwidth. If the overall trend is the feature of the most in-
terest to the investigator, a very smooth estimate may be desirable. If the investigator is
interested in local extremes, a less smooth estimate may be preferred.
Subjective choice of the smoothing parameter offers a great deal of flexibility, as well as
a comprehensive look at the data, see [22].
The AMISE (asymptotic MISE) has some useful advantages. Its simplicity as a math-
ematical expression to deal with, makes it useful for large sample approximation.
Also, we can see an important alternative relationship between bias and variance, it is
known as the variance-bias trade-off. It gives us an understanding about the role of band-
22
width hn.
-3 -2 -1 0 1 2 3
0.0
0.1
0.2
0.3
0.4
0.5
-3 -2 -1 0 1 2 3
0.0
0.1
0.2
0.3
0.4
0.5
-3 -2 -1 0 1 2 3
0.0
0.1
0.2
0.3
0.4
0.5
Figure 1.2: Kernel density estimates based on different bandwidths (see [20])
Figure 1.2 depends on three bandwidths:
If we choose hn = 0.25, then we have a solid curve. If we choose hn = 0.5, then we have
a dashed curve. And if we choose hn = 0.75, then we have adotted curve.
There are many rules for bandwidth selection, for example Normal Scale Rules, Over-
smoothed bandwidth selection rules, Least Squares Cross-Validation, Biased Cross-Validation,
Estimation of density functionals and Plug-In Bandwidth Selection. For more details see
[11], [22] and [23].
Corollary 1.6.1.
The AMISE-optimal bandwidth, hAMISE , has a closed form
hopt = [R(K)
µ2(K)2R(f (2))n]15 (1.6.1)
23
Proof :
By differentiating (1.5.3) with respect to hn and setting the derivative equal to zero we
can find the optimal bandwidth
d
dhAMISEfn(x) = −(nh2
n)−1R(K) + h3nµ
22(K)R(f ′′) = 0
h5nµ2
2(K)R(f ′′) = n−1R(K)
hopt = R(K)
nµ22(K)R(f ′′)
15 .
When trying to understand what this hn guides to, we will find that it depends on the
known kernel function K and n, and it is inversely proportional to R(f ′′)15 . This R(f ′′)
measures that the total curvature of f . So if R(f ′′) is small, that is f has a little curvature
and the bandwidth h will be large. On the other hand hn will be small if R(f ′′) is large.
The previous corollary gives agood optimal hn can work to choose a good bandwidth, if
R(f ′′) is known. But f is unknown.
Therefore if we substitute (1.7.1.) to (1.6.3), we obtain the smallest value of AMSE
(since the seconed derivative is grater than zero) for estimating f using the kernel K
AMISEfn(x) = (nhn)−1R(K) +n
4h5
nµ22(K)R(f ′′)
=5
4R(K)µ2
2(K)R(f ′′) 15 /n
45 (R(K))
15
=5
4n−45 (R(K))
45 (µ2(K)2R(f ′′))
15
take the infimum over hn > 0, we get
infhn>0
AMISEfn =5
4µ2(K)2R(K)4R(f (2)) 1
5 n−45
Notice that in (1.6.1) the optimal bandwidth depends on the unknown density being
estimated, so we can not use (1.6.1) directly to find the optimal bandwidth hopt. Also
from (1.6.1) we can draw the following useful conclusions:
1. The optimal bandwidth will converge to zero as the sample size increases, but at very
slow rate.
24
2. The optimal bandwidth is very inversely proportional to R(f ′′)15 . Since R(f ′′) measures
the curvature of f , this means that for a density function with little curvature, the
optimal bandwidth will be large. conversely, if the density function has a large
curvature, the optimal bandwidth will be small.
1.7 Optimal Kernel
In this section, we investigate what effect the shape of kernel function K has on density
estimation. Usually K is taken to be symmetric, unimodal density function, but there
are many kernel functions that do satisfy these characteristics and still their performance
varies. The best kernel will be known as the optimal kernel.
Epanechnikov (1969) was the first to consider this problem in the density estimation
context and to give a comparison of common kernels in asymptotic performance terms.
Consider the formula for AMISEfn(x) in (1.5.3). In the formula that scaling of K is
incorporated with the bandwidth hn. This causes difficulty in optimization with respect
to K. If we Choose a re-scaling of K of the form
Kδ(x) =1
δK(
x
δ),
the dependance of K and hn can be separated. To know how this can be made, we will
give this lemma.
Lemma 1.7.1.
R(Kδ) = µ22(Kδ) is satisfied iff δ = δ0 = R(K)/µ2
2(K) 15
Proof : See [11].
Theorem 1.7.2.
Let R(Kδ) = µ22(Kδ), where δ = δ0 = R(K)/µ2(K)2 1
5 , then
AMISE(fn(x)) = C(Kδ0)(nhn)−1 +1
4h4R(f ′′). (1.7.1)
25
Proof :
First, since R(Kδ) = µ22(Kδ), then
µ22(K) = δ−5R(K) = δ.δ−5R(Kδ) = δ−4R(Kδ).
Note that
AMISE(fn(x)) =1
nhn
R(K) +1
4h4
nµ22(K)R(f ′′)
=δ
nhn
R(Kδ) +1
4h4
nδ−4R(Kδ)R(f ′′)
= R(Kδ0)(nhn)−1 +1
4h4
nR(f ′′)
= δ−10 R(K)(nhn)−1 +
1
4h4
nR(f ′′)
= δ−4R4(K)δ4µ22(K) 1
5(nhn)−1 +1
4h4
nR(f ′′)
= C(Kδ0)(nhn)−1 +1
4h4R(f ′′)
Thus the result holds.
Definition 1.7.1. We say that C(K) is invariant to re-scalings of K if C(Kδ1) = C(Kδ2),
for any δ1, δ2 > 0. We call Kc = Kδ0 the canonical kernel for the class Kδ : δ > 0 of
resealed K.
Corollary 1.7.3.
C(K) is invariant to re-scaling of K.
Proof : See [11].
Canonical kernels also can simplify the optimization procedure of the kernel shape.
That is; from Equation (1.7.1), it is enouhg to choose K that minimizes C(Kδ0), with:
∫ ∞
∞K(x)dx = 1,
∫ ∞
∞xK(x)dx = 0,
∫ ∞
∞x2K(xdx) = a2 ≤ ∞ and K(x) ≥ 0 for all x.
The solution to this problem was given by as
Ka(x) =3
4(1− x2/(5a2))/(5
12 a)I|x|<5
12 a, (1.7.2)
26
where a is an arbitrary scale parameter.
Now, if we choose a2 =1
5, we get the simp;est of Ka(x)
K?(x) =3
4(1− x2)I|x|<1. (1.7.3)
The kernel in (1.7.3) is known as Epanechinkov kernel, since its optimality properties in
density estimation were first described by Epanechinkov (1969).
Now, we will introduce the useful ratio C(K?)/C(K) 54 .
Definition 1.7.2. The ratio C(K?)/C(K) 54 represents ratio of sample sizes necessary
to obtain the same minimum AMISE (for a given f) when using K?(x) as when using K,
and is called the efficiency of K relative to K?.
x
y
-3 -2 -1 0 1 2 3
0.0
0.2
0.4
0.6
0.8
1.0
x
y
-3 -2 -1 0 1 2 3
0.0
0.2
0.4
0.6
0.8
1.0
x
y
-3 -2 -1 0 1 2 3
0.0
0.2
0.4
0.6
0.8
1.0
Figure 1.3: The Epanechinkov kernel K∗ (see [20])
27
kernel C(K?)/C(K) 45
Epanechnikov 1.000
Biweight .994
Triweight .987
Triangular .986
Gaussian .951
Uniform .930
Table 1.2 : Efficiency of several kernels compered to the optimal kernel K?.
From Table 1.2, if the efficiency of K is 0.98, this means that, we have to use 98%
of data as that using K, if we want the density estimate optimal kernel K? to reach the
same minimum AMISE.
Figure 1.4 shows the kernel density estimates of the Ethanol data based on the same
bandwidth hn = 0.2, but using different kernels. The solid curve stands for the triangular
kernel, the dashed curve for the uniform kernel, and the dotted curve for the normal
kernel.
0.6 0.8 1.0 1.2
12
34
Figure 1.4: Kernel density estimates of the Ethanol data (see [20])
28
Chapter 2
On the Estimation of the Mode
The mode is considered as one of the central tendency measures, the tendency of data
towards the central around a particular value. The mode is defined as the common value
or the most repetitive among those observation, and the data might have more than one
modes or nothing at all which can’t be calculated.
The mode can be found by calculation or drawing and it is not affected by the irregu-
lar values. It’s important to refer to the relation between the mean, the median, and
the mode. Those measures are the most used measures of locations by the investiga-
tors, because those measures are easy to understand. They are equal when the curve is
symmetric, while when the curve is in a state of a positive bending, the mean is larger
than the median and the both are larger than the mode. When the curve is in a state of
negative bending, the mode is larger than the median and the both are larger than the
mean.
In this chapter, firstly we present the kernel estimation of the mode and conditional mode
function, and we study them in the case of i.i.d. random variables. Also we study the
asymptotic behavior of the mode and the conditional mode estimation.
This chapter consists of four sections. In Section 2.1, we introduce the problem of es-
timating the mode of a probability density function, and giving historical notes. We
study under what conditions the asymptotic normality of the unconditional mode. In
the next section, we present the simple mode estimation of [3]. Section 2.3 comprises an
29
introduction to the study of the relationship between two variables, X and Y, the first is
called the predictor variable, and the second is the response variable, and we present the
Nadaraya-Watson estimator as on approach of the kernel regression estimators. Finally,
we study the joint estimation of the conditional mode function taken at k finite distinct
points.
2.1 Mode Estimation
The problem of estimating the mode of a probability density function has received con-
siderable attention in the literature. The study of nonparametric mode estimation is now
four decades old, having roots in many papers. In the last few years, an increasing interest
in this topic can be observed. Among the most recent evidence of this growing interest
are the papers by [3].
There are many fields where the knowledge of the mode is of great interest. For example,
the estimation of contours is a natural extension of the estimation of mode points. For
more details see [18] and [20].
Let X1, X2, . . . , Xn be a sequence of i.i.d. random variable with pdf f. Assume that
the probability density function f(x) is uniformly continuous in x. It follows that f(x)
possesses a mode θ which defined by
f(θ) = maxx
f(x).
Assume that θ is unique.
The classical procedure to estimate the mode is as follows:
If f(x) is the unknown function and θ is the mode of f, then θ is estimated from the
location θn which maximize function fn for f.
Suppose that fn(x) is a continuous function and tends to 0 as s tends to ∞. There is a
random variable θn such that
θn = arg maxx
fn(x) (2.1.1)
We call θn the sample mode.
30
Lemma 2.1.1. (Bochner Lemma)
Suppose K(y) is a mesurable function satisfying the following:
1. sup−∞<y<∞
|K(y)|dy < ∞,
2.
∫ ∞
−∞|K(y)|dy < ∞,
3. limy→∞
|yK(y)| = 0.
Let g(y) satisfy
∫ ∞
−∞|g(y)|dy < ∞. Let hn be a sequence of positive constants satisfying
the condition
4. limn→∞
hn = ∞.
Define gn(x) =1
hn
∫ ∞
−∞K(
y
hn
)g(x− y)dy.
Then at every point x of continuity of g(·),
limn→∞
gn(x) = g(x)
∫ ∞
−∞|K(y)|dy. (2.1.2)
Proof :
Note first that
gn(x)− g(x)
∫ ∞
−∞|K(y)|dy =
1
hn
∫ ∞
−∞K(
y
hn
)g(x− y)dy − g(x)
∫ ∞
−∞K(y)dy
=
∫ ∞
−∞g(x− y)− g(x) 1
hn
K(y
hn
)dy.
Let δ > 0, and split the region of integration into two regions, |y| ≤ δ and |y| > δ.
Now let z =y
hn
. Then y = zhn and so dy = hndz. Then we have dz =1
hn
dy.
31
Now
|gn(x)− g(x)
∫ ∞
−∞K(y)dy| = |g(x− y)− g(x)|
∫
|y|≤δ
1
hn
K(y
hn
)dy
+ |g(x− y)− g(x)|∫
|y|>δ
1
hn
K(y
hn
)dy
≤ |g(x− y)− g(x)|∫
|y|≤δ
1
hn
K(y
hn
)dy
+
∫
|y|≥δ
|g(x− y)|y
y
hn
K(y
hn
)dy + g(x)
∫
|y|≥δ
1
hn
K(y
hn
)dy
≤ max|y|≤δ
|g(x− y)− g(x)|∫
|z|≤δ/hn
|K(z)|dz
+
∫
|y|≥δ
|g(x− y)|y
y
hn
K(y
hn
)dy + |g(x)|∫
|y|≥δ
1
hn
K(y
hn
)dy
≤ max|y|≤δ
|g(x−y)−g(x)|∫ ∞
−∞|K(z)|dz+
1
δsup
|z|≥δ/hn
|zK(z)|∫ ∞
−∞|g(y)|dy
+ |g(x)|∫
|z|≥δ/hn
|K(z)|dz, ( Since1
y<
1
δ)
which tends to 0 as n tends to ∞, and δ tend to 0.
Lemma 2.1.2.
Consider the foemula of fn(x) as in Equarion (1.2.3). Then fn(x) can be written as
fn(x) = (2π)−1
∫ ∞
−∞e−iuxK(uhn)ϕn(u)du. (2.1.3)
where
ϕn(u) =
∫ ∞
−∞eiuxdFn(x) = n−1
n∑
k=1
eiuXk
Proof : See [12]
Theorem 2.1.3.
Under the conditions of the last lemma (2.1.1), if hn is a function of n satisfying
limn→∞
nh2n = ∞, lim
n→∞E[fn(x)] = f(x), and if the probability density function f(x) is uni-
formly continuous. Then for every ε > 0
P [supx|fn(x)− f(x)| < ε] −→ 1 as n −→∞.
32
Proof :
To prove this theorem we want to show that
limn→∞
E12 [ sup−∞<x<∞
|fn(x)− f(x)|2] = 0. (2.1.4)
Since limn→∞ E[fn(x)] = f(x), it suffices to show that
E12 [ sup−∞<x<∞
|fn(x)− E[fn(x)]|2] −→ 0, (2.1.5)
as n −→∞, since by Lemma 2.1.1, it follows that
limn→∞
sup−∞<x<∞
|E[fn(x)]− f(x)| = 0.
Since
fn(x) = (2π)−1
∫ ∞
−∞e−iuxK(uhn)ϕn(u)du.
Then
sup−∞<x<∞
|fn(x)− E[fn(x)]| ≤ (2π)−1|∫ ∞
−∞eiuxK(uhn)ϕn(u)du−
∫ ∞
−∞eiuxK(uhn)E[ϕn(u)]du|
= (2π)−1
∫ ∞
−∞|eiuxK(uhn)ϕn(u)− E[ϕn(u)]du|
= (2π)−1
∫ ∞
−∞|K(uhn)||ϕn(u)− E[ϕn(u)]|du. ( since | eiux| = 1.)
Therefore, by Minkowski’s inequality, the quantity in (2.1.5) is no greater than
(2π)−1
∫ ∞
−∞|K(uhn)|σ[ϕn(u)]du ≤ (n
12 hn)−1
∫ ∞
−∞|k(u)|du
which tends to 0. The proof of this Theorem is complete.
Theorem 2.1.4.
Under the conditions of the last Theorem, If θn are the sample modes, and if the
population mode θ is unique, then for every ε > 0
limn→∞
P (|θn − θ| < ε) = 1, ∀ε > 0. (2.1.6)
33
Proof :
since f(x) is a uniformly continuous probability density function with a unique mode θ,
it has the following property,
for every ε > 0 there exists an η > 0 such that, for every point x, |θ − x| ≥ ε implies
|f(θ)− f(x)| ≥ η.
If the assertion were false, then there would exist an ε > 0 and a sequence of xn such
that
|f(θ)− f(xn)| < 1
nand |θ − xn| ≥ ε (2.1.7)
Now (2.1.7), and the fact f(x) −→ 0 as x −→ ±∞, implies that there exists point θ′ 6= θ
such that f(θ′) = f(θ), which contradicts the assumption that f(x) has a unique mode θ.
From this assertion since f is uniformly continuous, and it follows that to prove θn −→ θ
in probability, it sufficient to prove that
f(θn) −→ f(θ) in probability as n −→∞. (2.1.8)
Now,
|f(θn)− f(θ)| = |f(θn)− fn(θn) + fn(θn)− f(θ)|≤ |f(θn)− fn(θn)|+ |fn(θn)− f(θ)|≤ sup
x|f(x)− fn(x)|+ sup
x|fn(x)− f(x)|
= 2 supx|fn(x)− f(x)| (2.1.9)
Since
|fn(θn)− f(θ)| = | supx
fn(x)− supx
f(x)| ≤ supx|fn(x)− f(x)|. (2.1.10)
From (2.1.9) and Theorem (2.1.2), we obtain (2.1.8).
Nadaraya (1965) has proved the strongest result in this direction. He proved that
under certain conditions, the sample mode θn converges to the population mode θ with
probability 1.
To achieve the asymptotic normality of θn, and therefore to be able to construct as-
ymptotic confidence interval for θ, it is generally believed that rather heavy smoothing
34
conditions are needed. The next Theorem state conditions on the constants hn and the
kernel K(u) such that the estimated mode θn is asymptotically normally distributed.
Consider a probability density function f(x) with a unique mode at θ. If f(x) has a
continuous second derivative, then by definition of the mode we have
f ′(θ) = 0, f ′′(θ) < 0. (2.1.11)
Similarly, if the estimated probability density function fn(x) is chosen to be twice dif-
ferentiable (that is, the weighting function K(y) is chosen to be twice differentiable),
then
f ′n(θn) = 0, f ′′n(θn) < 0, (2.1.12)
if θn is the mode of fn(x). Then by Taylor’s theorem, we have
0 = f ′n(θn) = f ′n(θ) + (θn − θ)f ′′n(θ?n) (2.1.13)
for some random variable θ?n between θn and θ. From (2.1.13) we can write
θn − θ = −f ′n(θ)/f ′′n(θ?n) (2.1.14)
if the denominator does not vanish. Using (2.1.14) as a basis, we now state conditions
under which the estimated mode θn is asymptotically normally distributed.
Theorem 2.1.5.
Suppose that there exists δ, 0 < δ < 1, such that the transform k(u) has a charac-
teristic exponent r ≥ 2, and satisfies
1.
∫ ∞
−∞u2+δ|k(u)|du < ∞, and hn is a function of n satisfying
2. limn→∞
nh5+2δn = 0,
3. limn→∞
nh6n = ∞, and that the characteristic function ϕn(u) satisfying
4.
∫ ∞
−∞u2+δ|φ(u)|du < ∞.
35
Then as n −→∞,
E[ sup−∞<x<∞
|f ′′n(x)− f ′′(x)|2] −→ 0 (2.1.15)
f ′′n(θ?n) −→ f ′′(θ) in probability (2.1.16)
(nh3)12 f ′n(θ) −→ N(0, f(θ)J) in distribution (2.1.17)
(nh3n)
12 (θn − θ) −→ N(0, f(θ)/[f ′′(θ)]2J) in distribution (2.1.18)
where we define
J =
∫ ∞
−∞K ′2(y)dy = (2π)−1
∫ ∞
−∞u2k2(u)du. (2.1.19)
Proof :
From (2.1.3), since fn(x) = (2π)−1
∫ ∞
−∞e−iuxk(uhn)ϕn(u)du.
Then f ′n(x) =−i
2π
∫ ∞
−∞ue−iuxk(uhn)ϕn(u)du,
and so f ′′n(x) =i2
2π
∫ ∞
−∞u2e−iuxk(uhn)ϕn(u)du =
−1
2π
∫ ∞
−∞u2e−iuxk(uhn)ϕn(u)du.
First, we will prove (2.1.15)
|f ′′n(x)− E[f ′′n(x)]| = |−1
2π
∫ ∞
−∞u2e−iuxk(uhn)ϕn(u)du− −1
2π
∫ ∞
−∞u2e−iuxk(uhn)E[ϕn(u)]du|
= |−1
2π
∫ ∞
−∞u2e−iuxk(uhn)ϕn(u)du +
1
2π
∫ ∞
−∞u2e−iuxk(uhn)E[ϕn(u)]du|
=1
2π|∫ ∞
−∞u2e−iuxk(uhn)[ϕn(u)− E[ϕn(u)]]du|
≤ 1
2π
∫ ∞
−∞|e−iuxk(uhn)u2[ϕn(u)− E[ϕn(u)]]du|
= (2π)−1
∫ ∞
−∞|k(hnu)|u2|ϕn(u)− E[ϕn(u)]|du. ( Since |e−iux| = 1.)
Let uhn = v, then dv = hndu and so du =1
hn
dv, u2 =v2
h2n
to get
E12 [ sup−∞<x<∞
|f ′′n(x)− E[f ′′n(x)]|2] ≤∫ ∞
−∞|k(uhn)|u2σ[ϕn(u)]du
≤ (n12 h3
n)−1
∫ ∞
−∞|k(v)|v2dv,
|E[f ′′n(x)]− f ′′(x)| ≤ (2π)−1
∫ ∞
−∞|1− k(uhn)|u2|ϕ(u)|du.
36
Equation (2.1.16) follows from (2.1.15) and the fact that θ?n tends to θ, since it is between
θn and θ, and θn tends to θ.
To prove (2.1.17), let
f ′n(θ) = n−1
n∑
k=1
Vnk , Vnk =1
h2n
K ′(θ −Xk)
hn
,
Vnk are independent and identically distributed as Vn = (h2n)−1K ′(θ −Xk)
hn
.Now,
E|Vn|m =
∫ ∞
−∞| 1
hn
K(x− y
hn
)|mf(y)dy
=1
hmn
∫ ∞
−∞|K(
x− y
hn
)|mf(y)dy
Let u =x− y
hn
, to get y = x− uhn, and so dy = −hdu.
That is;
E|Vn|m =hn
hmn
∫ ∞
−∞|K ′(y)|mf(y)dy
=1
h2m−1n
∫ ∞
−∞|K ′(y)|mf(y)dy.
Then
E|Vn|m =1
h2m−1n
f(θ)
∫ ∞
−∞|K ′(y)|mdy
hence
h2m−1E|Vn|m −→ f(θ)
∫ ∞
−∞|K ′(y)|mdy,
Using Liapunov’s condition it is sufficient to show that,
for some δ > 0,E|Vn − E[Vn]|2+δ
n(δ/2)σ2+δ[Vn]−→ 0 as n −→∞.
Now,
(nh3n)−1E[f ′n(θ)] = (nh3
n)−1(−i
2π)
∫ ∞
−∞e−iuθk(hu)− 1uϕudu −→ 0,
37
nh3nVar[f ′n(θ)] = h−1
n
∫ ∞
−∞K ′2(θ − y)/hnf(y)dy − nh3
nE2[f ′n(θ)]
−→ f(θ)
∫ ∞
−∞K ′2(y)dy.
Therefore
(nh3n)
12 f ′n(θ) −→ N(0, f(θ)J).
This implies that
f ′n(θ)− E[f ′n(θ)]
σ[f ′n(θ)]−→ N(0 , 1) in distribution.
To proof of (2.1.18), from Equation (2.1.14) and (2.1.17), we have
θn − θ =−f ′n(θ)
f ′′n(θ?n)
.
That is;
(nh3n)
12 f ′n(θ) −→ N(0, f(θ)J).
Then,
(nh3n)
12 (θn − θ) =
−(nh3n)
12 f ′n(θ)
[f ′′(θ?n)]2
−→ N( 0 ,f(θ)J
[f ′′(θ)]2).
2.2 A Simple Estimation of the Mode
The estimator (2.1.1) is increasingly used, although it is difficult to calculate. Indeed,
in addition to the calculation of fn, it involves a numerical step for the computation of
arg max .
As noticed by [5], classical search methods of the arg max perform satisfactorily only when
fn is sufficiently regular. Thus in practice, the arg max is usually computed over a finite
grid, although it may affect the asymptotic properties of the estimator. Moreover, when
the dimension of the sample space is large, or when accurate estimation is needed, the grid
size increasing exponentially with the dimension, leads to timeconsuming computations.
38
Finally, the search grid should be located around high density areas. In high dimension,
this is a difficult task and the grid search grid usually includes low density areas. To solve
this problem, [3] proposed a concurrent estimator of the mode θ?n which is defined by
θ?n = arg max
x∈Sn
fn(X), (2.2.1)
where Sn = X1, . . . , Xn, ia a finite sample of d dimension data.
The main advantage of using θ?n instead of θn, is that the former is easily computed
in a finite number of operations. Moreover, since the sample points are naturally concen-
trated in high density areas, the set Sn can be regarded as the most natural random grid
for approximating the mode.
[3] established the strong consistency of θ?n towards θn, and provided almost sure
rate of convergence without any differentiability condition on f around the mode.
[2] examine whether maximization over a finite sample alters the rate of convergence of
the estimate θ?n compared to that of the estimate θn. They proved that the two estimates
have the same asymptotic behavior. Also, another use of computing θ?n is that it may be
an appropriate choice for a stating value of an optimization algorithm to approximate θn.
2.3 Nonparametric Regression Estimation
Kernel smoothing provides a simple way of finding structure in the data sets without
the imposition of a parametric model. One of the most fundamental setting where ker-
nel smoothing ideas can applied is the simple regression problem. In this case paired
observations for each of two variable are available and one is interested in determining
an appropriate functional relationship between the two variables. One of the variables,
usually denoted by X, which called the predictor variable and the other variable usually
denoted by Y, which called the response variable.
39
A well known result from elementary statistics is the function m that minimizes EY −m(X)2, and it is known as the conditional expectation ( mean ) function of Y given X,
that is
m(X) = E(Y |X)
This function is always called the regression function of Y on X. There are now several
approaches to the nonparametric regression problem. Some of the more popular are those
based on kernel functions, spline functions and wavelets. For more details see [22] and [23].
Each of these approaches has its particular strengths and weaknesses, although kernel
regression estimators have the advantage of mathematical and intuitive simplicity. One
of the most known kernel estimators is the Nadaraya-Watson estimator.
The Nadaraya-Watson Estimator
Let (Xi, Yi) be R×R valued independent random variables with a common proba-
bility density function f. Also assume that X admits a marginal density g(X).
Suppose that we are given n observations of (X, Y ), denoted by (X1, Y1), . . . , (Xn, Yn).
First, we consider the following estimator of the joint density f(x, y) of (X,Y ) :
fn(x, y) =1
n
n∑i=1
Khn(x−Xi)Khn(y − Yi),
and define the marginal pdf of X as
gn(x) =1
n
n∑i=1
Khn(x−Xi)
where Khn(x) = K(x/hn)/hn.
The Nadaraya-Watson estimator of the conditional density function f(y|x) is given
40
by
fn(y|x) =fn(x, y)
gn(x)=
n−1
n∑i=1
Khn(x−Xi)Khn(y − Yi)
n−1
n∑i=1
Khn(x−Xi)
=
n∑i=1
Khn(x−Xi)Khn(y − Yi)
n∑i=1
Khn(x−Xi)
.
Now to estimate m(·), first we compute an estimator of the joint density f(x, y) of (X, Y ),
and then to integrate it according to the formula
m(x) =
∫ ∞
−∞yf(x, y)dy
∫ ∞
−∞f(x, y)dy
. (2.3.1)
Lemma 2.3.1.
Under the formulas of fn(x, y) and f(x, y), we have
(1)
∫ ∞
−∞fn(x, y)dy =
1
n
n∑i=1
Khn(x−Xi).
(2)
∫ ∞
−∞yfn(x, y)dy =
1
n
n∑i=1
Khn(x−Xi)Yi.
Proof :
Since
∫ ∞
−∞K(u)du = 1, we have that
(1)
∫ ∞
−∞fn(x, y)dy =
∫ ∞
−∞
1
n
∞∑n=1
Khn(x−Xi)Khn(y − Yi)dy
=1
n
∞∑n=1
Khn(x−Xi)
∫ ∞
−∞Khn(y − Yi)dy
=1
n
∞∑n=1
Khn(x−Xi).
41
(2)
∫ ∞
−∞yfn(x, y)dy =
∫ ∞
−∞
y
n
∞∑n=1
Khn(x−Xi)Khn(y − Yi)dy
=1
n
∞∑n=1
Khn(x−Xi)
∫ ∞
−∞yKhn(y − Yi)dy
=1
n
∞∑n=1
Khn(x−Xi)Yi.
If we substitute these into the numerator and denominator of 2.3.1 we obtain the Nadaraya-
Watson kernel estimator for m(·),
mn(x) =
n∑i=1
Khn(x−Xi)Yi
n∑i=1
Khn(x−Xi)
=n∑
i=1
Wni(x)Yi,
where
Wni(x) =Khn(x−Xi)
n∑i=1
Khn(x−Xi)
, i = 1, . . . , n,
are the weight functions.
The bandwidth hn determines the degree of smoothness of mn(·). This can be imme-
diately seen by considering the limits for hn tending to zero or to infinity respectively.
Corollary 2.3.2.
(a) If hn −→ 0, then at an observation Xi,
mn(Xi) −→ Khn(0)Yi
Khn(0)= Yi,
indicating that small bandwidths reproduce the data.
(b) If hn −→∞, then
mn(Xi) −→
n∑i=1
Khn(0)Yi
n∑i=1
Khn(0)
=1
n
n∑i=1
Yi = Y .
42
Proof: See [20]
That is, in (a) if hn −→ 0, then mn(x) tends to one piont. But if hn −→∞, then mn(x)
tends to the mean.
Suggesting that large bandwidth leads to an over smoothed curve, the sample mean. In
general, the bandwidth function hn acts as follows.
If hn is very small, then the weights focus on a few observations that are in the neigh-
borhood around each Xi. If hn is very large, then the weights will spread over larger
neighborhood around each Xi.
Consequently, the choice of hn plays an important role in kernel regression. These two
limit considerations make it clear that the smoothing parameter hn, in relation to the
sample size n, should not converge to zero too rapidly nor too slowly.
2.4 Joint Asymptotic Distribution of the Estimated
Conditional Mode
In nonparametric estimation of regression function, most investigations are concerned
with the regression function m(x), the conditional mean of Y given value x of a predictor
X. However, new insights about the underlying structures can be gained by considering
other aspects of the conditional distribution f(y|x) of Y given X = x. One of this aspects
is the conditional mode function, which will be the topic of this section.
Assume that (X1, Y1), (X2, Y2), . . . , (Xn, Yn) are i.i.d. random variables with joint proba-
bility density function f(x, y). The marginal probability density function of X1 is g(x) =∫ ∞
−∞f(x, y)dy, and the conditional probability density function of Y1 given X1 = x is
given by f(y|x) =f(x, y)g(x)
. We assume that for each x, f(x, y) is uniformly continuous in
y and it follows that f(x, y) possesses a mode θ(x) defined by
θ(x) = arg max−∞<y<∞
f(y|x).
We call θ(x) the population conditional mode or the mode function, and we assume that
θ(x) is unique.
43
Let K be a measurable function and hn be a sequence of positive numbers converging
to zero. We consider the Nadaraya-Watson estimator fn(y|x) of the conditional density
f(y|x).
If K is chosen such that K(u) tends to zero as u tends to ±∞, then for every sam-
ple sequence and for each x, fn(y|x) is a continuous function of y and tends to ±∞.
Consequently, there is a random variable θn(x) such that
θn(x) = arg max−∞<y<∞
fn(y|x).
We call θn(x) the sample conditional mode. [18] considered θn(x) as an estimator of θ(x)
and established conditions under which the estimator is strongly consistent and asymp-
totically normally distributed. They proved that (nh4n)
12 (θn(x) − θ(x)) is asymptotically
normally distributed with mean zero and variance
f(x, θ(x))
f (0,2)(x, θ(x))2.
∫ ∞
−∞
∫ ∞
−∞K(u)K(1)(v)2dudv
nh4n
where K(1)(v) means the first derivative of K(v), and f (0,2)(x, θ(x)) is defined in the fol-
lowing assumptions.
In this section, we will discuss this result for multivariate case. For distinct points
x1, x2, . . . , xk we will establish conditions under which (nh4n)
12 (θn(x1)−θ(x1), . . . , θn(xk)−
θ(xk))T where T denotes the transpose, is asymptotically multivariate normal with mean
zero vector and diagonal covariance matrix B = [bij] with
bij =f(xi, θ(xi))
f (0,2)(xi, θ(xi))2
∫ ∞
−∞
∫ ∞
−∞K(u)K(1)(v)2dudv.
We consider the following assumptions from [18],
(A1) (X1, Y1), . . . , (Xn, Yn), is a sample of i.i.d. random variables with joint probability
density function f(x, y), where the following hold,
44
(i) g(x) the marginal probability density function of X, is uniformly continuous.
(ii) f (i,j)(x, y) =∂i+jf(x, y)
∂xi∂yj exist and bounded for 1 ≤ i + j ≤ 3.
(A2) The kernel K is a Borel function and satisfies the following:
(i) K(u) tends to zero as u tends to ±∞.
(ii) K(u) and it’s first two derivative are functions of bounded variation.
(iii) lim|u|→∞
|u2K(i)(u)| = 0, (i = 0, 1)
(iv)
∫ ∞
−∞uiK(u)du = 1, , i = 0 (= 0, if i = 1, 2)
(v)
∫ ∞
−∞|u|3K(u)du < ∞.
(A3) hn is a sequence of positive numbers tending to zero, and satisfies the following:
hn = n−δ ,1
10< δ <
1
8; i.e. lim
n→∞nh8
n = ∞ and limn→∞
nh10n = 0
To prove our result we will use the following preliminary lemmas from [18] and [20].
Lemma 2.4.1. (Bochner Lemma)
Suppose K1(u) and K2(u) are real valued Borel measurable function satisfying the
following conditions
1. supu∈R
|Ki(u)|du < ∞, (i = 1, 2).
2.
∫ ∞
−∞|Ki(u)|du < ∞, (i = 1, 2).
3. lim|u|→∞
|u2Ki(u)| = 0, (i = 1, 2).
If (x, y) ∈ C(f), the set of continuity points of f, then for any η ≥ 0,
limn→∞
[h−2n
∫ ∞
−∞
∫ ∞
−∞|K1(
u
hn
)K2(v
hn
)|1+ηf(x− u, y − v)dudv] =
f(x, y)
∫ ∞
−∞
∫ ∞
−∞|K1(u)K2(v)|1+ηdudv.
45
Define
∂jfn(x, y)
∂yj= f (0,j)(x, y)
= (nhj+2n )−1
n∑i=1
K(x−Xi
hn
)K(j)(y − Yi
hn
),
where K(j) denotes the jth derivative of K, (j = 1, 2),
and
Wni = h−3n K(
x−Xi
hn
)K(1)(y − Yi
hn
), (i = 1, 2, . . . , n)
Lemma 2.4.2.
Under the assumptions (A1)(ii), (A2) and (A3), if (x, y) ∈ C(f), then the following
are true
(i) limn→∞
nh4n[varf (0,1)
n (x, y)] = f(x, y)
∫ ∞
−∞
∫ ∞
−∞(K(u)K(1)(v))2dudv.
(ii) (nh4n)
12Ef (0,1)
n (x, y)− f (0,1)(x, y) = o(1).
Lemma 2.4.3.
Under the assumption of the above Lemma, the following is true,
limn→∞
(n−1h4n)1+ δ
2 [n∑
i=1
E|Wni − EWni|2+δ] = 0.
For fixed x expanding f(0,1)n (x, θn(x)) around θ(x), we obtain
0 = f (0,1)n (x, θn(x)) = f (0,1)
n (x, θ(x)) + (θn(x)− θ(x))f (0,2)n (x, θ?
n(x)),
where
|θ?n(x)− θ(x)| < |θn(x)− θ(x)|.
Hence,
θn(x)− θ(x) = − f(0,1)n (x, θ(x))
f(0,2)n (x, θ?
n(x))(2.4.1)
46
Lemma 2.4.4.
Under the assumption (A1), (A2)(ii), (iii) and (A3), if g(x) > 0, then f(0,2)n (x, θ?
n(x))
converges in probability to f(0,2)n (x, θn(x)) as n tends to infinity.
Now we prove an intermediate result in the next Theorem
Theorem 2.4.5.
Suppose that x1, x2, . . . , xk are distinct points, where f(x, y) > 0, and (xi, y) ∈C(f), (i = 1, 2, . . . , k). Then under the assumption (A1),(A2)(ii),(iii), and (A3), the
distribution of the vector
(nh4n)
12f (0,1)
n (x1, y)− f (0,1)(x1, y), . . . , f (0,1)n (xk, y)− f (0,1)(xk, y)T ,
where T denotes the transpose, is asymptotically multivariate normal, with mean zero
vector and diagonal covariance matrix Γ = [γij], with
γij = f(xi, y)
∫ ∞
−∞
∫ ∞
−∞K(u)K(1)(v)2dudv, (i = 1, 2, . . . , k).
Proof:
Without loss of generality, we consider the special case k = 2. The same arguments are
used for the more general case.
Before we start the proof of the theorem, we introduce some notation.
For i = 1, 2, . . . , n and s = 1, 2, we define the following:
Vni = h−3n K(
xs −Xi
hn
)K(1)(y − Yi
hn
),
Wni(xs) = h2n(Vni(xs)− EVni(xs)),
Wn(xs) =n∑
i=1
Wni(xs),
Zni = (Wni(x1) , Wni(x1))T ,
Zn = n−12 (Wn(x1) , Wn(x2))
T
47
Zn = (nh4n)
12f (0,1)
n (x1, y)− Ef (0,1)n (x1, y) , f (0,1)
n (x2, y)− Ef (0,1)n (x2, y)T (2.4.2)
Let A = [ars] be a 2× 2 diagonal matrix with
ars = f(xs, y)
∫ ∞
−∞
∫ ∞
−∞K(u)K(1)(v)2dudv.
Let Z be the bivariate normal with mean vector zero and covariance matrix A.
First, we will show that Zn converges in distribution to Z. To do that, we will use the
Cramer-Wold theorem.
It will be sufficient to prove that CZTn converge in distribution to CZT for any constant
C = (c1, c2) ∈ R2 ,C 6= 0. Note that,
CZTn =
n∑i=1
n−12CZni , E(n−
12CZni) = 0.
Let ρ2+δni = E|n− 1
2CZni|2+δ , ρ2+δn =
n∑i=1
ρ2+δni , and σ2
n = Var(CZTn ).
Using Liapounov’s theorem, it will be sufficient to show that,
limn→∞
ρ2+δn
σ2+δn
= 0. (2.4.3)
Now, the proof of the Theorem 2.4.5 will given via the following lemmas.
Lemma 2.4.6.
Under conditions (A2)(ii),(iii),(iv), if (xs, y) ∈ C(f), then for (s = 1, 2), (r = 1, 2), the
following are true:
a. limn→∞
EW 2ni(xs) = f(x, y)
∫ ∞
−∞
∫ ∞
−∞K(u)K(1)(v)2dudv.
b. limn→∞
EWni(xs)Wni(xr) = 0, (r 6= s).
Proof :
(a) By definition of Wni(xs),
EW 2ni(xs) = h4
n(EV 2ni(xs) − (EVni(xs))
2), (2.4.4)
48
where
h4n(EV 2
ni(xs)) = h4n(h−6
n
∫ ∞
−∞
∫ ∞
−∞K(
xs − u
hn
)K(1)(y − v
hn
)2f(u, v)dudv)
= h−2n
∫ ∞
−∞
∫ ∞
−∞K(
u
hn
)K(1)(v
hn
)2f(xs − u, y − v)dudv.
Now, by an application of Bochner Lemma, we obtain that
limn→∞
h4nEV 2
ni(xs) = f(xs, y)
∫ ∞
−∞
∫ ∞
−∞K(u)K(1)(v)2dudv. (2.4.5)
Next,
h4n(EVni(xs))
2 = h2n(hn(EVni(xs)))
2
= h2n(h−2
n
∫ ∞
−∞
∫ ∞
−∞K(
u
hn
)K(1)(v
hn
)f(xs − u, y − v)dudv)2.
By another application of Bochner Lemma, we obtain that
limn→∞
h4n(EVni(xs))
2 = 0. (2.4.6)
By a combination of (2.4.4), (2.4.5), and (2.4.6), (a) holds.
(b) From the definition of Wni(x), we have
E(Wni(x1)Wni(x2)) = h4n(EVni(x1)Vni(x2) − EVni(x1)EVni(x2)). (2.4.7)
Suppose that x2 > x1, let δ = x2 − x1, and δn =δ
hn
.
h4nEVni(x1)Vni(x2) = h−2
n
∫ ∞
−∞
∫ ∞
−∞K(
x1 − u
hn
)K(x2 − u
hn
)(K(1)(y − v
hn
))2f(u, v)dudv
=
∫ ∞
−∞
∫ ∞
−∞K(u)K(δn + u)(K(1)(v))2f(x1 − hnu , y − hnv)dudv
=
∫ ∞
−∞[
∫ ∞
−∞K(u)K(δn + u)(K(1)(v))2g(x1 − hnu)du]
× (K(1)(v))2f(y − hnv|x1 − hnu)dv. (2.4.8)
49
Next,
∫ ∞
−∞K(u)K(δn + u)g(x1 − hnu)du =
∫
|u|< δn2
K(u)K(δn + u)g(x1 − hnu)du
+
∫
|u|≥ δn2
K(u)K(δn + u)g(x1 − hnu)du
≤ sup|u|< δn
2
K(δn + u)
∫ ∞
−∞K(z)g(x1 − hnz)dz
+ sup|u|≥ δn
2
K(u)
∫ ∞
−∞K1(δn + z)g(x1 − hnz)dz
≤ sup|u|≥ δn
2
K(u) . O(1) + sup|u|≥ δn
2
K(u) . O(1)
= 2 sup|u|≥ δn
2
K(u) . O(1)
≤ 4
δn
sup|u|≥ δn
2
|uK(u)| . O(1)
=4hn
δsup|u|≥ δn
2
|uK(u)| . O(1) = O(hn). (2.4.9)
Finally, from (2.4.8), and (2.4.9), we have that
limn→∞
h4nEVni(x1)Vni(x2) = 0. (2.4.10)
Now, we get
h4nEVni(x1)Vni(x2) = h2
n[(h−2n
∫ ∞
−∞
∫ ∞
−∞K(
u
hn
)K(1)(v
hn
)f(x1 − u, y − v)dudv)]
× [(h−2n
∫ ∞
−∞
∫ ∞
−∞K(
w
hn
)K(1)(v
hn
)f(x2 − w, y − v)dwdv)]
−→ 0 (2.4.11)
by an application of Bochner Lemma. The proof of the Lemma is completed by a combi-
nation of (2.4.7), (2.4.10), and (2.4.11).
Lemma 2.4.7.
Under the conditions of the last Lemma, we have that
limn→∞
σ2n = CACT
50
Proof:
Since σ2n = Var (CZT
n ), and by the definition of Z, we have
σ2n = Var(n−
12 c1Wn(x1) + n−
12 c2Wn(x2))
= n−1c21Var(Wn(x1)) + n−1c2
2Var(Wn(x2)) + 2n−1c1c2Cov(Wn(x1) , Wn(x2))
= n−1c21
n∑i=1
Var(Wni(x1)) + n−1c22
n∑i=1
Var(Wni(x2))
+ 2n−1c1c2Cov(n∑
i=1
(Wni(x1)),n∑
i=1
(Wni(x2)))
= c21Var(Wni(x1)) + c2
2Var(Wni(x2)) + 2n−1c1c2E(n∑
i=1
n∑i=1
Wni(x1)Wnj(x2)).
Since CACT is a quadratic from associated with the positive definite matrix A, that is
(CACT > 0), an application of Lemma (2.4.6) implies that
limn→∞
σ2n =
∫ ∞
−∞
∫ ∞
−∞K(u)K(1)(v)2dudv[c2
1f(x1, y) + c22f(x2, y)]
= CACT > 0.
Now,
ρ2+δni ≤ n−(1+ δ
2)|C|2+δE|Z|2+δ
= n−(1+ δ2)|C|2+δE|(Wni(x1) , Wni(x2))|2+δ
≤ n−(1+ δ2)|C|2+δ22+δ maxE|Wni(x1)|2+δ , E|Wni(x2)|2+δ.
Assume that E|Wni(x1)|2+δ > E|Wni(x2)|2+δ. Then we have
ρ2+δni ≤ n−(1+ δ
2)|C|2+δ22+δE|Wni(x1)|2+δ
= n−(1+ δ2)|C|2+δ22+δE|h2
n(Vni(x1)− EVni(x1))|2+δ
= |C|2+δ22+δ(n1−)1+ δ2 (h2
n)2+δE|Vni(x1)− EVni(x1))|2+δ
= |C|2+δ22+δ(n−1h4n)1+ δ
2 E|Vni(x1)− EVni(x1))|2+δ.
This implies that
ρ2+δn =
n∑i=1
ρ2+δni ≤ |C|2+δ22+δ(n−1h4
n)1+ δ2
n∑i=1
E|Vni(x1)− EVni(x1))|2+δ,
51
which converge to zero as n tends to infinity by an application of Lemma (2.4.3).
Hence the Liapounov’s condition, limn→∞
ρ2+δn
σ2+δn
= 0, is satisfied. So we have that, CZTn is
asymptotically normally distributed with mean zero and variance CACT .
By Cramer-Wold Theorem we have that Zn converges in distribution to Z. Now an appli-
cation of Lemma (2.4.2)(ii) to equation(2.4.2) completes the proof of the Theorem (2.4.2).
We are now in position to prove our maim theorem.
Theorem 2.4.8.
Suppose that x1, x2, . . . , xn are distinct points, where f(xi, y) > 0, and (xi, y) ∈C(f), (i = 1, 2, . . . , k). Then under the assumption (A1)-(A3), the distribution of the
vector
(nh4n)
12 (θn(x1)− θ(x1), . . . , θn(xk)− θ(xk))
T ,
where T denotes the transpose, is asymptotically multivariate normal with mean vector
zero and diagonal covariance matrix B = [bij], with
bij =f(xi, θ(xi))
f (0,2)(xi, θ(xi))2
∫ ∞
−∞
∫ ∞
−∞K(u)K(1)(v)2dudv.
Proof:
(nh4n)
12 (θn(x1)−θ(x1), . . . , θn(xk)−θ(xk))
T =−(nh4n)
12 (
f(0,1)n (x1, θ(x1))
f(0,2)n (x1, θ?
n(x1)), . . . ,
f(0,1)n (xk, θ(xk))
f(0,2)n (xk, θ?
n(xk))),
where
|θ?n(xi)− θ(xi)| < |θn(xi)− θ(xi)|, (i = 1, 2, . . . , k).
An application of Theorem (2.4.5) and Lemma (2.4.4) completes the proof.
52
Chapter 3
Quantiles Regression
The term quantile is synonymous with percentile; the median is the best example of
a quantile. We know that the sample median can be defined as the middle value (or
the value half-way between the two middle values) of a set of ranked data, i.e. the
sample median splits the data into two parts with an equal number of data points in
each. Usually, the sample median is taken as an estimator of the population median m,
a quantity which splits the distribution into two halves in the sense that, if the random
variable Y can be measured on the population, then P (Y ≤ m) = P (Y ≥ m) =1
2. In
particular, for a continuous random variable, m is a solution to the equation F (m) =1
2,
where F (y) = P (Y ≤ y) is the cumulative distribution function.
As an example of the use of the median, consider the distribution of salaries. This
is typically skewed to the right since relatively few people earn large salaries. As a
consequence, the sample median provides a better summary of typical salaries than mean.
More generally, the 25% and 75% sample quantile can be defined as values that split the
data into proportions of one-and three-quarters, and vice versa. Corresponding, in the
continuous case, the population lower quartile and the upper quartile are the solutions
to the equations F (y) =1
4and F (y) =
3
4respectively. Generally, for a proportion α
(0 < α < 1), and in the continuous case, the 100α% quantile (equivalently, the 100pth
percentile) of F is the value y which solves F (y) = α. Note that we assume that this value
53
is unique.
A further generalization of the concept to the conditional quantile emerges when we want
to study the relationship between a response variable Y and some covariates X. To explore
this relationship, we use analysis to quantify the relationship.
The conditional distribution function F (y|X = x) is the important role of inference in
solving this problem.
In parametric and nonparametric estimation of the conditional distribution function most
investigation of the underlying structures is concerned with the conditional mean function
m(x) = E(Y |X = x), the conditional mean of Y given the value x of X. New insight about
the underlying structure can be gained by considering other aspects of the conditional
distribution function F (y|X).
Estimation of the conditional quantiles has gained particular attention during the recent
three decades because of their useful application in various fields such as econometrics,
finance, environmental sciences and medicine. For more details see [9].
This chapter consists of three sections, in the first section we will discuss the asymptotic
normality of the conditional quantiles, while in the next section we will discuss the joint
asymptotic normality of the conditional quantiles, Finally in Section 3.3 we will give a
comparison between the mode and the median.
3.1 Nonparametric estimation of conditional quan-
tiles
In this Section, we will introduce the definition of conditional α−quantiles and we will
discuss the joint asymptotic normality of the conditional quantiles.
Let X,Y be a bivariate random variable and F (y|X = x) = P (Y ≤ y |X = x) the
conditional distribution of Y, given X = x.
Definition 3.1.1. The conditional α−quantile qα(x) is defined as follows
qα(x) = inf y ∈ R | F (y|x) ≥ α, 0 < α < 1, x ∈ R.
54
The quantiles give more complete information about the distribution of Y as a function
of the predictor variable X than the conditional mean alone.
[1] discussed the following two kernel estimators of the cdf F (y|x) and the α−quantile
qα(x) respectively
Fn(y|x) =
n∑i=1
IYi≤yK(x−Xi
hn
)
n∑i=1
K(x−Xi
hn
)
(3.1.1)
qn,α(x) = infy ∈ R | Fn(y|x) ≥ α, 0 < α < 1. (3.1.2)
Now, we will consider some properties of the E[Fn(y|x)] and V ar[Fn(y|x)] to give more
information about the Mean Square Error.
Lemma 3.1.1.
Let Yi be an independent random variables. The expectation of the estimator
Fn(y|x) is given by
E[Fn(y|x)] =n∑
i=1
K(x−Xi
hn
)
[n∑
i=1
K(x−Xi
hn
)]
. F (y|Xi)
Proof :
From the definition of the expectation and Equation (3.1.1), we have
E[Fn(y|x)] = E [
n∑i=1
IYi≤yK(x−Xi
hn
)
n∑i=1
K(x−Xi
hn
)
]
=
E [n∑
i=1
IYi≤yK(x−Xi
hn
)]
n∑i=1
K(x−Xi
hn
)
55
=
∫ ∞
−∞
n∑i=1
IYi≤yK(x−Xi
hn
)f(y|Xi)dy
n∑i=1
K(x−Xi
hn
)
=
n∑i=1
K(x−Xi
hn
)
∫ ∞
−∞IYi≤yf(y|Xi)dy
n∑i=1
K(x−Xi
hn
)
=
n∑i=1
K(x−Xi
hn
)
∫ y
−∞f(t|Xi)dt
n∑i=1
K(x−Xi
hn
)
=
n∑i=1
K(x−Xi
hn
) F (y|Xi)
[n∑
i=1
K(x−Xi
hn
)]
=n∑
i=1
K(x−Xi
hn
)
[n∑
i=1
K(x−Xi
hn
)]
. F (y|Xi). (3.1.3)
Thus the proof of this lemma is completed.
Now, the next lemma gives the varaince of Fn(y|x).
Lemma 3.1.2.
Let Yi be an independent random variables. The varaince of the estimator Fn(y|x)
is given by
V ar[Fn(y|x)] =n∑
i=1
K2(x−Xi
hn
)
[n∑
i=1
K(x−Xi
hn
)]2. [F (y|Xi) − F 2(y|Xi)].
56
Proof:
From the definition of the varaince and Equation (3.1.1), we have
V ar[Fn(y|x)] = E(F 2n(y|x))− [E(Fn(y|x))]2
= E [
n∑
i=1
I2Yi≤yK(
x−Xi
hn
) +∑
1≤i<j<n
IYi < yIYj < yK2(x−Xi
hn
)
[n∑
i=1
K(x−Xi
hn
)]2]
− n∑
i=1
K(x−Xi
hn
)
[n∑
i=1
K(x−Xi
hn
)]
. F (y|Xi)2
=
E [n∑
i=1
I2Yi≤yK
2(x−Xi
hn
)]
[n∑
i=1
K(x−Xi
hn
)]2−
n∑i=1
K2(x−Xi
hn
)
[∑
i
K(x−Xi
hn
)]2. F 2(y|Xi)
=
∫ ∞
−∞
n∑i=1
I2Yi≤yK
2(x−Xi
hn
) f(y|Xi)dy
[n∑
i=1
K(x−Xi
hn
)]2−
n∑i=1
K2(x−Xi
hn
)
[n∑
i=1
K(x−Xi
hn
)]2. F 2(y|Xi)
=
n∑i=1
K2(x−Xi
hn
)
∫ ∞
−∞I2Yi≤y f(y|Xi)dy
[n∑
i=1
K(x−Xi
hn
)]2−
n∑i=1
K2(x−Xi
hn
)
[n∑
i=1
K(x−Xi
hn
)]2. F 2(y|Xi)
=
n∑i=1
K2(x−Xi
hn
) F (y|Xi)
[n∑
i=1
K(x−Xi
hn
)]2−
n∑i=1
K2(x−Xi
hn
)
[n∑
i=1
K(x−Xi
hn
)]2. F 2(y|Xi)
57
=n∑
i=1
K2(x−Xi
hn
)
[n∑
i=1
K(x−Xi
hn
)]2. F (y|Xi)−
n∑i=1
K2(x−Xi
hn
)
[n∑
i=1
K(x−Xi
hn
)]2. F 2(y|Xi)
=n∑
i=1
K2(x−Xi
hn
)
[n∑
i=1
K(x−Xi
hn
)]2. [F (y|Xi) − F 2(y|Xi)]. (3.1.4)
For further results, we need assumptions of the kernel function, the bandwidth and the
conditional distribution function. These assumptions will be used in this chapter.
(A1) hn is sequence of positive number satisfies the following:
(i) hn −→ 0, for n −→∞;
(ii) nhn −→∞, for n −→∞;
(A2) The kernel K is a Borel function and satisfies the following:
(i) K has a compact support;
(ii) K is symmetric;
(iii) K is Lipschitz-continuous;
(iv)
∫K(u)du = 1;
(v) K is bounded;
(A3) For a fixed y ∈ R there exists F ′′(y|x) =∂2F (y|x)
∂x2in a neighborhood of x.
We assume that (A2)( i , ii ), and (A3) are satisfied. Let Ui =x−Xi
hn
and x ∈ (hn , 1−hn), then from Lemma (1.5.1) it follows that
E[Fn(y|x)]− F (y|x) =1
2h2
nµ2(K)F ′′(y|x) + o(h2n),
58
where
µ2(K) =
∫u2K(u)du =
∑i
U2i K(u)
∑i
K(Ui).
Then
E[Fn(y|x)] = F (y|x) +h2
n
2
∑i
U2i K(Ui)
∑i
K(Ui)F ′′(y|x) + o(h2
n). (3.1.5)
Lemma 3.1.3. ( Integral approximation of the sum over the kernel function )
With (A2)(i), Lipschitz-continuity (A2)(iii) and the mean value theorem of integration,
it follows
(i) limn→∞
n∑i=1
1
nhn
K(Ui) =
∫ ∞
−∞K(u)du
(ii) limn→∞
n∑i=1
1
nhn
K2(Ui) =
∫ ∞
−∞K2(u)du
(iii) limn→∞
n∑i=1
1
nhn
UiK(Ui) =
∫ ∞
−∞uK(u)du
Proof :
Assume J be the index set of observations, |J | = O(nhn), |J | denotes the cardinality.
We will prove (i) and the proof of (ii) and (iii) are similar.
|n∑
i=1
1
nhn
K(Ui)−∫ ∞
−∞K(u)du| ≤
∑i∈J
| 1
nhn
K(Ui)−∫ Ui−1
Ui
K(u)du|
=∑i∈J
| 1
nhn
K(Ui)− (Ui−1 − Ui)K(ζi)|
=∑i∈J
| 1
nhn
K(Ui)− (x−Xi−1
hn
− x−Xi
hn
)K(ζi)|
=∑i∈J
| 1
nhn
K(Ui)− (Xi −Xi−1
hn
)K(ζi)|
=∑i∈J
| 1
nhn
K(Ui)− 1
nhn
K(ζi)|
59
=1
nhn
∑i∈J
|K(Ui)−K(ζi)|
=1
nhn
∑i∈J
L|Ui − ζi| ( From Lipschitz Condition)
≤ 1
nhn
∑i∈J
O(1
nhn
)
= O(1
n2h2n
)∑i∈J
1
= O(1
nhn
).
Now, we want to approximate the mean square error as in the next Theorem.
Theorem 3.1.4.
Let Yi be independent and let (A1)( i , ii ), (A2)( i , ii , iii , iv ) and (A3) be satisfied.
Then it holds for n −→∞ and x ∈ (hn , 1− hn) :
MSE(Fn(y|x)) ≈ [h2
n
2F ′′(y|x)
∫u2K(u)du]2
+1
nhn
(F (y|x) − F 2(y|x))
∫K2(u)du. (3.1.6)
Proof :
Since MSE(Fn(y|x)) = (E[Fn(y|x)]− F (y|x))2 + V ar[Fn(y|x)]
Then by (3.1.5), we only want to find V ar[Fn(y|x)].
Taylor expansion yields the variance
F ( y | x− hnUi) = F (y|x)− hnUiF′(y|x) + h2
nU2i F ′′(y|x) + o(h2
n),
F 2( y | x− hnUi) = F (2y|x)− 2hnUiF (y|x)F ′(y|x) + h2nU
2i F ′(y|x)
+ h2nU2
i F (y|x)F ′′(y|x) + o(h2n),
60
Let the condition (A2)(ii) hold and suppose A = 1/[∑
i
K(Ui) ]2. Then, we have
V ar[Fn(y|x)] =n∑
i=1
K2(Ui)
[n∑
i=1
K(Ui)]2
[F (y|Xi) − F 2(y|Xi)]
= A
n∑i=1
K2(Ui)[F (y|Xi) − F 2(y|Xi)]
= A
n∑i=1
K2(Ui) [ F (y|x− Uihn) + hnU′iF (y|x)− h2
nU2i F ′′(y|x) + o(h2
n)
− F 2(y|x− Uihn)− 2hnUiF (y|x)F ′(y|x) + h2nU
2i F ′(y|x)
+ h2nU
2i F (y|x)F ′′(y|x) + o(h2
n)]
= A [F (y|x)− F 2(y|x)]∑
i
K2(Ui)
+ Ah2n [F ′′(y|x) + F ′(y|x) + F (y|x)F ′′(y|x)]
∑i
U2i K2(Ui)
+ A o(h2n)
∑i
K2(Ui).
That is
V ar[Fn(y|x)] =
∑i
K2(Ui)
[∑
i
K(Ui) ]2[F (y|x)− F 2(y|x)]
From the last Lemma, we have
V ar[Fn(y|x)] ≈ 1
nhn
[F (y|x)− F 2(y|x)]
∫K2(u)du.
So the proof of this theorem is completed.
Thus the bias of Fn(y|x) depends on the smoothness of the underlying conditional
distribution function by F ′′(y|x). It is now possible to give a formal assessment about the
asymptotic mean squared error.
Observe that the mean squared error depends on the second derivative of the condi-
tional distribution and the difference between (F (y|x) − F 2(y|x)). This means that the
61
variance of the estimator is highest in the middle of the distributions. ( Since the maxi-
mum of (F (y|x)− F 2(y|x)) is1
4and happens when F (y|x) =
1
2. )
From the last Theorem it follows that the kernel estimator (3.1.1) is consistent. Next, the
asymptotic normality of (nhn)12 (Fn(y|x) − E[Fn(y|x)]) and (nhn)
12 (Fn(y|x) − [F (y|x)])
is shown.
Theorem 3.1.5.
Let the conditions of the last theorem be satisfied. Then it holds for n −→∞,
(nhn)12 (Fn(y|x)− E[Fn(y|x)])
d−→ N (0, [F (y|x) − F 2(y|x)]
∫K2(u)du). (3.1.7)
Proof :
To prove this theorem, we use Liapunov’s condition.
Let
Qn,i(x) =
K(x−Xi
hn
)
∑i
K(x−Xi
hn
)[IYi≤y − F (y|Xi)]
√V ar[Fn(y|x)]
Therefor,
n∑i=1
Qn,i(x) =n∑
i=1
K(x−Xi
hn
)
∑i
K(x−Xi
hn
)[IYi≤y − F (y|Xi)]
√V ar[Fn(y|x)]
That is;
n∑i=1
Qn,i(x) =
n∑i=1
K(x−Xi
hn
)
∑i
K(x−Xi
hn
)IYi≤y −
n∑i=1
K(x−Xi
hn
)
∑i
K(x−Xi
hn
)F (y|Xi)
√V ar[Fn(y|x)]
This means thatFn(y|x) − E[Fn(y|x)]√
V ar[Fn(y|x)]=
n∑i=1
Qn,i(x)
62
if the Liapunov’s condition
limn→∞
∞∑i=1
E |Qn,i(x)|3 = limn→∞
n∑i=1
E |K(
x−Xi
hn
)
∑i
K(x−Xi
hn
)[IYi≤y − F (y|Xi)]|3
(V ar[Fn(y|x)])32
is satisfied.
With the integral approximation it holds for the numerator
n∑i=1
E |K(
x−Xi
hn
)
∑i
K(x−Xi
hn
)[IYi≤y − F (y|Xi)]|3
=n∑
i=1
E |K(
x−Xi
hn
)
∑i
K(x−Xi
hn
)|3 |IYi≤y − F (y|Xi)|3
=n∑
i=1
|K(
x−Xi
hn
)
∑i
K(x−Xi
hn
)|3 E |IYi≤y − F (y|Xi)|3
≤n∑
i=1
|K(
x−Xi
hn
)
∑i
K(x−Xi
hn
)|3 =
∑i
K3(x−Xi
hn
)
[∑
i
K(x−Xi
hn
)]3
=
nhn
∫K3(
x−Xi
hn
)du
[nhn
∫K(
x−Xi
hn
)du]3
=o(nhn)
o(n3h3n)
= o(1
n2h2n
) , By Lemma 3.1.1
For the variance of Fn,x(y) follows from the last theorem
V ar[Fn(y|x)] = o(1
nhn
).
63
Thus it holds
(V ar[Fn(y|x)])32 = o(
1
n32 h
32n
)
It follows for Liapunov’s condition
limn→∞
∑i
E |Qn,i(x)|3 ≤O( 1
n2h2n)
O( 1
n32 h
32n
)= O(
1
n12 h
12n
) = o(1).
From Liapunov’s condition and the variance of Fn,x(y) from the last theorem, it follows
asymptotic normality
Fn(y|x) − E[Fn(y|x)]√V ar[Fn(y|x)]
d−→ N (0, 1).
Therefor,
(nhn)12 (Fn(y|x)− E[Fn(y|x)]) −→ N ( 0 , [F (y|x) − F 2(y|x)]
∫K2(u)du ).
Corollary 3.1.6.
Let the condition of the last theorem be satisfied and let nh5n −→ 0, for n −→ ∞.
Then it follows
(nhn)12 (Fn(y|x) − F (y|x))
d−→ N ( 0 , [Fn(y|x) − F 2(y|x)]
∫K2(u)du). (3.1.8)
Proof :
The last theorem gives the asymptotic normality of (nhn)12 (Fn(y|x)− E[Fn(y|x)]).
That is we can replace E[Fn(y|x)] by F (y|x) to get (nhn)12 (E[Fn(y|x)] − F (y|x)) converge
to zero, for n −→∞.
From theorem (3.1.1)
E[Fn(y|x)] − F (y|x) =h2
n
2.
∑i
UiK(Ui)
∑i
K(Ui). F (2,0)(y|x) + o(h2
n)
= O(h2n).
That is
(nhn)12 (E[Fn(y|x)] − F (y|x)) = (nhn)
12 O(h2
n)
= O(nh5n)
12 .
64
With nh5n −→ 0 for n −→∞, it follows the asymptotic normality of (nhn)
12 ([Fn(y|x)]− F (y|x))
Above theorems deal with the estimator of the conditional distribution. Now the be-
havior of the estimator of the conditional quantile is analyzed. So assume that Fn,x(qn,α(x)) =
Fx(qα(x)) = α is unique and Yi independent.
Now, let
Hn,α(θ(x)) =∑
i
1∑
i
K(x−Xi
hn
). K(
x−Xi
hn
) [α− IYi ≤ θ(x)]
=∑
i
Hi,α(θ(x)). (3.1.9)
Using the central limit theorem,
Hn,α(θ(x)) − E[Hn,α(θ(x))]√V ar[Hn,α(θ(x))]
d−→ N ( 0 , 1) , n −→∞. (3.1.10)
With Hn,α(θ(x)) the mean squared error of qn,α(x) can be calculated.
Theorem 3.1.7.
Let the conditions of theorem (3.1.1) be satisfied and let Fn,x(qn,α(x)) = Fx(qα(x)) = α
be unique. Then it holds
MSE[qn,α(x)] = [1
2h2
n
F (2,0)(qα(x)|x)
f(qα(x)|x)
∫u2K(u)du]2
+1
nhn
α(1− α)
f 2(qα(x))
∫K2(u)du (3.1.11)
Proof :
By the Taylor expansion of the conditional distribution functions of theorem (3.1.1) and
θ(x) = qn,α(x) follows
E[Hn,α(qn,α(x))] ≈ f(qα(x)|x) [qn,α(x) − qα(x)]
+1
2h2
nF (2,0)(qα(x))
∑i
U2i K(Ui)
∑i
K(Ui)
65
and with integral approximation holds
E[Hn,α(qn,α(x))] ≈ f(qα(x)|x) [qn,α(x) − qα(x)]
+1
2h2
nF (2,0)(qα(x))
∫u2K(u)du.
Now,
V ar[Hn,α(qn,α(x))] =1
[∑
i
K(Ui) ]2
∑i
K2(Ui)[ F (qn,α(x) | x− hnUi)
− F 2((qn,α(x)) | x− hnUi)]
≈ 1
[∑
i
K(Ui) ]2[ α(1− α)]
∑i
K2(Ui)
≈ 1
nhn
α(1− α)
∫K2(u)du.
nhnHn,α(qn,α(x)) is a bounded random variable and∑
i
V ar(nhn Hi,α(qn,α(x))) −→ ∞for n −→∞.
From this the asymptotic normality ( Theorem 1.1.7 Ch.1), it follows.
nhnHn,α(qn,α(x)) − E[Hn,α(qn,α(x))]nhn
√V ar(Hn,x(qn,α(x)))
−→ N( 0 , 1 ) , n −→∞.
Since Fn(qn,α(x)|x) = α , Hn,α(qn,α(x)) = 0. This implies for n −→∞
f(qα(x)|x)[qn,α(x)− qα(x)] + 12h2
nF(2,0)(qα(x))
∫u2K(u)du
√1
nhnα(1− α)
∫K2(u)du
−→ N( 0 , 1 ) (3.1.12)
From that bias and variance of qn,α(x) can be calculated.
The bias depends through F (2,0)(qα(x)|x) on the smoothness of the quantile function.
But because of the division by the conditional density at qα(x) the steepness of the
conditional distribution also affects the bias and the variance. The steeper the conditional
distribution is the greater is the mean square error.
From the method of proof of theorem (3.1.3), asymptotic normality can be established.
66
Corollary 3.1.8.
Let the condition of theorem (3.1.3) be satisfied and let nh5n −→ 0, for n −→∞.
Then it holds
(nhn)12 (qn,α(x)− qα(x))
d−→ N (1
2(nh5
n)12
F (2,0)(qα(x)|x)
f(qα(x)|x)
∫u2K(u)du ,
α(1− α)
f 2(qα(x)|x)
∫K2(u)du )
) ;
d−→ N( 0 ,α(1− α)
f 2(qα(x)|x)
∫K2(u)du
) (3.1.13)
Proof :
From Equation (3.1.12) we have
f(qα(x)|x)[qn,α(x)− qα(x)] + 12h2
nF(2,0)(qα(x))
∫u2K(u)du
√1
nhnα(1− α)
∫K2(u)du
d−→ N( 0 , 1 ).
This implies;
E[
f(qα(x)|x)[qn,α(x)− qα(x)] + 12h2
nF(2,0)(qα(x))
∫u2K(u)du
√1
nhnα(1− α)
∫K2(u)du
] = 0,
and
V ar[f(qα(x)|x)[qn,α(x)− qα(x)] + 1
2h2
nF(2,0)(qα(x))
∫u2K(u)du
√1
nhnα(1− α)
∫K2(u)du
] = 1.
From the properties of the Expectation and the variance, we have
(nhn)12 E[[qn,α(x)− qα(x)]] −→ 1
2(nh2
n)12F (2,0)(qα(x))
fx(qα(x))
∫u2K(u)du (3.1.14)
And
V ar[f(qα(x)|x)[qn,α(x)− qα(x)] + 1
2h2
nF (2,0)(qα(x))
∫u2K(u)du
√1
nhnα(1− α)
∫K2(u)du
] −→ 1,
67
Then
(nhn) V ar[qn,α(x)− qα(x)] −→ α(1− α)
f 2(qα(x)|x)
∫K2(u)du (3.1.15)
From (3.1.14) and (3.1.15), we have
(nhn)12 (qn,α(x)− qα(x))
d−→ N (1
2(nh5
n)12
F (2,0)(qα(x)|x)
f(qα(x)|x)
∫u2K(u)du ,
α(1− α)
f 2(qα(x)|x)
∫K2(u)du )
)
Since nh5n −→ 0. Then we have (3.1.13).
3.2 Joint Asymptotic Distribution of the Conditional
Quantiles
Let (X1, Y1), (X2, Y2), . . . , (Xn, Yn) be independent and identically distributed two di-
mensional random variables with a joint density function f(x, y) and a joint distrib-
ution function F (x, y) =
∫ x
−∞
∫ y
−∞f(u, v)dvdu. The marginal density function of X is
g(x) =
∫ ∞
−∞f(x, y)dy. The conditional density function and the conditional distribution
function of Y given X = x are f(y|x) =f(x, y)
g(x), and
F (y|x) =
∫ y
−∞f(u|x)du =
∫ y
−∞f(x, u)du
g(x)
respectively. Now for i = 1,2 let qαi(x) denote the αi th quantile of the conditional
distribution F (y|x), i.e., a root of the equation F (q(x)|x) = αi, with 0 < α1 < α2 < 1.
Let fn(x, y), gn(x), fn(y|x) and Fn(y|x) be the estimators of f(x, y), g(x), f(y|x) and
68
F (y|x) respectively and are defined as follows
fn(x, y) =1
nh2n
n∑i=1
K(x−Xi
hn
)K(y − Yi
hn
),
gn(x) =
∫ ∞
−∞fn(x, y)dy =
1
nhn
n∑i=1
K(x−Xi
hn
),
fn(y|x) =fn(x, y)
gn(x)
Fn(y|x) =
∫ y
−∞fn(u|x)du =
Bn(x, y)
gn(x)
where K is a probability density function, hn is a sequence of positive numbers con-
verging to zero, and
Bn(x, y) =1
nhn
n∑i=1
G(y − Yi
hn
)K(x−Xi
hn
)
With G(y) =
∫ y
−∞K(u)du.
Now, we consider for i =1,2 two estimators qαi,n(x) of qαi(x) defined by the root of the
equation Fn(q(x)|x) = αi, i = 1, 2. We shall call qαi,n(x) the conditional sample quantiles.
We prove that under some regularity conditions these estimators are strongly consistent
and asymptotically normally distributed.
Now, we shall assume the following conditions:
(A1) The conditional distribution function satisfy:
(i) F (i,j)(x, y) = ∂i+jF (x, y)/∂xi∂yj exist and are bounded for (i, j) = (1, 2), (2, 0), (2, 1), (3, 0).
(ii) The conditional population quantiles qαi(x) defined by
F (qαi(x)|x) =
F (1,0)( x , qαi(x) )
g(x)= αi, i = 1, 2
are unique.
(iii) f(x, y) is uniformly continuous.
69
(A2) The marginal density function of X satisfy
(i) g(i)(x) =
∫ ∞
−∞∂if(x, y)/∂xidy exist for i = 1, 2.
(ii) Both h(x) =
∫ ∞
−∞|∂f(x, y)/∂x|dy and g(i)(x) are bounded for i = 1, 2.
(iii) g(x) is uniformly continuous.
(A3) The kernel K is a Borel function and satisfies the following:
(i) K(u) is a function of bounded variation.
(ii)
∫ ∞
−∞uK(u)du = 0.
(iii)
∫ ∞
−∞u2K(u)du < ∞.
(A4) hn is a sequence of positive numbers satisfying:
(i) hn = n−δ, 15
< δ < 14. i.e. lim
n→∞nh4
n = ∞, limn→∞
nh5n = ∞.
Lemma 3.2.1.
Under the conditions ( A2 ), ( i , ii ), ( A3 )( i ) and (A4 )( i ), we have
limn→∞
supx∈R
|gn(x)− g(x)| = 0
with probability one.
Proof : see [17].
Lemma 3.2.2.
Under the conditions (A1 )( i ) and ( A3 )( iii ), we have
supx∈R
|EBn(x, y) − F (1,0)(x, y)| = O(hn).
70
Proof :
By definition of Bn(x, y), we have
EBn(x, y) = E 1
nhn
n∑i=1
G(y − Yi
hn
)K(x−Xi
hn
)
=
∫ ∞
−∞
∫ ∞
−∞
1
hn
G(y − Yi
hn
)K(x−Xi
hn
)f(x, y)dxdy
= hn
∫ ∞
−∞
∫ ∞
−∞G(u)K(v)f(y − uhn, x− vhn)dudv
= hn
∫ ∞
−∞
∫ ∞
−∞K(u)K(v)F (1,0)(y − uhn, x− vhn)dudv
= hn
∫ ∞
−∞
∫ ∞
−∞K(u)K(v)F (1,0)(x, y)− uhnF
(1,1)(x, y) + u2h2nF (0,2)(x, y) + o(h2
n)
= F (1,0)(x, y) + O(hn).
Then, we have the result.
Lemma 3.2.3.
Under the conditions (A1 )( i ), ( A3 )( i , iii ) and (A4 )( i ), we have
limn→∞
supx∈R
|Bn(x, y)− F (1,0)(x, y)| = 0
with probability one.
Proof :
By above lemma, it suffices to show that
limn→∞
supx∈R
|Bn(x, y)− EBn(x, y)| = 0
with probability one. Let Sn(u, v) be the two dimensional empirical distribution function
defined by
Sn(u, v) =1
n
n∑i=1
I(u−Xi)I(v − Yi)
where
I(x− y) =
1 x− y ≥ 0,
0 x− y < 0.
71
=
1 x ≥ y,
0 x < y.
Now,
supx∈R
|Bn(x, y)− EBn(x, y)|
= supx∈R
|∫ ∞
−∞
∫ ∞
−∞
1
hn
G(y − v
hn
)K(x− u
hn
)dSn(u, v)−∫ ∞
−∞
∫ ∞
−∞
1
hn
G(y − v
hn
)K(x− u
hn
)dF (u, v)|
= supx∈R
|∫ ∞
−∞
∫ ∞
−∞
1
hn
G(y − v
hn
)K(x− u
hn
)× dSn(u, v)− dF (u, v)|
= h−1n sup
x∈R|∫ ∞
−∞
∫ ∞
−∞Sn(u, v)− F (u, v) × dG(
y − v
hn
)dK(x− u
hn
)| ( Integrating by parts )
≤ h−1n µ sup
(u,v)∈R2
|Sn(u, v)− F (u, v)|
where µ =
∫ ∞
−∞|K(1)(t)|dt.
Hence, for any ε > 0, we have
n∑i=1
P [ supy∈R
|Bn(x, y)− EBn(x, y)| ≥ ε ] ≤n∑
i=1
Ph−1n µ sup
(u,v)∈R2
|Sn(u, v)− F (u, v)| ≥ ε
=n∑
i=1
P sup(u,v)∈R2
|Sn(u, v)− F (u, v)| ≥ hnε
µ
< C1
n∑i=1
exp−C2ε2nh2
n
µ2 < ∞, ( By Lemma 1.1.9 )
where C1 and C2 are positive constants.
Sincen∑
i=1
P [ supy∈R
|Bn(x, y)− EBn(x, y)| ≥ ε ] < ∞,
then by Borel Cantell Lemma, we have the rsult.
Lemma 3.2.4.
Under the conditions (A1 )( i ), ( A2 )( ii ), ( A3 )( i , iii ) and (A4 )( i ) if g(x) > 0, then
limn→∞
supy∈R
|Fn(y|x)− F (y|x)| = 0
72
with probability one.
Proof :
Since F (y|x) =
∫ y
−∞f(x, u)du
g(x)=
F (1,0)(x, y)
g(x).
Then by Lemma (3.2.1) and Lemma (3.2.3), we have
limn→∞
supy∈R
|Fn(y|x)− F (y|x)| = limn→∞
supy∈R
|Bn(x, y)
gn(x)− F (1,0)(x, y)
g(x)| = 0
with probability one.
Lemma 3.2.5.
Under the conditions of Lemma (3.2.4), if g(x) > 0, we have
limn→∞
|F (qαi,n(x)|x)− F (qαi(x)|x)| = 0
with probability one (i = 1, 2.)
Proof :
Since
|F (qαi,n(x)|x)− F (qαi(x)|x)| = |F (qαi,n(x)|x)− Fn(qαi,n(x)|x)− F (qαi
(x)|x) + Fn(qαi,n(x)|x)|≤ |F (qαi,n(x)|x)− Fn(qαi,n(x)|x)|+ |F (qαi
(x)|x)− Fn(qαi,n(x)|x)|≤ 2 sup
y∈R|Fn(y|x)− F (y|x)|
Then
supy∈R
|F (qαi,n(x)|x)− F (qαi(x)|x)| ≤ 2 sup
y∈R|Fn(y|x)− F (y|x)|
Applying the last Lemma to get
limn→∞
supy∈R
|F (qαi,n(x)|x)− F (qαi(x)|x)| ≤ 2 lim
n→∞supy∈R
|Fn(y|x)− F (y|x)| = 0
Thus
limn→∞
supy∈R
|F (qαi,n(x)|x)− F (qαi(x)|x)| = 0.
Now, the following theorem deals with the strong consistency of the estimators qαi,n(x), i =
1, 2.
73
Theorem 3.2.6.
Under the conditions (A1 )( i ), ( A3 )( i , iii ), ( A4 )( i ), if g(x) > 0, then
limn→∞
qαi,n(x) = qαi(x), i = 1, 2,
with probability one.
Proof :
We only prove this theorem at i = 1. That is, we want to show
limn→∞
qα1,n(x) = qα1(x)
with probability one.
Sine qα1,n(x) is unique. Then for any ε > 0 there exists an δ = η(ε) > 0 defined by
η(ε) = minF (qα1(x) + ε|x)− F (qα1(x)|x) , F (qα1(x)|x)− F (qα1(x)− ε|x)such that |qα1,n(x)− qα1(x)| > ε implies that |F (qα1,n(x)|x)− F (qα1(x)|x)| > η(ε).
Expanding Fn(qαi,n(x)|x) around qαi(x) to get
F (qαi(x)|x) = αi = Fn(qαi,n(x)|x)
= Fn(qαi(x)|x) + (qαi,n(x)− qαi
(x))fn(qi|x)
where qi is some random point between qαi,n(x) and qαi(x), i = 1, 2.
Hence
qαi,n(x)− qαi(x) =
F (qαi(x)|x)− Fn(qαi
(x)|x)
fn(qi|x)
and so
(nhn)12 (qαi,n(x)− qαi
(x)) =−(nhn)
12Fn(qαi
(x)|x)− F (qαi(x)|x)
fn(qi|x), i = 1, 2. (3.2.1)
Since limn→∞
Fn(qαi(x)|x) = F (qαi
(x)|x), then we have the result.
Lemma 3.2.7.
Under the conditions (A1 )( i ), ( A3 )( i , iii ) and (A4 )( i ) if g(x) > 0, then
fn(qi|x) = f(qαi(x)|x) + op(1), i = 1, 2.
74
Proof :
Since
|fn(qi|x)− f(qαi(x)|x)| = |fn(qi|x)− f(qi|x) + f(qi|x)− f(qαi
(x)|x)|≤ |fn(qi|x)− f(qi|x)|+ |f(qi|x)− f(qαi
(x)|x)|≤ sup
y∈R|fn(y|x)− f(y|x)|+ |f(qi|x)− f(qαi
(x)|x)|= o(1).
To complete the asymptotic joint distribution of qα1,n(x) and qα2,n(x) we define for i = 1, 2
and j = 1, 2, . . . , n the following:
U∗nj(x) =
1
hn
K(x−Xi
hn
),
V ∗nij(x) =
1
hn
G(qαi
(x)− Yj
hn
)K(x−Xj
hn
),
Unj(x) = (hn)12 [U∗
nj(x)− EU∗nj(x)],
Vnij(x) = (hn)12 [V ∗
nij(x)− EV ∗nij(x)],
Un(x) =n∑
j=1
Unj(x), Vni(x) =n∑
j=1
Vnij(x)
Wnj =
Uni(x)
Vn1j(x)
Vn2j(x)
, n
12Zn =
Un(x)
Vn1(x)
Vn2(x)
wi(x) = F (1,0)(x, qαi(x))
n12Z∗n = (hn)
12
n∑j=1
[U∗nj(x)− g(x)]
n∑j=1
[V ∗n1j(x)− w1(x)]
n∑j=1
[V ∗n2j(x)− w2(x)]
75
A =
∫ ∞
−∞K2(u)du .
g(x) w1(x) w2(x)
w1(x) w1(x) w1(x)
w2(x) w1(x) w2(x)
.
Lemma 3.2.8.
Under the conditions (A1 )( i ), ( A2 )( ii ), and (A3 )( i , iii ), the following results hold:
1. limn→∞
EU2nj(x) = g(x)
∫ ∞
−∞K2(u)du,
2. limn→∞
EV 2nij(x) = wi(x)
∫ ∞
−∞K2(u)du, i = 1, 2,
3. limn→∞
EUnj(x)Vnij(x) = wi(x)
∫ ∞
−∞K2(u)du, i = 1, 2,
4. limn→∞
EVn1j(x)Vn2j(x) = w1(x)
∫ ∞
−∞K2(u)du.
Proof:
(1) since
EU2nj(x) = hn[
1
h2n
∫ ∞
−∞K2(
x− u
hn
)g(u)du− (1
hn
∫ ∞
−∞K(
x− u
hn
)g(u)du)2],
then,
limn→∞
EU2nj(x) = lim
n→∞1
hn
∫ ∞
−∞K2(
x− u
hn
)g(u)du
− limn→∞
hn(1
hn
∫ ∞
−∞K(
x− u
hn
)g(u)du)2
= g(x)
∫ ∞
−∞K(u)du − 0 = g(x)
∫ ∞
−∞K(u)du
(2)
limn→∞
EV 2nj(x) = lim
n→∞1
h2n
∫ ∞
−∞
∫ ∞
−∞G2(
qα(x)− v
hn
)K2(x− u
hn
)f(u, v)dudv
= limn→∞
1
h2n
∫ ∞
−∞
∫ ∞
−∞G2(
qα(x)− v
hn
)f(v|x− uhn)dv K2(u) g(x− uhn)du
76
= g(x)
∫ ∞
−∞K2(u)du . lim
n→∞1
h2n
∫ ∞
−∞
∫ ∞
−∞G2(
qα(x)− v
hn
)f(v|x− uhn)dv
= g(x)
∫ ∞
−∞K2(u)du .
∫ qα(x)
−∞f(v|x)d
= g(x)F (qα(x)|x)
∫ ∞
−∞K2(u)du.
The proof of (3) and (4) omited.
Lemma 3.2.9.
Under the conditions (A1 )( i ), ( A2 )( ii ), ( A3 )( i , iii ) and (A4 )( ii ) Zn converges in
distribution to a trivariate normal random variable with mean vector 0 and covariance
matrix A.
Proof :
To prove this theorem , it sufficient to show that CZTn converge in distribution to CZT ,
for any real vector C =
C1
C2
C3
and C 6= 0.
Now, we define for j = 1, 2, . . . , n the following
σ2nj = var[CTWnj ]
ρ3njE|CTWnj|3
and let σ2n =
n∑i=1
σ2nj, ρ3
n =n∑
i=1
ρ3nj.
Next, for any C 6= 0, we have
limn→∞
σ2nj = lim
n→∞V ar[CTWnj ]
= limn→∞
C1Unj(x) + C2Vn1j(x) + C3Vn2j(x) = CTAC > 0, j = 1, 2, . . . , n.
Using computations similar to those in Lemma (3.2.8), we have
E|Un1(x)|3 = O(h− 1
2n ) and
77
E|Vni1(x)|3 = O(h− 1
2n ), i = 1, 2.
Therefor,
ρ3n = nE|CTWn1 |3
= nE|(
C1 C2 C3
)
Un1(x)
Vn11(x)
Vn21(x)
|3
= nEC21U
2n1(x) + C2
2V2n1j(x) + C2
3V2n2j(x) 3
2
≤ 332 n(C2
1 + C22 + C2
3)32 ×maxE|Un1(x)|3, E|Vni1(x)|3 i = 1, 2.
= O(nh− 1
2n )
Hence it follows that limn→∞
ρn
σn
= 0. By Liapounov’s version of the central limit theorem we
conclude that CTZn = n−12
n∑i=1
CTWnj converge in distribution to a univariate normal
random variable with mean 0 and variance CTAC.
We recall the Cramer-Wold Theorem to complete the proof of this lemma.
78
Lemma 3.2.10.
Under the conditions (A1 )( i ), ( A2 )( ii ) and ( A3 )( i , iii ) Z∗n converge in distribution
of a trivariate normal random variable with mean vector 0 and covariance matrix A.
Proof :
Since
n12 (Z∗n − Zn) = (hn)
12
n∑j=1
[U∗nj(x)− g(x)]
n∑j=1
[V ∗n1j(x)− w1(x)]
n∑j=1
[V ∗n2j(x)− w2(x)]
−
Un(x)
Vn1(x)
Vn2(x)
=
h12n
n∑j=1
[U∗nj(x)− g(x)]− Un(x)
h12n
n∑j=1
[V ∗n1j(x)− w1(x)]− Vn1(x)
h12n
n∑j=1
[V ∗n2j(x)− w2(x)]− Vn2(x)
=
h12n
n∑j=1
[U∗nj(x)− g(x)]−
n∑j=1
Unj(x)
h12n
n∑j=1
[V ∗n1j(x)− w1(x)]−
n∑j=1
Vn1j(x)
h12n
n∑j=1
[V ∗n2j(x)− w2(x)]−
n∑j=1
Vn2j(x)
=
h12n
n∑j=1
[U∗nj(x)]−
n∑j=1
Unj(x)− h12n
n∑j=1
[g(x)]
h12n
n∑j=1
[V ∗n1j(x)]−
n∑j=1
Vn1j(x)− h12n
n∑j=1
[w1(x)]
h12n
n∑j=1
[V ∗n2j(x)]−
n∑j=1
Vn2j(x)− h12n
n∑j=1
[w2(x)]
79
=
h12n
n∑j=1
E[U∗nj(x)]− h
12n
n∑j=1
[g(x)]
h12n
n∑j=1
E[V ∗n1j(x)]− h
12n
n∑j=1
[w1(x)]
h12n
n∑j=1
E[V ∗n2j(x)]− h
12n
n∑j=1
[w2(x)]
= nh12n
E[U∗nj(x)]− g(x)
E[V ∗n1j(x)]− w1(x)
[V ∗n2j(x)]− w2(x)
That is Z∗n − Zn = (nhn)12
E[U∗nj(x)]− g(x)
E[V ∗n1j(x)]− w1(x)
[V ∗n2j(x)]− w2(x)
= (nhn)
12 Cn
where Cn =
E[U∗nj(x)]− g(x)
E[V ∗n1j(x)]− w1(x)
[V ∗n2j(x)]− w2(x)
since
EU∗n1(x1) − g(x1) = E 1
hn
K(x−X1
hn
) − g(x1)
=1
hn
∫K(
x1 − u
hn
)g(u)du− g(x1)
=
∫K(u)g(xi − uhn)du− g(x1)
=
∫K(u)g(x1)− uhng
′(x1) + u2h2ng′′(x1)du− g(x1)
= h2ng′′(x1)
∫U2K(u)du ≤ Ch2
n = O(h2n)
We can prove similar the other elements of Cn to get Cn = O(h2n).
80
Now
Z∗n = Zn + (Z∗n − Zn)
= Zn + (nhn)12 Cn
= Zn + O(nh5n)
12
= Zn + O(1)
Then the proof of this Lemma is complete.
Now, we will consider the main theorem of this section.
Theorem 3.2.11.
Under the conditions (A1)(i), (A4)(ii), if g(x) > 0 and f(x, qαi(x)) > 0, i = 1, 2, we
have
qα1,n(x)
qα2,n(x)
is asymptotically normally distributed with mean vector
qα1(x)
qα2(x)
and covariance matrix Bn =
∫ ∞
−∞K2(u)du
nhng(x)
b11 b12
b12 b22
,
where
bij =αi(1− αi)
f(qαi(x)|x)f(qαj
(x)|x), 1 ≤ i ≤ j ≤ 2.
Proof :
Let the function H from R3 to R2 defined by
H(y) =
y2
y1y3
y1
with y =
y1
y2
y3
and let θ =
g(x)
w1(x)
w2(x)
81
We can write Z∗n = (nhn)12 (Tn − θ), where
Tn =
Tn1
Tn2
Tn3
with Tn1 =
1
n
n∑j
U∗nj(x)
and
Tni =1
n
n∑j
V ∗n(i−1)j(x), i = 2, 3.
From [17] with (n)12 replaced by (nhn)
12 we conclude that
(nhn)12H(Tn)−H(θ) = (nhn)
12
Fn(qα1(x)|x)− F (qα1(x)|x)
Fn(qα2(x)|x)− F (qα2(x)|x)
converge in distribution to a bivariate normal random variable with mean vector 0 and
covariance matrix DADT where D is the matrix of partial derivative of H, evaluated at
θ, and given by
D =
−w1(x)
g2(x)
1
g(x)0
−w2(x)
g2(x)0
1
g(x)
Then
DADT =
∫ ∞
−∞K2(u)du
g(x)×
α1(1− α1) α1(1− α2)
α1(1− α2) α2(1− α2)
By (3.2.1), we have the result.
82
3.3 Mode and Median as a Comparison
In chapter 2 and 3 we have studied two important aspects of the conditional density func-
tion, the conditional mode and the conditional quantiles.
As a measure of central density we compare between the mode and the median.
First, for the mode
i) The bias term vanishes since we use the condition (A2) (iv) page 45.
ii) The variance equalsf(x, θ(x))
f (0,2)(x, θ(x))2.
∫ ∫K(u)K(1)(v)2dudv
nh4n
That is the MSE depends only on the value of the variance.
Now, for the median
if we put α = 0.5 in Theorem 3.1.5, we get
i) The bias term is given by1
2h2
n
F (2,0)(q0.5(x)|x)
f(q0.5(x)|x)
∫u2K(u)du
ii) The variance term is1
nhn
1/4
f 2(q0.5(x))
∫K2(u)du
Notice that for the mean square error, MSE(x) = Bias2[f(x)] + V ar[f(x)].
For the median, we have the bias depends through F (2,0)(q0.5(x)|x) on the smoothness
of the quantile function.
Because of the division by the conditional density at q0.5(x) the steepness of the conditional
distribution also affects the bias and the variance. The steeper the conditional distribution
is the greater is the mean square error.
Also, we note the variance of the median is the largest since1
4≥ (F (y|x)− F 2(y|x)).
83
Bibliography
[1] Abberger, K. (1997). Quantile Smoothing in Financial Time Series. Statisical Papers
38, 125-148.
[2] Abraham, G., Bian, G. and Cadre, B. (2004). On the Asymptotic Properties od a
Simple Estimation of the Mode. ESAIM: Probability and Statistics, Vol. 8, 1-11.
[3] Abraham, G., Bian, G. and Cadre, B. (2003). Simple Estimation of the Mode of a
Multivariate Density. The Canadian Journal of Statis- tics, Vol. 31, 23-34.
[4] Bartle Robert G. Sherbert Donaled R. (1991). Introduction To Real Analysis. Eastern
Michigan University.
[5] Devorye, L. (1979). Recursive Estimation of the Mode of a Multivariate Density. The
Canadian Journal of Statistics, Vol 7, 159-167.
[6] Freund, J. (1992). Mathematical Statistics, Arizona State University.
[7] George Casell. Roger L. Berger. (1990). Statistical Inference. Cornell University,
North Carolina State University.
[8] Hogg . Mckean . Craig (2005). Introduction to Mathematicial Statistic. University of
Iowa, Wester Michigan University, University of Iowa.
[9] Keming Yu (2003). Quantile regression: applicaions and current reseach areas. Uni-
versity of Plymouth.UK
[10] Loeve, M.(1960). Probanility Theory, 2nd Ed.Van Nostrand, Princeton. Nostand.
84
[11] Nada, G. (2002). On The Kernel Density Estimatiom, Islamic University of Gaza,
Palestine.
[12] Parzen, E. (1962). On Estimation of a Probability Density Functin and Mode. The
Annals of Mathematical Statistics, Vol. 33, 1065-1076.
[13] Pranab K.sen. (1993). Large Sample Methods in Statsitic ”An Itroduction With
Applications”. New York.
[14] Rao, C.R. (1965). Linear Statistical Inference and Its Applications. Wiley, New York.
[15] Rosenblatt, M. (1956). Remarks on some non-parametric estimates of the density
function. Annals Math. Statist. 27, 832-837.
[16] Royden H.L. (1997). Real Analysis. Stanford University
[17] Samanta, M. Non-Parametric Estimation of Conditional Quamtiles. (1988). Depar-
ment of Statistic. University of Manitoba. Canada.
[18] Samanta, M. and Thavanesmaran, A. (1990). Nonparametric Estimation of the Con-
ditional Mode. Communications in Statistics. Theory and Methods, Vol. 19, 4515-
4524.
[19] Samanta, M.(1973). Nonparametric Estimation of the Mode of Multivariate Density.
South African Statiscal Jourrnal, Vol.7, 109-117
[20] Salha, R. (2006). Kernel Estimation of the Conditional Quantiles and Mode for Time
Series. Univ. Of Macedonia.
[21] Schustewr, E. (1972). Joint Asymptotic Distribution of Estimated Regression Func-
tion at a Finite Number of Distinct Points. The Annals of Mathematical Statistics,
Vol. 43, 84-88.
[22] Silverman, B.W. (1986). Density Estimation for Statistics and Datd Analysis, School
of Mathematics Univ. of Bath, UK.
85
[23] Wand, M.P , Jones, M.C. (1995). Kernel Smoothing. Univ of New South Wales
Asutralia.
[24] Watson, G.S. (1964). Smooth Regression Analysis. Sankhya, Series. A, Vol. 26, 359-
372.
[25] Whittle, P. (1958). On the Smoothing of Probabiliy Density Function. J. Roy. Statist.
Soc, Ser. B 20 334-343.
86