Estimation of Functions - Service Catalogmason.gmu.edu/~jgentle/csi9723/11s/l12_11s.pdfEstimation of...

Estimation of Functions

An interesting problem in statistics, and one that is generally

difficult, is the estimation of a continuous function such as a

probability density function.

The statistical properties of an estimator of a function are more

complicated than statistical properties of an estimator of a single

parameter or even of a countable set of parameters.

We consider the general case of a real scalar-valued function over

real vector-valued arguments (that is, a mapping from IRd into

IR).

One of the most common situations in which these properties

are relevant is in nonparametric probability density estimation.

1

Notation

We may denote a function by a single letter, f , for example, or

by the function notation, f(·) or f(x).

When f(x) denotes a function, x is merely a placeholder.

The notation f(x), however, may also refer to the value of the

function at the point x. The meaning is usually clear from the

context.

Using the common “hat” notation for an estimator, we use f or

f(x) to denote the estimator of f or of f(x).

2

More on Notation

The hat notation is also used to denote an estimate, so we must

determine from the context whether f or f(x) denotes a random

variable or a realization of a random variable.

The estimate or the estimator of the value of the function at

the point x may also be denoted by f(x).

Sometimes, to emphasize that we are estimating the ordinate of

the function rather than evaluating an estimate of the function,

we use the notation f(x).

3

Optimality

The usual optimality properties that we use in developing a the-

ory of estimation of a finite-dimensional parameter must be ex-

tended for estimation of a general function.

As we will see, two of the usual desirable properties of point es-

timators, namely unbiasedness and maximum likelihood, cannot

be attained in general by estimators of functions.

4

Estimation or Approximation?

There are many similarities in estimation of functions and approx-

imation of functions, but we must be aware of the fundamental

differences in the two problems.

Estimation of functions is similar to other estimation problems:

we are given a sample of observations; we make certain assump-

tions about the probability distribution of the sample; and then

we develop estimators.

The estimators are random variables, and how useful they are

depends on properties of their distribution, such as their expected

values and their variances.

Approximation of functions is an important aspect of numeri-

cal analysis. Functions are often approximated to interpolate

functional values between directly computed or known values.

5

General Methods for Estimating

Functions

In the problem of function estimation, we may have observationson the function at specific points in the domain, or we may haveindirect measurements of the function, such as observations thatrelate to a derivative or an integral of the function.

In either case, the problem of function estimation has the com-peting goals of providing a good fit to the observed data andpredicting values at other points.

In many cases, a smooth estimate satisfies this latter objec-tive. In other cases, however, the unknown function itself is notsmooth.

Functions with different forms may govern the phenomena in dif-ferent regimes. This presents a very difficult problem in functionestimation, but we won’t go into it.

6

General Methods for Estimating

Functions

There are various approaches to estimating functions.

Maximum likelihood has limited usefulness for estimating func-

tions because in general the likelihood is unbounded.

A practical approach is to assume that the function is of a par-

ticular form and estimate the parameters that characterize the

form.

For example, we may assume that the function is exponential,

possibly because of physical properties such as exponential de-

cay. We may then use various estimation criteria, such as least

squares, to estimate the parameter.

7

Mixtures of Functions with Prescribed

Forms

An extension of this approach is to assume that the function is

a mixture of other functions.

The mixture can be formed by different functions over different

domains or by weighted averages of the functions over the whole

domain.

Estimation of the function of interest involves estimation of var-

ious parameters as well as the weights.

8

Use of Basis Functions

Another approach to function estimation is to represent the func-

tion of interest as a linear combination of basis functions, that

is, to represent the function in a series expansion.

The basis functions are generally chosen to be orthogonal over

the domain of interest, and the observed data are used to esti-

mate the coefficients in the series.

9

Estimation of a Function at a Point

It is often more practical to estimate the function value at agiven point.

(Of course, if we can estimate the function at any given point,we can effectively have an estimate at all points.)

One way of forming an estimate of a function at a given pointis to take the average at that point of a filtering function thatis evaluated in the vicinity of each data point.

The filtering function is called a kernel, and the result of thisapproach is called a kernel estimator.

We must be concerned about the properties of the estimators atspecific points and also about properties over the full domain.

Global properties over the full domain are often defined in termsof integrals or in terms of suprema or infima.

10

Kernel Methods

One approach to function estimation and approximation is to

use a filter or kernel function to provide local weighting of the

observed data.

This approach ensures that at a given point the observations

close to that point influence the estimate at the point more

strongly than more distant observations.

A standard method in this approach is to convolve the observa-

tions with a unimodal function that decreases rapidly away from

a central point.

A kernel has two arguments representing the two points in the

convolution, but we typically use a single argument that repre-

sents the distance between the two points.

11

Some Kernels

Some examples of univariate kernel functions are

uniform: Ku(t) = 0.5, for |t| ≤ 1,

quadratic: Kq(t) = 0.75(1 − t2), for |t| ≤ 1,

normal: Kn(t) = 1√2π

e−t2/2, for all t.

The kernels with finite support are defined to be 0 outside that

range. Often, multivariate kernels are formed as products of

these or other univariate kernels.

12

Kernels Methods

In kernel methods, the locality of influence is controlled by a

window around the point of interest.

The choice of the size of the window is the most important issue

in the use of kernel methods.

In practice, for a given choice of the size of the window, the

argument of the kernel function is transformed to reflect the

size.

The transformation is accomplished using a positive definite ma-

trix, V , whose determinant measures the volume (size) of the

window.

13

Kernels Methods

To estimate the function f at the point x, we first decompose f

to have a factor that is a probability density function, p,

f(x) = g(x)p(x).

For a given set of data, x1, . . . , xn, and a given scaling trans-formation matrix V , the kernel estimator of the function at thepoint x is

f(x) = (n|V |)−1n∑

i=1

g(xi)K(V −1(x − xi)

). (1)

In the univariate case, the size of the window is just the width h.The argument of the kernel is transformed to s/h, so the functionthat is convolved with the function of interest is K(s/h)/h. Theunivariate kernel estimator is

f(x) =1

nh

n∑

i=1

g(x)K(

x − xi

h

).

14

Pointwise Properties of Function

Estimators

The statistical properties of an estimator of a function at a given

point are analogous to the usual statistical properties of an esti-

mator of a scalar parameter.

The statistical properties involve expectations or other properties

of random variables.

15

Bias

The bias of the estimator of a function value at the point x is

E(f(x)) − f(x).

If this bias is zero, we would say that the estimator is unbiased

at the point x.

If the estimator is unbiased at every point x in the domain of f ,

we say that the estimator is pointwise unbiased.

Obviously, in order for f(·) to be pointwise unbiased, it must be

defined over the full domain of f .

16

Variance

The variance of the estimator at the point x is

V(f(x)) = E((f(x) − E(f(x)))2

).

Estimators with small variance are generally more desirable, and

an optimal estimator is often taken as the one with smallest

variance among a class of unbiased estimators.

17

Mean Squared Error

The mean squared error, MSE, at the point x is

MSE(f(x)) = E((f(x) − f(x))2). (2)

The mean squared error is the sum of the variance and the square

of the bias:

MSE(f(x)) = E((f(x))2 − 2f(x)f(x) + (f(x))2)

= V(f(x)) + (E(f(x)) − f(x))2. (3)

18

Mean Squared Error

Sometimes, the variance of an unbiased estimator is much greater

than that of an estimator that is only slightly biased, so it is of-

ten appropriate to compare the mean squared error of the two

estimators.

In some cases, as we will see, unbiased estimators do not exist,

so rather than seek an unbiased estimator with a small variance,

we seek an estimator with a small MSE.

19

Mean Absolute Error

The mean absolute error, MAE, at the point x is similar to the

MSE:

MAE(f(x)) = E(|f(x) − f(x)|). (4)

It is more difficult to do mathematical analysis of the MAE than

it is for the MSE.

Furthermore, the MAE does not have a simple decomposition

into other meaningful quantities similar to the MSE.

20

Consistency

Consistency of an estimator refers to the convergence of the

expected value of the estimator to what is being estimated as

the sample size increases without bound.

If m is a function (maybe a vector-valued function that is an

elementwise norm), we can define consistency of an estimator

Tn in terms of m if

E(m(Tn − θ)) → 0. (5)

21

Rate of Convergence

If convergence does occur, we are interested in the rate of con-

vergence.

We define rate of convergence in terms of a function of n, say

r(n), such that

E(m(Tn − θ)) = O(r(n)).

A common form of r(n) is nα, where α < 0.

For example, in the simple case of a univariate population with

a finite mean µ and finite second moment, use of the sample

mean x as the estimator Tn, and use of m(z) = z2, we have

E(m(x − µ)) = E((x − µ)2)

= MSE(x)

= O(n−1).

22

Pointwise Consistency

In the estimation of a function, we say that the estimator f of

the function f is pointwise consistent if

E(f(x)) → f(x) (6)

for every x the domain of f .

If the convergence in expression (6) is in probability, for example,

we say that the estimator is weakly pointwise consistent.

We can also define other kinds of pointwise consistency in func-

tion estimation along the lines of other types of consistency.

23

Global Properties of Estimators of

Functions

Often, we are interested in some measure of the statistical prop-

erties of an estimator of a function over the full domain of the

function. The obvious way of defining statistical properties of an

estimator of a function is to integrate the pointwise properties.

Statistical properties of a function, such as the bias of the func-

tion, are often defined in terms of a norm of the function.

For comparing f(x) and f(x), the Lp norm of the error is

(∫

D|f(x) − f(x)|p dx

)1/p, (7)

where D is the domain of f . The integral may not exist, of

course. Clearly, the estimator f must also be defined over the

same domain.24

Convergence Norms

Three useful measures are the L1 norm, also called the integrated

absolute error, or IAE,

IAE(f ) =∫

D

∣∣∣f(x) − f(x)∣∣∣ dx, (8)

the square of the L2 norm, also called the integrated squared

error, or ISE,

ISE(f ) =∫

D

(f(x) − f(x)

)2dx, (9)

and the L∞ norm, the sup absolute error, or SAE,

SAE(f ) = sup∣∣∣f(x) − f(x)

∣∣∣ . (10)

25

Convergence Norms

The L1 measure is invariant under monotone transformations of

the coordinate axes, but the measure based on the L2 norm is

not.

The L∞ norm, or SAE, is the most often used measure in gen-

eral function approximation. In statistical applications, this mea-

sure applied to two cumulative distribution functions is the Kol-

mogorov distance.

The measure is not so useful in comparing densities and is not

often used in density estimation.

26

Convergence Measures

Other measures of the difference in f and f over the full range

of x are the Kullback-Leibler measure,

∫

Df(x) log

(f(x)

f(x)

)dx,

and the Hellinger distance,(∫

D

(f 1/p(x) − f1/p(x)

)pdx

)1/p.

27

Integrated Bias and Variance

We now want to develop global concepts of bias and variance

for estimators of functions.

Bias and variance are statistical properties that involve expecta-

tions of random variables.

The obvious global measures of bias and variance are just the

pointwise measures integrated over the domain.

(In the case of the bias, of course, we must integrate the absolute

value, otherwise points of negative bias could cancel out points

of positive bias.)

28

Integrated Bias

Because we are interested in the bias over the domain of thefunction, we define the integrated absolute bias as

IAB(f ) =∫

D|E(f(x)) − f(x)| dx (11)

and the integrated squared bias as

ISB(f ) =∫

D(E(f(x)) − f(x))2 dx. (12)

If the estimator is unbiased, both the integrated absolute biasand integrated squared bias are 0.

This, of course, would mean that the estimator is pointwiseunbiased almost everywhere.

Although it is not uncommon to have unbiased estimators ofscalar parameters or even of vector parameters with a countablenumber of elements, it is not likely that an estimator of a functioncould be unbiased at almost all points in a dense domain.

29

Integrated Variance

The integrated variance is defined in a similar manner:

IV(f ) =∫

DV(f(x)) dx

=∫

DE((f(x) − E(f(x)))2) dx. (13)

30

Integrated Mean Squared Error

As we suggested before, global unbiasedness is generally not to

be expected.

An important measure for comparing estimators of funtions is,

therefore, based on the mean squared error.

The integrated mean squared error is

IMSE(f ) =∫

DE((f(x) − f(x))2) dx

= IV(f ) + ISB(f ). (14)

(Compare equations (2) and (3) on slide 18.)

31

Integrated Mean Squared Error

If the expectation integration can be interchanged with the outer

integration in the expression above, we have

IMSE(f ) = E(∫

D(f(x) − f(x))2 dx

)

= MISE(f ),

the mean integrated squared error.

We will assume that this interchange leaves the integrals un-

changed, so we will use MISE and IMSE interchangeably.

32

Integrated Mean Absolute Error

Similarly, for the integrated mean absolute error, we have

IMAE(f ) =∫

DE(|f(x) − f(x)|) dx

= E(∫

D|f(x) − f(x)|dx

)

= MIAE(f ),

the mean integrated absolute error.

33

Mean SAE

The mean sup absolute error, or MSAE, is

MSAE(f ) =∫

DE(sup|f(x) − f(x)|) dx. (15)

This measure is not very useful unless the variation in the func-

tion f is relatively small. For example, if f is a density func-

tion, f can be a “good” estimator, yet the MSAE may be quite

large. On the other hand, if f is a cumulative distribution func-

tion (monotonically ranging from 0 to 1), the MSAE may be a

good measure of how well the estimator performs. As mentioned

earlier, the SAE is the Kolmogorov distance. The Kolmogorov

distance (and, hence, the SAE and the MSAE) does poorly in

measuring differences in the tails of the distribution.

34

Large-Sample Properties

The pointwise consistency properties are extended to the full

function in the obvious way.

Consistency of the function estimator is defined in terms of∫

DE(m(f(x) − f(x))) dx → 0.

The estimator of the function is said to be mean square consis-

tent or L2 consistent if the MISE converges to 0; that is,∫

DE((f(x) − f(x))2) dx → 0.

If the convergence is weak, that is, if it is convergence in prob-

ability, we say that the function estimator is weakly consistent;

if the convergence is strong, that is, if it is convergence almost

surely or with probability 1, we say the function estimator is

strongly consistent.

35


The estimator of the function is said to be L1 consistent if the

mean integrated absolute error (MIAE) converges to 0; that is,∫

DE(|f(x) − f(x)|) dx → 0.

As with the other kinds of consistency, the nature of the conver-

gence in the definition may be expressed in the qualifiers “weak”

or “strong”.

As we have mentioned above, the integrated absolute error is in-

variant under monotone transformations of the coordinate axes,

but the L2 measures are not.

As with most work in L1, however, derivation of various proper-

ties of IAE or MIAE is more difficult than for analogous properties

with respect to L2 criteria.

36


If the MISE converges to 0, we are interested in the rate ofconvergence. To determine this, we seek an expression of MISEas a function of n. We do this by a Taylor series expansion.

In general, if θ is an estimator of θ, the Taylor series for ISE(θ),equation (9), about the true value is

ISE(θ) =∞∑

k=0

1

k!(θ − θ)k ISEk′(θ), (16)

where ISEk′(θ) represents the kth derivative of ISE evaluated atθ.

Taking the expectation in equation (16) yields the MISE. Thelimit of the MISE as n → ∞ is the asymptotic mean integratedsquared error, AMISE.

One of the most important properties of an estimator is theorder of the AMISE.

37


In the case of an unbiased estimator, the first two terms in the

Taylor series expansion are zero, and the AMISE is

V(θ) ISE′′(θ)

to terms of second order.

38

Other Global Properties of Estimators

of Functions: Roughness

There are often other properties that we would like an estimatorof a function to possess.

We may want the estimator to weight given functions in someparticular way.

For example, if we know how the function to be estimated, f ,weights a given function r, we may require that the estimate f

weight the function r in the same way; that is,∫

Dr(x)f(x)dx =

∫

Dr(x)f(x)dx.

We may want to restrict the minimum and maximum values ofthe estimator. For example, because many functions of interestare nonnegative, we may want to require that the estimator benonnegative.

39



We may want to restrict the variation in the function.

This can be thought of as the “roughness” of the function.

A reasonable measure of the variation is∫

D

(f(x) −

∫

Df(x)dx

)2dx.

If the integral∫D f(x)dx is constrained to be some constant (such

as 1 in the case that f(x) is a probability density), then the

variation can be measured by the square of the L2 norm,

S(f) =∫

D(f(x))2dx. (17)

40



We may want to restrict the derivatives of the estimator or the

smoothness of the estimator.

Another intuitive measure of the roughness of a twice-differentiable

and integrable univariate function f is the integral of the square

of the second derivative:

R(f) =∫

D(f ′′(x))2dx. (18)

Often, in function estimation, we may seek an estimator f such

that its roughness (by some definition) is small.

41

Nonparametric Probability Density

Estimation

Estimation of a probability density function is similar to the esti-

mation of any function, and the properties of the function esti-

mators that we have discussed are relevant for density function

estimators.

A density function p(y) is characterized by two properties:

• it is nonnegative everywhere;

• it integrates to 1 (with the appropriate definition of “inte-

grate”).

42

Nonparametric Probability Density

Estimation

We consider several nonparametric estimators of a density; that

is, estimators of a general nonnegative function that integrates

to 1 and for which we make no assumptions about a functional

form other than, perhaps, smoothness.

It seems reasonable that we require the density estimate to have

the characteristic properties of a density:

• p(y) ≥ 0 for all y;

•∫IRd p(y) dy = 1.

43

Bona Fide Density Estimator

A probability density estimator that is nonnegative and integrates

to 1 is called a bona fide estimator.

Rosenblatt has shown that no unbiased bona fide estimator can

exist for all continuous p.

Rather than requiring an unbiased estimator that cannot be a

bona fide estimator, we generally seek a bona fide estimator with

small mean squared error or a sequence of bona fide estimators

pn that are asymptotically unbiased; that is,

Ep(pn(y)) → p(y) for all y ∈ IRd as n → ∞.

44

The Likelihood Function

Suppose that we have a random sample, y1, . . . , yn, from a pop-

ulation with density p.

Treating the density p as a variable, we write the likelihood func-

tional as

L(p; y1, . . . , yn) =n∏

i=1

p(yi).

The maximum likelihood method of estimation obviously cannot

be used directly because this functional is unbounded in p.

We may, however, seek an estimator that maximizes some mod-

ification of the likelihood.

45

Modified Maximum Likelihood

Estimation

There are two reasonable ways to approach this problem.

One is to restrict the domain of the optimization problem. This

is called restricted maximum likelihood.

The other is to regularize the estimator by adding a penalty

term to the functional to be optimized. This is called penalized

maximum likelihood.

46

Restricted Maximum Likelihood

Estimation

We may seek to maximize the likelihood functional subject to

the constraint that p be a bona fide density.

If we put no further restrictions on the function p, however,

infinite Dirac spikes at each observation give an unbounded like-

lihood, so a maximum likelihood estimator cannot exist, subject

only to the restriction to the bona fide class.

An additional restriction that p be Lebesgue-integrable over some

domain D (that is, p ∈ L1(D)) does not resolve the problem be-

cause we can construct sequences of finite spikes at each obser-

vation that grow without bound.

We therefore must restrict the class further.47


Estimation

Consider a finite dimensional class, such as the class of step

functions that are bona fide density estimators. We assume that

the sizes of the regions over which the step function is constant

are greater than 0.

For a step function with m regions having constant values, c1, . . . , cm,

the likelihood is

L(c1, . . . , cm; y1, . . . , yn) =n∏

i=1

p(yi)

=m∏

k=1

cnkk , (19)

where nk is the number of data points in the kth region.

48

continued ...

For the step function to be a bona fide estimator, all ck must be

nonnegative and finite. A maximum therefore exists in the class

of step functions that are bona fide estimators.

If vk is the measure of the volume of the kth region (that is, vk

is the length of an interval in the univariate case, the area in the

bivariate case, and so on), we have

m∑

k=1

ckvk = 1.

We incorporate this constraint together with equation (19) to

form the Lagrangian,

L(c1, . . . , cm) + λ

1 −

m∑

k=1

ckvk

.

49

continued ...

Differentiating the Lagrangian function and setting the derivative

to zero, we have at the maximum point ck = c∗k, for any λ,

∂L

∂ck= λvk.

Using the derivative of L from equation (19), we get

nkL = λc∗kvk.

Summing both sides of this equation over k, we have

nL = λ,

and then substituting, we have

nkL = nLc∗kvk.

50

continued ...

Therefore, the maximum of the likelihood occurs at

c∗k =nk

nvk.

The restricted maximum likelihood estimator is therefore

p(y) =nk

nvk, for y ∈ region k,

= 0, otherwise.

(20)

51


Estimation

Instead of restricting the density estimate to step functions, we

could consider other classes of functions, such as piecewise linear

functions.

We may also seek other properties, such as smoothness, for the

estimated density.

One way of achieving other desirable properties for the estimator

is to use a penalizing function to modify the function to be

optimized.

52

continued ...

Instead of the likelihood function, we may use a penalized likeli-

hood function of the form

Lp(p; y1, . . . , yn) =n∏

i=1

p(yi)e−T (p),

where T (p) is a transform that measures some property that we

would like to minimize.

For example, to achieve smoothness, we may use the transform

R(p) of equation (18) in the penalizing factor.

To choose a function p to maximize Lp(p) we would have to use

some finite series approximation to T (p).

53

Estimation of Functions - Service Catalogmason.gmu.edu/~jgentle/csi9723/11s/l12_11s.pdfEstimation of...

Documents

Transcript of Estimation of Functions - Service Catalogmason.gmu.edu/~jgentle/csi9723/11s/l12_11s.pdfEstimation of...