Additional Topics in Prediction Methodology. Introduction Predictive distribution for random...

Additional Topics in Prediction Methodology

Introduction

• Predictive distribution for random variable Y0 is meant to capture all the information about Y0 that is contained in Yn.

• not completely specify Y0 but does provide a probability distribution of more likely and less likely values of Y0

• E{Y0|Yn} is the best MSPE predictor of Y0

Hierarchical models have two stages

• X Rd

• f0=f(x0) known p*1 vector

• F=(fj(xj)) known n*p matrix

unknown p*1 vector regression coefficients

• R=(R(xi-xj)) known n*n matrix correlations among trainning data Yn

• r0=(R(xi-x0)) known n*1 vector correlations of Y0 with Yn

Predictive Distributions when Z2, R

and r0 are known

Interesting features of (a) and (b)

• Non-informative Prior is the limit of the normal prior as

• While the prior is non-informative, it is not a proper distribution. The corresponding predictive distribution is proper.

• The same conditioning argument can be applied to drive posterior mean for the non-informative prior and normal prior.

The mean and variance of the predictive distribution (mean)

0|n(x0) and 0|n(x0) depend on x0 only through the regression function f0 and correlation vector r0

0|n(x0) is a linear unbiased predictor of Y(x0)• The continuity and other smoothness properties

of 0|n(x0) are inherited from correlation function R(.) and the regressors {f(.)}j=1

p

0|n(x0) depends on the parameters z2 2

only through their ratio

0|n(x0) interpolate the training data. When x0=xi, f0=f(xi), and r0

TR-1=eiT, the ith unit vect

or.

)2/7cos()( 4.1 xexy x

00| )1(0

bn

The mean and variance of the predictive distribution (Variance)

• MSPE(0|n(x0) )= 0|n2(x0)

• The variance of the posterior of Y(x0) given Yn should be 0 whenever x0=xi

0|n2(xi)=0

Most important use of Theorem 4.1.1

Predictive Distributions when R and r0 are known

The posterior is a location shifted and scaled univariate t distribution having degrees of freedom that are enhanced when there is informative prior information for either or z

2

Degree of freedom

• Base value for the degree of freedom i=n-p

• P additional degrees of freedom when prior is informative

0 additional degree of freedom when z2 is infor

mative

Location shift

The same centering value as Theorem 4.1.1 (known z

2 )

The non-informative prior gives the BLUP

Scale factor i2(x0)

(compare 4.1.15 with 4.1.6)

• Estimate of the scale factor 0|n2(x0).

• Qi2/i : estimate z

2

• Qi2: get information about z

2 from the conditional distribution Yn given z

2 and information from the prior of z

2

i2(xi)=0, xi is any of the training data point

s.

Prediction Distributions when Correlation parameters are unknown

• If the correlations among the observations is unknown (R r0 are unknown)?– Assume y(.) has a Gaussian prior with

correlation function R(.|), is unknown vector parameters

• Two issues– Standard error of Plug-in predictor 0|n(x0|)

by substituting comes from MLE or REML– Bayesian approach to uncertainty in which

is to model it by a prior distribution

Prediction of Multiple Response Models

• Several outputs are available for from a computer experiment

• Several codes are available for computing the same response (fast and slow code)

• Competing response

• Several stochastic models for joint response• Using these models to describe the optimal

predictor for one of the several computed responses.

Modeling Multiple Outputs

• Zi(.): marginally mean zero stationary Gaussian stochastic processes with unknown variance and correlation function R

• Zi(x) implies that the correlation between Zi(x1) and Zi(x2) only depends on x1-x2

• Assume Cov(Zi(x1), Zj(x2))=ijRij(x1-x2)• Rij(.) cross-correlation function of Zi(.) and Zj(.) • Linear model: global mean of the Yi process. fi(.): known

regression functions i: unknown regression parameters

Selection of correlation and cross-correlation functions are complicated

• Reason: for any input sites xli, the multivariate normal distributed random vector (Z1(x1

1), ….)T must have a nonnegative definite covariance matrix

• Solution: construct the Zi(.) from a set of elementary processes (usually this processes are mutually independent)

Example by Kennedy and O’Hagan

• Yi(x): prior for the ith code level (i=m top-level code). The autoregressive model:– Yi(x)=i-1Yi-1(x)+i(x), i=2, … , m

• The output for each successive higher level code i at x is related to the output of the less precise code i-1 at x plus the refinement i(x)

– Cov(Yi(x), Yi-1(w)|Yi-1(x))=0 for all w~=x• No additional second-order knowledge of code i at x can be

obtained from the lower-level code i-1 if the value of code i-1 at x is known (Markov property on the hierarchy of codes)

• Since there is no natural hierarchy of computer code in such applications, we need find something better.

More reasonable Model

• Each constraint function is associated with the objective function plus a refinement– Yi(x)=iY1(x)+i(x), i=2, … , m+1

• Ver Hoef and Marry– Form models in the environmental sciences– Include an unknown smooth surface plus a ra

ndom measurement error.– Moving averages over white noise processes

Morris and Mitchell model• Prior information about y(x) is specified by a Gaussian pr

ocessor Y(.)• Prior information about the partial derivatives y(j)(x) is obt

ained by considering the “derivative” processes of Y(.)– Y1(.)=y(.), y2(.)= y(1)(.), y1+m(.)=y(m)(.)

• Natural prior for y(j)(x):

• The covariances between Y(x1), Y(j)(x2) and Y(i)(x1), Y(j)(x2) are:

Optimal Predictors for Multiple Outputs

• The best MSPE predictor based on training data is:

• Where Y0=Y1(X0), Yini=(Yi(x1

i), …), and yini i

s observed value for i=[1,m]

The joint distribution is the multivariate normal distribution

Conditional expectation

…..• In practice, this is useless (it requires knowledge of marg

inal correlation functions, joint correlation function and ratio of all the process variance)

• Empirical versions are of practical use:– Every time we assume each of the correlation matrices Ri and cr

oss-correlation matrices Rij are known up to a vector of parameters.

– Estimate using MLE or REML

example1

• 14 point training data has feature that it allows us to learn over the entire input space: space-filling

• Compare two model– Using the predictor of y(.) based on y(.) alone– Using the predictor of y(.) base on (y(.), y(1)(.),

y(2)(.))

• Second one is both more visually fit and has 24% smaller ERMSPE

Thank you!

Additional Topics in Prediction Methodology. Introduction Predictive distribution for random...

Documents

Transcript of Additional Topics in Prediction Methodology. Introduction Predictive distribution for random...