Exploiting Hessian matrix and trust-region algorithm in hyperparameters estimation of Gaussian...

Applied Mathematics and Computation 171 (2005) 1264–1281

www.elsevier.com/locate/amc

Exploiting Hessian matrix and trust-regionalgorithm in hyperparametersestimation of Gaussian process

Yunong Zhang a,*, W.E. Leithead a,b

a Department of Electronic and Electrical Engineering, University of Strathclyde,

Glasgow G1 1QE, UKb Hamilton Institute, National University of Ireland, Maynooth, Co. Kildare, Ireland

Abstract

Gaussian process (GP) regression is a Bayesian non-parametric regression model,

showing good performance in various applications. However, it is quite rare to see

research results on log-likelihood maximization algorithms. Instead of the commonly

used conjugate gradient method, the Hessian matrix is first derived/simplified in this

paper and the trust-region optimization method is then presented to estimate GP hyper-

parameters. Numerical experiments verify the theoretical analysis, showing the advan-

tages of using Hessian matrix and trust-region algorithms. In the GP context, the

trust-region optimization method is a robust alternative to conjugate gradient method,

also in view of future researches on approximate and/or parallel GP-implementation.

� 2005 Elsevier Inc. All rights reserved.

Keywords: Gaussian process; Log likelihood maximization; Conjugate gradient; Trust region;

Hessian matrix

0096-3003/$ - see front matter � 2005 Elsevier Inc. All rights reserved.

doi:10.1016/j.amc.2005.01.113

* Corresponding author.

E-mail addresses: [email protected] (Y. Zhang), [email protected] (W.E. Leithead).

mailto:[email protected]

mailto:[email protected]

Y. Zhang, W.E. Leithead / Appl. Math. Comput. 171 (2005) 1264–1281 1265

1. Introduction

In the last two decades, neural networks have been used widely for non-lin-

ear regression and classification. One attraction of neural networks is their uni-

versal approximation ability [1]. However, with large number of parameters

being used, there is a risk of overfitting [2]. In the limit case of large networks,Neal [3] showed that the prior distribution over non-linear functions implied by

Bayesian neural network falls into a class of probability distributions known as

Gaussian processes. This observation motivated the idea of discarding param-

eterized network models and working directly with Gaussian processes for

regression. Following some pioneering work in the late 1990s (e.g., [4–6]), inter-

est has grown quickly. In particular, Gaussian processes have recently been

used for various examples and applications [7–12].

The idea of Gaussian process for regression, having actually been used for along time, can be traced back to [13–15]. Roughly speaking, a Gaussian process

is characterized by a covariance function, while our prior knowledge like the

smoothness of the underlying function is incorporated into this covariance

function. A Gaussian process, specified by the covariance matrix, usually has

a small-number set of hyperparametersH like length scales and variance scales.

In addition to Monte Carlo approach, another standard and general practice

for estimating hyperparameters is to minimize the negative log-likelihood func-

tion of sample data over H [4,6,8,12,14]. As the likelihood is in general highlynon-linear and multimodal, efficient and robust optimization algorithms are

very crucial to develop automated Gaussian processes without much human

intervention for large-scale high-dimensional regression tasks.

To our best knowledge, and also confirmed by other research reports like

[4,8], only steepest-descent or conjugate-gradient methods are investigated

for the negative log-likelihood minimization, without giving much implementa-

tion/comparison details. The scarcity on the investigation of optimization algo-

rithms for Gaussian process regression motivates our such research. In thispaper, we first derive/simplify the Hessian matrix for the log-likelihood func-

tion, which contains a lot of useful information to Gaussian process. In light

of numerical experiments, we then propose using the trust-region algorithm

as an efficient and robust optimization alternative for the Gaussian process

regression. Comparison results with the conjugate-gradient method [4] are also

shown with analysis and remarks.

The remainder of this paper is organized in five sections. For readability and

completeness, the preliminaries on Gaussian process regression are given inSection 2. Hessian matrix is derived and simplified in Section 3. In addition

to briefing other minimization algorithms, Section 4 presents the trust-region

method for GP hyperparameters estimation. Numerical experiments and com-

parative results are discussed in Section 5. Lastly, Section 6 concludes this

paper with final remarks.

1266 Y. Zhang, W.E. Leithead / Appl. Math. Comput. 171 (2005) 1264–1281

2. Preliminaries

Let us first brief the problem of regression. That is, modelling function y(x)

from given noisy data D ¼ fðxi; tiÞgNi¼1 (also termed sample/training data),

where input vector xi 2 RL, output vector ti 2 R, N is the number of sample

data points, and L is the dimension of input space. Then, the tuned estimationmodel may be used to make predictions at new input points (also termed test/

prediction point), e.g., tN+1 corresponding to xNþ1 62 D. Neural networks have

been a widespread parametric model for regression, typically with a large num-

ber of parameters like weights and biases [1]. Its overfitting and Neal�s obser-vation, however, have drawn researchers� attention back to non-parametric

Gaussian process for modelling.

2.1. Gaussian process prior

Now let us write here a general model of data D in the Bayesian regression

context:

ti ¼ yðxiÞ þ vi;

where vi is the noise on y(xi), i2{1,2, . . .,N}. The prior over the space of pos-

sible functions and the prior over the noise are defined respectively as P(yjHy)and P(vjHv), where Hy and Hv are the hyperparameter sets respectively for the

function model and for the noise model. Given such hyperparameters, the

probability of the data can be integrated out by

P ðtjx;Hy ;HvÞ ¼Z Z

P ðtjx; y; vÞP ðyjHyÞP ðvjHvÞdy dv;

where t := (t1, t2, . . ., tN) and x := (x1,x2, . . .,xN). Defining t̂ ¼ ðt1; t2; . . . ; tN ;tNþ1Þ, the conditional distribution of tN+1 corresponding to xN+1 is

P ðtNþ1jD;Hy ;Hv; xNþ1Þ ¼P ð̂tjx;Hy ;Hv; xNþ1Þ

P ðtjx;Hy ;HvÞ; ð1Þ

used as the predictions about tN+1. The aforementioned analytical integration

for calculating P(tjx,Hy,Hv) is usually very complicated if not impossible, and a

simpler approach based on Gaussian process priors has often been adopted

[4–6,8,9,11,12].

The data vector t is defined as a collection of random variables

t = (t(x1),t(x2), . . .) having the following Gaussian joint distribution under

standard zero-mean assumption:

P ðtjx;HÞ / exp � 1

2tTC�1ðHÞt

� �ð2Þ


for any set of input x. In (2), the ijth entry of covariance matrix C(H) is the

scalar covariance function c(xi,xj;H), and / stands for equality up to a normal-

izing factor. It follows from (1) that the conditional Gaussian distribution of

tN+1 is

P ðtNþ1jD;H; xNþ1Þ / exp � t̂T bC�1

ðHÞ̂t� tTC�1ðHÞt2

!

/ exp �ðtNþ1 ��tNþ1Þ2

2r2Nþ1

!;

where the predictive mean�tNþ1 ¼ kC�1ðHÞt with row vector k := [c(x1,xN+1;H),

c(x2,xN+1;H),. . .,c(xN,xN+1;H)] being the covariance between training data andtest point, and the predictive variance r2

Nþ1 ¼ cðxNþ1; xNþ1;HÞ � kC�1ðHÞk. Evi-dently, vector C�1(H)t is independent of the test data and can be understood as

representing the model constructed from covariance function c( Æ , Æ ;H) and

training data D. Given D, the covariance function (or simply, the hyperparam-

eters H) dominates the Gaussian process model.

2.2. Hyperparameters estimation

The following standard covariance function is often used by most research-

ers in practice, and also considered throughout this paper to demonstrate our

proposed Hessian and trust-region optimization approach to Gaussian process

regression:

cðxi; xj;HÞ ¼ a exp � 1

2

XLl¼1

ðxðlÞi � xðlÞj Þ2dl

!þ tdij

or, in matrix notation,

cðxi; xj;HÞ ¼ a exp � 1

2ðxi � xjÞTDðxi � xjÞ

� �þ tdij; ð3Þ

where xðlÞi denotes the lth entry of vector xi 2 RL (i.e., the ith datum input), and

dij is the Kronecker delta [10]. In (3), hyperparameter a relates to the overall

amplitude of the Gaussian process in output space, matrix D = diag(d1,d2, . . .,dL) characterizes a set of length scales dl corresponding to each input,

and hyperparameter t gives the noise level. The hyperparameter set is assumed

in this order H ¼ ½a; d1; d2; . . . ; dL; t� 2 RLþ2 with each constrained to be non-

negative. A covariance function with non-diagonal singular D [12] is also

involved in this research (for more choices of covariance functions, refer to

[5,16]).


Given the specific form of covariance function c( Æ , Æ ;H), we have to esti-

mate the distribution of hyperparameters H from training data D, in order

to make an ideal prediction of tN+1, like

P ðtNþ1jD; xNþ1Þ ¼Z

P ðtNþ1jD;H; xNþ1ÞP ðHjDÞdH:

But the analytical integration is usually intractable. In addition to numeri-cal integration via Monte Carlo methods [4,10], we could approximate it

via the most probable values of hyperparameters Hmp [5,8,9,12], i.e.,

P ðtNþ1jD; xNþ1Þ � PðtNþ1jD;Hmp; xNþ1Þ. To obtain Hmp, it is common practice

to minimize the following negative log-likelihood of training data (i.e., maxi-

mum likelihood estimation, MLE)

LðHÞ ¼ 1

2log detCðHÞ þ 1

2tTC�1ðHÞt: ð4Þ

Simply speaking, derivation of (4) implies, from an engineering point of view,

that given specific training data and covariance function, the Gaussian process

regression actually becomes an unconstrained non-linear optimization problem

with high-dimension matrices/vectors, which we will address in the ensuing

sections.

3. Analysis of Hessian matrix

In solving the MLE optimization problem, only steepest-ascent and conju-

gate-gradient methods have been studied in the GP literature, together with the

first-order information of the log likelihood (4). Gradient-based optimization

routines could only guarantee convergence to stationary points (not necessarily

minima/maxima) without checking the second-order Hessian information [17].

Besides, if available, the optimization methods with Hessian are generallymore efficient and effective than those without Hessian, especially for banana

or ravine-type problems [18]. In view of the above, the Hessian matrix is

investigated.

3.1. Derivation of Hessian matrix

The first-order derivative of log likelihood (4) with respect to hyperpara-

meter Hi (i.e., the gradient) is [14]

oL

oHi¼ 1

2tr C�1 oC

oHi

� �� 1

2tTC�1 oC

oHiC�1t; ð5Þ

where tr( Æ ) denotes the trace operator of a matrix, C(H) and LðHÞ are abbre-viated respectively as C and L, and gradient g :¼ oL=oH ¼ ½oL=oH1;oL=oH2; . . . ; oL=oHLþ2�T for notational convenience.


Proposition 1. The ijth entry of Hessian matrix H of (4) can be derived as

Hij : ¼o2L

oHioHj

¼ 1

2tr C�1 o2C

oHioHj

� �� 1

2tr C�1 oC

oHjC�1 oC

oHi

� �þ tTC�1 oC

oHjC�1

� oCoHi

C�1t� 1

2tTC�1 o2C

oHioHjC�1t; 8i; j 2 f1; 2; . . . ; Lþ 2g:

ð6ÞProof. It follows from (5) that

o2L

oHioHj¼ 1

2

o tr C�1 oCoHi

� �� oHj

� 1

2tTo C�1 oC

oHiC�1

� �oHj

t

¼ 1

2tr

otr C�1 oCoHi

� �o C�1 oC

oHi

� � o C�1 oCoHi

� �oHj

0@ 1AT0B@1CA

� 1

2tT

oðC�1ÞoHj

oCoHi

C�1 þ C�1 o2CoHioHj

C�1 þ C�1 oCoHi

oðC�1ÞoHj

� �t:

By formulas on trace derivative and quadratic products [19], we have

o2L

oHioHj¼ 1

2tr

o oCoHi

C�1� �oHj

0@ 1A� 1

2tT C�1 o2C

oHioHjC�1 þ 2C�1 oC

oHi

oðC�1ÞoHj

� �t:

Applying inverse-derivative formula yields

o2L

oHioHj¼ 1

2tr

o2CoHioHj

C�1 þ oCoHi

oðC�1ÞoHj

� �� 1

2tT C�1 o2C

oHioHjC�1 � 2C�1 oC

oHiC�1 oC

oHjC�1

� �t

¼ 1

2tr

o2C

oHioHjC�1 � oC

oHiC�1 oC

oHjC�1

� �� 1

2tT C�1 o

2CoHioHj

C�1 � 2C�1 oCoHi

C�1 oCoHj

C�1

� �t:

For computational efficiency, matrix operations of order O(N3) should be

avoided as far as possible. This is done by using matrix–vector operations of


order O(N2) and re-using intermediate results in gradient computation (5).

Exploiting the symmetric property of covariance function C(H) and its deriv-

atives yields (6) and thus completes the proof. h

Remark 1. The computational cost of Hessian matrix is of the same order as

that of log likelihood function and its gradient. In the general implementation

of Gaussian process, the determinant and inverse operations are O(N3),

whereas computing Hij is also O(N3) and C�1(oC/oHi) can be re-used as an

intermediate result of gradient computation.

Remark 2. Instead of finite-difference or iteration-type approximations, this

explicit Hessian in (6) could be utilized directly and accurately to speed upMLE procedure through second-order optimization routines such as scaled

conjugate gradient [18], Newton methods [17,21], and trust-region methods

[22,27]. Moreover, for automated Gaussian processes, the explicit Hessian

can be used to check the nature of optimization solutions (i.e., negative defi-

niteness being a sufficient condition for local maxima), rejecting saddle points

and re-starting the MLE automatically. Besides, the Hessian may convey other

useful information about the Gaussian process, e.g., whether the number of

training data is adequate for a regression task or not (this point is shown vianumerical experiments).

We will show in the next subsection that the second-order derivative, Hij, is

usually as simple as the first-order derivative oC/oHi and easy to calculate, andthat for the standard covariance function, all the terms in Eq. (6) could be sim-

plified further.

3.2. Simplification of Hessian matrix

Return to covariance function (3). To guarantee the hyperparameter posi-

tiveness while using unconstrained optimization, we could consider the expo-

nential form of the hyperparameters, i.e., the following modified covariancefunction (the ijth entry of the covariance matrix):

cðxi; xj;HÞ ¼ expðaÞ exp � 1

2ðxi � xjÞTDðxi � xjÞ

� �þ expðtÞdij

� �:¼ c1ði; jÞ þ xdij; ð7Þ

where D = diag(exp(d1), exp(d2), . . ., exp(dL)), x = exp(a)exp(t), and H = [a,d1,d2, . . .,dL,t].

Partial derivatives of C with respect to hyperparameter a can be derived as

oCoa

¼ C;o2Coaoa

¼ oCoa

¼ C;o2Coaot

¼ oCot

;o2Coaodk

¼ oCodk

;

k ¼ 1; 2; . . . ; L:


Partial derivatives of C with respect to t are

oCot

¼ xI ;o2Cotoa

¼ xI ;o2Cotot

¼ xI ;o2Cotodk

¼ 0; k ¼ 1; 2; . . . ; L:

Finally, partial derivatives of C with respect to dk are obtained as follows:

oCodk

¼ oCij

odk

� �N�N

¼ � 1

2c1ði; jÞ xðkÞi � xðkÞj

� �2expðdkÞ

� �N�N

;

o2Codkodp

¼

oCij

odk1� 1

2xðkÞi � xðkÞj

� �2expðdkÞ

� �� N�N

; if p ¼ k

oCij

odk� 1

2xðpÞi � xðpÞj

� �2expðdpÞ

� �� N�N

; if p 6¼ k

8>>><>>>:ð8Þ

which are computed only for L P p P k = 1,2, . . . ,L due to the H symmetry.

Thus, the a-related Hessian entries are simplified as

H11 ¼o2L

oaoa¼ 1

2tTC�1t; H1ðLþ2Þ ¼

o2L

oaot¼ x

2tTC�1C�1t;

H1ðkþ1Þ ¼o2L

oaodk¼ 1

2tTC�1 oC

odkC�1t; k ¼ 1; 2; . . . ; L:

ð9Þ

The t-related Hessian entries are as follows:

HðLþ2Þðkþ1Þ ¼o2L

otodk¼ �x

2tr C�1 oC

odkC�1

� �þ xtTC�1 oC

odkC�1C�1t; k ¼ 1; 2; . . . ; L;

HðLþ2ÞðLþ2Þ ¼o2L

otot¼ x

2trðC�1Þ � x2

2trðC�1C�1Þ

þ x2tTC�1C�1C�1t� x2tTC�1C�1t:

ð10Þ

It follows from (8) and (6) that the d-related Hessian entries are,"p,k2{L P p P k = 1,2, . . .,L},

Hðkþ1Þðpþ1Þ ¼o2L

odkodp

¼ 1

2tr C�1 o

2Codkodp

� �� 1

2tr C�1 oC

odpC�1 oC

odk

� �þ tTC�1

� oCodp

C�1 oCodk

C�1t� 1

2tTC�1 o2C

odkodpC�1t: ð11Þ


Proposition 2. For the specific covariance function in (7), Hessian matrix HðHÞis simplified as in Eqs. (9)–(11) with instrumental variables in (8).

Proof. Following the above derivation procedure and utilizing the symmetry

properties of C(H), its derivative matrices, and Hessian matrix HðHÞ. h

Remark 3. Re-check the computation cost in the general GP implementation.

Note that the trace of two-matrix product and matrix–vector product are both

of O(N2) operations. In order to compute the Hessian, L matrix–matrix oper-

ations (totally LN3) have to be performed. That is, to compute and store

C�1(oC/odk), k = 1,2, . . . ,L as they are frequently used in Hessian and gradient

computation. In contrast, computing the log likelihood and its gradient is of2 � 3N3 operations. Thus, using finite-difference to approximate Hessian takes

2(L + 2)N3 � 4(L + 2)N3 operations [23], aside from the accuracy.

4. Trust-region MLE approach

Iterative approaches to optimization can be divided into two categories: line

search methods and trust region methods [17]. Classical methods including

steepest descent, conjugate gradient and quasi-Newton methods, are line

search methods. In each iteration, they obtain a search direction first, and then

try different step lengths along this direction for a better solution point.It is known (and also confirmed in [8]) that steepest descent methods are

generally slow in the GP context. Conjugate gradient methods are widely used

in the exact and approximate implementations of Gaussian processes [4,5].

However, it becomes slow for banana/ravine-type problems, and more towards

local optima, it requires many more iterations and line searches, as compared

to second-order methods [17,18,20,23].

Trust region methods are a class of relatively new and robust second-order

optimization algorithms [17,22,27]. To show the advantages of using trust-re-gion methods for MLE hyperparameters estimation, the standard trust-region

algorithm is investigated here. At each iteration, say k, the algorithm begins

with a local quadratic approximation q( Æ ) of (4):

LðHþ sÞ � qðHþ sÞ ¼ LðHÞ þ gT sþ 1

2sTHs;

where s denotes the next step to be taken. The trust region algorithm restrictsthe length of s within trust-region radius ., resulting in a bounded trust-region

subproblem:

minimize qðHþ sÞsubject to jjsjj 6 .:

ð12Þ


In addition to robust existence of solutions to (12), there also exist efficient

algorithms for approximately solving (12) like [24–26] (exact solution is not re-

quired). Once this subproblem is solved, we have the following improvement

criterion about LðHÞ to decide whether or not to accept s and adjust .

c ¼ actual decrease

predicted decrease¼ LðHþ sÞ �LðHÞ

qðHþ sÞ �LðHÞ : ð13Þ

The standard decision for updating H and . is

H ¼Hþ s if LðHþ sÞ < LðHÞH otherwise

�; . ¼

maxfc1.; �.g if c P c1;

c2jjsjj if c 6 c2;

. otherwise;

8><>:where �. is an upper bound on the trust-region radius .. Typically, c1 = 0.75 and

c1 = 2 implying the good match of quadratic model q( Æ ) toL and increasing .;or, c2 = 0.25 and c2 = 0.5 implying the poor match of q( Æ ) to L and decreasing

.; otherwise, it is an acceptable match and keeping ..Compared to line-search methods, the above trust-region optimization pro-

cedure has shown the following advantages when applied to the MLE hyper-

parameters estimation of Gaussian processes.

Remark 4. As shown in (12), the bounded subproblem always has solutions by

decreasing ., and the trust region methods do not require the Hessian (or its

approximation) to be positive definite. It is thus able to handle ill-conditioned

problems [22], which is sometimes the situation of our multimodal log-likelihood function. In contrast, the second-order line-search methods like

quasi-Newton method need modify the Hessian to be always positive definite

from iteration to iteration in order to guarantee its invertibility and function-

value decreasing during minimizing L [17,23].

Remark 5. For theoretical analysis of trust region methods, fewer and milder

conditions are assumed as compared to line search methods [17,21,22,27]. For

example, it is only required that the norm of Hessian matrix do not increase at

a rate faster than linear. In contrast, for line-search methods, it is required that

the condition number of the Hessian do not grow rapidly and is also uniformly

bounded. In our context, the Hessian may be positive/negative semi-definite or

indefinite with unbounded condition numbers, as the log-likelihood ismultimodal.

Remark 6. Because of the improvement measure (13) and the adjusting mech-anism of trust-region radius, approximate solution to the trust-region subprob-

lem (12) is sufficient for convergence proof [26]. In our comparative

experiments, trust-region methods are more robust than line-search methods


in terms of the success rate of MLE solutions. Furthermore, it is shown [22]

that trust region methods are extendable to general approximation-based opti-

mization problems. This might be useful to the approximate implementation of

Gaussian process [5,20].

5. Numerical experiments

A series of numerical tests with different numbers of sample points and noiselevels are conducted based on two regression applications detailed in [12],

namely, the sinusoidal function example and the Wiener–Hammerstain non-

linear-system example. With and without explicit Hessian, the trust-region

algorithms implemented in MATLAB [28] are used to compare their perfor-

mance with the Polack-Ribiere conjugate-gradient method [4] under the

same initial and terminating conditions. Only general and interesting conclu-

sions are presented below from the abundant numerical observations and

comparisons.

5.1. Sinusoidal function

The underlying function is yðxÞ ¼ A sinða1x1 þ a2x2Þ where A = 1, a1 = 1.0

and a2 = 0.8. The domain of interest is the rectangle, {0 6 x1 6 7,0 6 x2 6 8},

and the training data is the values on a regularffiffiffiffiN

p�

ffiffiffiffiN

plattice of the domain

with noise intensity v. The tuned GP model is used to predict the values of y(x)

on a 2ffiffiffiffiN

p� 2

ffiffiffiffiN

plattice of the domain. The regression task with N = 289 and

v = 0.2 is illustrated in Fig. 1. Fig. 2 shows a typical comparison of computa-

tional cost under the same initial conditions, while Table 1 shows the compar-

ison of ten times averaged computational cost, including the CPU time and the

02

46

8 02

46

8-2

-1.5

-1-0.5

0

0.5

1

1.5

x2x1

t(x)

02

46

8 02

46

-1.5

-1

-0.5

0

0.5

1

1.5

x2x1

y(x)

(a) Noisy training data (b) Regression result

Fig. 1. Sinusoidal regression using Hessian and trust-region algorithm.

0 100 200 300 400 500 600 700 800 900 10000

20

40

60

80

100

120

140

No.

of I

tera

tions

CTUTCTHUTHCG

N

0 100 200 300 400 500 600 700 800 900 10000

100

200

300

400

500

600

700

cpu

time

(sec

ond)

CTUTCTHUTHCG

N

(b)(a)

Fig. 2. Computation cost under the same random initial state (sinusoidal example).

Table 1

Ten times averaged computational cost for sinusoidal-function example

N CT UT CTH UTH CG

49 0.6893 (19) 0.4248 (16) 0.2689 (20) 0.1687 (15) 0.9234 (55)

64 0.9141 (21) 0.8515 (23) 0.2955 (17) 0.3421 (22) 1.0423 (37)

81 1.4903 (24) 1.4516 (26) 0.5845 (23) 0.5781 (26) 1.4687 (44)

100 2.1658 (27) 1.6671 (22) 0.7638 (26) 0.5924 (21) 2.4984 (49)

121 3.6310 (31) 2.9969 (27) 1.2847 (31) 1.0529 (26) 3.8423 (50)

144 5.5093 (33) 5.8751 (37) 2.0327 (32) 2.1003 (35) 4.8295 (46)

169 6.7128 (30) 8.3547 (38) 2.4796 (31) 2.6923 (33) 7.3282 (48)

196 9.9048 (33) 7.9421 (26) 3.4923 (28) 3.2342 (26) 9.1127 (48)

225 14.1794 (34) 13.7769 (34) 5.4796 (32) 5.6218 (34) 17.9844 (65)

256 23.2840 (35) 22.3798 (36) 9.1206 (32) 9.2717 (36) 21.1783 (52)

289 26.0578 (36) 29.9000 (39) 10.2234 (34) 13.2611 (38) 22.0466 (48)

324 37.2813 (41) 34.9299 (39) 15.5528 (40) 17.7267 (46) 32.8221 (62)

361 43.7952 (39) 46.9579 (41) 19.0483 (39) 19.8437 (40) 47.0142 (57)

400 66.6562 (47) 63.0625 (44) 27.2343 (44) 26.8908 (43) 60.0358 (73)

441 60.3577 (34) 83.0390 (47) 28.8203 (37) 34.4076 (44) 63.7875 (51)

484 94.1719 (44) 85.9374 (39) 39.2123 (40) 40.9970 (42) 89.2422 (77)

529 112.9249 (42) 144.3187 (54) 52.7749 (44) 59.3784 (49) 101.1625 (66)

576 142.2546 (42) 174.6532 (52) 71.8812 (47) 78.8845 (52) 185.4781 (78)

625 171.0424 (44) 190.0734 (49) 69.1358 (38) 87.2672 (48) 183.4015 (84)

676 210.3343 (45) 239.9610 (51) 104.3015 (47) 121.1297 (54) 214.1891 (84)

729 240.0218 (43) 319.0110 (57) 114.5608 (43) 155.5173 (58) 307.6295 (79)

784 280.1562 (42) 335.4124 (50) 119.0766 (37) 171.6220 (53) 270.2671 (70)

841 322.6843 (41) 438.9391 (56) 174.6562 (46) 185.7595 (48) 334.0406 (81)

900 378.8032 (40) 476.8765 (51) 170.9968 (37) 228.1205 (49) 464.7920 (94)

961 473.6595 (43) 574.4327 (52) 263.8191 (48) 300.0325 (55) 534.1203 (87)


iteration numbers in brackets. In the figures and tables, CT denotes the con-

strained trust-region algorithm without explicit Hessian, UT denotes the

unconstrained trust-region algorithm without explicit Hessian, CTH denotes


the constrained trust-region algorithm with explicit Hessian, UTH denotes the

unconstrained trust-region algorithm with explicit Hessian, and CG denotes

the conjugate-gradient algorithm.

Remark 7. In our GP context, the trust-region algorithm with explicit Hessian

(i.e., CTH and UTH) is in general faster than the trust-region algorithmwithout explicit Hessian (i.e., CT and UT). This verifies the computational

analysis in Section 3.2 and Remark 3. In addition, without fully utilizing

curvature information, the general efficiency of the conjugate gradient

algorithm (CG) is between CTH/UTH and CT/UT, usually taking more

iterations and function evaluations to approach the optimal solutions.

Remark 8. In the very difficult ravine-type tasks, compared to trust-region

algorithms, the conjugate gradient method usually converges much slower

and even fails. For example, using a non-diagonal singular matrix D in (3)

improves the regression accuracy, but makes the optimization task into a

ravine-type problem. The comparison of computational cost in this case is

shown in Fig. 3, where the number of CG iterations is limited to 200. Thisobservation is consistent with others� about CG and line-search methods

[4,8,12,18,20].

5.2. Wiener–Hammerstein system

Consider the identification problem of a transversal Wiener–Hammerstein

system [12]. In terms of measured variables (input r and output y at time in-

stant s), the system dynamics of interest can be reformulated as

yi ¼ 0:3ðH 1RÞ3 þ 0:165ðH 3RÞ3;

0 100 200 300 400 500 600 700 800 900 10000

500

1000

1500

cpu

time

(sec

ond)

CTUTCG

N

0 100 200 300 400 500 600 700 800 900 10000

50

100

150

200

No.

of I

tera

tions

CTUTCG

N

(a) (b)

Fig. 3. Comparison of computation cost for non-diagonal D (sinusoidal example).


where R :¼ ½ rðsiÞ rðsi�1Þ rðsi�2Þ rðsi�3Þ �T (i.e., L = 4), and Hi is the ith row

of the matrix

H ¼0:9184 0:3674 0 0

0 0:9184 0:3674 0

0 0 0:9184 0:3674

264375:

The output in response to a Gaussian input is measured: data is collected for0.1N seconds with sampling interval 0.1, and Gaussian white noise of standard

deviation v is added to the output signal. The prediction effect with N = 150

and v = 0.2 is shown in Fig. 4, and the optimization comparison is in Fig. 5

and Table 2. The typical situation is similar to Remark 7: the trust-region algo-

rithms, especially with explicit Hessian, are good alternatives for the MLE

hyperparameters estimation of Gaussian processes.

0 5 10 15-15

-10

-5

0

5

10

15

time (second)

syst

em o

utpu

t

train ttest y

0 5 10-0.15

-0.1

-0.05

0

0.05

0.1

0.15

test

err

or

time (second)

(a) (b)

Fig. 4. Wiener–Hammerstein example of using Hessian and trust-region algorithm.

0 100 200 300 400 500 600 700 8000

200

400

600

800

1000

1200

cpu

time

(sec

ond)

CTUTCTHUTHCG

N

0 100 200 300 400 500 600 700 8000

20

40

60

80

100

120

140

160

180

200

No.

of I

tera

tions

N

(a) (b)CTUTCTHUTHCG

Fig. 5. Computation cost under the same initial state (Wiener–Hammerstein).

Table 2

Ten times averaged computational cost for Wiener–Hammerstein example

N CT UT CTH UTH CG

3 0.2420 (18) 0.2453 (19) 0.1172 (18) 0.1172 (19) 0.5593 (155)

4 0.2265 (17) 0.2579 (20) 0.1095 (17) 0.1110 (18) 0.5107 (167)

7 0.1967 (13) 0.2206 (15) 0.0967 (13) 0.1016 (14) 0.5967 (165)

12 0.2954 (17) 0.2796 (16) 0.1484 (18) 0.1454 (17) 0.6657 (180)

19 0.3687 (17) 0.4828 (23) 0.1798 (17) 0.2469 (24) 0.5749 (102)

28 0.4376 (16) 0.5513 (21) 0.2330 (17) 0.2970 (22) 0.7592 (85)

39 0.6406 (17) 0.9016 (24) 0.3123 (17) 0.3971 (21) 1.1874 (98)

52 1.5860 (30) 1.5437 (30) 0.8234 (30) 0.6906 (26) 1.6531 (94)

67 1.4811 (19) 2.1579 (28) 0.7485 (19) 1.1672 (30) 2.9562 (109)

84 2.3140 (21) 3.2797 (30) 1.1671 (21) 1.4608 (26) 3.7937 (99)

103 5.0031 (31) 5.1827 (32) 2.0781 (26) 2.6532 (33) 5.1969 (88)

124 5.7314 (24) 7.5250 (32) 2.3515 (23) 2.7624 (28) 7.6966 (101)

147 9.6078 (29) 11.8451 (36) 4.0781 (27) 4.9626 (33) 12.5906 (111)

172 12.3187 (27) 15.4998 (33) 6.0766 (27) 7.6905 (34) 16.2812 (107)

199 18.5699 (30) 17.5422 (29) 9.4765 (31) 8.2341 (27) 19.6078 (91)

228 36.3089 (45) 40.6810 (50) 19.7888 (48) 17.1811 (42) 29.8246 (111)

259 34.5827 (32) 41.8322 (39) 19.9953 (35) 20.7262 (37) 37.0437 (92)

292 48.0307 (35) 47.2388 (34) 24.8887 (35) 26.1404 (36) 47.0650 (94)

327 76.0839 (44) 80.3743 (46) 38.9654 (43) 43.5152 (47) 75.5367 (120)

364 123.1177 (57) 120.6036 (56) 74.3760 (65) 64.0106 (56) 84.5962 (118)

403 123.9826 (46) 162.6480 (60) 61.8358 (43) 98.4776 (69) 102.6462 (106)

444 182.7944 (55) 199.8484 (61) 98.4025 (56) 88.9385 (51) 139.9789 (122)

532 241.7160 (49) 271.7094 (55) 141.3059 (53) 157.0452 (59) 186.0429 (103)

579 333.0707 (55) 339.6398 (55) 181.7228 (55) 165.2496 (50) 887.8056 (108)

628 363.6192 (50) 404.6686 (56) 227.4981 (58) 209.7888 (53) 281.3492 (106)

679 491.1429 (54) 630.0538 (73) 292.9498 (62) 302.7325 (64) 384.8370 (122)

732 1219.7418 (120) 1061.0852 (105) 522.9653 (93) 528.0667 (94) 486.7467 (129)


Remark 9. For more automatic GP implementation, the eigenvalue positive-

ness of Hessian matrix (i.e., min kðHÞ > 0) can be used to reject inappropriate

solutions and re-start the MLE optimization if necessary. Moreover, Hessian

might be a good indicator for the adequateness of training data for a specific

GP-regression task. For example, as in Fig. 6, if the dataset is not large enough,

the likelihood function possibly contains more saddle points/minima at which

the optimization routine may get stuck. In comparison, after the training-data

size increases to some value, it turns out to be much easier and more frequentto find an optima.

Finally, it is worth mentioning that the CG algorithm still remains as one of

the most useful techniques for solving large-scale optimization problems if

explicit matrix computation and storage seem impractical [17]; e.g., in ourcontext, if both L and N are extremely large. In addition, as shown in Fig. 7

on function-value comparison, the algorithms may converge to different local

0 100 200 300 400 500 600 700 800 900 1000−5000

−4000

−3000

−2000

−1000

0

1000

Fun

ctio

n V

alue

s

CTUTCTHUTHCG

N

0 100 200 300 400 500 600 700 800−2500

−2000

−1500

−1000

−500

0

500

1000

Fun

ctio

n V

alue

s

N

(a) (b)

CTUTCTHUTHCG

Fig. 7. Log-likelihood function values at different MLE solutions. (a) Sinusoidal example. (b)

Wiener–Hammerstein.

0 100 200 300 400 500 600 700 800 900 10000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

N

0 100 200 300 400 500 600 700 8000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

N

(a) (b)

Fig. 6. Non-optimality percentage of the MLE solutions versus N. (a) Sinusoidal example. (b)

Wiener–Hammerstein.


optima, and thus a parallel/robust GP-implementation may be developed by

incorporating different optimization routines.

6. Conclusions

This paper has investigated in detail the log-likelihood optimization in the

context of Gaussian process. The derivation, simplification and utility of theHessian matrix have been shown theoretically and experimentally, eliminating

previous misunderstandings on it. The idea of using trust region algorithms is

also interesting and new for the non-trivial numerical computation of the

Gaussian process regression, even for general statistical research. For GP opti-

mization with the normal choice of covariance functions, the trust-region algo-

rithms with explicit Hessian perform considerably better than the trust-region

algorithms without Hessian and a little better than the conjugate gradient


method. For ravine-type GP optimization that can occur with other choice of

the covariance function, the trust-region algorithms remarkably outperform

the conjugate gradient method. The conclusion may have immediate conse-

quences in this field; e.g., the approximate GP-implementation based on trust

region methods, and the parallel/robust GP-implementation.

References

[1] K. Hornik, M. Stinchcombe, H. White, Multilayer feedforward networks are universal

approximators, Neural Networks 2 (1989) 259–366.

[2] E.B. Baum, D. Haussler, What size net gives valid generalization?, Neural Computation 6

(1989) 151–160.

[3] R.M. Neal, Bayesian learning for neural networks, Lecture Notes in Statistics, 118, Springer,

New York, 1996.

[4] C.E. Rasmussen, Evaluation of Gaussian processes and other methods for non-linear

regression, Ph.D. thesis, University of Toronto, 1996.

[5] D.J.C. MacKay, Introduction to Gaussian processes, Neural Networks and Machine

Learning, F: Computer and Systems Sciences, vol. 168, Springer, Berlin, Heidelberg, 1998,

pp. 133–165.

[6] C.K.I. Williams, D. Barber, Bayesian classification with Gaussian processes, IEEE Transac-

tions on Pattern Analysis and Machine Intelligence 20 (1998) 1342–1351.

[7] S. Sambu, M. Wallat, T. Graepel, K. Obermayer, Gaussian process regression: active data

selection and test point rejection, in: Proceedings of the IEEE-INNS-ENNS International

Joint Conference on Neural Networks, 3, 2000, pp. 241–246.

[8] T. Yoshioka, S. Ishii, Fast Gaussian process regression using representative data, in:

Proceedings of International Joint Conference on Neural Networks, 1, 2001, pp. 132–137.

[9] D.J. Leith, W.E. Leithead, E. Solak, R. Murray-Smith, Divide and conquer identification

using Gaussian process priors, in: Proceedings of the 41st IEEE Conference on Decision and

Control, 1, 2002, pp. 624–629.

[10] J.Q. Shi, R. Murray-Smith, D.M. Titterington, Bayesian regression and classification using

mixtures of multiple Gaussian processes, International Journal of Adaptive Control and Signal

Processing 17 (2003) 149–161.

[11] E. Solak, R. Murray-Smith, W.E. Leithead, D.J. Leith, C.E. Rasmussen, Derivative

observations in Gaussian process models of dynamic systems, Advances in Neural Information

Processing Systems 15 (2003) 1033–1040.

[12] W.E. Leithead, E. Solak, D.J. Leith, Direct identification of nonlinear structure using

Gaussian process prior models, in: European Control Conference, Cambridge, 2003.

[13] A. O�Hagan, On curve fitting and optimal design for regression, Journal of the Royal

Statistical Society B 40 (1978) 1–42.

[14] K.V. Mardia, R.J. Marshall, Maximum likelihood estimation for models of residual

covariance in spatial regression, Biometrika 71 (1984) 135–146.

[15] N.A.C. Cressie, Statistics for Spatial Data, John Wiley and Sons, New York, 1993.

[16] C.J. Paciorek, M.J. Schervish, Nonstationary covariance functions for Gaussian process

regression, Advances in Neural Information Processing Systems 16 (2003).

[17] J. Nocedal, Theory of algorithms for unconstrained optimization, Acta Numerica (1992)

199–242.

[18] M.F. Moller, A scaled conjugate gradient algorithm for fast supervised learning, Neural

Networks 6 (1993) 525–533.


[19] G.H. Golub, C.F. Van Loan, Matrix Computations, Johns Hopkins University Press,

Baltimore, 1996.

[20] J. Skilling, Bayesian Numerical Analysis, Physics and Probability, Cambridge University

Press, 1993.

[21] Z. Wu, G.N. Phillips Jr., R. Tapia, Y. Zhang, A fast Newton method for entropy

maximization in statistical phase estimation, Acta Crystallographica A 57 (2001) 681–685.

[22] N. Alexandrov, J.E. Dennis Jr., R.M. Lewis, V. Torczon, A trust region framework for

managing the use of approximation models in optimization, Journal on Structural Optimi-

zation 15 (1998) 16–23.

[23] J.E. Dennis, R.B. Schnabel, Numerical Methods for Unconstrained Optimization and

Nonlinear Equations, Prentice-Hall, Englewood, NJ, 1983.

[24] J.J. More, D.C. Sorensen, Computing a trust region step, SIAM Journal of Scientific and

Statistically Computing 4 (1983) 553–572.

[25] T. Steihaug, The conjugate gradient method and trust regions in large scale optimization,

SIAM Journal on Numerical Analysis 20 (1983) 626–637.

[26] R.H. Byrd, R.B. Schnabel, G.A. Shultz, Approximate solution of the trust region problem

by minimization over two-dimensional subspaces, Mathematical Programming 40 (1988)

247–263.

[27] A.R. Conn, N.I.M. Gould, Ph.L. Toint, Trust Region Methods, SIAM, Philadelphia, 2000.

[28] The MathWorks Inc., Optimization Toolbox User�s Guide, Version 2.1, 2000.

Exploiting Hessian matrix and trust-region algorithm in hyperparameters estimation of Gaussian...

Documents

Transcript of Exploiting Hessian matrix and trust-region algorithm in hyperparameters estimation of Gaussian...