Exploiting Hessian matrix and trust-region algorithm in hyperparameters estimation of Gaussian...
-
Upload
yunong-zhang -
Category
Documents
-
view
216 -
download
0
Transcript of Exploiting Hessian matrix and trust-region algorithm in hyperparameters estimation of Gaussian...
Applied Mathematics and Computation 171 (2005) 1264–1281
www.elsevier.com/locate/amc
Exploiting Hessian matrix and trust-regionalgorithm in hyperparametersestimation of Gaussian process
Yunong Zhang a,*, W.E. Leithead a,b
a Department of Electronic and Electrical Engineering, University of Strathclyde,
Glasgow G1 1QE, UKb Hamilton Institute, National University of Ireland, Maynooth, Co. Kildare, Ireland
Abstract
Gaussian process (GP) regression is a Bayesian non-parametric regression model,
showing good performance in various applications. However, it is quite rare to see
research results on log-likelihood maximization algorithms. Instead of the commonly
used conjugate gradient method, the Hessian matrix is first derived/simplified in this
paper and the trust-region optimization method is then presented to estimate GP hyper-
parameters. Numerical experiments verify the theoretical analysis, showing the advan-
tages of using Hessian matrix and trust-region algorithms. In the GP context, the
trust-region optimization method is a robust alternative to conjugate gradient method,
also in view of future researches on approximate and/or parallel GP-implementation.
� 2005 Elsevier Inc. All rights reserved.
Keywords: Gaussian process; Log likelihood maximization; Conjugate gradient; Trust region;
Hessian matrix
0096-3003/$ - see front matter � 2005 Elsevier Inc. All rights reserved.
doi:10.1016/j.amc.2005.01.113
* Corresponding author.
E-mail addresses: [email protected] (Y. Zhang), [email protected] (W.E. Leithead).
Y. Zhang, W.E. Leithead / Appl. Math. Comput. 171 (2005) 1264–1281 1265
1. Introduction
In the last two decades, neural networks have been used widely for non-lin-
ear regression and classification. One attraction of neural networks is their uni-
versal approximation ability [1]. However, with large number of parameters
being used, there is a risk of overfitting [2]. In the limit case of large networks,Neal [3] showed that the prior distribution over non-linear functions implied by
Bayesian neural network falls into a class of probability distributions known as
Gaussian processes. This observation motivated the idea of discarding param-
eterized network models and working directly with Gaussian processes for
regression. Following some pioneering work in the late 1990s (e.g., [4–6]), inter-
est has grown quickly. In particular, Gaussian processes have recently been
used for various examples and applications [7–12].
The idea of Gaussian process for regression, having actually been used for along time, can be traced back to [13–15]. Roughly speaking, a Gaussian process
is characterized by a covariance function, while our prior knowledge like the
smoothness of the underlying function is incorporated into this covariance
function. A Gaussian process, specified by the covariance matrix, usually has
a small-number set of hyperparametersH like length scales and variance scales.
In addition to Monte Carlo approach, another standard and general practice
for estimating hyperparameters is to minimize the negative log-likelihood func-
tion of sample data over H [4,6,8,12,14]. As the likelihood is in general highlynon-linear and multimodal, efficient and robust optimization algorithms are
very crucial to develop automated Gaussian processes without much human
intervention for large-scale high-dimensional regression tasks.
To our best knowledge, and also confirmed by other research reports like
[4,8], only steepest-descent or conjugate-gradient methods are investigated
for the negative log-likelihood minimization, without giving much implementa-
tion/comparison details. The scarcity on the investigation of optimization algo-
rithms for Gaussian process regression motivates our such research. In thispaper, we first derive/simplify the Hessian matrix for the log-likelihood func-
tion, which contains a lot of useful information to Gaussian process. In light
of numerical experiments, we then propose using the trust-region algorithm
as an efficient and robust optimization alternative for the Gaussian process
regression. Comparison results with the conjugate-gradient method [4] are also
shown with analysis and remarks.
The remainder of this paper is organized in five sections. For readability and
completeness, the preliminaries on Gaussian process regression are given inSection 2. Hessian matrix is derived and simplified in Section 3. In addition
to briefing other minimization algorithms, Section 4 presents the trust-region
method for GP hyperparameters estimation. Numerical experiments and com-
parative results are discussed in Section 5. Lastly, Section 6 concludes this
paper with final remarks.
1266 Y. Zhang, W.E. Leithead / Appl. Math. Comput. 171 (2005) 1264–1281
2. Preliminaries
Let us first brief the problem of regression. That is, modelling function y(x)
from given noisy data D ¼ fðxi; tiÞgNi¼1 (also termed sample/training data),
where input vector xi 2 RL, output vector ti 2 R, N is the number of sample
data points, and L is the dimension of input space. Then, the tuned estimationmodel may be used to make predictions at new input points (also termed test/
prediction point), e.g., tN+1 corresponding to xNþ1 62 D. Neural networks have
been a widespread parametric model for regression, typically with a large num-
ber of parameters like weights and biases [1]. Its overfitting and Neal�s obser-vation, however, have drawn researchers� attention back to non-parametric
Gaussian process for modelling.
2.1. Gaussian process prior
Now let us write here a general model of data D in the Bayesian regression
context:
ti ¼ yðxiÞ þ vi;
where vi is the noise on y(xi), i2{1,2, . . .,N}. The prior over the space of pos-
sible functions and the prior over the noise are defined respectively as P(yjHy)and P(vjHv), where Hy and Hv are the hyperparameter sets respectively for the
function model and for the noise model. Given such hyperparameters, the
probability of the data can be integrated out by
P ðtjx;Hy ;HvÞ ¼Z Z
P ðtjx; y; vÞP ðyjHyÞP ðvjHvÞdy dv;
where t := (t1, t2, . . ., tN) and x := (x1,x2, . . .,xN). Defining t̂ ¼ ðt1; t2; . . . ; tN ;tNþ1Þ, the conditional distribution of tN+1 corresponding to xN+1 is
P ðtNþ1jD;Hy ;Hv; xNþ1Þ ¼P ð̂tjx;Hy ;Hv; xNþ1Þ
P ðtjx;Hy ;HvÞ; ð1Þ
used as the predictions about tN+1. The aforementioned analytical integration
for calculating P(tjx,Hy,Hv) is usually very complicated if not impossible, and a
simpler approach based on Gaussian process priors has often been adopted
[4–6,8,9,11,12].
The data vector t is defined as a collection of random variables
t = (t(x1),t(x2), . . .) having the following Gaussian joint distribution under
standard zero-mean assumption:
P ðtjx;HÞ / exp � 1
2tTC�1ðHÞt
� �ð2Þ
Y. Zhang, W.E. Leithead / Appl. Math. Comput. 171 (2005) 1264–1281 1267
for any set of input x. In (2), the ijth entry of covariance matrix C(H) is the
scalar covariance function c(xi,xj;H), and / stands for equality up to a normal-
izing factor. It follows from (1) that the conditional Gaussian distribution of
tN+1 is
P ðtNþ1jD;H; xNþ1Þ / exp � t̂T bC�1
ðHÞ̂t� tTC�1ðHÞt2
!
/ exp �ðtNþ1 ��tNþ1Þ2
2r2Nþ1
!;
where the predictive mean�tNþ1 ¼ kC�1ðHÞt with row vector k := [c(x1,xN+1;H),
c(x2,xN+1;H),. . .,c(xN,xN+1;H)] being the covariance between training data andtest point, and the predictive variance r2
Nþ1 ¼ cðxNþ1; xNþ1;HÞ � kC�1ðHÞk. Evi-dently, vector C�1(H)t is independent of the test data and can be understood as
representing the model constructed from covariance function c( Æ , Æ ;H) and
training data D. Given D, the covariance function (or simply, the hyperparam-
eters H) dominates the Gaussian process model.
2.2. Hyperparameters estimation
The following standard covariance function is often used by most research-
ers in practice, and also considered throughout this paper to demonstrate our
proposed Hessian and trust-region optimization approach to Gaussian process
regression:
cðxi; xj;HÞ ¼ a exp � 1
2
XLl¼1
ðxðlÞi � xðlÞj Þ2dl
!þ tdij
or, in matrix notation,
cðxi; xj;HÞ ¼ a exp � 1
2ðxi � xjÞTDðxi � xjÞ
� �þ tdij; ð3Þ
where xðlÞi denotes the lth entry of vector xi 2 RL (i.e., the ith datum input), and
dij is the Kronecker delta [10]. In (3), hyperparameter a relates to the overall
amplitude of the Gaussian process in output space, matrix D = diag(d1,d2, . . .,dL) characterizes a set of length scales dl corresponding to each input,
and hyperparameter t gives the noise level. The hyperparameter set is assumed
in this order H ¼ ½a; d1; d2; . . . ; dL; t� 2 RLþ2 with each constrained to be non-
negative. A covariance function with non-diagonal singular D [12] is also
involved in this research (for more choices of covariance functions, refer to
[5,16]).
1268 Y. Zhang, W.E. Leithead / Appl. Math. Comput. 171 (2005) 1264–1281
Given the specific form of covariance function c( Æ , Æ ;H), we have to esti-
mate the distribution of hyperparameters H from training data D, in order
to make an ideal prediction of tN+1, like
P ðtNþ1jD; xNþ1Þ ¼Z
P ðtNþ1jD;H; xNþ1ÞP ðHjDÞdH:
But the analytical integration is usually intractable. In addition to numeri-cal integration via Monte Carlo methods [4,10], we could approximate it
via the most probable values of hyperparameters Hmp [5,8,9,12], i.e.,
P ðtNþ1jD; xNþ1Þ � PðtNþ1jD;Hmp; xNþ1Þ. To obtain Hmp, it is common practice
to minimize the following negative log-likelihood of training data (i.e., maxi-
mum likelihood estimation, MLE)
LðHÞ ¼ 1
2log detCðHÞ þ 1
2tTC�1ðHÞt: ð4Þ
Simply speaking, derivation of (4) implies, from an engineering point of view,
that given specific training data and covariance function, the Gaussian process
regression actually becomes an unconstrained non-linear optimization problem
with high-dimension matrices/vectors, which we will address in the ensuing
sections.
3. Analysis of Hessian matrix
In solving the MLE optimization problem, only steepest-ascent and conju-
gate-gradient methods have been studied in the GP literature, together with the
first-order information of the log likelihood (4). Gradient-based optimization
routines could only guarantee convergence to stationary points (not necessarily
minima/maxima) without checking the second-order Hessian information [17].
Besides, if available, the optimization methods with Hessian are generallymore efficient and effective than those without Hessian, especially for banana
or ravine-type problems [18]. In view of the above, the Hessian matrix is
investigated.
3.1. Derivation of Hessian matrix
The first-order derivative of log likelihood (4) with respect to hyperpara-
meter Hi (i.e., the gradient) is [14]
oL
oHi¼ 1
2tr C�1 oC
oHi
� �� 1
2tTC�1 oC
oHiC�1t; ð5Þ
where tr( Æ ) denotes the trace operator of a matrix, C(H) and LðHÞ are abbre-viated respectively as C and L, and gradient g :¼ oL=oH ¼ ½oL=oH1;oL=oH2; . . . ; oL=oHLþ2�T for notational convenience.
Y. Zhang, W.E. Leithead / Appl. Math. Comput. 171 (2005) 1264–1281 1269
Proposition 1. The ijth entry of Hessian matrix H of (4) can be derived as
Hij : ¼o2L
oHioHj
¼ 1
2tr C�1 o2C
oHioHj
� �� 1
2tr C�1 oC
oHjC�1 oC
oHi
� �þ tTC�1 oC
oHjC�1
� oCoHi
C�1t� 1
2tTC�1 o2C
oHioHjC�1t; 8i; j 2 f1; 2; . . . ; Lþ 2g:
ð6ÞProof. It follows from (5) that
o2L
oHioHj¼ 1
2
o tr C�1 oCoHi
� �� �oHj
� 1
2tTo C�1 oC
oHiC�1
� �oHj
t
¼ 1
2tr
otr C�1 oCoHi
� �o C�1 oC
oHi
� � o C�1 oCoHi
� �oHj
0@ 1AT0B@1CA
� 1
2tT
oðC�1ÞoHj
oCoHi
C�1 þ C�1 o2CoHioHj
C�1 þ C�1 oCoHi
oðC�1ÞoHj
� �t:
By formulas on trace derivative and quadratic products [19], we have
o2L
oHioHj¼ 1
2tr
o oCoHi
C�1� �oHj
0@ 1A� 1
2tT C�1 o2C
oHioHjC�1 þ 2C�1 oC
oHi
oðC�1ÞoHj
� �t:
Applying inverse-derivative formula yields
o2L
oHioHj¼ 1
2tr
o2CoHioHj
C�1 þ oCoHi
oðC�1ÞoHj
� �� 1
2tT C�1 o2C
oHioHjC�1 � 2C�1 oC
oHiC�1 oC
oHjC�1
� �t
¼ 1
2tr
o2C
oHioHjC�1 � oC
oHiC�1 oC
oHjC�1
� �� 1
2tT C�1 o
2CoHioHj
C�1 � 2C�1 oCoHi
C�1 oCoHj
C�1
� �t:
For computational efficiency, matrix operations of order O(N3) should be
avoided as far as possible. This is done by using matrix–vector operations of
1270 Y. Zhang, W.E. Leithead / Appl. Math. Comput. 171 (2005) 1264–1281
order O(N2) and re-using intermediate results in gradient computation (5).
Exploiting the symmetric property of covariance function C(H) and its deriv-
atives yields (6) and thus completes the proof. h
Remark 1. The computational cost of Hessian matrix is of the same order as
that of log likelihood function and its gradient. In the general implementation
of Gaussian process, the determinant and inverse operations are O(N3),
whereas computing Hij is also O(N3) and C�1(oC/oHi) can be re-used as an
intermediate result of gradient computation.
Remark 2. Instead of finite-difference or iteration-type approximations, this
explicit Hessian in (6) could be utilized directly and accurately to speed upMLE procedure through second-order optimization routines such as scaled
conjugate gradient [18], Newton methods [17,21], and trust-region methods
[22,27]. Moreover, for automated Gaussian processes, the explicit Hessian
can be used to check the nature of optimization solutions (i.e., negative defi-
niteness being a sufficient condition for local maxima), rejecting saddle points
and re-starting the MLE automatically. Besides, the Hessian may convey other
useful information about the Gaussian process, e.g., whether the number of
training data is adequate for a regression task or not (this point is shown vianumerical experiments).
We will show in the next subsection that the second-order derivative, Hij, is
usually as simple as the first-order derivative oC/oHi and easy to calculate, andthat for the standard covariance function, all the terms in Eq. (6) could be sim-
plified further.
3.2. Simplification of Hessian matrix
Return to covariance function (3). To guarantee the hyperparameter posi-
tiveness while using unconstrained optimization, we could consider the expo-
nential form of the hyperparameters, i.e., the following modified covariancefunction (the ijth entry of the covariance matrix):
cðxi; xj;HÞ ¼ expðaÞ exp � 1
2ðxi � xjÞTDðxi � xjÞ
� �þ expðtÞdij
� �:¼ c1ði; jÞ þ xdij; ð7Þ
where D = diag(exp(d1), exp(d2), . . ., exp(dL)), x = exp(a)exp(t), and H = [a,d1,d2, . . .,dL,t].
Partial derivatives of C with respect to hyperparameter a can be derived as
oCoa
¼ C;o2Coaoa
¼ oCoa
¼ C;o2Coaot
¼ oCot
;o2Coaodk
¼ oCodk
;
k ¼ 1; 2; . . . ; L:
Y. Zhang, W.E. Leithead / Appl. Math. Comput. 171 (2005) 1264–1281 1271
Partial derivatives of C with respect to t are
oCot
¼ xI ;o2Cotoa
¼ xI ;o2Cotot
¼ xI ;o2Cotodk
¼ 0; k ¼ 1; 2; . . . ; L:
Finally, partial derivatives of C with respect to dk are obtained as follows:
oCodk
¼ oCij
odk
� �N�N
¼ � 1
2c1ði; jÞ xðkÞi � xðkÞj
� �2expðdkÞ
� �N�N
;
o2Codkodp
¼
oCij
odk1� 1
2xðkÞi � xðkÞj
� �2expðdkÞ
� �� �N�N
; if p ¼ k
oCij
odk� 1
2xðpÞi � xðpÞj
� �2expðdpÞ
� �� �N�N
; if p 6¼ k
8>>><>>>:ð8Þ
which are computed only for L P p P k = 1,2, . . . ,L due to the H symmetry.
Thus, the a-related Hessian entries are simplified as
H11 ¼o2L
oaoa¼ 1
2tTC�1t; H1ðLþ2Þ ¼
o2L
oaot¼ x
2tTC�1C�1t;
H1ðkþ1Þ ¼o2L
oaodk¼ 1
2tTC�1 oC
odkC�1t; k ¼ 1; 2; . . . ; L:
ð9Þ
The t-related Hessian entries are as follows:
HðLþ2Þðkþ1Þ ¼o2L
otodk¼ �x
2tr C�1 oC
odkC�1
� �þ xtTC�1 oC
odkC�1C�1t; k ¼ 1; 2; . . . ; L;
HðLþ2ÞðLþ2Þ ¼o2L
otot¼ x
2trðC�1Þ � x2
2trðC�1C�1Þ
þ x2tTC�1C�1C�1t� x2tTC�1C�1t:
ð10Þ
It follows from (8) and (6) that the d-related Hessian entries are,"p,k2{L P p P k = 1,2, . . .,L},
Hðkþ1Þðpþ1Þ ¼o2L
odkodp
¼ 1
2tr C�1 o
2Codkodp
� �� 1
2tr C�1 oC
odpC�1 oC
odk
� �þ tTC�1
� oCodp
C�1 oCodk
C�1t� 1
2tTC�1 o2C
odkodpC�1t: ð11Þ
1272 Y. Zhang, W.E. Leithead / Appl. Math. Comput. 171 (2005) 1264–1281
Proposition 2. For the specific covariance function in (7), Hessian matrix HðHÞis simplified as in Eqs. (9)–(11) with instrumental variables in (8).
Proof. Following the above derivation procedure and utilizing the symmetry
properties of C(H), its derivative matrices, and Hessian matrix HðHÞ. h
Remark 3. Re-check the computation cost in the general GP implementation.
Note that the trace of two-matrix product and matrix–vector product are both
of O(N2) operations. In order to compute the Hessian, L matrix–matrix oper-
ations (totally LN3) have to be performed. That is, to compute and store
C�1(oC/odk), k = 1,2, . . . ,L as they are frequently used in Hessian and gradient
computation. In contrast, computing the log likelihood and its gradient is of2 � 3N3 operations. Thus, using finite-difference to approximate Hessian takes
2(L + 2)N3 � 4(L + 2)N3 operations [23], aside from the accuracy.
4. Trust-region MLE approach
Iterative approaches to optimization can be divided into two categories: line
search methods and trust region methods [17]. Classical methods including
steepest descent, conjugate gradient and quasi-Newton methods, are line
search methods. In each iteration, they obtain a search direction first, and then
try different step lengths along this direction for a better solution point.It is known (and also confirmed in [8]) that steepest descent methods are
generally slow in the GP context. Conjugate gradient methods are widely used
in the exact and approximate implementations of Gaussian processes [4,5].
However, it becomes slow for banana/ravine-type problems, and more towards
local optima, it requires many more iterations and line searches, as compared
to second-order methods [17,18,20,23].
Trust region methods are a class of relatively new and robust second-order
optimization algorithms [17,22,27]. To show the advantages of using trust-re-gion methods for MLE hyperparameters estimation, the standard trust-region
algorithm is investigated here. At each iteration, say k, the algorithm begins
with a local quadratic approximation q( Æ ) of (4):
LðHþ sÞ � qðHþ sÞ ¼ LðHÞ þ gT sþ 1
2sTHs;
where s denotes the next step to be taken. The trust region algorithm restrictsthe length of s within trust-region radius ., resulting in a bounded trust-region
subproblem:
minimize qðHþ sÞsubject to jjsjj 6 .:
ð12Þ
Y. Zhang, W.E. Leithead / Appl. Math. Comput. 171 (2005) 1264–1281 1273
In addition to robust existence of solutions to (12), there also exist efficient
algorithms for approximately solving (12) like [24–26] (exact solution is not re-
quired). Once this subproblem is solved, we have the following improvement
criterion about LðHÞ to decide whether or not to accept s and adjust .
c ¼ actual decrease
predicted decrease¼ LðHþ sÞ �LðHÞ
qðHþ sÞ �LðHÞ : ð13Þ
The standard decision for updating H and . is
H ¼Hþ s if LðHþ sÞ < LðHÞH otherwise
�; . ¼
maxfc1.; �.g if c P c1;
c2jjsjj if c 6 c2;
. otherwise;
8><>:where �. is an upper bound on the trust-region radius .. Typically, c1 = 0.75 and
c1 = 2 implying the good match of quadratic model q( Æ ) toL and increasing .;or, c2 = 0.25 and c2 = 0.5 implying the poor match of q( Æ ) to L and decreasing
.; otherwise, it is an acceptable match and keeping ..Compared to line-search methods, the above trust-region optimization pro-
cedure has shown the following advantages when applied to the MLE hyper-
parameters estimation of Gaussian processes.
Remark 4. As shown in (12), the bounded subproblem always has solutions by
decreasing ., and the trust region methods do not require the Hessian (or its
approximation) to be positive definite. It is thus able to handle ill-conditioned
problems [22], which is sometimes the situation of our multimodal log-likelihood function. In contrast, the second-order line-search methods like
quasi-Newton method need modify the Hessian to be always positive definite
from iteration to iteration in order to guarantee its invertibility and function-
value decreasing during minimizing L [17,23].
Remark 5. For theoretical analysis of trust region methods, fewer and milder
conditions are assumed as compared to line search methods [17,21,22,27]. For
example, it is only required that the norm of Hessian matrix do not increase at
a rate faster than linear. In contrast, for line-search methods, it is required that
the condition number of the Hessian do not grow rapidly and is also uniformly
bounded. In our context, the Hessian may be positive/negative semi-definite or
indefinite with unbounded condition numbers, as the log-likelihood ismultimodal.
Remark 6. Because of the improvement measure (13) and the adjusting mech-anism of trust-region radius, approximate solution to the trust-region subprob-
lem (12) is sufficient for convergence proof [26]. In our comparative
experiments, trust-region methods are more robust than line-search methods
1274 Y. Zhang, W.E. Leithead / Appl. Math. Comput. 171 (2005) 1264–1281
in terms of the success rate of MLE solutions. Furthermore, it is shown [22]
that trust region methods are extendable to general approximation-based opti-
mization problems. This might be useful to the approximate implementation of
Gaussian process [5,20].
5. Numerical experiments
A series of numerical tests with different numbers of sample points and noiselevels are conducted based on two regression applications detailed in [12],
namely, the sinusoidal function example and the Wiener–Hammerstain non-
linear-system example. With and without explicit Hessian, the trust-region
algorithms implemented in MATLAB [28] are used to compare their perfor-
mance with the Polack-Ribiere conjugate-gradient method [4] under the
same initial and terminating conditions. Only general and interesting conclu-
sions are presented below from the abundant numerical observations and
comparisons.
5.1. Sinusoidal function
The underlying function is yðxÞ ¼ A sinða1x1 þ a2x2Þ where A = 1, a1 = 1.0
and a2 = 0.8. The domain of interest is the rectangle, {0 6 x1 6 7,0 6 x2 6 8},
and the training data is the values on a regularffiffiffiffiN
p�
ffiffiffiffiN
plattice of the domain
with noise intensity v. The tuned GP model is used to predict the values of y(x)
on a 2ffiffiffiffiN
p� 2
ffiffiffiffiN
plattice of the domain. The regression task with N = 289 and
v = 0.2 is illustrated in Fig. 1. Fig. 2 shows a typical comparison of computa-
tional cost under the same initial conditions, while Table 1 shows the compar-
ison of ten times averaged computational cost, including the CPU time and the
02
46
8 02
46
8-2
-1.5
-1-0.5
0
0.5
1
1.5
x2x1
t(x)
02
46
8 02
46
-1.5
-1
-0.5
0
0.5
1
1.5
x2x1
y(x)
(a) Noisy training data (b) Regression result
Fig. 1. Sinusoidal regression using Hessian and trust-region algorithm.
0 100 200 300 400 500 600 700 800 900 10000
20
40
60
80
100
120
140
No.
of I
tera
tions
CTUTCTHUTHCG
N
0 100 200 300 400 500 600 700 800 900 10000
100
200
300
400
500
600
700
cpu
time
(sec
ond)
CTUTCTHUTHCG
N
(b)(a)
Fig. 2. Computation cost under the same random initial state (sinusoidal example).
Table 1
Ten times averaged computational cost for sinusoidal-function example
N CT UT CTH UTH CG
49 0.6893 (19) 0.4248 (16) 0.2689 (20) 0.1687 (15) 0.9234 (55)
64 0.9141 (21) 0.8515 (23) 0.2955 (17) 0.3421 (22) 1.0423 (37)
81 1.4903 (24) 1.4516 (26) 0.5845 (23) 0.5781 (26) 1.4687 (44)
100 2.1658 (27) 1.6671 (22) 0.7638 (26) 0.5924 (21) 2.4984 (49)
121 3.6310 (31) 2.9969 (27) 1.2847 (31) 1.0529 (26) 3.8423 (50)
144 5.5093 (33) 5.8751 (37) 2.0327 (32) 2.1003 (35) 4.8295 (46)
169 6.7128 (30) 8.3547 (38) 2.4796 (31) 2.6923 (33) 7.3282 (48)
196 9.9048 (33) 7.9421 (26) 3.4923 (28) 3.2342 (26) 9.1127 (48)
225 14.1794 (34) 13.7769 (34) 5.4796 (32) 5.6218 (34) 17.9844 (65)
256 23.2840 (35) 22.3798 (36) 9.1206 (32) 9.2717 (36) 21.1783 (52)
289 26.0578 (36) 29.9000 (39) 10.2234 (34) 13.2611 (38) 22.0466 (48)
324 37.2813 (41) 34.9299 (39) 15.5528 (40) 17.7267 (46) 32.8221 (62)
361 43.7952 (39) 46.9579 (41) 19.0483 (39) 19.8437 (40) 47.0142 (57)
400 66.6562 (47) 63.0625 (44) 27.2343 (44) 26.8908 (43) 60.0358 (73)
441 60.3577 (34) 83.0390 (47) 28.8203 (37) 34.4076 (44) 63.7875 (51)
484 94.1719 (44) 85.9374 (39) 39.2123 (40) 40.9970 (42) 89.2422 (77)
529 112.9249 (42) 144.3187 (54) 52.7749 (44) 59.3784 (49) 101.1625 (66)
576 142.2546 (42) 174.6532 (52) 71.8812 (47) 78.8845 (52) 185.4781 (78)
625 171.0424 (44) 190.0734 (49) 69.1358 (38) 87.2672 (48) 183.4015 (84)
676 210.3343 (45) 239.9610 (51) 104.3015 (47) 121.1297 (54) 214.1891 (84)
729 240.0218 (43) 319.0110 (57) 114.5608 (43) 155.5173 (58) 307.6295 (79)
784 280.1562 (42) 335.4124 (50) 119.0766 (37) 171.6220 (53) 270.2671 (70)
841 322.6843 (41) 438.9391 (56) 174.6562 (46) 185.7595 (48) 334.0406 (81)
900 378.8032 (40) 476.8765 (51) 170.9968 (37) 228.1205 (49) 464.7920 (94)
961 473.6595 (43) 574.4327 (52) 263.8191 (48) 300.0325 (55) 534.1203 (87)
Y. Zhang, W.E. Leithead / Appl. Math. Comput. 171 (2005) 1264–1281 1275
iteration numbers in brackets. In the figures and tables, CT denotes the con-
strained trust-region algorithm without explicit Hessian, UT denotes the
unconstrained trust-region algorithm without explicit Hessian, CTH denotes
1276 Y. Zhang, W.E. Leithead / Appl. Math. Comput. 171 (2005) 1264–1281
the constrained trust-region algorithm with explicit Hessian, UTH denotes the
unconstrained trust-region algorithm with explicit Hessian, and CG denotes
the conjugate-gradient algorithm.
Remark 7. In our GP context, the trust-region algorithm with explicit Hessian
(i.e., CTH and UTH) is in general faster than the trust-region algorithmwithout explicit Hessian (i.e., CT and UT). This verifies the computational
analysis in Section 3.2 and Remark 3. In addition, without fully utilizing
curvature information, the general efficiency of the conjugate gradient
algorithm (CG) is between CTH/UTH and CT/UT, usually taking more
iterations and function evaluations to approach the optimal solutions.
Remark 8. In the very difficult ravine-type tasks, compared to trust-region
algorithms, the conjugate gradient method usually converges much slower
and even fails. For example, using a non-diagonal singular matrix D in (3)
improves the regression accuracy, but makes the optimization task into a
ravine-type problem. The comparison of computational cost in this case is
shown in Fig. 3, where the number of CG iterations is limited to 200. Thisobservation is consistent with others� about CG and line-search methods
[4,8,12,18,20].
5.2. Wiener–Hammerstein system
Consider the identification problem of a transversal Wiener–Hammerstein
system [12]. In terms of measured variables (input r and output y at time in-
stant s), the system dynamics of interest can be reformulated as
yi ¼ 0:3ðH 1RÞ3 þ 0:165ðH 3RÞ3;
0 100 200 300 400 500 600 700 800 900 10000
500
1000
1500
cpu
time
(sec
ond)
CTUTCG
N
0 100 200 300 400 500 600 700 800 900 10000
50
100
150
200
No.
of I
tera
tions
CTUTCG
N
(a) (b)
Fig. 3. Comparison of computation cost for non-diagonal D (sinusoidal example).
Y. Zhang, W.E. Leithead / Appl. Math. Comput. 171 (2005) 1264–1281 1277
where R :¼ ½ rðsiÞ rðsi�1Þ rðsi�2Þ rðsi�3Þ �T (i.e., L = 4), and Hi is the ith row
of the matrix
H ¼0:9184 0:3674 0 0
0 0:9184 0:3674 0
0 0 0:9184 0:3674
264375:
The output in response to a Gaussian input is measured: data is collected for0.1N seconds with sampling interval 0.1, and Gaussian white noise of standard
deviation v is added to the output signal. The prediction effect with N = 150
and v = 0.2 is shown in Fig. 4, and the optimization comparison is in Fig. 5
and Table 2. The typical situation is similar to Remark 7: the trust-region algo-
rithms, especially with explicit Hessian, are good alternatives for the MLE
hyperparameters estimation of Gaussian processes.
0 5 10 15-15
-10
-5
0
5
10
15
time (second)
syst
em o
utpu
t
train ttest y
0 5 10-0.15
-0.1
-0.05
0
0.05
0.1
0.15
test
err
or
time (second)
(a) (b)
Fig. 4. Wiener–Hammerstein example of using Hessian and trust-region algorithm.
0 100 200 300 400 500 600 700 8000
200
400
600
800
1000
1200
cpu
time
(sec
ond)
CTUTCTHUTHCG
N
0 100 200 300 400 500 600 700 8000
20
40
60
80
100
120
140
160
180
200
No.
of I
tera
tions
N
(a) (b)CTUTCTHUTHCG
Fig. 5. Computation cost under the same initial state (Wiener–Hammerstein).
Table 2
Ten times averaged computational cost for Wiener–Hammerstein example
N CT UT CTH UTH CG
3 0.2420 (18) 0.2453 (19) 0.1172 (18) 0.1172 (19) 0.5593 (155)
4 0.2265 (17) 0.2579 (20) 0.1095 (17) 0.1110 (18) 0.5107 (167)
7 0.1967 (13) 0.2206 (15) 0.0967 (13) 0.1016 (14) 0.5967 (165)
12 0.2954 (17) 0.2796 (16) 0.1484 (18) 0.1454 (17) 0.6657 (180)
19 0.3687 (17) 0.4828 (23) 0.1798 (17) 0.2469 (24) 0.5749 (102)
28 0.4376 (16) 0.5513 (21) 0.2330 (17) 0.2970 (22) 0.7592 (85)
39 0.6406 (17) 0.9016 (24) 0.3123 (17) 0.3971 (21) 1.1874 (98)
52 1.5860 (30) 1.5437 (30) 0.8234 (30) 0.6906 (26) 1.6531 (94)
67 1.4811 (19) 2.1579 (28) 0.7485 (19) 1.1672 (30) 2.9562 (109)
84 2.3140 (21) 3.2797 (30) 1.1671 (21) 1.4608 (26) 3.7937 (99)
103 5.0031 (31) 5.1827 (32) 2.0781 (26) 2.6532 (33) 5.1969 (88)
124 5.7314 (24) 7.5250 (32) 2.3515 (23) 2.7624 (28) 7.6966 (101)
147 9.6078 (29) 11.8451 (36) 4.0781 (27) 4.9626 (33) 12.5906 (111)
172 12.3187 (27) 15.4998 (33) 6.0766 (27) 7.6905 (34) 16.2812 (107)
199 18.5699 (30) 17.5422 (29) 9.4765 (31) 8.2341 (27) 19.6078 (91)
228 36.3089 (45) 40.6810 (50) 19.7888 (48) 17.1811 (42) 29.8246 (111)
259 34.5827 (32) 41.8322 (39) 19.9953 (35) 20.7262 (37) 37.0437 (92)
292 48.0307 (35) 47.2388 (34) 24.8887 (35) 26.1404 (36) 47.0650 (94)
327 76.0839 (44) 80.3743 (46) 38.9654 (43) 43.5152 (47) 75.5367 (120)
364 123.1177 (57) 120.6036 (56) 74.3760 (65) 64.0106 (56) 84.5962 (118)
403 123.9826 (46) 162.6480 (60) 61.8358 (43) 98.4776 (69) 102.6462 (106)
444 182.7944 (55) 199.8484 (61) 98.4025 (56) 88.9385 (51) 139.9789 (122)
532 241.7160 (49) 271.7094 (55) 141.3059 (53) 157.0452 (59) 186.0429 (103)
579 333.0707 (55) 339.6398 (55) 181.7228 (55) 165.2496 (50) 887.8056 (108)
628 363.6192 (50) 404.6686 (56) 227.4981 (58) 209.7888 (53) 281.3492 (106)
679 491.1429 (54) 630.0538 (73) 292.9498 (62) 302.7325 (64) 384.8370 (122)
732 1219.7418 (120) 1061.0852 (105) 522.9653 (93) 528.0667 (94) 486.7467 (129)
1278 Y. Zhang, W.E. Leithead / Appl. Math. Comput. 171 (2005) 1264–1281
Remark 9. For more automatic GP implementation, the eigenvalue positive-
ness of Hessian matrix (i.e., min kðHÞ > 0) can be used to reject inappropriate
solutions and re-start the MLE optimization if necessary. Moreover, Hessian
might be a good indicator for the adequateness of training data for a specific
GP-regression task. For example, as in Fig. 6, if the dataset is not large enough,
the likelihood function possibly contains more saddle points/minima at which
the optimization routine may get stuck. In comparison, after the training-data
size increases to some value, it turns out to be much easier and more frequentto find an optima.
Finally, it is worth mentioning that the CG algorithm still remains as one of
the most useful techniques for solving large-scale optimization problems if
explicit matrix computation and storage seem impractical [17]; e.g., in ourcontext, if both L and N are extremely large. In addition, as shown in Fig. 7
on function-value comparison, the algorithms may converge to different local
0 100 200 300 400 500 600 700 800 900 1000−5000
−4000
−3000
−2000
−1000
0
1000
Fun
ctio
n V
alue
s
CTUTCTHUTHCG
N
0 100 200 300 400 500 600 700 800−2500
−2000
−1500
−1000
−500
0
500
1000
Fun
ctio
n V
alue
s
N
(a) (b)
CTUTCTHUTHCG
Fig. 7. Log-likelihood function values at different MLE solutions. (a) Sinusoidal example. (b)
Wiener–Hammerstein.
0 100 200 300 400 500 600 700 800 900 10000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
N
0 100 200 300 400 500 600 700 8000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
N
(a) (b)
Fig. 6. Non-optimality percentage of the MLE solutions versus N. (a) Sinusoidal example. (b)
Wiener–Hammerstein.
Y. Zhang, W.E. Leithead / Appl. Math. Comput. 171 (2005) 1264–1281 1279
optima, and thus a parallel/robust GP-implementation may be developed by
incorporating different optimization routines.
6. Conclusions
This paper has investigated in detail the log-likelihood optimization in the
context of Gaussian process. The derivation, simplification and utility of theHessian matrix have been shown theoretically and experimentally, eliminating
previous misunderstandings on it. The idea of using trust region algorithms is
also interesting and new for the non-trivial numerical computation of the
Gaussian process regression, even for general statistical research. For GP opti-
mization with the normal choice of covariance functions, the trust-region algo-
rithms with explicit Hessian perform considerably better than the trust-region
algorithms without Hessian and a little better than the conjugate gradient
1280 Y. Zhang, W.E. Leithead / Appl. Math. Comput. 171 (2005) 1264–1281
method. For ravine-type GP optimization that can occur with other choice of
the covariance function, the trust-region algorithms remarkably outperform
the conjugate gradient method. The conclusion may have immediate conse-
quences in this field; e.g., the approximate GP-implementation based on trust
region methods, and the parallel/robust GP-implementation.
References
[1] K. Hornik, M. Stinchcombe, H. White, Multilayer feedforward networks are universal
approximators, Neural Networks 2 (1989) 259–366.
[2] E.B. Baum, D. Haussler, What size net gives valid generalization?, Neural Computation 6
(1989) 151–160.
[3] R.M. Neal, Bayesian learning for neural networks, Lecture Notes in Statistics, 118, Springer,
New York, 1996.
[4] C.E. Rasmussen, Evaluation of Gaussian processes and other methods for non-linear
regression, Ph.D. thesis, University of Toronto, 1996.
[5] D.J.C. MacKay, Introduction to Gaussian processes, Neural Networks and Machine
Learning, F: Computer and Systems Sciences, vol. 168, Springer, Berlin, Heidelberg, 1998,
pp. 133–165.
[6] C.K.I. Williams, D. Barber, Bayesian classification with Gaussian processes, IEEE Transac-
tions on Pattern Analysis and Machine Intelligence 20 (1998) 1342–1351.
[7] S. Sambu, M. Wallat, T. Graepel, K. Obermayer, Gaussian process regression: active data
selection and test point rejection, in: Proceedings of the IEEE-INNS-ENNS International
Joint Conference on Neural Networks, 3, 2000, pp. 241–246.
[8] T. Yoshioka, S. Ishii, Fast Gaussian process regression using representative data, in:
Proceedings of International Joint Conference on Neural Networks, 1, 2001, pp. 132–137.
[9] D.J. Leith, W.E. Leithead, E. Solak, R. Murray-Smith, Divide and conquer identification
using Gaussian process priors, in: Proceedings of the 41st IEEE Conference on Decision and
Control, 1, 2002, pp. 624–629.
[10] J.Q. Shi, R. Murray-Smith, D.M. Titterington, Bayesian regression and classification using
mixtures of multiple Gaussian processes, International Journal of Adaptive Control and Signal
Processing 17 (2003) 149–161.
[11] E. Solak, R. Murray-Smith, W.E. Leithead, D.J. Leith, C.E. Rasmussen, Derivative
observations in Gaussian process models of dynamic systems, Advances in Neural Information
Processing Systems 15 (2003) 1033–1040.
[12] W.E. Leithead, E. Solak, D.J. Leith, Direct identification of nonlinear structure using
Gaussian process prior models, in: European Control Conference, Cambridge, 2003.
[13] A. O�Hagan, On curve fitting and optimal design for regression, Journal of the Royal
Statistical Society B 40 (1978) 1–42.
[14] K.V. Mardia, R.J. Marshall, Maximum likelihood estimation for models of residual
covariance in spatial regression, Biometrika 71 (1984) 135–146.
[15] N.A.C. Cressie, Statistics for Spatial Data, John Wiley and Sons, New York, 1993.
[16] C.J. Paciorek, M.J. Schervish, Nonstationary covariance functions for Gaussian process
regression, Advances in Neural Information Processing Systems 16 (2003).
[17] J. Nocedal, Theory of algorithms for unconstrained optimization, Acta Numerica (1992)
199–242.
[18] M.F. Moller, A scaled conjugate gradient algorithm for fast supervised learning, Neural
Networks 6 (1993) 525–533.
Y. Zhang, W.E. Leithead / Appl. Math. Comput. 171 (2005) 1264–1281 1281
[19] G.H. Golub, C.F. Van Loan, Matrix Computations, Johns Hopkins University Press,
Baltimore, 1996.
[20] J. Skilling, Bayesian Numerical Analysis, Physics and Probability, Cambridge University
Press, 1993.
[21] Z. Wu, G.N. Phillips Jr., R. Tapia, Y. Zhang, A fast Newton method for entropy
maximization in statistical phase estimation, Acta Crystallographica A 57 (2001) 681–685.
[22] N. Alexandrov, J.E. Dennis Jr., R.M. Lewis, V. Torczon, A trust region framework for
managing the use of approximation models in optimization, Journal on Structural Optimi-
zation 15 (1998) 16–23.
[23] J.E. Dennis, R.B. Schnabel, Numerical Methods for Unconstrained Optimization and
Nonlinear Equations, Prentice-Hall, Englewood, NJ, 1983.
[24] J.J. More, D.C. Sorensen, Computing a trust region step, SIAM Journal of Scientific and
Statistically Computing 4 (1983) 553–572.
[25] T. Steihaug, The conjugate gradient method and trust regions in large scale optimization,
SIAM Journal on Numerical Analysis 20 (1983) 626–637.
[26] R.H. Byrd, R.B. Schnabel, G.A. Shultz, Approximate solution of the trust region problem
by minimization over two-dimensional subspaces, Mathematical Programming 40 (1988)
247–263.
[27] A.R. Conn, N.I.M. Gould, Ph.L. Toint, Trust Region Methods, SIAM, Philadelphia, 2000.
[28] The MathWorks Inc., Optimization Toolbox User�s Guide, Version 2.1, 2000.