A First-Order Interior-Point Method for Linearly Constrained Smooth Optimization1 Paul ... ·...

A First-Order Interior-Point Method forLinearly Constrained Smooth Optimization1

Paul Tseng2, Immanuel M. Bomze3, Werner Schachinger4

Abstract: We propose a first-order interior-point method for linearly con-strained smooth optimization that unifies and extends first-order affine-scalingmethod and replicator dynamics method for standard quadratic program-ming. Global convergence and, in the case of quadratic program, (sub)linearconvergence rate and iterate convergence results are derived. Numerical ex-perience on simplex constrained problems with 1000 variables is reported.

Key words. Linearly constrained optimization, affine scaling, replicatordynamics, interior-point method, global convergence, sublinear convergencerate

1 Introduction

Consider a linearly constrained smooth optimization problem

maxx∈Λ

f(x), (1)

where Λ = {x ∈ Rn | Ax = b, x ≥ 0} with A ∈ Rm×n, b ∈ Rm, and f :Rn → R is continuously differentiable. We assume without loss of generalitythat A has rank m. This problem has been well studied and many iterativemethods have been proposed for its solution, such as active-set methods andinterior-point methods; see, e.g., [1, Chapter 2], [10, Chapter 5], [6, 9, 11, 19].

1The research of the first author is supported by the National Science Foundation,Grant No. DMS-0511283.

2Department of Mathematics, University of Washington, Seattle, WA 98195, U.S.A.([email protected])

3Department of Statistics and Decision Support Systems, University of Vienna,Brunnerstr. 72, A-1210 Wien, Austria. ([email protected])

4Department of Statistics and Decision Support Systems, University of Vienna,Brunnerstr. 72, A-1210 Wien, Austria. ([email protected])

1

In what follows, riΛ = {x ∈ Rn | Ax = b, x > 0} denotes the relative interiorof Λ and ‖ · ‖ denotes the 2-norm.

An important special case of (1) is the standard quadratic program (StQP),where Λ is the unit simplex and f is homogeneous quadratic, i.e.,

A = e>, b = 1, f(x) =1

2x>Qx,

with Q ∈ Rn×n symmetric and e ∈ Rn a vector of 1s. Applications ofStQP include maximum clique, portfolio selection, graph isomorphism [2, 22].By adding a nonnegative multiple of ee> to Q (which changes f by only aconstant on Λ), we can assume that

qii > 0 and qij ≥ 0 ∀ i, j, (2)

so that Qx > 0 componentwise for all x ∈ Λ. In [2, 3, 4, 22, 23], a remarkablysimple interior-point method called replicator dynamics (RD) was used forsolving StQP:

xt+1 =X tgt

(xt)>gt, t = 0, 1, . . . , x0 ∈ riΛ, (3)

where gt = Qxt, X t = Diag(xt). The assumption (2) implies that gt > 0and xt ∈ riΛ for all t. In fact, (2) is necessary and sufficient for Qx ≥ 0 andxT Qx > 0 for all x ∈ Λ. This method is reminiscent of the power methodfor finding the largest (in magnitude) eigenvalue of a square matrix, with theunit Euclidean sphere replaced by the unit simplex. Starting the RD methodin riΛ is essential. If it is started at an x0 in the relative boundary of Λ, thenall xt will stay in the relative interior of the face of Λ that contains x0.

The RD method has a long history in mathematical biology, and it con-nects three different fields: optimization, evolutionary games, and qualitativeanalysis of dynamical systems; see [3] and references therein. It arises in pop-ulation genetics under the name selection equations where it is used to modeltime evolution of haploid genotypes, with Q being the (symmetric) fitnessmatrix, and xt

i representing the relative frequency of allele i in the populationat time t (see, e.g., [20, Chapter III]). Since it also serves to model replicatingentities in a much more general context, it is often called replicator dynam-ics nowadays. The continuous-time version of the RD method is known tobe a gradient system with respect to Shahshahani geometry; see [12]. Thissuggests that the method may be useful for local optimization. In fact, (3)

2

has the remarkable property that, under assumption (2), the generated se-quence of iterates {xt} converges [15] (i.e., has a unique cluster point) and,given that we start in riΛ, the limit x is a first-order stationary point of(1). This contrasts with other interior-point methods for solving (1), forwhich additional assumptions on Q are required to prove convergence of thegenerated iterates; see [6, 11, 19, 25, 27]. Additionally, the objective valuesf(xt) increase with t, ‖xt − x‖ = O(1/

√t), and convergence rate is linear

if and only if strict complementarity holds at x. In [4], the RD method (3)was applied to solve medium-sized test problems from portfolio selection andwas shown to be superior in performance to classical feasible ascent methodsusing exact line search, including Rosen’s gradient projection method andWolfe’s reduced gradient method. A variant of the RD method that usesexact line search was also considered. Recently, some of the aforementionedresults were extended to the case where Λ is a product of simplices, underthe name of multi-standard quadratic program (MStQP); see [5].

However, in practice the RD method seems slow on large instances of bothStQP and MStQP. Can this method be improved and extended to the generalproblem (1) while retaining its elegant simplicity? Denoting, as always in thesequel, X t = Diag(xt), we can rewrite (3) as

xt+1 − xt =X t[gt − (xt)>gte]

(xt)>gt,

so we can interpret the numerator as the search direction and 1/(xt)>gt asthe stepsize; see [4, Equation (15)]. What if we do a line search along thissearch direction instead? This motivates the following interior-point methodfor solving the special case of (1) where Λ is the unit simplex (f need not behomogeneous quadratic):

xt+1 = xt + αtdt, dt = X tr(xt), t = 0, 1, . . . , x0 ∈ riΛ, (4)

with 0 < αt < −1/ minj r(xt)j, where

r(x) = ∇f(x)− x>∇f(x)e = [I − ex>]∇f(x).

The restriction on αt ensures that xt ∈ riΛ for all t. Note that, for x ∈ Λ, wehave minj r(x)j < 0 unless r(x) = 0, in which case x is a stationary point of(1).

Among existing interior-point methods, the RD method (4) is most closelyrelated to the first-order affine-scaling (AS) method of Dikin [7] for quadratic

3

programming and extended by Gonzaga and Carlos for linearly constrainedsmooth convex minimization [11]. Their method, when specialized to (1),has the form

xt+1 = xt + αtdt, dt = (X t)2r(xt), t = 0, 1, . . . , x0 ∈ riΛ, (5)

where αt is chosen by a limited maximization rule on (0,−1/(minj dtj/x

tj)),

r(x) = ∇f(x)− x>X∇f(x)

‖x‖2e =

(I − 1

‖x‖2ex>X

)∇f(x) ,

and X = Diag(x). They showed that every cluster point of {xt} is a station-ary point of (1) when f is concave. The proof extends an idea of Dikin [8]and makes use of a constancy property of ∇f on each isocost line segment.Subsequently, Bonnans and Pola [6] and Monteiro and Wang [19] proposedfirst- and second-order AS trust-region methods based on generalizations ofthis search direction. In [6, Theorem 2.2], αt is chosen by an Armijo-typerule and it is shown that every cluster point of the generated iterates is astationary point provided a certain relaxed first-order optimality system hasa unique solution or has isolated solutions, and an additional technical condi-tion holds. In [19], the analysis in [11] is extended to show every cluster pointof the generated iterates is a stationary point provided f is either concaveor convex. In general, global convergence analysis for these kinds of interior-point methods, including (4), is nontrivial due to the search direction beingcomponentwise proportional to the current iterate.

Upon comparing (4) with (5), we see that they differ mainly in that onescales its direction by X t while the other scales by (X t)2. Also, r(x) isobtained by subtracting from ∇f(x) componentwise a weighted average of itcomponents. In fact, the two methods (4) and (5) belong to a general classof first-order interior-point methods for the general problem (1) that has theform

xt+1 = xt + αtdt, dt = (X t)2γr(xt), t = 0, 1, . . . , x0 ∈ riΛ, (6)

where γ > 0, 0 < αt < −1/(minj dtj/x

tj), and

r(x) = ∇f(x)− A>(AX2γA>)−1AX2γ∇f(x)

= [I − A>(AX2γA>)−1AX2γ]∇f(x) .(7)

(Here X2γ denotes X raised to the power 2γ, in contrast to X t. The meaningof the exponent should be clear from the context.) Thus, γ = 1 yields the

4

first-order AS method while γ = 1/2 yields the RD method. Since A hasrank m, AX2γA> is positive definite and hence invertible whenever x > 0.

We will discuss (inexact) line search strategies for choosing the stepsizeαt so as to achieve fast global convergence. We will show that if αt is chosenby either an Armijo rule or a limited maximization rule, then every clus-ter point of {xt} is a stationary point of (1) under a primal nondegeneracyassumption and additional assumptions such as f being concave or convex;see Theorem 1(c). Thus if f is concave, then every cluster point would bea global maximizer. In the special case where f is quadratic, we show that{f(xt)} converges at a sublinear rate of O(1/t1/ max{γ,2γ−1}); see Theorem 2.To our knowledge, this is the first rate of convergence result for a first-orderinterior-point method when the objective function f is nonlinear. Moreover,we extend the result in [15] to show that, for γ < 1, {xt} converges sublin-early and, under primal nondegeneracy, its limit x is a stationary point of(1). If in addition γ ≤ 1

2and x satisfies strict complementarity, then {f(xt)}

and {xt} converge linearly. On the other hand, if 12≤ γ < 1 and x does not

satisfy strict complementarity, then {xt} cannot converge linearly. Why arewe interested in a first-order method if its convergence rate can be sublinear?They have much simpler iterations compared to second-order interior-pointmethods [9, 27, 30] and hence may be suited for solving very large problems(n ≥ 10000). The case of multiple simplex constraints is a good example.In this case, AX2γA> has a block-diagonal structure corresponding to thesimplices and r(x) decomposes accordingly. In general, if m is small or AA>

has a nice sparsity structure, then r(x) can be inexpensively computed from∇f(x). Our analysis and numerical experience suggest that a value of γ < 1is superior to values of γ ≥ 1. Our main contributions are: a unified algorith-mic framework, practical stepsize rules, a comprehensive global convergenceanalysis and, for quadratic f , convergence rate analysis, and implementationand testing of the method. Our results can be extended to handle upperbound constraint x ≤ u by working with Diag(min{x, u− x}); see, e.g., [27].For simplicity we do not consider this more general problem here.

2 Properties of search direction

The following lemma shows key feasible ascent properties of the search di-rection dt.

5

Lemma 1 For any γ > 0 and x ∈ riΛ, let X = Diag(x) and d = X2γr(x),where r is given by (7). Then we have Ad = 0 and

∇f(x)>d = ‖X−γd‖2 = ‖Xγr(x)‖2 .

Moreover, d solves the following subproblem

maxu∈Rn

{∇f(x)>u | Au = 0, ‖X−γu‖ ≤ ‖Xγr(x)‖} . (8)

Proof. Let g = ∇f(x) and denote by

Pγ = I −XγA>(AX2γA>)−1AXγ

the matrix of orthogonal projection onto the null space of AXγ. Then P>γ =

Pγ = (Pγ)2. We have from (7) that

X−γd = Xγr(x) = PγXγg,

so X−γd is in the null space of AXγ and hence Ad = (AXγ)(X−γd) = 0.Also,

g>d = (Xγg)>(X−γd)

= (Xγg)>PγXγg

= ‖PγXγg‖2

= ‖X−γd‖2,

where the third equality uses Pγ = (Pγ)2. The minimum-norm property of

orthogonal projection implies that d solves the subproblem (8).

We will make use of the following primal nondegeneracy assumption,which is standard in the analysis of AS methods, especially when the ob-jective is nonquadratic; see [6, 7, 11, 19].

Assumption 1 For any x ∈ Λ, the columns of A corresponding to {j | xj 6=0} have rank m.

Assumption 1 is satisfied when Λ is the unit simplex or a Cartesian prod-uct of simplices. The following result is well known; see [7, 11].

Lemma 2 Under Assumption 1, AX2γA> is nonsingular for all x ∈ Λ andr is a continuous mapping on Λ, where X, r are given by (7).

6

3 Stepsize rules

For general f , we propose to choose αt by an Armijo-type rule [1, Section2.2.1]: αt is the largest α ∈ {αt

0(β)k}k=0,1,... satisfying

f(xt + αdt) ≥ f(xt) + σα(gt)>dt, (9)

where gt = ∇f(xt), 0 < β, σ < 1 are constants and

0 < αt0 <

{∞ if dt ≥ 0;−1

minj dtj/xt

jelse. (10)

Notice that if dt = 0, then (9) is satisfied by any α ≥ 0 and the Armijo ruleyields αt = αt

0. Since dt is a feasible ascent direction at xt by Lemma 1 andαt

0 > 0, αt is well defined and positive.In the special case where f is a quadratic or cubic function, we can choose

αt by the limited maximization rule:

αt ∈ arg max0≤α≤αt

0

f(xt + αdt). (11)

4 Global convergence

In this section we analyze the global convergence of the first-order interior-point method (6), (7). The proof uses ideas from [1, Section 1.2] and [11],[19, Appendix A]. As with RD and AS methods, the proof is complicatedby the fact that the direction mapping x 7→ X2γr(x) is undefined on therelative boundary of Λ. Even when it is defined and continuous on therelative boundary of Λ, as is the case under Assumption 1, it may be zero ata non-stationary point.

Theorem 1 Assume Λ0 = {x ∈ Λ | f(x) ≥ f(x0)} is bounded. Let {xt} begenerated by the method (6), (7) with {αt} chosen by the Armijo rule (9) and{αt

0} satisfying (10). Then the following results hold with gt = ∇f(xt).

(a) xt ∈ riΛ for all t, {f(xt)} is nondecreasing, and {xt}, {dt} are bounded.

(b) Assume inft αt0 > 0. Then {(gt)>dt} → 0, {(X t)γr(xt)} → 0, and every

cluster point x of {xt} satisfies

Diag(x)(∇f(x)− A>p) = 0 for some p ∈ Rm. (12)

If supt αt0 < ∞, then {xt+1 − xt} → 0.

7

(c) Suppose inft αt0 > 0, supt α

t0 < ∞, Assumption 1 holds, and either (i) f

is concave or convex or (ii) Λcs consists of isolated points or (iii) everyx ∈ Λcs satisfies strict complementarity (i.e., xj − r(x)j 6= 0 for all j),where

Λcs = {x ∈ Λ | Diag(x)r(x) = 0, f(x) = limt→∞

f(xt)}.

Then every cluster point of {xt} is a stationary point of (1). Under(ii), {xt} converges.

(d) If inft αt0 > 0 and ∇f is Lipschitz continuous on Λ, then inft α

t > 0.

Proof. (a). Since x0 ∈ riΛ, by using (6) and Lemma 1 and an inductionargument on t, we have that xt ∈ riΛ and αt > 0 for all t. Also, αt satisfies(9), so (6) implies

f(xt+1)− f(xt) ≥ σαt(gt)>dt = σαt‖(X t)−γdt‖2 ∀t, (13)

where the equality uses Lemma 1. Thus {f(xt)} is nondecreasing. Thenxt ∈ Λ0 for all t. Since Λ0 is bounded, this implies {xt} is bounded. Also,(6) implies

‖dt‖ = ‖(X t)2γr(xt)‖ ≤ maxj

(xtj)

γ‖(X t)γr(xt)‖

and (7) implies

‖(X t)γr(xt)‖ = ‖P tγX

γgt‖ ≤ ‖Xγgt‖,

where P tγ is the matrix of orthogonal projection onto the null space of A(X t)γ.

Since {xt} is bounded and ∇f is continuous so that {gt} is bounded, thisshows that {dt} is bounded.

(b). Suppose inft αt0 > 0. Let x be any cluster point of {xt}. Since xt ∈ Λ0

for all t and Λ0 is closed, x ∈ Λ0. Since f is continuous and, by (a), {f(xt)}is nondecreasing, we have {f(xt)} ↑ f(x) and hence {f(xt+1) − f(xt)} → 0.Then (13) implies

{αt(gt)>dt} → 0. (14)

Consider any subsequence {xt}t∈T (T ⊆ {0, 1, . . .}) converging to x. Letg = ∇f(x). By further passing to a subsequence if necessary, we will assumethat either (i) inft∈T αt > 0 or (ii) {αt}t∈T → 0. In case (i), we have from (14)

8

that {(gt)>dt}t∈T → 0. In case (ii), we have from inft αt0 > 0 that αt < αt

0 forall t ∈ T sufficiently large, implying that the ascent condition (9) is violatedby α = αt/β, i.e.,

f(xt + (αt/β)dt)− f(xt)

αt/β< σ(gt)>dt. (15)

Since {dt} is bounded, by further passing to a subsequence if necessary, wecan assume that {dt}t∈T → some d. Since {αt}t∈T → 0 and f is continuouslydifferentiable, the above inequality yields in the limit that

g>d ≤ σg>d.

Since 0 < σ < 1, this implies g>d ≤ 0. Thus lim supt∈T ,t→∞(gt)>dt ≤ 0.Since (gt)>dt ≥ 0 for all t, this implies {(gt)>dt}t∈T → 0. Then Lemma 1implies {

(X t)γr(xt)}

t∈T → 0,

and hence {r(xt)J c}t∈T → 0, where J c = {j | xj > 0}. By (7), the systemof linear equations in p ∈ Rm

(gt − A>p)j = r(xt)j ∀j ∈ J c

has a solution. Let pt be its least 2-norm solution. Since the coefficientmatrix in this system does not change with t, it follows from {gt}t∈T →∇f(x) and {r(xt)J c}t∈T → 0 that {pt}t∈T converges to some p satisfying(∇f(x)− A>p)j = 0 for all j ∈ J c. Thus (12) holds.

Since {xt} is bounded by (a), the above argument shows that {(gt)>dt} →0, {(X t)γr(xt)} → 0, and every cluster point x satisfies (12).

By (13), for all t,

(gt)>dt = ‖(X t)−γdt‖2 =1

(αt)2‖(X t)−γ(xt+1 − xt)‖2 ≥ 1

(αt0)

2

‖xt+1 − xt‖2

maxj(xtj)

2γ.

If supt αt0 < ∞, then since {(gt)>dt} → 0 and {xt} is bounded by (a), this

implies {xt+1 − xt} → 0.(c). Suppose that inft α

t0 > 0, supt α

t0 < ∞, and Assumption 1 holds. Let

x be any cluster point of {xt}. Let X = diag(x), r = r(x). We have from{(X t)γr(xt)} → 0 in (b) that Xr = 0 or, equivalently,

X(g − A>p) = 0,

9

where g = ∇f(x) and p = (AX2γA>)−1AX2γ g. Thus x belongs to Λcs, andx is a stationary point of (1) if and only if r ≤ 0.

Suppose that f is concave or convex. We show below that r ≤ 0. Theargument is similar to one used by Gonzaga and Carlos [11]; also see [19,Section 3.3]. First, we have the key result that

{r(xt)} → r. (16)

Its proof is given in Appendix A. If r 6≤ 0, then there would exist somej ∈ {1, . . . , n} such that rj > 0. Then Xr = 0 implies xj = 0. By (16), thereexists t such that r(xt)j > 0 for all t ≥ t, so that dt

j> 0 for all t ≥ t and

hence xtj > xt

j > 0 for all t ≥ t. This contradicts xj = 0.Suppose that, instead of f being concave or convex, Λcs consists of isolated

points. Since {xt+1 − xt} → 0 so the set of cluster points of {xt} form acontinuum, (b) implies {xt} → x. Then (16) holds and the same argumentas above yields r ≤ 0. Suppose instead that every x ∈ Λcs satisfies strictcomplementarity. Let

Λcs ={x ∈ Λcs | r(x)J0

= 0, r(x)J+> 0, r(x)J− < 0

},

where J0 = {j | rj = 0}, J+ = {j | rj > 0}, J− = {j | rj < 0}. Since everyx ∈ Λcs satisfies strict complementarity, Λcs is isolated from the rest of Λcs,i.e., there exists a δ > 0 such that (Λcs + δB) ∩ Λcs = Λcs, where B denotesthe unit Euclidean ball centered at the origin. Since the set of cluster pointsof {xt} forms a continuum and, by (b), is contained in Λcs, this set must infact be contained in Λcs. Hence for every j ∈ J+ we have r(xt)j > 0 for all tsufficiently large, so dt

j > 0 for all t sufficiently large, implying lim inft xtj > 0,

a contradiction to xj = 0. Thus J+ = ∅, i.e., r ≤ 0.(d). Suppose that inft α

t0 > 0 and ∇f is Lipschitz continuous on Λ with

Lipschitz constant L ≥ 0. Then it is readily shown using the mean valuetheorem that

∇f(x)>(y − x)− L

2‖y − x‖2 ≤ f(y)− f(x) ∀x, y ∈ Λ

(see [1, page 667]). For each t ∈ {0, 1, ...}, either αt = αt0 or else (9) is

violated by α = αt/β, i.e., (15) holds. In the second case, applying the aboveinequality with x = xt, y = xt + (αt/β)dt and using (15) yields

αt

β(gt)>dt − L

2

(αt

β

)2

‖dt‖2 ≤ f

(xt +

αt

βdt

)− f(xt) <

αt

βσ(gt)>dt.

10

Dividing both sides by αt/β and rearranging terms yields

(1− σ)(gt)>dt ≤ L

2

αt

β‖dt‖2.

By Lemma 1,

(1− σ)‖dt‖2

(maxj xtj)

2γ≤ (1− σ)‖(X t)−γdt‖2 ≤ L

2

αt

β‖dt‖2.

Since αt 6= αt0, we have dt 6= 0, so this yields (1−σ)/(maxj xt

j)2γ ≤ Lαt/(2β).

Thus in both cases we have

αt ≥ min

{αt

0,2β(1− σ)

L(maxj xtj)

2γ

}. (17)

Since inft αt0 > 0 and {xt} is bounded, this shows that inft α

t > 0.

The assumption in Theorem 1(c)-(d) of (10) and inft αt0 > 0 is reasonable

since, by (a), {dt} is bounded, so the right-hand side of (10) is uniformlybounded away from zero. Theorem 1(d) will be used in the convergencerate analysis of the next section. Similar to the observation in [1, page 45],Theorem 1(a)-(c) extend to the limited maximization rule (11) or any stepsizerule that yields a larger ascent than the Armijo rule at each iteration.

Corollary 1 Theorem 1(a)-(c) still hold if, in the interior-point method (6),(7), αt more generally satisfies

0 < αt ≤ αt0, f(xt + αtdt) ≥ f(xt + αt

armijodt),

where αtarmijo is chosen by the Armijo rule (9).

Proof. Theorem 1(a) clearly holds. Theorem 1(b) holds since (13), (14), (15)in its proof still hold with αt replaced by αt

armijo. This yields {(gt)>dt} → 0and {dt} → 0. The proof of Theorem 1(c) is modified accordingly.

Theorem 1(c) under condition (i) is similar to [11, Section 3] and [19,Theorem 3.14] for the case of AS methods (γ = 1). Theorem 1(c) undercondition (ii) is similar to [6, Theorem 2.2], [30, Theorem 3] for the caseof AS methods. In particular, Assumption 1 is equivalent to (H3) in [6],

11

and condition (ii) and Λcs are refinements of, respectively, (H1) and (OS)I

solutions in [6]. When f is quadratic, Λcs consists of isolated points if andonly if it is a finite set.

The convergence of {xt} for AS methods has been much studied. Inthe case of linear f , convergence has been shown for first-order AS methods[14, 18, 28]. In the cases of concave quadratic f or a more general class ofquadratic f and box constraint, convergence has been shown for second-orderAS methods [17, 25, 27]. For more general f , convergence has been shownfor AS methods under Assumption 1 and condition (ii) in Theorem 1 [30,Theorem 3], [6, Theorem 2.2], and for second-order AS methods, assumingf is concave or convex and ∇2f has a constant null space property [26], [19,Theorem 4.12].

5 Sublinear convergence when f is quadratic

In this section we show that, in the special case where f is quadratic, {f(xt)}generated by the first-order interior-point method (6), (7) converges sublin-early. The proof, which uses Lemma 1 and Theorem 1, adapts the linearconvergence analysis of a second-order AS method with line search [27, The-orem 1]. To our knowledge, this is the first rate of convergence result for afirst-order interior-point method when f is nonlinear, and it does not assumeprimal nondegeneracy (Assumption 1). Moreover, by adapting the proof of[15, Theorem 3.2], we show that the generated iterates {xt} converge sublin-early for γ < 1 and, under primal nondegeneracy, the limit is stationary for(1). If in addition γ ≤ 1

2and strict complementarity holds, then convergence

is linear. But if supt αt0 < ∞, 1

2≤ γ < 1 and strict complementarity fails,

then convergence cannot be linear. This suggests γ < 1 may be preferableto γ ≥ 1, which is corroborated by the numerical results in Section 6.

Theorem 2 Assume f(x) = x>Qx/2 + c>x for some symmetric Q ∈ Rn×n

and c ∈ Rn. Assume Λ0 = {x ∈ Λ | f(x) ≥ f(x0)} is bounded. Let {xt} begenerated by the method (6), (7) with {αt} chosen by the Armijo rule (9) and{αt

0} satisfying (10) and inft αt0 > 0. Then the following results hold with

ω = 1/(γ − 1) and γ = max{1 + γ, 2γ}.(a) There exist υ ∈ R and C > 0 (depending on x0) such that

0 ≤ υ − f(xt) ≤ Ct−ω ∀t ≥ 1 . (18)

12

(b) Assume γ < 1. Then there exist x ∈ Λ0 and C ′ > 0 (depending on x0)such that

‖x− xt‖ ≤ C ′t−1−γ2γ ∀t ≥ 1.

Suppose Assumption 1 also holds. Then x is a stationary point of (1).Moreover, if γ ≤ 1

2and x−r(x) > 0, then {f(xt)} converges linearly in

the quotient sense and {‖x− xt‖} converges linearly in the root sense.If instead supt α

t0 < ∞, γ ≥ 1

2and x−r(x) 6> 0, then {‖x−xt‖} cannot

converge linearly.

Proof. We have from Theorem 1(a) and its proof that {f(xt)} is nonde-creasing and (13) holds or, equivalently,

f(xt+1)− f(xt) ≥ σαt‖ηt‖2 ∀t, (19)

where we letηt = (X t)γrt, rt = r(xt).

Thus {f(xt)} converges to a limit υ and {f(xt+1) − f(xt)} → 0. Since ∇fis Lipschitz continuous on Λ, Theorem 1(d) implies inft α

t > 0. Then (19)implies {ηt} → 0.

For any J ⊆ {1, ..., n}, let

TJ =

{t ∈ {0, 1, ...}

∣∣∣ xtj ≤ |ηt

j|1

1+γ ∀j ∈ J ,

|rtj| < |ηt

j|1

1+γ ∀j ∈ J c

}, (20)

where J c = {1, . . . , n}\J . Since (xtj)

γ|rtj| = |ηt

j| so that either xtj ≤ |ηt

j|1

1+γ or

else |rtj| < |ηt

j|1

1+γ , it follows that each t ∈ {0, 1, ...} belongs to exactly one setTJ for some J (because evidently J 6= K ⊆ {1, ..., n} implies TJ ∩ TK = ∅).Since the number of subsets J is finite, there is at least one J such that TJis infinite.

Consider any J such that TJ is infinite. Then, for each t ∈ TJ , thefollowing linear system in (x, p)

xJ = xtJ , Q>

j x− A>j p = −cj + rt

j ∀j ∈ J c, x ≥ 0, Ax = b, (21)

has a solution, e.g., (x, p) = (xt, pt), where pt = (A(X t)2γA>)−1A(X t)2γgt

and Qj, Aj denote the jth column of Q, A, respectively. Now, let ‖ · ‖ν

13

denote the ν-norm. (We drop the subscript ν for the Euclidean norm whereν = 2.) By the definition (20),

(‖(xtJ , rt

J c)‖1+γ)1+γ ≤ ‖ηt‖1 ∀t ∈ TJ , (22)

so {ηt} → 0 yields {(xtJ , rt

J c)}t∈TJ → 0. Thus, the right-hand side of (21)is uniformly bounded for t ∈ TJ , so an error bound of Hoffman [13] im-plies that (21) has a solution (yt, qt) that is bounded for t ∈ TJ . Since{(xt

J , rtJ c)}t∈TJ → 0, any cluster point (y, q) of {(yt, qt)}t∈TJ satisfies

yJ = 0, Q>j y − A>

j q = −cj ∀j ∈ J c, y ≥ 0, Ay = b. (23)

Thus, this linear system has a solution. Let ΣJ denote the set of solutionsfor (23). Since (xt, pt) is a solution of (21), an error bound of Hoffman [13]implies there exists (xt, pt) ∈ ΣJ satisfying

‖(xt, pt)− (xt, pt)‖ ≤ C1‖(xtJ , rt

J c)‖1+γ ∀t ∈ TJ , (24)

where C1 is a constant depending on Q, A, and J only.We claim that f is constant on each ΣJ . If (y, q) and (y′, q′) both belong

to ΣJ , then (23) yields

f(y′)− f(y) =1

2(y′ − y)>Q(y′ − y) + (Qy + c)>(y′ − y)

=1

2(y′ − y)>Q(y′ − y) + (Qy + c− A>q)>(y′ − y)

=1

2(y′ − y)>Q(y′ − y),

where the second equality uses A(y′−y) = 0 and third equality uses y′J = yJand Q>

j y − A>j q = −cj for all j ∈ J c. A symmetric argument yields

f(y)− f(y′) =1

2(y − y′)>Q(y − y′).

Combining the above two equalities yields f(y′) = f(y).For any t ∈ TJ , we have

(Qxt + c)>(xt − xt) = (Qxt + c− A>pt)>(xt − xt)

=∑j∈J

(Q>j xt + cj − A>

j pt)xtj

=∑j∈J

(Q>j (xt − xt)− A>

j (pt − pt) + rtj)x

tj,

14

where the first equality uses A(xt − xt) = 0, and the second equality uses(xt, pt) ∈ ΣJ . If γ ≤ 1, then this together with

f(xt)− f(xt) =1

2(xt − xt)>Q(xt − xt) + (Qxt + c)>(xt − xt)

and the definition of ηt = (X t)γrt yields

|f(xt)− f(xt)|

=

∣∣∣∣∣1

2(xt − xt)>Q(xt − xt) +

(∑j∈J

(Q>j (xt − xt)− A>

j (pt − pt))xtj + xt

jrtj

)∣∣∣∣∣

≤ 1

2|(xt − xt)>Q(xt − xt)|+

∑j∈J

(∣∣(Q>j (xt − xt)− A>

j (pt − pt))xtj

∣∣ + (xtj)

1−γ|ηtj|)

≤ C2‖xt − xt‖2 +∑j∈J

(‖(Qj,−Aj)‖ ‖(xt, pt)− (xt, pt)‖xtj + (xt

j)1−γ|ηt

j|)

≤ C2C21‖ηt‖

21+γ

1 +∑j∈J

‖(Qj,−Aj)‖C1‖ηt‖1

1+γ

1 xtj + (xt

j)1−γ|ηt

j|

≤ C2C21‖ηt‖

21+γ

1 + C1

∑j∈J

‖(Qj, Aj)‖ ‖ηt‖1

1+γ

1 |ηtj|

11+γ + |ηt

j|2

1+γ ,

where C2 is a constant depending on Q only; the first inequality uses xtj|rt

j| =(xt

j)1−γ|ηt

j|; the third inequality uses (22) and (24); the last inequality uses(20), γ ≤ 1, and t ∈ TJ . It follows that

|f(xt)− f(xt)| ≤ CJ ‖ηt‖2

1+γ

1 ∀t ∈ TJ , (25)

where CJ is a constant depending on Q,A, J , ϑ, and x0. If γ > 1 instead,

then by using xtj|rt

j| = |ηtj|1/γ|rt

j|1−1/γ and supt ‖rt‖1−1/γ∞ < ∞, we similarly

obtain (25) but with “‖ηt‖2

1+γ

1 ” replaced by “‖ηt‖1/γ1 ” (also using 2/(1+γ) >

1/γ). Let C3 be the maximum of CJ over all J such that TJ is infinite.Since {f(xt)} ↑ υ and {ηt} → 0, it follows from (25) that {f(xt)}t∈TJ →

υ. Since xt ∈ ΣJ for all t ∈ TJ and f is constant on ΣJ , this impliesf(xt) = υ for all t ∈ TJ . This, together with (25), the subsequent remark,and CJ ≤ C3, yields

υ − f(xt) = f(xt)− f(xt)

15

≤ C3‖ηt‖min{ 21+γ

, 1γ}

1

= C3‖ηt‖2γ

1 (26)

≤ κ(f(xt+1)− f(xt)

) 1γ

for all t ∈ TJ with some constant κ > 0, where the second inequality uses(19) and inft α

t > 0. The above inequality yields, upon letting ∆t = υ−f(xt)and rearranging terms,

∆t+1 ≤ ∆t −(

∆t

κ

)γ

. (27)

This holds for all t ∈ TJ and all J such that TJ is infinite. Since eacht ∈ {0, 1, ...} belongs to TJ for some J , then (26) and (27) hold for all tsufficiently large, say t ≥ T .

(a) Take C ≥ κγ

γ−1 sufficiently large so that (18) holds for t = 1, . . . , T .Then an induction argument shows that (18) holds for all t ≥ 1. In particular,

if (18) holds for some t ≥ 1, then we have from C ≥ κγ

γ−1 that (C/κ)γ ≥ Cand hence (27) yields

∆t+1 ≤ ∆t −(

∆t

κ

)γ

≤ C

tω−

(C

κtω

)γ

≤ C

(1

tω− 1

tωγ

)≤ C

(t + 1)ω,

where the last inequality holds since ωγ = 1 + ω and (1 − 1t+1

)ω ≥ 1 − 1t+1

(using ω ≤ 1). This proves (18).(b) Assume γ < 1. Then γ = 1 + γ < 2. Using (6), we have for all t ≥ T

that

‖xt+1 − xt‖ = αt‖(X t)γηt‖≤ αt‖xt‖γ

∞‖ηt‖= αt‖xt‖γ

∞‖ηt‖2

‖ηt‖≤ ‖xt‖γ

∞√

nαt‖ηt‖2

‖ηt‖1

≤ ‖xt‖γ∞√

n∆t −∆t+1

σ

(C3

∆t

) γ2

,

where the last inequality uses (19) and (26) which holds for all t ≥ T asexplained above. Since xt lies in the bounded set Λ0 so that supt ‖xt‖∞ < ∞,

16

this implies

‖xt+1 − xt‖ ≤ C4(∆t −∆t+1)(∆t)−

γ2

≤ C4

∫ ∆t

∆t+1

t−γ2 dt

=C4

1− γ2

((∆t)1− γ

2 − (∆t+1)1− γ2

)

for all t ≥ T , where C4 > 0 is some constant. Thus for any t2 ≥ t1 ≥ T , wehave

t2∑t=t1

‖xt+1 − xt‖ ≤ C4

1− γ2

((∆t1)1− γ

2 − (∆t2+1)1− γ2

)≤ C4

1− γ2

(∆t1)1− γ2 .

Since ∆t1 → 0 as t1 → ∞, this shows that the sequence {xt} satisfiesCauchy’s criterion for convergence and hence has a unique cluster point x.Moreover, the triangle inequality yields

‖xt2+1 − xt1‖ =

∥∥∥∥∥t2∑

t=t1

(xt+1 − xt)

∥∥∥∥∥ ≤t2∑

t=t1

‖xt+1 − xt‖ ≤ C4

1− γ2

(∆t1)1− γ2

so taking t2 → ∞ and then using (18) yields ‖x − xt1‖ = O((t−ω

1 )1− γ2

).

Moreover, γ = 1 + γ and ω = 1/γ, so ω(1− γ2) = 1−γ

2γ.

Assume also that Assumption 1 holds. Then {rt} converges to r = r(x)(since r is continuous by Lemma 2) and it readily follows from (6) and {xt} →x that r ≤ 0. Suppose x− r > 0, i.e., there exists J ⊆ {1, . . . , n} such that

rJ < xJ = 0 and xJ c > rJ c = 0.

Then {(xt, rt)} → (x, r) and ηt = (X t)γrt imply

xtj = O(|ηt

j|1/γ) ∀j ∈ J and |rtj| = O(|ηt

j|) ∀j ∈ J c.

Repeating the preceding argument yields (25) with “ 21+γ

” replaced by “min{2, 1γ}”

and (27) with “γ” replaced by “max{1, 2γ}”. The latter in turn implies that,for γ ≤ 1

2,

∆t+1 ≤ (1− 1/κ)∆t

17

for all t sufficiently large, so that {∆t} → 0 linearly in the quotient sense and,by the above argument, {‖xt − x‖} → 0 linearly in the root sense. Supposeinstead supt α

t0 < ∞, γ ≥ 1

2and rj = xj = 0 for some j. Then supt α

t < ∞and (6) yields

xt+1j

/xtj = 1 + αt(xt

j)2γ−1rt

j → 1.

Thus {xtj} cannot converge linearly to 0 in the quotient or root sense, and

hence neither can {‖xt − x‖}.

As γ decreases, ω increases while the proof of Theorem 2(a) suggests that

C increases (since κγ

γ−1 →∞ as γ → 1). This cautions against taking γ toosmall. Theorem 2(b) shows γ = 1

2is a good choice in theory. However, the

numerical results in Section 6 suggest the resulting method may be prone toroundoff errors. The convergence of {xt} for γ = 1 remains an open question.Since the proof of Theorem 2(a) is adapted from that of [27, Theorem 1],why can we prove only sublinear convergence and not linear convergence?This is because the amount of ascent f(xt+1) − f(xt) is only guaranteed tobe in the order of ‖(X t)γrt‖2 (see (19)) instead of ‖X trt‖ (see [27, Lemma1]). To prove linear convergence, we would need to establish not a constantlower bound on αt, as in Theorem 1(d), but a lower bound in the order of1/‖(X t)γrt‖. This may be the price we pay for the simpler iterations ofa first-order interior-point method. The linear convergence results in [19,Lemma 4.11] and [27, Theorem 1] hold for second-order AS methods only.

When γ < 1, the ellipsoid associated with dt (see (8)) tends to be rounder(since xγ

j < xj when xj > 1 and xγj > xj when xj < 1). This may give some

intuition for the better convergence behavior of {xt}.

x1

x2

��

��

��

��

x t

Ax = bγ<1

γ=1

γ>1

Figure 1: The ellipsoid associated with dt, centered at xt (n = 2).

18

6 Numerical Experience

In order to better understand its practical performance, we have implementedin Matlab the first-order interior-point method (6), (7), with αt chosen bythe Armijo-type rule (9), to solve (1) with simplex constraint (A = e>, b = 1)and large n. In this section, we describe our implementation and report ournumerical experience on test problems with objective functions f from Moreet al. [21], negated for maximization.

In our implementation, we use six different values of γ (γ = .5, .8, .9, 1, 1.1, 1.2).We use the standard setting of β = .5, σ = .1 for the Armijo rule (9), andwe choose

αt0 = min

{αt

feas, max

{10−5,

αt−1

β2

}}, αt

feas =0.95

−minj(dtj/x

tj)

,

with α−1 = ∞. Since e>dt = 0, we have αtfeas > 0 whenever dt 6= 0. Thus

{αt0} satisfies (10) and inft α

t0 > 0. Moreover, if αt−1 is small, then so is

αt0. This can save on function evaluations since often αt ≈ αt−1. In our

experience, this yields much better performance than αt0 = .99/‖X tr(xt)‖

used in [6, Algorithm 1] for the case of γ = 1.For f , we choose 9 test functions with n = 1000 from the set of nonlinear

least square functions used by More et al. [21] and negate them. These func-tions, listed in Table 1 with the numbering from [21, pages 26-28] shown inparentheses, are chosen for their diverse characteristics: convex or nonconvex,sparse or dense Hessian, well-conditioned or ill-conditioned Hessian, and aregrouped accordingly. The first three functions ER, DBV, BT are nonconvex,with sparse Hessian. The next two functions TRIG, BAL are nonconvex,with dense Hessian. The sixth function EPS is convex, with sparse Hessian.The seventh function VD is strongly convex with dense Hessian. The lasttwo functions LR1, LR1Z are convex quadratic with dense Hessian of rank1. The functions ER and EPS have block-diagonal Hessians, and VD, LR1,LR1Z have ill-conditioned Hessians. Upon negation, these convex functionsbecome concave functions. The functions and gradients are coded in Matlabusing vector operations.

Since the starting points given in [21] may not satisfy the simplex con-straint, we use the starting point

x0 = e>/n

19

for the first-order interior-point method. We terminate the method when theresidual ‖min{xt,−rt}‖ is below a tolerance tol > 0. Note that rt dependson γ. We set tol = 10−6 for all problems except DBV, BT, EPS, and LR1Z.For DBV, BT and EPS, this is too tight and tol = 10−4, tol = 10−3 andtol = 10−3 are used instead. For LR1Z, this is too loose and tol = 10−7 isused instead. Roundoff error in Matlab occasionally causes this terminationcriterion never to be met, in which case we quit. In particular, we quit atiteration t if the Armijo ascent condition (9) is still not met when α fallsbelow 10−20.

Table 1 reports the number of iterations (iter), number of f -evaluations(nf), cpu time (in seconds), final objective value (negated), final residual(resid), final stepsize (step), and strict complementarity measure minj(xj −rj(x)) (sc). All runs are performed on an HP DL360 workstation, runningRed Hat Linux 3.5 and Matlab (Version 7.0). We see from Table 1 that iterand nf vary with γ but the best performance is obtained at γ = .8, withlower nf and iter on average. The ratio of nf/iter is typically below 4, sug-gesting that the Armijo rule typically uses few f -evaluations before acceptinga stepsize. For a simple first-order method, the number of iterations looksquite reasonable on all problems except BT (whose Hessian is tridiagonal),for which interestingly γ = .5 yields the best performance. However, themethod seems prone to roundoff error when γ = .5. The final stepsize tendsto increase with γ. This is consistent with the lower bound (17), which in-creases with γ (since maxj xt

j < 1). When γ > 1, a smaller residual maybe needed to achieve the same accuracy in the final objective value; see LR1and LR1Z. For the nonconvex functions BT and TRIG, multiple local minimacould exist and, depending on the starting point, the method can convergeto different local minima with different objective value.

The sc values in Table 1 suggest that strict complementarity holds forall problems. To see how the method performs on a degenerate problem, wetest it on f(x) = −(e>x)2 −∑n−1

j=1 x2j , for which (1) has a unique stationary

point x = (0, ..., 0, 1)T that violates strict complementarity at all except onecomponent. On this problem, γ = .5 yields the best performance, e.g., whentol = 10−6, it terminates with iter = 7, nf = 8, cpu = .01, and sc = 2 · 10−8.

20

7 Monotonely Transformed Methods

In [22, 23, 24, 4] an interesting exponential variant of (4) is studied for solvingthe special case of (1) with homogeneous quadratic f and simplex constraints.We extend this variant to allow for general f and general monotone trans-formation as follows. Suppose Λ is the unit simplex. Consider the method(6) but, instead of (7), we use

r(x) = ψ(∇f(x))− e>X2γψ(∇f(x))

e>X2γee with X = Diag(x), (28)

where ψ : R→ R is any strictly increasing function, and ψ(y) means applyingψ to y ∈ Rn componentwise. Taking ψ(τ) = eτ yields the exponential variantwhen γ = 1/2. Taking ψ(τ) = τ yields the RD method (4) when γ = 1/2 andthe first-order AS method (5) when γ = 1. To avoid numerical overflow, wecan replace the exponential by a polynomial when t exceeds some threshold.This monotone transformation, which is related to payoff monotonic gamedynamics [23, Eq. (8)], seems to apply only in the case of simplex constraints.

Lemma 3 Suppose Λ is the unit simplex, i.e., A = e>, b = 1. For anyx ∈ Λ, let X and r(x) be given by (28) with ψ any strictly increasing function.Then the following results hold.

(a) Xr(x) = 0 if and only if X(∇f(x) − p(x)e) = 0 and r(x) ≤ 0 if andonly if ∇f(x)− p(x)e ≤ 0, where

p(x) = ψ−1

(e>X2γψ(∇f(x))

e>X2γe

).

(b) d = X2γr(x) satisfies e>d = 0 and ∇f(x)>d ≥ 0. Moreover, ∇f(x)>d =0 if and only if Xr(x) = 0.

Proof. (a). This follows from (28) and the strictly increasing property ofψ.

(b) For any x ∈ Λ, d = X2γr(x) satisfies e>d = 0 and

g>d = g>X2γ

(ψ(g)− e>X2γψ(g)

e>X2γee

)

= (g − ρe)> X2γψ(g)

= (g − ρe)> X2γ (ψ(g)− ψ(ρ)e)

≥ 0,

21

where we let g = ∇f(x) and ρ = g>X2γee>X2γe

. We used monotonicity of ψ, whichimplies (y−z)[ψ(y)−ψ(z)] ≥ 0 for any two y, z ∈ R. Moreover, the inequalityis strict unless gj = ρ for all j with xj 6= 0. In this case, Xψ(g) = ψ(ρ)x andX2γψ(g) = ψ(ρ)X2γe, so that

Xr(x) = X

(ψ(g)− ψ(ρ)

e>X2γe

e>X2γee

)= ψ(ρ)x− ψ(ρ)x = 0 .

The converse implication Xr(x) = 0 ⇒ g>d = 0 follows similarly.

Remark 1 In the case of ψ(τ) = exp(θτ) with θ > 0, we have

∇f(x)>d ≥ 4

∥∥∥∥Xγ

(exp

(θ

2g

)− exp

(θ

2ρ

)e

)∥∥∥∥2

(cf. Lemma 1). This follows from the inequality (y − z)[exp(y) − exp(z)] ≥4[exp(y

2)− exp( z

2)]2

, which holds for any y, z ∈ R.

Extension of the above analysis to the case of multiple simplices is straight-forward.

Lemma 4 Let Λ be a Cartesian product of m unit simplices of dimensionsn1, . . . , nm (so that

∑mi=1 ni = n), corresponding to A ∈ {0, 1}m×n satisfying

A>e = e ∈ Rn, Ae = (n1, . . . , nm)>, and b = e ∈ Rm. For any x ∈ Λ, letg = ∇f(x) and

r(x) = [I − A>(AX2γA>)−1AX2γ]ψ(g) with X = Diag(x), (29)

where ψ : R → R is any strictly increasing function. Then the followingresults hold.

(a) Xr(x) = 0 if and only if X(g−A>p(x)) = 0 and r(x) ≤ 0 if and only ifg − A>p(x) ≤ 0, where

p(x) = ψ−1((AX2γA>)−1AX2γψ(g)

). (30)

(b) d = X2γr(x) satisfies Ad = 0 and g>d ≥ 0. Moreover, g>d = 0 if andonly if Xr(x) = 0.

22

Proof. The proof of (a) is as in Lemma 3 above. For the proof of (b),we analogously let ρ = (AX2γA>)−1AX2γg. Furthermore we observe thatA>ψ(ρ) = ψ(A>ρ) by the properties of A, and obtain

g>d = g>X2γ(ψ(g)− A>(AX2γA>)−1AX2γψ(g)

)

=(g − A>ρ

)>X2γψ(g)

=(g − A>ρ

)>X2γ

(ψ(g)− A>ψ(ρ)

)

=(g − A>ρ

)>X2γ

(ψ(g)− ψ(A>ρ)

)

≥ 0,

where the inequality is justified as in the proof of Lemma 3. Again, theinequality is strict only if X(ψ(g)−A>ψ(ρ)) = 0, which can be used to showthat Xr(x) = 0 if and only if g>d = 0.

Lemma 4 remains true if different ψ functions are used for different sim-plices. Lemmas 3 and 4 generalize [4, Theorem 3] for the special case ofγ = 1

2and ψ(τ) = exp(θτ) with θ > 0. By Lemma 4(b), d = X2γr(x) is a

feasible ascent direction at every x ∈ riΛ. Using Lemma 4, we can extendTheorem 1(a)-(c) and Corollary 1 to the monotonely transformed method,assuming furthermore that ψ is continuous.

Theorem 3 Assume Λ is a Cartesian product of m unit simplices, with Aand b given as in Lemma 4. Then Theorem 1(a)-(b) still holds. Moreover,Theorem 1(c) under condition (ii) or (iii) still holds if r is instead given by(29) with ψ any continuous strictly increasing function. This remains truewhen the Armijo rule is replaced by any stepsize rule that yields a largerascent as described in Corollary 1.

Proof. It is readiy verified that Assumption 1 holds so that, by Lemma 2and continuity of ψ, r given by (29) is continuous on Λ. It is then readilyseen from its proof that Theorem 1(a) still holds.

The proof of Theorem 1(b) yields that {(gt)>dt}t∈T → 0, where {xt}t∈T(T ⊆ {0, 1, . . .}) is any subsequence of {xt} converging to some x. Sincer is continuous on Λ, this yields in the limit that g>X2γr(x) = 0, whereg = ∇f(x) and X = Diag(x). By Lemma 4(b), Xr(x) = 0. Assumingsupt α

t0 < ∞, we prove that {xt+1−xt} → 0 as follows: Since {xt} is bounded

and every cluster point x satisfies Xγr(x) = 0, the continuity of r implies

23

{dt} = {(X t)2γr(xt)} → 0. Since supt αt0 < ∞ so that supt α

t < ∞, {xt+1 −xt} → 0.

The proof of Theorem 1(c) under conditions (ii) or (iii) still holds. Theproof shows that every cluster point x of {xt} satisfies r(x) ≤ 0 as well asXr(x) = 0. By Lemma 4(a), x satisfies X(g−A>p(x)) = 0 and g−A>p(x) ≤0, where p is given by (30). Thus x is a stationary point of (1).

It is not known if Theorem 1(c) under condition (i) or Theorem 2 can beextended to the monotonely transformed method.

8 Appendix A

In this section we assume f is concave or convex and prove (16) followingthe line of analysis in [11] and [19, Appendix A]. Let J = {j | rj = 0},J c = {1, . . . , n} \ J , and

Λ = {x ∈ Λ | xJ c = 0, f(x) = f(x)}.

Lemma 5 Λ is convex.

Proof. Since Xr = 0, x satisfies the Karush-Kuhn-Tucker (KKT) optimalitycondition for the restricted problem

optimize {f(x) | Ax = b, xJ ≥ 0, xJ c = 0} ,

where “optimize” means “maximize” when f is concave and means “min-imize” when f is convex. Thus x is an optimal solution of this restrictedproblem and Λ is its optimal solution set. Since this problem is equivalentto a convex minimization problem, its optimal solution set is convex.

Lemma 6 If f is constant on a convex subset of Rn, then ∇f is constanton this subset.

Proof. See, e.g., [16].

Using Lemmas 5 and 6, we have the following lemma.

Lemma 7 r(x) = r for all x ∈ Λ.

24

Proof. For any x ∈ Λ, Lemmas 5 and 6 imply ∇f(x) = g = r + A>p. Thus(7) yields

r(x) = (I−A>(AX2γA>)−1AX2γ)(r+A>p) = (I−A>(AX2γA>)−1AX2γ)r = r,

where the last equality uses rJ = 0 and xJ c = 0, so that X2γ r = 0.

Using Theorem 1(b) and Lemma 7, we have the following lemma.

Lemma 8 Every cluster point of {xt} is in Λ.

Proof. We argue by contradiction. Suppose there exists a cluster point xof {xt} that is not in Λ. Since x ∈ Λ and f(x) = f(x), this implies xj > 0

for some j ∈ J .Since f(x) ≥ f(x0), Λ lies inside the bounded set Λ0, so Λ is compact.

Then, by Lemma 2, r is uniformly continuous over Λ. Lemma 7 implies that,for all δ > 0 sufficiently small, we have

|r(x)j| ≥ |rj|/2 ∀ j ∈ J c, ∀x ∈ Λ + δB, (31)

where B denotes the unit Euclidean ball centered at the origin. Take δ smallenough so that δ < xj. Then x 6∈ Λ + δB (since |xj − xj| = xj > δ for all

x ∈ Λ). By Theorem 1(c), {xt+1 − xt} → 0, so the cluster points of {xt}form a continuum. Since there exists a cluster point in Λ and another notin Λ + δB, there must exist a cluster point x in Λ + δB but not in Λ. Sincex ∈ Λ and f(x) = f(x), the latter implies xJ c 6= 0. Since x is in Λ+δB, (31)implies |r(x)j| ≥ |rj|/2 for all j ∈ J c. Thus Xr(x) 6= 0, where X = Diag(x),a contradiction of Theorem 1(b).

Since Λ is compact and r is a continuous mapping on Λ, Lemmas 7 and8 imply {r(xt)} → r. This proves (16).

References

[1] Bertsekas, D. P., Nonlinear Programming, 2nd edition, Athena Sci-entific, Belmont, 1999.

[2] Bomze, I. M., On standard quadratic optimization problems, J.Global Optim. 13 (1998), 369–387.

25

[3] Bomze, I. M., Regularity vs. degeneracy in dynamics, games, andoptimization: a unified approach to different aspects, SIAM Rev. 44(2002), 394–414.

[4] Bomze, I. M., Portfolio selection via replicator dynamics and pro-jection of indefinite estimated covariances, Dyn. Contin. DiscreteImpuls. Syst. Ser. B 12 (2005), 527–564.

[5] Bomze, I. M. and Schachinger, W., Multi-standard quadratic opti-mization problems, Technical Report TR-ISDS 2007-12, Universityof Vienna, 2007, submitted.

[6] Bonnans, J. F. and Pola, C., A trust region interior point algorithmfor linearly constrained optimization, SIAM J. Optim. 7 (1997), 717–731.

[7] Dikin, I. I., Iterative solution of problems of linear and quadraticprogramming, Soviet Math. Dokl. 8 (1967), 674–675.

[8] Dikin, I. I., Letter to the editor, Math. Program. 41 (1988), 393–394.

[9] Forsgren, A., Gill, P. E., and Wright, M. H., Interior methods fornonlinear optimization, SIAM Rev. 44 (2002), 525–597.

[10] Gill, P. E., Murray, W., and Wright, M. H., Practical Optimization,Academic Press, New York, 1981.

[11] Gonzaga, C. C. and Carlos, L. A., A primal affine scaling algorithmfor linearly constrained convex programs, Tech. Report ES-238/90,Department of Systems Engineering and Computer Science, COPPEFederal University of Rio de Janeiro, Rio de Janeiro, Brazil, Decem-ber 1990.

[12] Hofbauer, J. and Sigmund, K., Evolutionary Games and PopulationDynamics, Cambridge University Press, Cambridge, UK, 1998.

[13] Hoffman, A. J., On approximate solutions of systems of linear in-equalities, J. Res. Natl. Bur. Standards 49 (1952), 263-265.

[14] Luo, Z.-Q. and Tseng, P., On the convergence of the affine-scalingalgorithm, Math. Program. 56 (1992), 301–319.

26

[15] Lyubich, Y., Maistrowskii, G. D., and Ol’khovskii, Yu. G., Selection-induced convergence to equilibrium in a single-locus autosomal pop-ulation, Probl. Inf. Transm. 16 (1980), 66–75.

[16] Mangasarian, O. L., A simple characterization of solution sets ofconvex programs, Oper. Res. Lett. 7 (1988), 21–26.

[17] Monteiro, R. D. C. and Tsuchiya, T., Global convergence of theaffine scaling algorithm for convex quadratic programming, SIAMJ. Optim. 8 (1998), 26–58.

[18] Monteiro, R. D. C., Tsuchiya, T., Wang, Y., A simplified globalconvergence proof of the affine scaling algorithm, Ann. Oper. Res.46/47 (1993), 443–482.

[19] Monteiro, R. D. C. and Wang, Y., Trust region affine scaling algo-rithms for linearly constrained convex and concave programs, Math.Program. 80 (1998), 283–313.

[20] Moran, P. A. P., The statistical Processes of Evolutionary Theory,Oxford, Clarendon Press, 1962.

[21] More, J. J., Garbow, B. S., and Hillstrom, K. E., Testing un-constrained optimization software, ACM Trans. Math. Software 7(1981), 17–41.

[22] Pelillo, M., Replicator equations, maximal cliques, and graph iso-morphism, Neural Comput. 11 (1999), 1933–1955.

[23] Pelillo, M., Matching free trees, maximal cliques, and monotonegame dynamics, IEEE Trans. Pattern Anal. Machine Intell. 24(2002), 1535–1541.

[24] Pelillo, M., Siddiqi, K., and Zucker, S. W., Matching hierarchicalstructures using association graphs, IEEE Trans. Pattern Anal. Ma-chine Intell. 21 (1999), 1105–1120.

[25] Sun, J., A convergence proof for an affine-scaling algorithm for con-vex quadratic programming without nondegeneracy assumptions,Math. Program. 60 (1993), 69–79.

27

[26] Sun, J., A convergence analysis for a convex version of Dikin’s algo-rithm, Ann. Oper. Res. 62 (1996), 357–374.

[27] Tseng, P., Convergence properties of Dikin’s affine scaling algorithmfor nonconvex quadratic minimization, J. Global Optim. 30 (2004),285–300.

[28] Tsuchiya, T., Global convergence of the affine scaling methodsfor degenerate linear programming problems, Math. Program. 52(1991), 377–404.

[29] Tsuchiya, T. and Muramatsu, M., Global convergence of a long-stepaffine scaling algorithm for degenerate linear programming problems,SIAM J. Optim. 5 (1995), 525–551.

[30] Ye, Y., On affine scaling algorithms for nonconvex quadratic pro-gramming, Math. Program. 56 (1992), 285–300.

28

Problem γ iter nf cpu obj resid step sc

ER (21) .5 6 7 0.03 498.002 3.4·10−7 0.47 .001

.8 5 6 0.01 498.002 5.2·10−7 7.9·103 .001

.9 5 6 0.01 498.002 6.8·10−7 4.4·104 .001

1 7 8 0.03 498.002 2.6·10−7 3.5·106 .001

1.1 8 9 0.02 498.002 8.4·10−7 5.6·107 .001

1.2 10 11 0.03 498.002 3.7·10−7 3.5·109 .001

DBV (28) .5 117 352 0.4 4.5·10−8 9.8·10−5 59.3 .0003

.8 99 299 0.42 4.9·10−8 9.8·10−5 1.8·103 .0003

.9 107 323 0.45 4.7·10−8 9.7·10−5 7.4·103 .0003

1 146 440 0.52 4.5·10−8 9.8·10−5 2.9·104 .0003

1.1 191 575 0.82 4.3·10−8 9.9·10−5 1.1·105 .0003

1.2 240 722 1.02 4.0·10−8 9.9·10−5 4.7·105 .0003

BT (30) .5 3893 3894 4.25 999.030 9.9·10−4 0.15 .007

.8 9146 27409 20.68 999.031 9.9·10−4 0.50 .006

.9 38124 114346 87.05 999.033 9.9·10−4 0.69 .006

1 17559 52665 23.31 999.055 9.9·10−4 0.45 .01

1.1 83209 249222 191.06 999.077 9.9·10−4 0.65 .01

1.2 527757 1.58·106 1192.55 999.081 9.9·10−4 1.14 .01

TRIG (26) .5 41 121 0.11 1.1·10−6 9.0·10−7 623.6 .0006

.8 39 115 0.14 9.5·10−7 9.9·10−7 3.6·104 .0006

.9 73 214 0.24 9.0·10−7 9.8·10−7 8.6·104 .0006

1 95 271 0.22 9.7·10−7 8.3·10−7 5.3·105 .0006

1.1 52 143 0.17 8.8·10−7 8.1·10−7 2.3·106 .0006

1.2 86 223 0.27 1.1·10−6 9.9·10−7 6.4·106 .0006

BAL (27) .5 †6 63 0.02 9.98998·108 1.9·10−5 6.5·10−21 .001

.8 †8 84 0.02 9.98998·108 1.4·10−6 6.0·10−21 .001

.9 6 7 0.01 9.98998·108 6.1·10−7 113.648 .001

1 7 8 0.03 9.98998·108 3.6·10−8 1947.55 .001

1.1 †9 94 0.03 9.98998·108 1.0·10−6 6.4·10−21 .001

1.2 8 9 0.02 9.98998·108 3.7·10−7 123466 .001

EPS (22) .5 120 313 0.43 1.3·10−6 9.9·10−4 39.23 .0001

.8 424 1269 1.89 1.3·10−6 9.9·10−4 1503.22 .0002

.9 644 1929 2.85 2.4·10−6 9.9·10−4 5984.43 .0002

1 987 2958 3.52 3.9·10−6 9.9·10−4 23824.5 .0003

1.1 870 2608 3.81 5.9·10−6 9.9·10−4 47423.4 .0003

1.2 1963 5887 8.76 8.0·10−6 9.4·10−4 188796 .0004

VD (25) .5 †17503 17505 18.78 6.22504·1022 2.9·1010 9.5·10−22 -2·1010

.8 22 74 0.04 6.22504·1022 2.4·10−9 9.5·10−14 1

.9 19 43 0.03 6.22504·1022 1.6·10−8 3.7·10−19 1

1 19 46 0.04 6.22504·1022 2.6·10−8 6.3·10−19 1

1.1 20 86 0.04 6.22504·1022 8.8·10−8 2.1·10−20 1

1.2 18 50 0.05 6.22504·1022 5.7·10−9 6.5·10−19 .9

LR1 (33) .5 †30595 30624 50.12 3.32834·108 1.5·10−4 1.4·10−12 .9

.8 20 21 0.04 3.32834·108 9.5·10−7 2.0·10−4 .9

.9 19 20 0.04 3.32834·108 9.5·10−7 2.3·10−3 .9

1 19 20 0.03 3.32839·108 3.5·10−7 0.031 1

1.1 19 20 0.03 3.32911·108 5.9·10−7 0.16 .9

1.2 20 21 0.03 3.33481·108 9.9·10−7 0.77 .9

LR1Z (34) .5 †3454 3538 3.96 251.125 145.94 8.8·10−21 -7.9

.8 †21 88 0.05 251.125 119.46 9.9·10−21 -6.5

.9 †34 129 0.07 251.125 3.7·10−6 8.8·10−21 .01

1 †25 102 0.05 251.125 170.84 8.8·10−21 -9.3

1.1 †30 119 0.06 251.125 5.8·10−7 7.35 .008

1.2 †65 224 0.13 251.125 3.7·10−7 41.94 .01

Table 1: Behavior of first-order interior-point method with inexact line searchon test functions from [21], with x0 = e/n and n = 1000.† Quit due to roundoff error causing Armijo ascent condition not met whenα < 10−20.

29

A First-Order Interior-Point Method for Linearly Constrained Smooth Optimization1 Paul ... ·...

Documents

Transcript of A First-Order Interior-Point Method for Linearly Constrained Smooth Optimization1 Paul ... ·...