Feasible and accurate algorithms for covering semidefinite ... › DB_FILE › 2010 › 07 ›...

22
Feasible and accurate algorithms for covering semidefinite programs G. Iyengar ? , D. J. Phillips ?? , and C. Stein ??? The Department of Industrial Engineering & Operations Research, Columbia University, cliff,[email protected], Mathematics Department, The College of William & Mary, [email protected] Abstract. In this paper we describe an algorithm to approximately solve a class of semidefinite programs called covering semidefinite programs. This class in- cludes many semidefinite programs that arise in the context of developing algo- rithms for important optimization problems such as Undirected SPARSEST CUT, wireless multicasting, and pattern classification. We give algorithms for covering SDPs whose dependence on is only -1 . These algorithms, therefore, have a better dependence on than other combinatorial approaches, with a tradeoff of a somewhat worse dependence on the other parameters. For many reasons, includ- ing numerical stability and a variety of implementation concerns, the dependence on is critical, and the algorithms in this paper may be preferable to those of the previous work. Our algorithms exploit the structural similarity between covering semidefinite programs, packing semidefinite programs and packing and covering linear programs. 1 Introduction Semidefinite programming (SDP) is a powerful tool for designing approximation algo- rithms for NP-hard problems. Early uses of SDP include Lov´ asz’s work on the Shan- non capacity of a graph [19] and that of Gr¨ otschel et al on the stable set of a perfect graph [12]. The Goemans and Williamson MAXCUT algorithm [10] demonstrated the power of SDP-relaxations, and subsequent SDP-based approximation algorithms in- clude those for the SPARSEST CUT [3] and coloring [14]. SDP-relaxations are also used in a variety of important applications such as multicast beam-forming [25], pattern clas- sification [26] and sparse principal component analysis [7]. Solving SDPs remains a significant theoretical and practical challenge. Interior point algorithms compute an approximate solution for an SDP with n × n decision matrices with an absolute error in O( n(m 3 + n 6 )) · log( -1 )) time, where m is the number of constraints [23]. These algorithms have a very low dependence on , but a high dependence on the other parameters; for example, interior point algorithms require O(n 9.5 · log( -1 )) time to solve the SDPs that arise in Undirected SPARSEST ? Supported in part by NSF grants CCR-00-09972, DMS-01-04282 and ONR grant N000140310514 ?? Supported in part by NSF grant DMS-0703532 and a NASA/VSGC New Investigator grant. ??? Supported in part by NSF grants CCF-0728733 and CCF-0915681.

Transcript of Feasible and accurate algorithms for covering semidefinite ... › DB_FILE › 2010 › 07 ›...

Page 1: Feasible and accurate algorithms for covering semidefinite ... › DB_FILE › 2010 › 07 › 2670.pdfin covering SDPs, both scaling and shifting by a known strictly feasible point

Feasible and accurate algorithms for coveringsemidefinite programs

G. Iyengar?, D. J. Phillips ??, and C. Stein? ? ?

The Department of Industrial Engineering & Operations Research, Columbia University,cliff,[email protected], Mathematics Department, The College of William

& Mary, [email protected]

Abstract. In this paper we describe an algorithm to approximately solve a classof semidefinite programs called covering semidefinite programs. This class in-cludes many semidefinite programs that arise in the context of developing algo-rithms for important optimization problems such as Undirected SPARSEST CUT,wireless multicasting, and pattern classification. We give algorithms for coveringSDPs whose dependence on ε is only ε−1. These algorithms, therefore, have abetter dependence on ε than other combinatorial approaches, with a tradeoff of asomewhat worse dependence on the other parameters. For many reasons, includ-ing numerical stability and a variety of implementation concerns, the dependenceon ε is critical, and the algorithms in this paper may be preferable to those of theprevious work. Our algorithms exploit the structural similarity between coveringsemidefinite programs, packing semidefinite programs and packing and coveringlinear programs.

1 Introduction

Semidefinite programming (SDP) is a powerful tool for designing approximation algo-rithms for NP-hard problems. Early uses of SDP include Lovasz’s work on the Shan-non capacity of a graph [19] and that of Grotschel et al on the stable set of a perfectgraph [12]. The Goemans and Williamson MAXCUT algorithm [10] demonstrated thepower of SDP-relaxations, and subsequent SDP-based approximation algorithms in-clude those for the SPARSEST CUT [3] and coloring [14]. SDP-relaxations are also usedin a variety of important applications such as multicast beam-forming [25], pattern clas-sification [26] and sparse principal component analysis [7].

Solving SDPs remains a significant theoretical and practical challenge. Interiorpoint algorithms compute an approximate solution for an SDP with n × n decisionmatrices with an absolute error ε in O(

√n(m3 + n6)) · log(ε−1)) time, where m is

the number of constraints [23]. These algorithms have a very low dependence on ε,but a high dependence on the other parameters; for example, interior point algorithmsrequire O(n9.5 · log(ε−1)) time to solve the SDPs that arise in Undirected SPARSEST

? Supported in part by NSF grants CCR-00-09972, DMS-01-04282 and ONR grantN000140310514

?? Supported in part by NSF grant DMS-0703532 and a NASA/VSGC New Investigator grant.? ? ? Supported in part by NSF grants CCF-0728733 and CCF-0915681.

Page 2: Feasible and accurate algorithms for covering semidefinite ... › DB_FILE › 2010 › 07 › 2670.pdfin covering SDPs, both scaling and shifting by a known strictly feasible point

CUT. Thus, interior point algorithms have some drawbacks, and there has been signifi-cant work on designing faster algorithms to approximately solve the SDPs that arise inimportant applications.

One general class of SDPs with efficient solutions are known as packing SDPs.(Packing SDPs are analogous to packing linear programs, see [13] for a precise defi-nition). Examples of packing SDPs include those used to solve MAXCUT, COLORING,Shannon capacity and sparse principal component analysis. For these problems there areessentially three known solution methods, each with its own advantages and drawbacks.The first method is specialized interior point methods which have a good dependenceon ε, but a poor dependence on the other parameters. A second method, due to Kleinand Lu [15, 16], extends ideas of, among others, Plotkin, Shmoys and Tardos [24] forpacking linear programs to give algorithms for MAXCUT and coloring that have runningtimes ofO(nm log2(n) ·ε−2 log(ε−1)) andO(nm log3(n) ·ε−4) respectively on graphswith n nodes and m edges. The drawback of these algorithms is the dependence on εof at least ε−2; this bound is inherent in these methods [17] and a significant bottle-neck in practice on these types of problems [18, 5]. A third approach, due to Iyengar,Phillips and Stein[13] also extends ideas from packing linear programs, but starts fromthe more recent work of Bienstock and Iyengar [6] who build on techniques of Nes-terov [21]. Nesterov [22] has also adapted his method to solve the associated saddlepoint problem with general SDP, although his method does not find feasible solutions.These results for packing SDPs have a dependence of only ε−1, but slightly larger de-pendence on the other parameters than the algorithms of Klein and Lu. The work in[13] is also significant in that it explicitly defines and approximately solves all packingSDPs in a unified manner.

A related approach for a more general class of SDPs is known as the “multiplicativeweights method,” which appears in several papers [1, 2] and generalizes techniques usedin packing and covering linear programs, e.g. [8, 9]. The work of Arora and Kale [2]gives faster algorithms for solving SDPs (in terms of the problem size), and also extendsresults to more applications, including Undirected SPARSEST CUT. Their running timesachieve a better dependence on n andm than the previous combinatorial algorithms do,but the dependence on ε grows to ε−6 for Undirected SPARSEST CUT. Moreover, theyonly satisfy each constraint to within an additive error ε, and thus, do not necessarilyfind a feasible solution to the SDP. Their algorithms do, however, incorporate a roundingstep so that finding a feasible solution to the SDP is not required.

The authors are not aware of any other work that efficiently finds approximate so-lutions to large classes of SDPs.

1.1 New resultsIn this work, we first define a new class of SDPs that we call covering SDPs. Analogousto covering linear programs which require that all variables be non-negative and all theconstraints are of the form a>x ≥ 1, with a ≥ 0; in a covering SDP the variable matrixX is positive semidefinite, and all constraints are of the form 〈A,X〉 ≥ 1 for positivesemidefinite A. We show that several SDPs, including those used in diverse applica-tions such as Undirected SPARSEST CUT, Beamforming, and k-nearest neighbors canbe expressed as covering SDPs. We then describe an algorithm for computing a feasi-ble solution to a covering SDP with objective value at most (1 + ε) times the optimal

Page 3: Feasible and accurate algorithms for covering semidefinite ... › DB_FILE › 2010 › 07 › 2670.pdfin covering SDPs, both scaling and shifting by a known strictly feasible point

Problem m This paper Previous workUndirected SPARSEST CUT O(n3) O(n4 · ε−1) O(min{E2 · ε−4, n2 · ε−6}) [2]Beamforming O(R) O((n4 + nR) · ε−1) O((R+ n2)3.5 · log(ε−1)) [25]k-Nearest Neighbor O(T 2) O((T 2n3 + T 4) · ε−1) O((n7 + nT 6) · log(ε−1)) [26]

Table 1. Running time comparison. n = matrix dimension, E = number of edges, R = numberof receivers, T = number of inputs. We use O to suppress logk(n) factors and all but the highestdependence on ε.

objective value. We call such a solution a relative ε-optimal solution. (We will use theterm absolute approximation to refer to an additive approximation bound.) Our algo-rithm has an ε−1 dependence, and the dependence on other parameters depends on thespecific application. The running times of our algorithm and previous works, applied tothree applications, are listed in Table 1. To obtain these results, we first give new SDPformulations of each problem, showing that they are examples of covering SPD. Wethen apply our main theorem (see Theorem 1 below).

Our algorithm for covering SDP is not a simple extension of that for packing SDPs [13].Several steps that were easy for packing SPDs are not so for covering SDPs and giverise to new technical challenges.

– Computing feasible solutions for covering SDPs is non-trivial; and previous workdoes not compute true feasible solutions. For packing SDPs a feasible solution canbe constructed by simply scaling the solution of a Lagrangian relaxation; whereasin covering SDPs, both scaling and shifting by a known strictly feasible point isrequired.

– In both packing and covering, the SDP is solved by solving a Lagrangian relax-ation. In packing, it is sufficient to solve a single Lagrangian relaxation; whereas incovering, we need to solve a sequence of Lagrangian relaxations. The iterative ap-proach is necessary to ensure feasibility. The convergence analysis of this iterativescheme is non-trivial.

– Our algorithms use quadratic prox functions as opposed to the more usual logarith-mic functions [2, 13]. Quadratic prox functions avoid the need to compute matrixexponentials and are numerically more stable. The quadratic prox function wasmotivated by the numerical results in [13].

Relative to barrier interior point methods, our algorithm represents an increased depen-dence on ε from O(log( 1

ε ) to O( 1ε in order to receive a considerable improvement in

runtime with respect to the other problem parameters. In comparison to other “combi-natorial” algorithms, our approximation algorithm reduces the dependence on ε at theexpense of an increase in some of the other terms in the running time. At first glance,a decrease in the dependence on ε may not seem very significant, since for constant εthe decrease is only by a constant (albeit a potentially large constant) factor. However,experience with implementations of approximation algorithms for packing and cover-ing type problems has repeatedly shown that the dependence on ε is a major bottleneckin designing an efficient algorithm (See, e.g. [4, 5, 11, 13, 18]), as it impacts the nu-merical stabilty and the convergence rate of the algorithm. Therefore, the reduction ofthe dependence on ε is an important technical endeavor, as is understanding the tradeoffbetween dependence on ε and other problem parameters.

Page 4: Feasible and accurate algorithms for covering semidefinite ... › DB_FILE › 2010 › 07 › 2670.pdfin covering SDPs, both scaling and shifting by a known strictly feasible point

2 Preliminaries

A semidefinite program(SDP) is an optimization problem of the form

min 〈C,X〉subject to 〈Ai,X〉 ≥ bi, i = 1, . . . ,m,

X � 0,(1)

where C ∈ IRn×n and Ai ∈ IRn×n, i = 1, . . . ,m, decision variable X ∈ IRn×n and

〈., .〉 denotes the usual Frobenius inner product 〈A,B〉 = Tr(A>B) =n∑j=1

n∑i=1

AijBij .

The constraint X � 0 indicates that the symmetric matrix X is positive semidefinite,i.e. X has nonnegative eigenvalues. We use Sn and Sn+ to denote the space of n × nsymmetric and positive semidefinite matrices, respectively, and omit the superscript nwhen the dimension is clear. For a matrix X ∈ Sn, we let λmax(X) denote the largesteigenvalue of X.

3 The covering SDP

We define the covering SDP as follows:

ν∗ = min 〈C,X〉subject to 〈Ai,X〉 ≥ 1, i = 1, . . . ,m

X ⊆ X := {X : X � 0,Tr(X) ≤ τ}

(2)

where Ai � 0, i = 1, . . . ,m, C � 0 and the X ⊂ S+ is a set over which linearoptimization is “easy”. We refer to the constraints of the form 〈A,X〉 ≥ 1 for A � 0as cover constraints. Before describing our algorithms, we make a set of assumptions.Each appliction we consider satisfies these assumptions. We assume the following aboutthe covering SDP (2):

(a) We can compute Y ∈ X such that min1≤i≤m{〈Ai,Y〉} ≥ 1 + 1q(n,m) for some

positive function q(n,m), i.e. Y is strictly feasible with the margin of feasibility atleast 1

q(n,m) .(b) We can compute a lower bound νL ∈ IR for (2) such that

νU := 〈C,Y〉 ≤ νL · p(n,m)

for some positive function p(n,m), i.e. the relative error of Y can be boundedabove by p(n,m).

Our main result is that we can find ε-optimal solutions to covering SDPs efficiently.

Theorem 1. Suppose a covering SDP (2) satisfies Assumptions (a) and (b). Then a rela-tive ε-optimal solution can be found inO

(τq(n,m) log(p(n,m) 1

ε )(TG+κ(m)) ‖A‖√

ln(m)·1ε

)time where ‖A‖ = max1≤i≤m λmax(Ai), TG denotes the running time for comput-

ing the negative eigenvalues and the corresponding eigenvectors for a matrix of theform C−

∑mi=1 viAi, v ∈ IRn, and κ(m) the running time to calculate 〈Ai,Z〉 for all

i and any Z ∈ X .

Page 5: Feasible and accurate algorithms for covering semidefinite ... › DB_FILE › 2010 › 07 › 2670.pdfin covering SDPs, both scaling and shifting by a known strictly feasible point

We prove this theorem to Section 4.3. For the results in Table 1 we set TG = n3 andcalculate κ(m) as described in their individual sections. In the remainder of this section,we describe how to formulate our applications as covering SDPs.

3.1 Undirected SPARSEST CUT

Let G = (V, E) denote a connected undirected graph with n = |V| nodes and E = |E|edges. Let L denote the Laplacian of G, K = nI − J denote the Laplacian of thecomplete graph with unit edge weights and Kij denote the Laplacian of the graph witha single undirected edge (i, j).

As in [3], we assume that G is unweighted and connected. In Appendix A.1 weshow that the ARV formulation of [3] is equivalent to the following new covering SDPformulation.

USC ARV Formulationmin 〈L,X〉s.t. 〈K,X〉 = 1

∀i, j, k 〈Kij + Kjk −Kik,X〉 ≥ 0

X � 0.

USC Covering SDP Formulationmin 〈L,X〉s.t. 〈K,X〉 ≥ 1

n4 〈Kij + Kjk −Kik + 2I,X〉 ≥ 1, ∀i, j, k ∈ VX ∈ X =

{Y : Y � 0,Tr(Y) ≤ 2

n

}Both formulations have m = O(n3) constraints. To obtain the covering SDP formula-tion we relax the first set of constraints, add a trace constraint Tr(X) ≤ 2

n , and thenshift the second set of linear constraints. Although a trace bound of 1

n is sufficient forequivalence, we need the bound to be 2

n in order to construct a feasible solution thatsatisfies Assumption (a).

We now show that Assumption (a) is satisfied. Let Y = 2n2(n−1)K � 0. Then

Tr(Y) = 2n2(n−1) (n(n− 1)) = 2

n , thus, Y ∈ X . For all i, j, k,

n

4〈Kij + Kjk −Kik + 2I,Y〉 =

1

2n(n− 1)〈Kij + Kjk −Kik,K〉+1 =

1

n− 1+1.

Also, 〈K,Y〉 = 2 > 1 + 1n . Thus, Y satisfies Assumption (a) with q(n,m) = n.

Since G is unweighted, the upper bound νU = 〈L,Y〉 = 〈L,K〉 = 2. It was shownin [3] that the ARV formulation has a lower bound of 1√

log(n), i.e., νL = 1√

log(n); thus,

p(m,n) = 2√

log(n). Finally, note that ‖A‖ = λmax(K) = n. Then, since τ = 2n ,

Theorem 1 implies the following result.

Corollary 1. An ε-optimal solution to the Undirected SPARSEST CUT SDP can befound inO(n4√

ln(n) · 1ε

)time.

3.2 Beamforming

Sidiropoulos et al [25] consider a wireless multicast scenario with a single transmit-ter with n antenna elements and R receivers each with a single antenna. Let hi ∈ Cn

Page 6: Feasible and accurate algorithms for covering semidefinite ... › DB_FILE › 2010 › 07 › 2670.pdfin covering SDPs, both scaling and shifting by a known strictly feasible point

denote the complex vector that models the propagation loss and the phase shift fromeach transmit antenna to the receiving antenna of user i = 1, . . . , R. Let w∗ ∈ Cn

denote the beamforming weight vector applied to the n antenna elements. Suppose thetransmit signal is zero-mean white noise with unit power and the noise at receiver i iszero-mean with variance σ2

i . Then the received signal to noise ratio ( SNR) at receiver iis |w∗hi|2 /σ2

i . Let ρi denote the minimum SNR required at receiver i. Then the prob-lem of designing a beamforming vector w∗ that minimizes the transmit power subjectto constraints on the received SNR of each user can be formulated as the followingoptimization problem.

min ‖w‖22 ,subject to |w∗hi|2 ≥ ρiσ2

i , i = 1, . . . , R.

In [25], the authors show that this optimization problem is NP-hard and formulate thefollowing SDP relaxation

min 〈I,X〉 ,subject to 〈Qi,X〉 ≥ 1, i = 1, . . . , R,

X � 0,(3)

Qi :=γ

ρiσ2i

(gig>i +gig

>i

), gi =

(<(hi)

> =(hi)>)> , gi =

(=(hi)

> −<(hi)>)> ,

where X ∈ S2n, and and γ is chosen to ensure that min1≤i≤RTr(Qi) = 1.We will now show that (3) is a covering SDP that satisfies Assumptions (a) and (b).

Since Tr(Qi) ≥ 1, it follows that for any c ≥ 2, Y = cI is a strictly feasible for(3) with q(m,n) = c − 1 = O(1) and 〈I,Y〉 = cn. Thus, we can assume withoutloss of generality that the optimal solution of (3) belongs to the set X = {X : X �0,Tr(X) ≤ cn}. The dual of (3) is given by

max∑Ri=1 vi,

subject to∑Ri=1 viQi � I,

v ≥ 0.

Let k denote any index such that Tr(Qk) = 1. Let v = 2ek, where ek denote a vectorwith 1 is the k-th position and zeroes everywhere else. It is easy to check that both thenon-zero eigenvalues of the rank-2 matrices Qi are equal to γ ‖gi‖2 /ρiσ2

i . Therefore,λmax(Qk) = 1

2 . Thus, v is feasible for the dual with an objective value 2. Thus, wehave that νL = 〈I,Y〉 = 2n ≤ n ·2, i.e. p(n,R) = n and satisfies Assumption b. Thus,since τ = 2n, Theorem 1 implies the following result.

Corollary 2. A relative ε-optimal feasible solution for the beamforming SDP (3) canbe computed in O

(n(n3 +R)

√ln(R) · 1

ε

)time.

3.3 k-Nearest Neighbor Classification

Weinberger et al [26] consider the following model for pattern classification. Let G =(V, E) be a graph where T = |V| and each node i ∈ V has an input, (vi, yi), i =

Page 7: Feasible and accurate algorithms for covering semidefinite ... › DB_FILE › 2010 › 07 › 2670.pdfin covering SDPs, both scaling and shifting by a known strictly feasible point

1, . . . , T , where vi ∈ IRn and yi are labels from a finite discrete set. For each i, thereare a constant k adjacent nodes in G, i.e., G is k-regular, and if (i, j) ∈ E , then i and jare “near” with respect to some distance function on vi and vj . The goal is to use theT inputs to derive a linear transformation, H ∈ IRn×n, so that for input i, Hvi is stillnear its k nearest neighbors but “far” away from inputs j that do not share the samelabel, i.e., yi 6= yj . Let F = {(i, j, `) : (i, j) ∈ E , yi 6= y`}. Let L denote the Laplacian

associated with G and C =

(L 00 cI

)where 0 denotes an appropriately sized matrix

of all zeros and c > 0 is a given constant. Also, let Aij` denote that diagonal blockmatrix with Ki` −Kij (the edge Laplacians to (i, j) and (i, `) respectively). Finally,let Aij` = Aij` + I where I denotes the block diagonal matrix with I in the uppern × n block and zeros everywhere else. In Appendix A.2, we show that the followingtwo formulations are equivalent.

kNN WBS Formulationmin 〈C,X〉s.t.

⟨Aij`,X

⟩≥ 1 (i, j, `) ∈ ED

X � 0

kNN Covering SDP Formulationmin 〈C,X〉s.t. 〈Aij`,X〉 ≥ 1 (i, j, `) ∈ ED

Tr(X) ≤ knX � 0.

The WBS formulation is due to Weinberger, Blitzer and Saul [26]. To obtain the cov-ering SDP formulation, we add the trace constraint trace constraint Tr(X) ≤ kT andshift the second set of constraints as we did in Undirected SPARSEST CUT. Note thenumber of constraints is m = kT 2 = O(T 2) Arguments similar to those used to con-struct covering SDP formulations for Undirected SPARSEST CUT and Beamformingshow that νL = 1, νU = p(n,m) = q(n,m) = O(T ) (see Appendix A.2 ). Thus, wehave the following corrollary to Theorem 1.

Corollary 3. An ε-optimal solution to the k-Nearest Neighbors covering SDP can befound inO(T 2(n3 + T 2)

√log(T ) · 1

ε ) time.

4 Computing a relative ε-optimal solution for a covering SDP

In this section we describe the steps of our solution algorithm SOLVECOVERSDP (SeeFigure 1).

– We start with the Lagrangian relaxation of (2)

ν∗ω = minX∈X

maxv∈V

{⟨C− ω

m∑i=1

viAi,X

⟩+ ω

m∑i=1

vi

}, (4)

where V = {v : v ≥ 0,∑ni=1 vi ≤ 1} and penalty multiplier ω controls the

magnitude of the dual variables. Lagrange duality implies that we need ω →∞ toensure strict feasibility.

Page 8: Feasible and accurate algorithms for covering semidefinite ... › DB_FILE › 2010 › 07 › 2670.pdfin covering SDPs, both scaling and shifting by a known strictly feasible point

– In Section 4.2 we show that an adaptation of an algorithm due to Nesterov [21]allows us to compute an ε-saddle-point, i.e. (v, X) such that

maxv∈V

{⟨C−ω

m∑i=1

viAi, X⟩

m∑i=1

vi

}−min

X∈X

{⟨C−ω

m∑i=1

viAi,X⟩

m∑i=1

vi

}≤ ε,

in O(‖A‖ωτε ) iterations of a Nesterov non-smooth optimization algorithm [21],where‖A‖ = max1≤i≤m λmax(Ai); thus, large ω leads to larger running times.

– In Section 4.1 we show that an ε-saddle-point can be converted into a relative ε-optimal solution provided ω ≥ 1

g(Y) ·(νU−νLνL

), where νU (resp. νL) is an upper

(resp. lower) bound on the optimal value ν∗ of the covering SDP (2), and g(Y) =min1≤i≤m〈Ai,Y〉− 1 denotes the feasibility margin for any strictly feasible pointY. Assumptions (a) and (b) guarantee that νU−νLνL

≤ p(n,m) and that there exists afeasible Y with g(Y) ≥ 1

q(n,m) . Thus, it follows that one can compute an ε-optimal

solution in O(p(n,m)q(n,m)‖A‖

ε

)iterations.

– In Section 4.3 we show that by solving a sequence of covering SDPs we can reducethe overall number of iterations to O

( q(n,m)‖A‖ε

), i.e. reduce it by a factor p(n,m).

The running time per iteration is dominated by an eigenvalue computation and the so-lution of an optimization problem via an active set method.

4.1 Rounding the Lagrangian saddle point problem

Let g(Y) = mini=1,...,m

{〈Ai,Y〉 − bi

}denote the feasibility margin with respect to

the covering constraint in (2) . Then Y ∈ X is strictly feasible for (2) if, and only if,g(Y) > 0.

Lemma 1. Let Y ∈ X be a strictly feasible solution to (2), so g(Y) > 0. Defineω := 〈C,Y〉−ν∗

g(Y) , and assume ω > 0. Choose ω ≥ ω, and suppose (X, v) is a δ-saddle-point for (4) i.e.

maxv∈V

{⟨C−ω

m∑i=1

viAi, X⟩

m∑i=1

vi

}− min

X∈X

{⟨C−ω

m∑i=1

viAi,X⟩

m∑i=1

vi

}≤ δ.

Then X = X+β(X)Y

1+β(X), where β(X) = g(X)−/g(Y), is feasible and absolute δ-optimal

for (2).

Proof. We first show that X is feasible to (2). Since X is a convex combination of Xand Y, X ∈ X . The definition of β(X) together with the concavity of g implies

g(X) ≥ 1

1 + β(X)· g(X) +

β(X)

1 + β(X)· g(Y) =

1

1 + β(X)

(g(X) + g(X)−

)≥ 0.

Thus, X is feasible for (2). All that remains is to show that⟨C,X

⟩≤ ν∗ + δ. First,

we show that if the penalty ω ≥ ω, then ν∗ω = ν∗. Since g(X)− = 0 for all X feasible

Page 9: Feasible and accurate algorithms for covering semidefinite ... › DB_FILE › 2010 › 07 › 2670.pdfin covering SDPs, both scaling and shifting by a known strictly feasible point

for (2) it follows that ν∗ω ≤ ν∗; therefore, we must show that ν∗ω ≥ ν∗ when ω ≥ ω. FixX ∈ X and let Xβ = (X+β(X)Y)/(1+β(X)), which means X =

(1+β(X)

)Xβ−

β(X)Y. Then, by the previous argument, g(Xβ) ≥ 0 so⟨C,Xβ

⟩≥ ν∗. Also,⟨

C,X⟩

+ ωg(X)− − ν∗ = (1 + β(X))⟨C,Xβ

⟩− (1 + β(X))ν∗ − β(X)(

⟨C,Y

⟩− ν∗) + ωg(X)−

= (1 + β(X))(⟨

C,Xβ⟩− ν∗

)− ωβ(X)g(Y) + ωg(X)−

= (1 + β(X))(⟨

C,Xβ⟩− ν∗︸ ︷︷ ︸

≥0

)+ (ω − ω︸ ︷︷ ︸

≥0

)g(X)−. (5)

Thus, if ω ≥ ω then ν∗ω = ν∗.We can now show that an X is δ-optimal for (2) when ω ≥ ω. Since X is a δ-saddle-

point, it follows that

maxv∈V

{⟨C− ω

m∑i=1

viAi, X

⟩+ ω

m∑i=1

vi

}=⟨C, X

⟩+ ωg(X)− ≤ ν∗ω + δ = ν∗ + δ.

Thus, the same argument used in (5) indicates that

δ ≥⟨C, X

⟩+ ωg(X)− − ν∗ = (1 + β(X))

(⟨C,X

⟩− ν∗

)+ (ω − ω)g(X)−

Since ω ≥ ω and g(X)− ≥ 0, it follows that⟨C,X

⟩− ν∗ ≤ δ

1+β(X)≤ δ. ut

A version of Lemma 1 was established independently by Lu and Monteiro [20].

4.2 Solving the saddle point problems

In this section we describe how to use a variant of the Nesterov non-smooth opti-mization algorithm [21] to compute δ-optimal saddle-points for the minimax prob-lem (4). We assume some familiarity with the Nesterov algorithm [21]. Let f(v) de-note the dual function (or, equivalently the objective function of the v-player): f(v) =m∑i=1

vi + ωτ minX∈X

{⟨C −

m∑i=1

viAi,X⟩}, where X = {X ∈ Sn+ : Tr(X) ≤ 1}. We

wish to compute an approximate solution for maxv∈V f(v).In order to use the Nesterov algorithm we need to smooth the non-smooth function

f using a strongly convex prox function. We smooth f using the spectral quadratic proxfunction

∑ni=1 λ

2i (X), where {λi(X) : i = 1, . . . , n} denotes the eigenvalues of X.

Let

fα(v) :=

m∑i=1

vi + ωτ minX∈X

{⟨C−

m∑i=1

viAi,X⟩

2

n∑i=1

λ2i (X)

}, (6)

The Nesterov algorithm requires that α = δDx

, where Dx = maxX∈X12λ

2i (X) = 1

2 .To optimize fα(v) the Nesterov algorithm uses a particular regularized gradient

descent method where each step involves solving a problem of the form

maxv∈V

{g>v − L

ε· d(v,v)

}, (7)

Page 10: Feasible and accurate algorithms for covering semidefinite ... › DB_FILE › 2010 › 07 › 2670.pdfin covering SDPs, both scaling and shifting by a known strictly feasible point

where g is either a gradient computed at the current iterate (y(k) computation in Fig-ure 3 in Appendix B.2) or a convex combination of gradients of fα computed at allthe previous iterates (z(k) computation in Figure 3 in Appendix B.2 ), and d(v,v)denotes the Bregman distance associated with a strongly convex function φ(v). Weuse d(v,v) =

∑ni=1 vi ln(vi/vi). Using results from [21], it is easy to show that for

this choice of the Bregman distance, the constant L = ω2τ2 ‖A‖2, where ‖A‖ =max1≤i≤m λmax(Ai) and (7) can be solved in closed form in O(m) operations.

From the envelope theorem it follows that∇fα(v)> = 1>+τω(〈A1,X

∗〉 . . . 〈Am,X∗〉)>

where X∗ = argminX∈X{⟨

C−∑mi=1 viAi,X

⟩+ α

2 λ2i (X)

}. We show below how

to compute X∗.Let X =

∑ni=1 λiuiu

>i and C −

∑mi=1 viAi =

∑ni=1 θiwiw

>i denote the eigen-

decompositions of X and C −∑mi=1 viAi, respectively. Suppose we fix the eigen-

values {λi} of X and optimize over the choice of eigenvectors {ui}. Then it is easyto show that minU:UTU=I

{⟨C −

∑mi=1 viAi,X

⟩}=∑ni=1 λ(i)θ[i], where λ(i) de-

notes the i-th smallest eigenvalue of X and θ[i] denotes the i-th largest eigenvalue ofC−

∑mi=1 viAi, and the minimum is achieved by setting eigenvector u(i) correspond-

ing to λ(i) equal to the eigenvector w[i] corresponding to θ[i]. Since prox-function12

∑ni=1 λ

2i (X) is invariant with respect to the eigenvectors of X, it follows that, by

suitably relabeling indices, the expression for the function fα simplifies to

fα(v) =

m∑i=1

vi + ωτ min

{n∑i=1

θiλi +α

2

n∑i=1

λ2i :

n∑i=1

λi ≤ 1, λi ≥ 0

}, (8)

and X∗ =∑ni=1 λ

∗iwiw

Ti , where l∗ achieves the minimum in (8).

Lemma 2. ACTIVESOLVE in Appendix B.1 solves (8) in O(n log n) time.

The computational effort in calculating ∇fα is the dominated by the effort required tocompute the eigen-decomposition (νi,ui). However, ACTIVESOLVE implies that weonly have to compute the negative eigenvalues the corresponding eigenvectors. Since

Tr(A>kX∗) = Tr(Ak

n∑i=1

λ∗iu∗i (u∗i )>) =

n∑i=1

λ∗i (u∗i )>Aku

∗i , it follows that in order

to compute the gradient, we only need to compute the product (u∗i )>Aku

∗i . If the con-

straint matrices Ak are sparse we can compute the product Aku∗i inO(sk) time, where

sk denotes the number of non-zero elements in Ak and the product (u∗i )>Aku

∗i can be

computed inO(n+sk) time. Thus, the k-th component of the gradient can be computed

in O(n(sk + n)) time, and all the m terms in O(m(n+ n2) + nm∑k=1

sk) time.

When a constraint matrix is the Laplacian of a completely connected graph, i.e.Ak = K, as in Undirected SPARSEST CUT, although there is no sparsity to exploit,we can exploit the fact that K = nI − J, where J is the all ones matrix. We then

have that Tr(K>X∗) =n∑i=1

l∗i Tr((nI − J)u∗i )(u∗i )>). Since Ju∗i is a vector with

identical entries (each the sum of the components of u∗i ), the computational cost isO(n)additions. Also, Iu∗i = u∗i , so the total cost of computing each of the n summand’s trace

Page 11: Feasible and accurate algorithms for covering semidefinite ... › DB_FILE › 2010 › 07 › 2670.pdfin covering SDPs, both scaling and shifting by a known strictly feasible point

is O(n) additions and multiplies (by the term n) plus O(n) additional multiplies andadditions to scale and add the main diagonal terms. Thus, the total cost is O(n2).

Theorem 2 (Theorem 3 in [21]). The Nesterov non-smooth optimization proceduredescribed in Figure 3 in Appendix B.2 computes an δ-saddle-point in O

(ωτ(TG +

κ(m)) ‖A‖√

ln(m) · 1δ

)time, where ω denote the penalty multiplier used in defining

the saddle-point problem, ‖A‖ = max1≤i≤m λmax(Ai), TG denotes the running timefor of computing the gradient ∇fα and κ(m) denotes the running time to compute〈Ai,Z〉 for i = 1, . . . ,m and Z ∈ X .

Note that TG is typically dominated by the complexity of computing the negative eigen-values and the corresponding eigenvectors, and is, in the worst case, O(n3).

Since νL ≤ ν∗ ≤ νU = 〈C,Y〉, Lemma 1 implies that ω = 〈C,Y〉−νLg(Y) suf-

fices. Since an absolute (ενL)-optimal solution is a relative ε-optimal solution, Assump-tion (b) implies that

ω

νL=(νU − νLνL

)· 1

g(Y)≤ p(n,m)q(n,m).

Thus, Theorem 2 implies that a relative ε-optimal solution can be computed inO(τq(n,m)p(n,m)(TG+m) ‖A‖

√ln(m) · 1ε

)time. In the next section, we show that

bisection can help us improve the overall running time by a factor p(n,m).

4.3 Bisection on the gap improves running time

Let Y ∈ X denote the strictly point specified in Assumption (a), i.e. g(Y) = min1≤i≤m 〈Ai,Y〉 >1 + 1

q(n,m) . In this section, we will assume that q(n,m) ≥ 8. Let ∆ = νU − νL. We

initialize the algorithm with ν0L = νL, and ν0

U = νU . Let ν(t)L and ν(t)

U denote lowerand upper bounds on the optimal value ν∗ of (2) and let ε(t) = ∆(t)

ν(t)

L

denote the relative

error at the beginning of the iteration t of SOLVECOVERSDP. Note that Assumption(b) implies ε(0) = O(p(n,m)). In iteration t we approximately solve the SDP

ν(t) = min

⟨1

ν(t)

L

C,X

⟩s.t.⟨Ai,X

⟩≥ q(n,m) + γ(t),

X ∈ X

(9)

where γ(t) = min{αt, ε(t)} for some 56 < α < 1 and Ai = q(n,m)Ai. Thus,∥∥A∥∥ = q(n,m) ‖A‖.

Since the right hand sides of the cover constraints in (9) are not all equal to 1, it is nota covering SDP. However, (9) can be converted into a covering SDP by rescaling; so, weabuse notation and refer to (9) as a covering SDP. In in each iteration we solve (9) withslightly different right hand sides, therefore, we find it more convenient to work with theunscaled version of the problem. Let g(t)(X)

∆= min

i=1,...,m

{⟨Ai,X

⟩− q(n,m)

}− γ(t)

denote the margin of feasibility of X with respect to the cover constraints in (9). In thissection, we let g(X) = min

i=1,...,m

{⟨Ai,X

⟩− q(n,m)

}; thus, g(t)(X) = g(X)− γ(t).

Page 12: Feasible and accurate algorithms for covering semidefinite ... › DB_FILE › 2010 › 07 › 2670.pdfin covering SDPs, both scaling and shifting by a known strictly feasible point

SOLVECOVERSDP

Inputs: C,AK ,X(0), ε, ν

(0)U , ν

(0)L , δt

Outputs: X∗

Set t← 0 γt ←ν(t)

U−ν(t)

L

ν(t)

L

while (γt > ε)do

Compute Y(t),(γt3

)-optimal solution to ν(t) using NESTEROV PROCEDURE

Set (ν(t+1)L , ν

(t+1)U )←

(t)L , ν

(t)L + 2

3∆(t)), if

⟨C,Y(t)

⟩≤ ν(t)

L + 23∆(t)

(ν(t)

L+∆(t)/3

1+γtν(t)

L/∆(t)

, ν(t)U ) otherwise

t← t+ 1, γt =ν(t)

U−ν(t)

L

ν(t)

L

return X(t)

Fig. 1. Our algorithm for solving the covering SDP

Lemma 3. For k ≥ 0, let X(t) ∈ X denote an absolute ε(t)

3 -optimal solution, i.e.⟨C, X(t)

⟩≤ ν(t) + 1

3∆(t). Update (ν

(t+1)L , ν

(t+1)U ) as follows:

(ν(t+1)L , ν

(t+1)U ) =

(t)L , ν

(t)L + 2

3∆(t)), if⟨C, X(t)

⟩≤ ν(t)

L + 23∆

(t),(ν(t)

L+ 1

3∆(t)

1+γ∆(t)

ν(t)

L

, ν(t)U

), otherwise.

Then, for all k ≥ 0,

(i) ∆(t+1)

ν(t+1)

L

≤(

56

)(∆(t)

ν(t)

L

)≤(

56

)t+1(∆0

ν0L

), i.e. the gap converges to geometrically

to 0.(ii) X(t) is feasible for ν(t+1), and g(t+1)(X(t)) ≥ γ(t) − γ(t+1) ≥ (1− α)γ(t).

The proof of Lemma 3 is in Appendix B.1.

Theorem 3. Fix 56 < α < 1. Then for all ε satisfying (10) SOLVECOVERSDP com-

putes a relative ε-optimal solution for the Cover SDP (2) inO(τq(n,m) log(p(n,m) 1

ε )(TG+

m) ‖A‖√

ln(m) · 1ε

)time, where q(n,m) is the polynomial that satisfies Assump-

tion (a), and TG denotes the running time for computing the gradient ∇fα in the Nes-terov procedure.

Proof. From Lemma 3 (i) it follows that SOLVECOVERSDP terminates after at most

T =

⌈ln( ε

(0)

ε )

ln( 65 )

Page 13: Feasible and accurate algorithms for covering semidefinite ... › DB_FILE › 2010 › 07 › 2670.pdfin covering SDPs, both scaling and shifting by a known strictly feasible point

iterations. From the analysis in the previous sections, we know that the run time forcomputing an absolute 1

3ε(t)-optimal solution is

O( ε(t)

g(t)(X(t))· 3

ε(t)

)= O

( 1

γ(t−1)

),

where we have ignored polynomial factors. Since α > 56 , it follows that γ(t+1) ≤ αγ(t),

i.e. γ(t) decreases geometrically. Thus, the overall running time of SOLVECOVERSDPis O( 1

γ(T ) ), and all that remains to show is that γ(T ) = ε.

Let Tγ = inf{t : ε(0)( 56 )t < αt}. Since ε(t) ≤ ε(0)( 5

6 )t, it follows that for allt ≥ Tγ , γ(t) = ε(t). From the definitions of Tγ and T it follows that T ≥ Tγ if

ε ≤ 5

6

(ε(0))− ln( 1

α)

ln( 6α5

) . (10)

For α = 67 >

56 , the bound is ε ≤ 5.85

(ε(0))−0.84

. ut

Page 14: Feasible and accurate algorithms for covering semidefinite ... › DB_FILE › 2010 › 07 › 2670.pdfin covering SDPs, both scaling and shifting by a known strictly feasible point

Bibliography

[1] S. ARORA, E. HAZAN, AND S. KALE, Fast algorithms for approximate semidefi-nite programming using the multiplicative weights update method, in Proceedingsof the 46th Annual Symposium on Foundations of Computer Science, 2005.

[2] S. ARORA AND S. KALE, A combinatorial, primal-dual approach to semidefiniteprograms, in Proceedings of the 39th Annual ACM Symposium on Theory ofComputing, 2007.

[3] S. ARORA, S. RAO, AND U. VAZIRANI, Expander flows, geometric embeddings,and graph partitionings, in Proceedings of the 36th Annual ACM Symposium onTheory of Computing, 2004, pp. 222–231.

[4] D. BIENSTOCK, Experiments with a network design algorithm using epsilon-approximate lps. Talk at ISMP97.

[5] D. BIENSTOCK, Potential function methods for approximately solving linear pro-gramming problems: theory and practice, International Series in Operations Re-search & Management Science, 53, Boston, MA, 2002.

[6] D. BIENSTOCK AND G. IYENGAR, Solving fractional packing problems inO∗( 1ε )

iterations, in Proceedings of the 36th Annual ACM Symposium on Theory ofComputing, 2004, pp. 146–155.

[7] A. D’ASPREMONT, L. EL GHAOUI, M. I. JORDAN, AND G. R. G. LANCKRIET,A direct formulation for sparse PCA using semidefinite programming, SIAM Rev.,49 (2007), pp. 434–448 (electronic).

[8] L. FLEISCHER, Fast approximation algorithms for fractional covering problemswith box constraint, in Proceedings of the 15th ACM-SIAM Symposium on Dis-crete Algorithms, 2004.

[9] N. GARG AND J. KONEMANN, Faster and simpler algorithms for multicommod-ity flow and other fractional packing problems, in Proceedings of the 39th AnnualSymposium on Foundations of Computer Science, 1998, pp. 300–309.

[10] M. X. GOEMANS AND D. P. WILLIAMSON, Improved approximation algorithmsfor maximum cut and satisfiability problems using semidefinite programming,Journal of the ACM, 42 (1995), pp. 1115–1145.

[11] A. V. GOLDBERG, J. D. OLDHAM, S. A. PLOTKIN, AND C. STEIN, An imple-mentation of a combinatorial approximation algorithm for minimum-cost multi-commodity flow, in Proceedings of the 4th Conference on Integer Programmingand Combinatorial Optimization, 1998, pp. 338–352. Published as Lecture Notesin Computer Science 1412, Springer-Verlag”,.

[12] M. GROTSCHEL, L. LOVASZ, AND A. SCHRIJVER, Polynomial algorithms forperfect graphs, in Topics on perfect graphs, vol. 88 of North-Holland Math. Stud.,North-Holland, Amsterdam, 1984, pp. 325–356.

[13] G. IYENGAR, D. J. PHILLIPS, AND C. STEIN, Approximation algorithms forsemidefinite packing problems with applications to maxcut and graph coloring, inProceedings of the 11th Conference on Integer Programming and CombinatorialOptimization, 2005, pp. 152–166. Submitted to SIAM Journal on Optimization.

Page 15: Feasible and accurate algorithms for covering semidefinite ... › DB_FILE › 2010 › 07 › 2670.pdfin covering SDPs, both scaling and shifting by a known strictly feasible point

[14] D. KARGER, R. MOTWANI, AND M. SUDAN, Approximate graph coloring bysemidefinite programming, J. ACM, 45 (1998), pp. 246–265.

[15] P. KLEIN AND H.-I. LU, Efficient approximation algorithms for semidefinite pro-grams arising from MAX CUT and COLORING, in Proceedings of the Twenty-eighth Annual ACM Symposium on the Theory of Computing (Philadelphia, PA,1996), New York, 1996, ACM, pp. 338–347.

[16] , Space-efficient approximation algorithms for MAXCUT and COLORINGsemidefinite programs, in Algorithms and computation (Taejon, 1998), vol. 1533of Lecture Notes in Comput. Sci., Springer, Berlin, 1998, pp. 387–396.

[17] P. KLEIN AND N. YOUNG, On the number of iterations for Dantzig-Wolfe op-timization and packing-covering approximation algorithms, in Integer program-ming and combinatorial optimization (Graz, 1999), vol. 1610 of Lecture Notes inComput. Sci., Springer, Berlin, 1999, pp. 320–327.

[18] T. LEONG, P. SHOR, AND C. STEIN, DIMACS Series in Discrete Mathemat-ics and Theoretical Computer Science: Volume 12, Network Flows and Match-ing, Oct. 1991, ch. Implementation of a combinatorial multicommodity flow algo-rithm.

[19] L. LOVASZ, On the Shannon capacity of a graph, IEEE Trans. Inform. Theory,25 (1979), pp. 1–7.

[20] Z. LU, R. MONTEIRO, AND M. YUAN, Convex optimization methods for dimen-sion reduction and coefficient estimation in multivariate linear regression, Arxivpreprint arXiv:0904.0691, (2009).

[21] Y. NESTEROV, Smooth minimization of nonsmooth functions, Mathematical Pro-gramming, 103 (2005), pp. 127–152.

[22] Y. NESTEROV, Smoothing technique and its applications in semidefinite optimiza-tion, Mathematical Programming, 110 (2007), pp. 245–259.

[23] Y. NESTEROV AND A. NEMIROVSKI, Interior-point polynomial algorithms inconvex programming, vol. 13 of SIAM Studies in Applied Mathematics, Societyfor Industrial and Applied Mathematics (SIAM), Philadelphia, PA, 1994.

[24] S. PLOTKIN, D. B. SHMOYS, AND E. TARDOS, Fast approximation algorithmsfor fractional packing and covering problems, Mathematics of Operations Re-search, 20 (1995), pp. 257–301.

[25] N. SIDIROPOULOS, T. DAVIDSON, AND Z. LUO, Transmit beamforming forphysical-layer multicasting, IEEE Transactions on Signal Processing, 54 (2006),p. 2239.

[26] K. WEINBERGER AND L. SAUL, Distance metric learning for large margin near-est neighbor classification, The Journal of Machine Learning Research, 10 (2009),pp. 207–244.

Page 16: Feasible and accurate algorithms for covering semidefinite ... › DB_FILE › 2010 › 07 › 2670.pdfin covering SDPs, both scaling and shifting by a known strictly feasible point

A Application details

A.1 Undirected SPARSEST CUT.

We first consider two equivalent SDP relaxations for Undirected SPARSEST CUT [3],and then show how a related covering SDP can be used to solve the SDP relaxations. Asin [3], we assume that G is unweighted. Recall that we assume the graph G is connected.Here are two equivalent Undirected SPARSEST CUT SDP relaxations.

Formulation (ARV)

min 〈L,X〉s.t. 〈K,X〉 = 1

∀i, j, k 〈Kij + Kjk,X〉 ≥ 〈Kik,X〉X � 0.

Formulation (ARVT)

min 〈L,X〉s.t. 〈K,X〉 = 1 (a)∀i, j, k 〈Kij + Kjk,X〉 ≥ 〈Kik,X〉 (b)

Tr(X) ≤ 2n (c)

X � 0.Formulation (ARV) is the original formulation due to Arora, Rao and Vazarani [3]

and (ARVT) is the same formulation with a trace constraint added. We first show thatthe additional trace bound does not change the objective at optimality.

Lemma 4. Formulation (ARVT) is equivalent to formulation (ARV) in the sense thatif one is infeasible than so is the other and they share the same optimal objective iffeasible.

Proof. Since (ARVT) is simply (ARV) with an additional constraint, we must onlyshow that a feasible solution to (ARV) can be transformed into a feasible solution to(ARVT) with the same objective value. Let X = V>V be feasible to (ARV) where thepositive semidefiniteness of X implies the existence of the decomposition V. In orderto transform X, we shift all the columns of V so they are centered at the origin. Thatis, let

w =1

nV1, W = V −w1>,

andY = W>W = X−V>w1> − 1w>V + w>wJ.

We must show that Y is feasible to (ARVT). We first note that any weighted Laplacianmatrix M on a graph with n nodes has

〈M,Y〉 = 〈M,X〉 −⟨M,V>w1>

⟩−⟨1w>V,M

⟩+⟨M,w>wJ

⟩= 〈M,X〉 ,

since M1 = 1>M = 0. Thus,

〈L,Y〉 = 〈L,X〉 , 〈Kij + Kjk −Kik,Y〉 = 〈Kij + Kjk −Kik,X〉 ≥ 0,∀i, j, k,

and 〈K,Y〉 = 〈K,X〉 = 1. The final equality implies that Tr(Y) = 1/n since K =nI− J and

〈J,Y〉 =⟨11>,W>W

⟩=⟨W11>,W>⟩ = 0,

Page 17: Feasible and accurate algorithms for covering semidefinite ... › DB_FILE › 2010 › 07 › 2670.pdfin covering SDPs, both scaling and shifting by a known strictly feasible point

as W1 = V1 − w1>1 = nw − nw = 0. Finally, Y � 0 by construction. So afeasible solution to (ARV) implies the existence of a feasible solution to (ARVT) withthe same objective, or, equivalently, if (ARVT) is infeasible so is (ARV) and the optimalobjective of (ARV) is the same as that of (ARVT). ut

We then use the following covering SDP, which is a relaxation of (ARVT).

min 〈L,X〉s.t. 〈Tijk,X〉 ≥ 1, ∀i, j, k ∈ V〈K,X〉 ≥ 1n2 Tr(X) ≤ 1X � 0,

(11)

where Tijk = n4 (Kij +Kjk −Kik + 2I). Note that Tijk � 0 and we use the last two

constraints to set X ={X � 0 : Tr(X) ≤ 2

n

}. The need for a trace bound of 2

n is sothat the feasible solution will satisfy Assumption (a) as we show below.

By suitably shifting the solution, we can solve (ARV) via (11).

Lemma 5. Formulations (ARV) and (11) share the same objective value at optimality.Moreover, (ARV) is infeasible if, and only if, (11) is.

Proof. Since (11) is a relaxation of (ARVT), Lemma 4 indicates we need only show afeasible solution to (11) can be converted to one for (ARV) with the same objective. So,suppose Y is a feasible solution to (11) with 〈K,Y〉 = α > 1. Since 〈M,J〉 = 0 forany Laplacian matrix M, we can use a shifting by J to achieve our result. Let

Y =1

αY + (1− 1

α)J.

Then

〈K,Y〉 =

⟨K,

1

αY + (1− 1

α)J

⟩=

1

α〈K,Y〉 = 1.

The feasibility of Y to (11) implies that 〈Tijk,Y〉 =⟨n4 (Kij + Kjk −Kik + 2I),Y

⟩≥

1. Thus, we have

〈Kij + Kjk −Kik,Y〉 ≥4

n− 2Tr(Y) ≥ 4

n− 4

n= 0,

where the last inequality follows since Tr(Y) ≤ 2n . Then, since α > 0, this implies

that〈Kij + Kjk −Kik,Y〉 =

1

α〈Kij + Kjk −Kik,Y〉 ≥ 0.

So Y is feasible to (ARV). Note that the objective at Y only improves since 〈L,Y〉 =1α 〈L,Y〉 < 〈L,Y〉. Thus, if a feasible y to (11) is not feasible to (ARVT), it can beconverted to a feasible solution to (ARV) with an improved objective. If y is feasible to(ARVT), then Lemma 4 can be used to convert it to a feasible solution to (ARV). ut

We then have, as a corollary to Theorem 1, the following result.

Page 18: Feasible and accurate algorithms for covering semidefinite ... › DB_FILE › 2010 › 07 › 2670.pdfin covering SDPs, both scaling and shifting by a known strictly feasible point

Corollary 4. An ε-optimal solution to the Undirected SPARSEST CUT SDP (ARV) canbe found in O(n4 · ε−1 log(ε−1) operations.

Proof. We show that Assumption (a) is satisfied with q(n,m) = n. Note that Y =2

n2(n−1)K � 0 has

Tr(Y) =2

n2(n− 1)(n(n− 1)) =

2

n,

and so Y ∈ X .

〈Tijk,Y〉 =1

2n(n− 1)〈Kij + Kjk −Kik,K〉+ 1 =

1

n− 1+ 1.

Also, since 〈K,K〉 = n2(n − 1), we have 〈K,Y〉 = 2 > 1 + 1n . Note that a trace

bound of 1n would not allow a scaled version of K to satisfy the Assumption (a).

Thus, Y is feasible to (11) and νU = 〈L,Y〉 ≤ 〈K,Y〉 = 2 since G is unweighted.Assumption (b) is by results in [3], where it was shown that the (ARV) formulation hasa lower bound of 1√

log(n), i.e., νL = 1√

log(n); thus, p(m,n) = 2

√log(n) = O(1).

Finally, X = {X � 0 : Tr(X) ≤ 2n} which satisfies Assumption (a). Finally, note that

that ‖A‖ = λmax(K) = n, TG = O(n3) and m = O(n3). ut

A.2 Distance metric learning

For a positive integer, k, the k-Nearest Neighbor problem can be stated as follows.We are given a set of vertices V = {1, . . . , T} with T > k and for each i ∈ V ,there is a vector, vi ∈ IRn and a label yi ∈ {0, 1, . . . , R} for some integer R ≥ 1.We are also given an edge set, EN = {(i, j) : µ(v,vj) ≤ α}, which denotes thatinputs i and j are near with respect to some distance function µ on IRd. Each nodei ∈ V has k neighbors, i.e., for fixed i, |{(u, v) ∈ E : u = i}| = k. We also letED = {(i, j, `) : (i, j) ∈ EL, yi 6= y`} denote the triple of a nearest neighbor pair,(i, j), along with an input ` with a different label than i. Finally, let L = |EL| ≤ kTand D = |ED| ≤ kT 2. Then, the objective is to find a transformation H ∈ IRn×n sothat after transforming the vectors to Hvi, the sum of the nearest neighbor distances isminimized but that the distance between inputs with different labels is maximized. Tosolve this problem, Weinberger, et al [26] formulate the following SDP with the matrixvariable Y = H>H and additional variables zij`.

min∑

(i,j)∈EN (vi − vj)>Y(vi − vj) + c

∑(i,j)∈EN ,(i,`)∈ED zij`

s.t. (vi − v`)>Y(vi − v`)− (vi − vj)

>Y(vi − vj) + zij` ≥ 1 (i, j, `) ∈ EDzij` ≥ 0 (i, j, `) ∈ EDY � 0.

(12)Note that Y ∈ Sn and that there areD = O(T 2) constraints. Also, c is a given, positiveconstant. To convert this to a Cover SDP, we introduce the variable matrix X ∈ IRn+D

where

X =

(Y DD Z

)

Page 19: Feasible and accurate algorithms for covering semidefinite ... › DB_FILE › 2010 › 07 › 2670.pdfin covering SDPs, both scaling and shifting by a known strictly feasible point

so that D are dummy variables and Z is a matrix whose main diagonal is z ∈ IRD+ and

off-diagonal entries are dummy variables. Although we make this formulation change,we will not explicitly use the dummy variables and so, not incur more computationalcosts when we need to perform expensive calculations such as eigenvalue-eigenvectordecompositions. Also, we use C to denote the objective coefficient matrix and Aij` theconstraint matrices where

C =

(L 00 cI

)and Aij` =

((ei − e`)(ei − e`)

> − (ei − ej)(ei − ej)> 0

0 diag(eij`)

)Then, (12) becomes

min 〈C,X〉s.t.

⟨Aij`,X

⟩≥ 1 (i, j, `) ∈ ED

X � 0

(13)

Also, we require upper and lower bounds on (13). Since I =

(I 00 0

)is (strictly) feasi-

ble, we can set νU = L ≤ kT = O(T ). To find a lower bound, we consider the dual to(13),

max∑

(i,j,`)∈ED pij`

s.t. C−∑

(i,j,`)∈ED pij`Aij` � 0.

The key is to note that for any given (i, j, `), C − Aij` is the Laplacian formed fromremoving the edge (i, j) and adding the edge (i, `). In particular, C − A � 0 andνL = 1. We now convert (13) into a covering SDP. Note that C � 0, but Aij` 6� 0 andwe do not have a trace bound. To obtain the latter, since I is a strictly feasible solution,it is without loss of generality to assume Tr(X) ≤ L = O(T ). Moreover, note that if

a solution X has Tr(X) < L, since J =

(J 00 0

)has

⟨Aij`, J

⟩=⟨C, J

⟩= 0 we

can (L − Tr(X))J to X in a manner analogous to Undirected SPARSEST CUT, andkeep the objective the same. Thus, we can also let Aij` = 1

L (Aij` + I) and obtain thecovering SDP

min 〈C,X〉s.t. 〈Aij`,X〉 ≥ 1 (i, j, `) ∈ ED

Tr(X) ≤ LX � 0.

(14)

We similarly scale C so that λmax(C) = O(1) (note the scaling effects both νU and νLthe same). Then

⟨Aij`, I

⟩= L+1

L ≥ 1 + 1kT , so we have q(n,m) = τ = p(n,m) =

L = O(T ). Moreover, note that computing eigenvalues of C −∑ij`

pij`Aij` is pro-

portionate to O(n3) since the matrix is diagonal for the last T elements. Finally, thereare T = O(T 2) constraints in (14) as there was in (12) and ‖A‖ = O(1) since eachconstraint matrix is a just an identity plus one edge Laplacian minus another edge Lapla-cian. We can therefore state the following corollary to Theorem 1.

Page 20: Feasible and accurate algorithms for covering semidefinite ... › DB_FILE › 2010 › 07 › 2670.pdfin covering SDPs, both scaling and shifting by a known strictly feasible point

ACTIVESOLVE

Inputs: θi, i = 1, ..., n, αOutputs: l∗

Set λi ← max{−θi/α, 0}, k ← |{i : λi > 0}|if (∑n

i=1λi) ≤ 1

thenSet λ∗i ← λi

elseSort λi in ascending order, update indices to sorted orderSet l← n− k + 1, q ← λlwhile (

∑n

i=lλi − kq > 1)

dok ← k − 1, l← l + 1

q ← λl, λl ← 0

q ← 1k

(∑n

i=lλi − 1

)Set λ∗i ← (λi − q)+, ∀i.

return l∗

Fig. 2. The Active Set method for finding the optimal solution to (6).

Corollary 5. An ε-optimal solution to the k-Nearest Neighbors covering SDP can befound inO(T 2(n3 + T 2)

√log(T ) · ε−1) time.

B Proofs

B.1 Algorithm details for Section 4.2

Proof (Proof of Theorem 2). The optimization problem in (8) is convex and satisfiesthe Slater condition. Thus, all optimal solutions satisfy Karush-Kuhn-Tucker (KKT)conditions, and since the objective function is strongly convex, the optimal point isunique. Thus, all we have to do is to show that the output of ACTIVESOLVE satisfiesthe KKT conditions:

λ∗i =(− θiα− q)+

,( n∑i=1

λ∗i − 1)q = 0, q ≥ 0.

If∑ni=1

(− θi

α

)+

≤ 1, ACTIVESOLVE returns l∗ = 1α (−θ)+. Clearly, l∗ and q = 0

satisfy the KKT conditions.

Page 21: Feasible and accurate algorithms for covering semidefinite ... › DB_FILE › 2010 › 07 › 2670.pdfin covering SDPs, both scaling and shifting by a known strictly feasible point

If∑ni=1

(− θi

α

)+

> 1. Then on termination, q > λl−1 ≥ 0. The output λ∗i =(− θi

α − q)+

, andn∑i=1

λ∗i =

n∑i=l

λ∗i − kq = 1.

Thus, l∗ and q again satisfy the KKT conditions.The runtime of ACTIVESOLVE is determined by O(1) initialization, an O(n log n)

sorting call and an O(n) loop. Thus, the total runtime is O(n log n). ut

Proof (Proof of Lemma 3). Consider the two cases specified in the update rules for ν(t)L

and ν(t)U .

–⟨C, X(t+1)

⟩≤ ν(t)

L + 23∆

(t): In this case, ∆(t+1)

ν(t+1)

L

= 23∆(t)

ν(t)

L

≤ 56∆(t)

ν(t)

L

–⟨C, X(t+1)

⟩> ν

(t)L + 2

3∆(t): In this case, ν(t)

L + 13∆

(t) is a lower bound for ν(t).

Assumption (a) and the fact that Ai � 0 imply that ν(t+1)L = ω

1+ γ(t)

q(n,m)

≤ ν∗.

Thus,

ν(t+1)L ≥

(t)L +

1

3∆(t)

)(1−γ∆

(t)

ν(t)L

)≥ ν(t)

L +(1

3− 4

3q(n,m)

)∆(t) ≥ ν(t)

L +∆(t)

6,

where use the fact that ν(t)L γ(t) ≤ ν(t)

L ε(t) = ∆(t), γ(t) ≤ 1 and q(n,m) ≥ 8. Thus,∆(t+1)

ν(t+1)

L

≤ ν(t)

U−(ν

(t)

L+ 1

6∆(t))

ν(t) ≤ 56∆(t)

ν(t)

L

.

Since ε(t) − ε(t+1) ≥ 16ε

(t) ≥ (1 − α)ε(t), a simple case analysis establishes thatγ(t+1) − γ(t) ≥ (1− α)γ(t). This establishes (ii). ut

B.2 Nesterov Procedure

Page 22: Feasible and accurate algorithms for covering semidefinite ... › DB_FILE › 2010 › 07 › 2670.pdfin covering SDPs, both scaling and shifting by a known strictly feasible point

NESTEROV PROCEDURE

Inputs: C,Ai, τ, ω, εOutputs: (Y,y)

Set N ←⌈τω ‖A‖

√ln(m) · 1

ε

⌉, α← ε

2, L← ω2τ2 ‖A‖2 · 1

ε, x(0) ← 1

n+11.

for k ← 0 to Ndo

Compute eigendecomposition: C−∑m

i=1w

(k)i Ai =

∑n

j=1θjuju

>j

Set l = ACTIVESOLVE(θ, α

)Set X(k) =

∑n

j=1λjuju

>j ;

g(k)i ← 1−

⟨Ai,X

(k)⟩

, i = 1, . . . ,m;γ

(k)i ← 1 + log(wi), i = 1, . . . ,m;

y(k) ← argmaxv∈V

{(g(k) − 1

εLγ(k))

)>v − 1

εLdv(v)

};

z(k) ← argmaxv∈V

{(∑k

i=0i+12

g(i))>

v − 1εLdv(v)

};

w(k+1) =(

2k+3

)z(k) +

(k+1k+3

)y(k).

return(Y =

∑N

k=0

2(i+1)(N+1)(N+2)

X(k),y(N))

.

Fig. 3. Nesterov Procedure