Robust Principal Component Analysis: Exact Recovery of ... · exact completion of low-rank matrices...

$: Robust Principal Component Analysis: Exact Recovery of ... · exact completion of low-rank matrices from only a small fraction of their entries [6, 8, 5].2 There, a similar convex$
Robust Principal Component Analysis:

Exact Recovery of Corrupted Low-Rank Matrices via Convex

Optimization

John Wright†∗, Arvind Ganesh†, Shankar Rao†, and Yi Ma†

†Department of Electrical EngineeringUniversity of Illinois at Urbana-Champaign

Visual Computing GroupMicrosoft Research Asia

Abstract. Principal component analysis is a fundamental operation in computationaldata analysis, with myriad applications ranging from web search to bioinformatics tocomputer vision and image analysis. However, its performance and applicability in realscenarios are limited by a lack of robustness to outlying or corrupted observations. Thispaper considers the idealized “robust principal component analysis” problem of recover-ing a low rank matrix A from corrupted observations D = A+E. Here, the error entriesE can be arbitrarily large (modeling grossly corrupted observations common in visualand bioinformatic data), but are assumed to be sparse. We prove that most matricesA can be efficiently and exactly recovered from most error sign-and-support patterns,by solving a simple convex program. Our result holds even when the rank of A growsnearly proportionally (up to a logarithmic factor) to the dimensionality of the observa-tion space and the number of errors E grows in proportion to the total number of entriesin the matrix. A by-product of our analysis is the first proportional growth results forthe related but somewhat easier problem of completing a low-rank matrix from a smallfraction of its entries. We propose an algorithm based on iterative thresholding that, forlarge matrices, is significantly faster and more scalable than general-purpose solvers. Wegive simulations and real-data examples corroborating the theoretical results.

∗Corresponding author: John Wright, Room 146 Coordinated Science Laboratory, 1308 West Main Street, Urbana,IL 61801. Email: [email protected]

1

arX

iv:0

905.

0233

v1 [

cs.I

T]

2 M

ay 2

009

1 Introduction

The problem of finding and exploiting low-dimensional structure in high-dimensional data is takingon increasing importance in image, audio and video processing, web search, and bioinformatics,where datasets now routinely lie in thousand- or even million-dimensional observation spaces. Thecurse of dimensionality is in full play here: meaningful inference with limited number of observationsrequires some assumption that the data have low intrinsic complexity, e.g., that they are low-rank[15], sparse in some basis [10], or lie on some low-dimensional manifold [27, 2]. Perhaps the simplestuseful assumption is that the observations all lie near some low-dimensional subspace. In otherwords, if we stack all the observations as column vectors of a matrix M ∈ Rm×n, the matrix shouldbe (approximately) low rank. Principal component analysis (PCA) [15, 20] seeks the best (in an`2-sense) such low-rank representation of the given data matrix. It enjoys a number of optimalityproperties when the data are only mildly corrupted by small noise, and can be stably and efficientlycomputed via the singular value decomposition.

One major shortcoming of classical PCA is its brittleness with respect to grossly corruptedor outlying observations [20]. Gross errors are ubiquitous in modern applications in imaging andbioinformatics, where some measurements may be arbitrarily corrupted (e.g., due to occlusion orsensor failure) or simply irrelevant to the structure we are trying to identify. A number of naturalapproaches to robustifying PCA have been explored in the literature. These approaches includeinfluence function techniques [19, 28], multivariate trimming [18], alternating minimization [21],and random sampling techniques [16]. Unfortunately, none of these existing approaches yields apolynomial-time algorithm with strong performance guarantees.1

In this paper, we consider an idealization of the robust PCA problem, in which the goal is torecover a low-rank matrix A from highly corrupted measurements D = A+E. The errors E can bearbitrary in magnitude, but are assumed to be sparsely supported, affecting only a fraction of theentries of D. This should be contrasted with the classical setting in which the matrix A is perturbedby small (but densely supported) noise. In that setting, classical PCA, computed via the singularvalue decomposition, remains optimal if the noise is Gaussian. Here, on the other hand, even a smallfraction of large errors can cause arbitrary corruption in PCA’s estimate of the low rank structure,A.

Our approach to robust PCA is motivated by two recent, and tightly related, lines of research.The first set of results concerns the robust solution of over-determined linear systems of equationsin the presence of arbitrary, but sparse errors. These results imply that for generic systems ofequations, it is possible to correct a constant fraction of arbitrary errors in polynomial time [7].This is achieved by employing the `1-norm as a convex surrogate for the highly-nonconvex `0-norm.A parallel (and still emerging) line of work concerns the problem of computing low-rank matrixsolutions to underdetermined linear equations [26, 6]. One of the most striking results concerns theexact completion of low-rank matrices from only a small fraction of their entries [6, 8, 5].2 There, asimilar convex relaxation is employed, replacing the highly non-convex matrix rank with the nuclearnorm (or sum of singular values).

The robust PCA problem outlined above combines aspects of both of these lines of work: we wish1Random sampling approaches guarantee near-optimal estimates, but have complexity exponential in the rank of

the matrix A0. Trimming algorithms have comparatively lower computational complexity, but guarantee only locallyoptimal solutions.

2A major difference between robust PCA and low-rank matrix completion is that here we do not know whichentries are corrupted, whereas in matrix completion the support of the missing entries is given.

2

to recover a low-rank matrix from large but sparse errors. We will show that combining the solutionsto the above problems (nuclear norm minimization for low-rank recovery and `1-minimization forerror correction) yields a polynomial-time algorithm for robust PCA that provably succeeds underbroad conditions:

With high probability, solving a simple convex program perfectly recovers a genericmatrix A ∈ Rm×m of rank as large as C m

log(m) , from errors affecting a constant fractionof the m2 entries.

Notice that the conclusion holds with high probability only when the dimensionality m is highenough. Thus, we will not be able to truly witness the striking ability of convex program in recoveringthe low-rank and sparse structures from the data unless m is large enough. This phenomenon hasbeen dubbed as the blessing of dimensionality [12]. Thus, we need scalable algorithms to solve theassociated convex program. To this end, we show how a near-solution to this convex program canbe obtained relatively efficiently via iterative thresholding techniques, similar to those proposed formatrix completion [4]. For large matrices, this algorithm is significantly faster and more scalablethan general-purpose convex program solvers.

Finally, our proof implies strong results for the low-rank matrix completion problem, and includ-ing the first results applicable to the proportional growth setting where the rank of the matrix growsas a constant (non-vanishing) fraction of the dimensionality:

With overwhelming probability, solving a simple convex program perfectly recovers ageneric matrix A ∈ Rm×m of rank as large as Cm, from observations consisting of onlya fraction ρm2 (ρ < 1) of its entries.

1.1 Organization of the paper

This paper is organized as follows. Section 2 formulates the robust principal component analysisproblem more precisely and states the main results of this paper, placing these results in the contextof existing work. Section 3 begins the proof of the main result by providing sufficient conditionsfor a desired pair (A,E) to be recoverable by convex programming. These conditions will be statedin terms of the existence of a dual vector satisfying certain constraints. Section 4 then showsthat with high probability such a certifying dual vector indeed exists. This establishes our mainresult. Section 5 explores the implications of this analysis for the related problem of completinga low-rank matrix from an observation consisting of only a small subset of its entries. Section 6gives a simple but fast algorithm for approximately solving the robust PCA problem that extendsexisting iterative thresholding techniques, and is more efficient and scalable that generic interior-point methods. In Section 7, we perform simulations and experiments corroborating the theoreticalresults and suggesting their applicability to real-world problems in computer vision. Finally, inSection 8, we outline several promising directions for future work.

2 Problem Setting and Main Results

We assume that the observed data matrix D ∈ Rm×n was generated from a low-rank matrix A ∈Rm×n, by corrupting some of its entries. The corruption can be represented as an additive errorE ∈ Rm×n, so that D = A+ E. Because the error affects only a portion of the entries of D, E is asparse matrix. The idealized (or noise-free) robust PCA problem can then be formulated as follows:

3

Problem 2.1 (Robust PCA). Given D = A+ E, where A and E are unknown, but A is known tobe low rank and E is known to be sparse, recover A.

This problem formulation immediately suggests a conceptual solution: seek the lowest rank Athat could have generated the data, subject to the constraint that the errors are sparse: ‖E‖0 ≤ k.The Lagrangian reformulation of this optimization problem is

minA,E

rank(A) + γ‖E‖0 subj A+ E = D. (1)

If we could solve this problem for appropriate γ, we might hope to exactly recover the pair (A0, E0)that generated the data D. Unfortunately, (1) is a highly nonconvex optimization problem, andno efficient solution is known.3 We can obtain a tractable optimization problem by relaxing (1),replacing the `0-norm with the `1-norm, and the rank with the nuclear norm ‖A‖∗ =

∑i σi(A),

yielding the following convex surrogate:

minA,E

‖A‖∗ + λ‖E‖1 subj A+ E = D. (2)

This relaxation can be motivated by observing that ‖A‖∗+λ‖E‖1 is the convex envelope of rank(A)+λ‖E‖0 over the set of (A,E) such that ‖A‖2,2 + ‖E‖1,∞ ≤ 1. Moreover, recent advances in ourunderstanding of the nuclear norm heuristic for low-rank solutions to matrix equations [26, 6] andthe `1 heuristic for sparse solutions to underdetermined linear systems [7, 13], suggest that theremight be circumstances under which solving the tractable problem (2) perfectly recovers the low-rankmatrix A0. The main result of this paper will be to show that this is indeed true under surprisinglybroad conditions. A sketch of the result is as follows:

For “almost all” pairs (A0, E0) consisting of a low-rank matrix A0 and a sparse matrixE0,

(A0, E0) = arg minA,E

‖A‖∗ + λ‖E‖1 subj A+ E = A0 + E0,

and the minimizer is uniquely defined.

That is, under natural probabilistic models for low-rank and sparse matrices, almost all observationsD = A0 +E0 generated as the sum of a low-rank matrix A0 and a sparse matrix E0 can be efficientlyand exactly decomposed into their generating parts by solving a convex program.4

Of course, this is only possible with an appropriate choice of the regularizing parameter λ > 0.From the optimality conditions for the convex program (2), it is not difficult to show that for matricesD ∈ Rm×m, the correct scaling is λ = O

(m−1/2

). Throughout this paper, unless otherwise stated,

we will fixλ =

1√m. (3)

3In a sense, this problem subsumes both the low rank matrix completion problem and the `0-minimization problem,both of which are NP-hard and hard to approximate. While we are not aware of any formal hardness result for (1),it is reasonable to conjecture that it is similarly difficult.

4Notice that this is not an “equivalence” result for the intractable optimization (1) and the convex program (2) –rather than asserting that the solutions of these two problems are equal with high probability, we directly prove thatthe convex program correctly decomposes D = A0 + E0 into (A0, E0). A natural conjecture, however, is that underthe conditions of our main result, Theorem 1, (A0, E0) is also the solution to (1) for some choice of γ.

4

For simplicity, all of our results in this paper will be stated for square matrices D ∈ Rm×m, althoughthere is little difficulty in extending them to non-square matrices.

It should be clear that not all matrices A0 can be successfully recovered by solving the convexprogram (2). Consider, e.g., the rank-1 case where U = [ei] and V = [ej ]. Without additional priorknowledge, the low-rank matrix A = USV ∗ cannot be recovered from even a single gross error. Wetherefore restrict our attention to matrices A0 whose row and column spaces are not aligned withthe standard basis. This can be done probabilistically, by asserting that the marginal distributionsof U and V are uniform on the Stiefel manifold Wm

r . In [6], this was called the random orthogonalmodel.

Definition 2.2 (Random orthogonal model [6]). We consider a matrix A0 to be distributed accordingto the random orthogonal model of rank r if its left and right singular vectors are independentuniformly distributed m×r matrices with orthonormal columns.5 In this model, the nonzero singularvalues of A0 can be arbitrary.

Our model for errors is similarly natural: each entry of the matrix is independently corruptedwith some probability ρs, and the signs of the corruptions are independent Rademacher randomvariables.

Definition 2.3 (Bernoulli error signs and support). We consider an error matrix E0 to be drawnfrom the Bernoulli sign and support model with parameter ρs if the entries of sign(E0) areindependently distributed, each taking on value 0 with probability 1 − ρs, and ±1 with probabilityρs/2 each. In this model, the magnitude of the nonzero entries in E0 can be arbitrary.

Our main result is the following:

Theorem 2.4 (Robust recovery from non-vanishing error fractions). For any p > 0, there existconstants (C?0 > 0, ρ?s > 0,m0) with the following property: if m > m0, (A0, E0) ∈ Rm×m × Rm×mwith the singular spaces of A0 ∈ Rm×m distributed according to the random orthogonal model of rank

r ≤ C?0m

log(m)(4)

and the signs and support of E0 ∈ Rm×m distributed according to the Bernoulli sign-and-supportmodel with error probability ≤ ρ?s, then with probability at least 1− Cm−p

(A0, E0) = arg min ‖A‖∗ +1√m‖E‖1 subj A+ E = A0 + E0, (5)


In other words, matrices A0 whose singular spaces are distributed according to the randomorthogonal model can, with probability approaching one, be efficiently recovered from almost allcorruption sign and support patterns without prior knowledge of the pattern of corruption.

Our line of analysis also implies strong results for the matrix completion problem studied in[6, 8, 5].

5That is, the left and right singular vectors are independent samples from the Haar measure on the Stiefel manifoldWmr .

5

Theorem 2.5 (Matrix completion in proportional growth). There exist numerical constants m0,ρ?r, ρ?s, C all > 0, with the following property: if m > m0 and A0 ∈ Rm×m is distributed accordingto the random orthogonal model of rank

r ≤ ρ?rm, (6)

and Υ ⊂ [m] × [m] is an independently chosen subset of [m] × [m] in which the inclusion of eachpair (i, j) is an independent Bernoulli(1− ρs) random variable with ρs ≤ ρ?s, then with probability atleast 1− exp (−Cm),

A0 = arg min ‖A‖∗ subj A(i, j) = A0(i, j) ∀ (i, j) ∈ Υ, (7)


Finally, in Section 6 we extend existing iterative thresholding techniques for solving equality-constrained `1-norm minimization problems [31] and nuclear norm minimization problems [4] togive an algorithm that produces a near-solution to (2) more efficiently and scalably than off-the-shelf interior point methods.

2.1 Relationship to existing work

During the preparation of this manuscript, we were informed of results by [9] analyzing the sameconvex program. That paper’s main result shows that under the probabilistic model studied here,exact recovery can be guaranteed with high probability when

‖E0‖0 ≤Cm1.5

log(m)√

max(r, logm). (8)

While this is an interesting result, even for constant rank r it guarantees correction of only avanishing fraction o(m1.5) m2 of errors. In contrast, our main result, Theorem 2.4, states thateven if r grows proportional to m/ log(m), non-vanishing fractions of errors are corrected with highprobability. Both analyses start from the optimality condition for the convex program (2). Thekey technical component of this improved result is a probabilistic analysis of an iterative refinementtechnique for producing a dual vector that certifies optimality of the pair (A0, E0). This prooftechnique is related to techniques used in [7, 30], with additional care required to handle an operatornorm constraint arising from the presence of the nuclear norm in (2).

Finally, while Theorem 2.5 is merely a byproduct of our analysis and not the main focus of thispaper, it is interesting in light of results by [8]. That work proves that in the probabilistic modelconsidered here, a generic m×m rank r matrix can, with high probability, be efficiently and exactlycompleted from a subset of only

Cmr log8(m) (9)

entries. When r > mC log8(m)

, this bound exceeds the number m2 of possible observations. Incontrast, our Theorem 2.5 implies that for certain scenarios with r as large as ρrm, the matrix canbe completed from a subset of (1 − ρs)m2 entries. For matrices of relatively large rank, this is asignificant extension of [8]. Our result does not supersede (9) for smaller ranks, however. Theseresults are further contrasted in Section 5.

6

3 Analysis Framework

In this section, we begin our analysis of the semidefinite program (2). After introducing notation,we provide two sufficient conditions for a pair (A,E) to be the unique optimal solution to (2). Thefirst condition, given in Lemma 3.1 below, is stated in terms of the existence of a dual vector Wthat certifies optimality of the pair (A,E). The existence (or non-existence) of such a dual vectoris itself a random convex programming feasibility problem, and so the condition in this Lemma isnot directly amenable to analysis. The next step, outlined in Lemma 3.3, is to show that if thereexists some W0 that does not violate the constraints of this feasibility problem too badly, then itcan be refined to produce a W∞ that satisfies the constraints and certifies optimality of (A,E). Theprobabilistic analysis in the following Section 4 will then show that, with high probability, such aW0 exists.

3.1 Notation

For any n ∈ Z+, [n] .= 1 . . . n. For M ∈ Rm×m, and I, J ⊂ [m], MI,J will denote the submatrixof M consisting of those rows indexed by I and those columns indexed by J . We will use • asshorthand for the entire index set: MI,• is the submatrix consisting of those rows indexed by I. M∗

will denote the transpose of M . For matrices P,Q ∈ Rm×n, 〈P,Q〉 .= trace[P ∗Q] will denote the(Frobenius) inner product. The symbol I will denote the identity matrix or identity operator onmatrices; the distinction will be clear from context.‖M‖p,q will denote the operator norm of the matrix M , as a linear map between `p and `q.

Important special cases are the spectral norm ‖M‖2,2, and the max row- and column norms,

‖M‖2,∞ = maxi‖Mi,•‖2 and ‖M‖1,2 = max

j‖M•,j‖2, (10)

and the max element norm‖M‖1,∞ = max

i,j|Mij |. (11)

We will also often reason about linear operators on matrices. If L : Rm×m → Rm×m is a linear map,‖ L‖F,F will denote its operator norm with respect to the Frobenius norm on matrices:

‖ L‖F,F.= sup

M∈Rm×m\0

‖ L[M ]‖F‖M‖F

. (12)

If S is a subspace, πS will denote the projection operator onto that subspace. For U ∈ Rm×r,πU will denote the projection operator onto the range of U . Finally, if I ⊂ [m], πI

.=∑i∈I eie

∗i will

denote the projection onto those coordinates indexed by I.We will let

Ω .= (i, j) | E0ij 6= 0 (13)

denote the set of corrupted entries. By slight abuse of notation, we will identify Ω with subspaceM |Mij = 0 ∀ (i, j) ∈ Ωc, so πΩ will denote the projection onto the corrupted elements. We willlet Σ ∈ Rm×m be an iid Rademacher (Bernoulli ±1) matrix that defines the signs of the corruptedentries:

sign(E0) = πΩ[Σ]. (14)

Expressing sign(E0) in this way allows us to exploit independence between Ω and Σ.

7

The symbol ⊗ will denote the Kronecker product between matrices. “vec” will denote theoperator that vectorizes a matrix by stacking the columns. For M ∈ Rm×m, this operator stacksthe entries Mij in lexicographic order. For x = vec[M ] ∈ Rm×m we write xΩ as shorthand forxL(Ω), where L(Ω) = jm + i | (i, j) ∈ Ω ⊂ [m2], and as further shorthand, we will occasionallyuse vecΩ[M ] to denote [vec[M ]]Ω.

In both our analysis and optimization algorithm, we will make frequent use of the soft thresh-olding operator, which we denote Sγ :

Sγ [x] =

0, |x| ≤ γ,sign(x)(|x| − γ), |x| > γ.

(15)

For matrices X, Sγ [X] will denote the matrix obtained by applying Sγ to each element of Xij .Finally, we often find it convenient to use the notation

SΩc

γ [X] .= πΩ⊥ [Sγ [X]]. (16)

That is, SΩc

γ applies the soft threshold to X but only retains those elements not in Ω.Throughout, we will use the symbol Wm

r to refer to the Stiefel manifold of matrices U ∈ Rm×rwith orthonormal columns: U∗U = Ir×r. We will use SO(r) to refer to the special orthogonal groupof r × r matrices R such that R∗R = RR∗ = Ir×r and det(R) = 1.

3.2 Optimality conditions for (A0, E0)

We begin with a simple sufficient condition for the pair (A0, E0) to be the unique optimal solution to(2). These conditions are stated in terms of a dual vector, the existence of which certifies optimality.Our analysis will then show that under the stated circumstances, such a dual certificate can beproduced with high probability.

Lemma 3.1. Let (A0, E0) ∈ Rm×m × Rm×m, with

Ω .= supp(E0) ⊆ [m]× [m]. (17)

Let A0 = USV ∗ denote the compact singular value decomposition of A0, and Θ denote the subspacegenerated by matrices with column space range(U) or row space range(V ):

Θ .= UM∗ |M ∈ Rm×r+ MV ∗ |M ∈ Rm×r ⊂ Rm×m. (18)

Suppose that ‖πΩπΘ‖F,F < 1 and there exists W ∈ Rm×m such that [UV ∗ +W ]ij = λ sign(E0i,j) ∀ i, j ∈ Ω,|[UV ∗ +W ]ij | < λ ∀ i, j ∈ Ωc,U∗W = 0, WV = 0, ‖W‖2,2 < 1

. (19)

Then the pair (A0, E0) is the unique optimal solution to (2).

Proof. We show that if a Lagrange multiplier vector Y .= UV ∗+W satisfying this system exists, thenany nonzero perturbation A0 7→ A0 + ∆, E0 7→ E0 −∆ respecting the constraint A+ E = A0 + E0

8

will increase the objective function. By convexity, we may restrict our interest to ∆ satisfying‖∆‖1,∞ < min(i,j)∈Ω |E0ij |. For such ∆,

λ‖E0 −∆‖1 − λ‖E0‖1 = −λ∑

(i,j)∈Ω

sign(E0ij)∆ij + λ∑

(i,j)/∈Ω

|∆ij | ≥ 〈Y,−∆〉.

with equality if and only if πΩ⊥ [∆] = 0 (since on this set |Yij | is strictly smaller than λ). Similarly,since Y = UV ∗ +W ∈ ∂‖ · ‖∗

∣∣A0

,

‖A0 + ∆‖∗ ≥ ‖A0‖∗ + 〈Y,∆〉. (20)

Moreover, since ‖A0‖∗ = 〈UV ∗, A0〉 = 〈Y,A0〉, ‖A0‖∗ + 〈Y,∆〉 = 〈Y,A0 + ∆〉. Thus, by Lemma 3.2below, if equality holds in (20), then ∆ ∈ Θ. Summing the two subgradient bounds, we have

‖A0 + ∆‖∗ + λ‖E0 −∆‖1 ≥ ‖A0‖∗ + λ‖E0‖1,

with equality only if ∆ ∈ Ω ∩Θ. If ‖πΩπΘ‖F,F < 1, then Ω ∩Θ = 0 and so either

‖A0 + ∆‖∗ + λ‖E0 −∆‖1 > ‖A0‖∗ + λ‖E0‖1,

or ∆ = 0.

Lemma 3.2. Consider P ∈ Rm×m, with ‖P‖2,2 = 1 and σmin(P ) < 1. Write the full singular value

decomposition of P as P =[U1 U2

] [ I 00 Σ2

] [V1 V2

]∗ where ‖Σ2‖2,2 < 1. Let Q ∈ Rm×m

have reduced singular value decomposition Q = UΣV ∗. Then if 〈P,Q〉 = ‖Q‖∗, U∗U2 = 0 andV ∗V2 = 0.

Proof. Let r denote the rank of Q.

〈P,Q〉 =⟨[

I 00 Σ2

],

[U∗1UU∗2U

]Σ[V ∗V1 V ∗V2

]⟩.

Let Y, Z ∈ Rm×r denote the matrices Y .=[U∗1UU∗2U

], Z .=

[V ∗1 VV ∗2 V

], and notice that both Y and Z

have orthonormal columns. Then, after the above rotation

〈P,Q〉 =m∑i=1

σi(P )r∑j=1

σj(Q)ZijYij =r∑j=1

σj(Q)m∑i=1

σi(P )ZijYij . (21)

For all j,∑mi=1 σi(P )ZijYij ≤

∥∥[ I 00 Σ2

]Z•,j

∥∥2‖Y•,j‖2 ≤ 1, with equality if and only if σi(P ) < 1 =⇒

Zij = 0 and Z•,j = Y•,j . Since whenever 〈P,Q〉 = ‖Q‖∗ each of the∑mi=1 σi(P )ZijYij must be one,

this implies that U∗2U and V ∗2 V are both zero.

Thus, we can guarantee that (A0, E0) is optimal with high probability by asserting that a certainrandom convex feasibility problem is satisfiable with high probability. While there are potentiallymany W satisfying these constraints, most of them lack an explicit expression. As has proven fruitfulin a variety of related problems, we will instead consider a putative dual vector W0 that does havean explicit expression: the minimum Frobenius norm solution to the equality constraints in (19).We will see that this vector already satisfies the operator norm constraint with high probability.However, the box constraint due to sign(E0) is likely to be violated. We then describe an iterativeprocedure that, with high probability, fixes the box constraint, while respecting the equality andoperator norm constraints. The output of this iteration is the desired certifying dual vector.

9

3.3 Iterative construction of the dual certificate

In constructing the dual certificate, we will use the fact that the violations of the inequality con-straints, viewed as a matrix, often have sparse rows and sparse columns. To formalize this, fix asmall constant c ∈ [0, 1], and define

Ψc.=M ∈ Rm×m | ‖M•,j‖0 ≤ cm∀ j, ‖Mi,•‖0 ≤ cm∀ i, πΩ[M ] = 0

. (22)

We will show that such matrices are near the nullspace of the equality constraints in (19). Theequality constraints in (19) can be expressed as

πΩ[W ] = πΩ[λ sign(E0)− UV ∗] and πΘ[W ] = 0.

We will let Γ denote the orthogonal complement of the nullspace of these constraints:

Γ .= Θ + Ω. (23)

Let ξc denote the operator norm of πΓ, with respect to the Frobenius norm on Rm×m, restricted toΨc:

ξc.= supM∈Ψc\0

‖πΓM‖F‖M‖F

∈ [0, 1]. (24)

In Section 4.3, we establish a useful probabilistic upper bound for this quantity. Before investigatingthese properties, we show that if ξc and W0 are well-controlled, we can find a dual vector certifyingoptimality of (A0, E0).

Lemma 3.3 (Iterative construction of a dual certificate). Suppose for some c ∈ (0, 1) and ε > 0,there exists W0 satisfying πΘ[W0] = 0 and πΩ[W0] = πΩ[λ sign(E0)− UV ∗], such that

‖UV ∗ +W0‖1,2 +1

1− ξc

∥∥∥SΩc

λ−ε(UV∗ +W0)

∥∥∥F≤ (λ− ε)

√cm, (25)

‖UV ∗ +W0‖2,∞ +1

1− ξc

∥∥∥SΩc

λ−ε(UV∗ +W0)

∥∥∥F≤ (λ− ε)

√cm, (26)

and‖W0‖2,2 +

11− ξc

‖SΩc

λ−ε(UV∗ +W0)‖F < 1. (27)

Then there exists W∞ satisfying the system (19).

Proof. We construct a convergent sequence W0,W1, . . . whose limit satisfies (19). For each k, set

Wk = Wk−1 − πΓ⊥SΩc

λ−ε (UV ∗ +Wk−1) . (28)

Notice that πΘ[Wk] = πΘ[Wk−1] and πΩ[Wk] = πΩ[Wk−1], so for all k, πΘ[Wk] = 0 and πΩ[Wk] =πΩ[λ sign(E0) − UV ∗]. We will further show by induction that the sequence (Wk) satisfies thefollowing properties:

‖UV ∗ +Wk‖1,2 ≤ ‖UV ∗ +W0‖1,2 +∥∥∥SΩc

λ−ε(UV∗ +W0)

∥∥∥F

k−1∑i=0

ξic, (29)

10

‖UV ∗ +Wk‖2,∞ ≤ ‖UV ∗ +W0‖2,∞ +∥∥∥SΩc

λ−ε(UV∗ +W0)

∥∥∥F

k−1∑i=0

ξic, (30)

maxi

∥∥∥∥[SΩc

λ−ε (UV ∗ +Wk−1)]i,•

∥∥∥∥0

≤ cm, (31)

maxj

∥∥∥∥[SΩc

λ−ε (UV ∗ +Wk−1)]•,j

∥∥∥∥0

≤ cm, (32)

‖SΩc

λ−ε(UV∗ +Wk)‖F ≤ ξkc ‖SΩc

λ−ε(UV∗ +W0)‖F . (33)

For k = 0, (29), (30) and (33) hold trivially. The sparsity assertion (31) follows from the assumptionsof the lemma: for all i,∥∥∥∥[SΩc

λ−ε(UV∗ +W0)

]i,•

∥∥∥∥0

≤‖[UV ∗ +W0]i,•‖22

(λ− ε)2≤‖UV ∗ +W0‖21,2

(λ− ε)2≤ cm.

The exact same reasoning applied to the columns gives (32). Now, suppose the statements (29)-(33)hold for W0 . . .Wk−1. Then

‖UV ∗ +Wk‖1,2 ≤ ‖UV ∗ +Wk−1‖1,2 +∥∥∥SΩc

λ−ε(UV∗ +Wk−1)

∥∥∥F

≤ ‖UV ∗ +Wk−1‖1,2 +∥∥∥SΩc

λ−ε(UV∗ +W0)

∥∥∥Fξk−1c

≤ ‖UV ∗ +W0‖1,2 +∥∥∥SΩc

λ−ε(UV∗ +W0)

∥∥∥F

k−1∑i=0

ξic,

establishing (29) for k. The same reasoning applied to ‖ · ‖2,∞ establishes (30) for k. Bounding thesummation in (29) by 1

1−ξc and applying assumption (25) of the lemma gives that ‖UV ∗+Wk‖21,2 ≤(λ− ε)2cm. This implies (31): the number of entries of each column of UV ∗+Wk that exceed λ− εin absolute value is at most cm. The same chain of reasoning establishes that (32): the rows ofSΩc

λ−ε(UV∗ +Wk) are also cm-sparse.

Finally, notice that

‖SΩc

λ−ε(UV∗ +Wk)‖F =

∥∥∥SΩc

λ−ε

(UV ∗ +Wk−1 − πΓ⊥SΩc


)∥∥∥F

=∥∥∥ SΩc

λ−ε

(UV ∗ +Wk−1 − SΩc

λ−ε(UV∗ +Wk−1) + πΓSΩc


) ∥∥∥F.

Since the entries of UV ∗ + Wk−1 − SΩc

λ−ε(UV∗ + Wk−1) have magnitude ≤ λ − ε, πΓSΩc

λ−ε(UV∗ +

Wk−1) dominates SΩc

λ−ε(UV∗ + Wk) elementwise in absolute value. Hence, ‖SΩc

λ−ε(UV∗ + Wk)‖F ≤

‖πΓSΩc

λ−ε(UV∗ + Wk−1)‖F ≤ ξc‖SΩc

λ−ε(UV∗ + Wk−1)‖F , where we have used that SΩc

λ−ε(UV∗ +

Wk−1) ∈ Ψc. Thus each of the three statements holds for all k, and ‖SΩc

λ−ε(UV∗ +Wk)‖F decreases

geometrically. For all k sufficiently large, ‖UV ∗ +Wk‖1,∞ ≤ λ.Meanwhile, notice that

‖Wk‖2,2 ≤ ‖Wk−1‖2,2 + ‖πΓ⊥SΩc

λ−ε(UV∗ +Wk−1)‖2,2

≤ ‖Wk−1‖2,2 + ‖SΩc

λ−ε(UV∗ +Wk−1)‖F

≤ ‖Wk−1‖2,2 + ξk−1c ‖SΩc

λ−ε(UV∗ +W0)‖F .

11

By induction, it is easy to show that

‖Wk‖2,2 ≤ ‖W0‖2,2 + ‖SΩc

λ−ε(UV∗ +W0)‖F

k−1∑i=0

ξic.

By the second assumption of the lemma, this quantity is bounded by 1 for all k.

4 Probabilistic Analysis of the Initial Dual Vector

In this section, we analyze the minimum Frobenius norm solution to the equality constraints in theoptimality condition (19). We show that with high probability, this initial dual vector satisfies theoperator norm constraint, and that the violations of the box constraint are small enough that theiteration described in (28) will succeed in producing a dual certificate. The analysis is organized asfollows: in Section 4.1, we introduce several tools and preliminary results. Section 4.2 then showsthat with overwhelming probability the operator norm of the initial dual vector is bounded by aconstant that can be made arbitrarily small by assuming the error probability ρs and rank r areboth low enough. Section 4.3 then shows that the restricted operator norm ξc is also bounded bya small constant, establishing that all row- and column- sparse matrices are nearly orthogonal tothe subspace spanned by the equality constraints in (19). Section 4.4 analyzes the violations of thebox constraint. Finally, Section 4.5 combines these results with the results of Section 3 to prove ourmain result, Theorem 2.4.

4.1 Tools and preliminaries

In this section, we will often need to refer to the following subspaces

ΞU.= UM∗ |M ∈ Rr×m,

ΞV.= MV ∗ |M ∈ Rm×r,

ΞUV.= UMV ∗ |M ∈ Rr×r.

Notice that πΘ = πΞU + πΞV − πΞUV and that πΞUV = πΞUπΞV = πΞV πΞU .In the Bernoulli support model, the number of errors in each row and column concentrates about

ρsm. For each j ∈ [m], letIj

.= i | (i, j) ∈ Ω, (34)

i.e., Ij contains the indices of the errors in the j-th column. Similarly, for i ∈ [m], set

Ji.= j | (i, j) ∈ Ω. (35)

In terms of these quantities, for each η > 0, define the event

EΩ(η) : maxj|Ij | < (1 + η)ρs and max

i|Ji| < (1 + η)ρs.

Much of our analysis hinges on the operator norm of πΩπΘ. We will see that on the above eventEΩ(η), this quantity can be controlled by bounding norms of submatrices of the singular vectors U

12

and V . This results in a number of bounds involving the following function of the rank and errorprobability:

τ(r/m, ρs).= 2

√r/m+

√ρs

1−√r/m

. (36)

Where τ is used as shorthand, it should be understood as τ(r/m, ρs). We will repeatedly refer tothe following good events:

EΩU : ‖πΩπΞU ‖F,F ≤ τ (r/m, ρs) ,EΩV : ‖πΩπΞV ‖F,F ≤ τ (r/m, ρs) ,EΩΘ : ‖πΩπΘ‖F,F ≤ 2τ (r/m, ρs) .

In establishing that these events are overwhelmingly likely, the following result on singular valuesof Gaussian matrices will prove useful:

Fact 4.1. Let M ∈ Rm×n, m > n and suppose that the elements of M are independent identicallydistributed N (0, 1) random variables. Then

P[σmax(M) ≥

√m+

√n+ t

]≤ exp

(− t

2

2

),

P[σmin(M) ≤

√m−

√n− t

]≤ exp

(− t

2

2

).

This result is widely used in the literature, with various estimates of the error exponent. Theform stated here can be obtained by via the bounds E [σmax(M)] ≤

√m +

√n and E [σmin(M)] ≥√

m−√n, in conjunction with [22] Proposition 2.18, Equation (2.35), and the observation that the

singular values are 1-Lipschitz.

Lemma 4.2. Fix any η ∈ (0, 1/16). Consider (U, V,Ω) drawn from the random orthogonal modelof rank r < m with error probability ρs > 0. Then there exists t?(r/m, ρs) > 0 such that

PU,V,Ω [EΩ(η) ∩ EΩU ∩ EΩV ∩ EΩΘ] ≥ 1− 4m exp(−mt

?2

2

)− 4m exp

(−η

2ρ2sm

2

). (37)

In particular, if for all m larger than some m0 ∈ Z, r/m ≤ ρr < 1, then

PU,V,Ω [EΩ(η) ∩ EΩU ∩ EΩV ∩ EΩΘ] ≥ 1− exp (−Cm+O(log(m))) . (38)

Proof. Each of the random variables |Ij | is a sum of m independent Bernoulli(ρs) random variablesX1,j , X2,j , . . . Xm,j . The partial sum

∑ki=1Xk−ρsk is a Martingale whose differences are all bounded

by 1. So by Azuma’s inequality, we have P [|Ij | − ρsm > t] < exp(− t2

2m

). The same calculation

clearly holds for the Ji, and so setting t = ηρsm,

P [EΩ(η)c] ≤ 2mP [|Ij | > ρs(1 + η)m] ≤ 2m exp(−η

2ρ2sm

2

). (39)

So, with high probability each row and column of E0 is (1 + η)ρsm-sparse. We next show that suchsparse vectors are nearly orthogonal to the random subspace range(U). The matrix U is uniformly

13

distributed on Wmr , and can be realized by orthogonalizing a Gaussian matrix. Let Z ∈ Rm×r be

an iid N (0, 1m ) matrix; then U is equal in distribution to Z(Z∗Z)−1/2. For each Ij ,∥∥∥ZIj ,•(Z∗Z)−1/2

∥∥∥2,2≤‖ZIj ,•‖2,2σmin(Z)

,

where σmin(Z) denotes the r-th singular value of the m× r matrix Z. Now, for any t > 0

P[σmin(Z) < 1−

√r/m− t

]< exp

(− t

2m

2

). (40)

Meanwhile, on the event EΩ(η),

PZ|Ω[‖ZIj ,•‖2,2 >

√r/m+

√|Ij |/m+ t

]< exp

(− t

2m

2

). (41)

Hence,

P

[‖UIj ,•‖2,2 >

√r/m+

√1 + η

√ρs + t

1−√r/m− t

]

≤ PU |Ij

[‖UIj ,•‖2,2 >

√r/m+

√1 + η

√ρs + t

1−√r/m− t

| |Ij | < (1 + η)ρsm

]+ P [|Ij | ≥ (1 + η)ρsm]

< 2 exp(−mt

2

2

)+ exp

(−η

2ρ2sm

2

).

Set t? ≤ min(

13 −

13

√rm ,√ηρs). By the assumption of the lemma,

√1 + η +

√η < 4/3, and√

r/m+√

1 + η√ρs + t?

1−√r/m− t?

≤√r/m+

(√1 + η +

√η)√

ρs23

(1−

√r/m

)≤

√r/m+ 4

3

√ρs

23

(1−

√r/m

) ≤ τ(r/m, ρs).

So, (applying a symmetric argument to the VJi), then, for the event E1 defined below,

E1 : max(

maxj‖UIj ,•‖2,2,max

i‖VJi,•‖2,2

)≤ τ(r/m, ρs), (42)

PU,V,Ω [E1 ∩ EΩ(η)] > 1− 4m exp(−mt

?2

2

)− 4m exp

(−η

2ρ2sm

2

). (43)

If for all m > m0, r/m ≤ ρr < 1, then t∗ is bounded away from zero. We next show that E1implies EΩU , EΩV , and EΩΘ. We can express πΩ[M ] in terms of its action on the columns of M :πΩ[M ] =

∑j πIjM•,je

∗j . So,

πΞUπΩ[M ] = πU∑j

πIjM•,je∗j , (44)

14

and ‖πΞUπΩ[M ]‖2F =∑j

‖πUπIjM•,j‖22 =∑j

‖U∗πIjM•,j‖22

≤∑j

‖UIj ,•‖22,2‖M•,j‖22 ≤(

maxj‖UIj ,•‖2,2

)2

‖M‖2F ,

and so on E1, ‖πΩπΞU ‖F,F ‖πΞUπΩ‖F,F ≤ τ(r/m, ρs); and so E1 =⇒ EΩU . A symmetric argumentestablishes that E1 =⇒ EΩV . Now, notice that since

πΩπΘ[M ] = πΩπΞU [M ] + πΩ [πU⊥MπV ] , (45)

if we choose a basis B ∈ Rm−r for the orthogonal complement of range(U) and define ΞU⊥V.=

BQV ∗ | Q ∈ Rm−r×r ⊂ Rm×m, then

‖πΩπΘ‖F,F ≤ ‖πΩπΞU ‖F,F +∥∥πΩπΞ

U⊥V

∥∥F,F

. (46)

Since ∥∥πΩπΞU⊥V

∥∥F,F

= supM∈Ξ

U⊥V \0‖πΩ[M ]‖F

≤ supM∈ΞV \0

‖πΩ[M ]‖F = ‖πΩπΞV ‖F,F ,

E1 =⇒ EΩΘ and the proof is complete.

We will need to understand concentration of Lipschitz functions of matrices that are uniformlydistributed (according to the Haar measure) on two manifolds of interest: the Stiefel manifold Wm

rand the group of r × r orthogonal matrices with determinant one, SO(r). This is governed by theconcentration function on these spaces:

Definition 4.3. [22] Let (X, d) be a metric space. For a given measure µ on X, the concentra-tion function is defined as

αX,d,µ(t).= sup

˘1− µ(At) | A ⊂ X,µ(A) ≥ 1

2

¯, (47)

where At = x | d(x,A) < t is a t-neighborhood of A.

The concentration functions for Wmr and SO(r) are well known:

Fact 4.4 ([24] Theorems 6.5.1 and 6.7.1). For r < m the manifold Wmr , with distance d(X,Y )

.=

‖X − Y ‖F , the Haar measure µ has concentration function

αW,d,µ(t) ≤rπ

8exp

„−mt

2

8

«. (48)

Similarly, on SO(r) with δ(X,Y ).= ‖X − Y ‖F , and ν the Haar measure,

αSO(r),δ,ν(t) ≤rπ

8exp

„−rt

2

8

«. (49)

15

Propositions 1.3 and 1.8 of [22] then imply that Lipschitz functions on these spaces concentrateabout their medians and expectations:

Corollary 4.5. Suppose r < m, and let f : Rm×r → R with Lipschitz norm

‖f‖lip.= supX 6=Y

|f(X)− f(Y )|‖X − Y ‖F

. (50)

Then if U is distributed according to the Haar measure on Wmr , and median(f) denotes any median,

P [f(U) ≥ median(f) + t] ≤ exp

(− mt2

8‖f‖2lip

). (51)

Similarly, if g : Rr×r → Rm with Lipschitz norm

‖g‖lip.= supX 6=Y

|g(X)− g(Y )|‖X − Y ‖F

. (52)

Then if R is distributed according to the Haar measure on SO(r), g satisfies the following tail bound:

P[|g(R)− E[g(R)]| ≥ ‖g‖lip

√8πr + t

]≤ 2 exp

(− rt2

8‖g‖2lip

). (53)

Proof. The concentration result (51) is a restatement of [22] Proposition 1.3. For (53), notice firstthat [22] Proposition 1.3 implies that

P [|g(R)−median(g)| ≥ t] ≤ 2 exp

(− rt2

8‖g‖2lip

). (54)

If we set

a.=∫ ∞

0

2 exp

(− rt2

8‖g‖2lip

)dt = ‖g‖lip

√8πr, (55)

then [22] Proposition 1.8 gives

P [|g(R)− E[g]| ≥ a+ t] ≤ 2 exp

(− rt2

8‖g‖2lip

). (56)

4.2 Bounding the operator norm

In this section, we begin our analysis of the minimum Frobenius norm solution to the equalityconstraints of the feasibility problem (19). We show that with overwhelming probability this matrixalso has operator norm bounded by a small constant. An important byproduct of this analysis is asimple proof for that, with the same models studied here, the easier problem of matrix completion– filling missing entries in a low-rank matrix – can be efficiently and exactly solved by convexoptimization, even for cases when the rank of the matrix grows in proportion to the dimensionality.Section 5 further elucidates connections to that problem.

16

4.2.1 General approach

We begin by showing that with high probability the minimum Frobenius norm solution is unique, andgiving an explicit representation for the pseudoinverse operator in that case. This operator, denotedH†, is applied to the matrix λ sign[E0]−UV ∗ to give the initial dual vector. We separately bound thenorm of the two terms induced by ‖H†sign[E0]‖2,2 and ‖H†[UV ∗]‖2,2, respectively. Both argumentsfollow in a fairly straightforward manner by reducing to a net and then applying concentrationinequalities. Throughout this section, N will denote a 1

2 -net for Sm−1. By [22] Lemma 3.18, thereis such a net with size at most exp(4m). Moving from ‖A‖2,2 = supx,y∈Sm−1 x∗Ay to our productof nets loses at most a constant factor in the estimate: ‖A‖2,2 ≤ 4 supx,y∈N x∗Ay (see e.g., [29]Proposition 2.6). We will argue that for our A of interest, f(A) = x∗Ay concentrates, and unionbound over all exp(8m) pairs in N ×N .

4.2.2 Representation and uniqueness of W0.

Lemma 4.6 (Representation of the pseudoinverse). Suppose that we have ‖πΘπΩ‖F,F < 1. Thenthe operator I− πΩπΘπΩ is invertible, and for any M the optimization problem

min ‖W‖F subj πΘ[W ] = 0, πΩ[W ] = πΩ[M ] (57)

has unique solution

W = πΘ⊥πΩ (I− πΩπΘπΩ)−1πΩ[M ] = πΘ⊥πΩ

∞∑k=0

(πΩπΘπΩ)kπΩ[M ].

Proof. Choose matrices U⊥, V ⊥ ∈ Rm×m−r whose columns form orthonormal bases for the orthog-onal complement of the ranges of U and V , respectively. Let Q denote the matrix

Q.=

V ⊥∗ ⊗ U∗

V ∗ ⊗ U∗V ∗ ⊗ U⊥∗

.Notice that the rows of Q are orthonormal, and that they form a basis for the subspace of vectorsvec[M ] |M ∈ Θ. The equality constraint in (57) can therefore be expressed as[

QIΩ,•

]vec(W ) =

[0

vecΩ[M ]

], (58)

Here, I is the m2 × m2 identity matrix, and IΩ,• is the submatrix of I consisting of those rowsindexed by Ω, taken in lexicographic order. The minimum Frobenius norm solution W is simply theminimum `2 norm solution to this system of equations. Define the matrix

P.= Q I•,Ω.

Notice that for any matrix M ,

Q∗P IΩ,•vec[M ] = vec [πΘπΩM ] .

17

Since Q and IΩ,• each have orthonormal rows, ‖πΘπΩ‖F,F = ‖Q∗PIΩ,•‖2,2 = ‖P‖2,2. So, by the

assumption of the lemma ‖P‖2,2 < 1, and the matrix[

I PP ∗ I

]is nonsingular. We therefore have

the following explicit expression for the unique minimum `2-norm solution to (58):

vec[W ] =[Q∗ I•,Ω

] [ I PP ∗ I

]−1 [ 0vecΩ[M ]

]. (59)

Applying the Schur complement formula (which is justified since I P ∗P ), the above is equal to

[Q∗ I•,Ω

] [ −PI

](I− P ∗P )−1vecΩ[M ]

= (I−Q∗Q)I•,Ω(I− P ∗P )−1IΩ,•vec[M ]

= (I−Q∗Q)I•,Ω∞∑k=0

(IΩ,•Q∗QI•,Ω)kIΩ,•vec[M ]

= vec

[πΘ⊥πΩ

∞∑k=0

(πΩπΘπΩ)kπΩ[M ]

],

yielding the representation in the statement of the lemma.

4.2.3 Effect of the singular vectors

We next analyze the part of the initial dual vector W0 induced by UV ∗. From the previous lemmawe have the expression

H†[·] = πΘ⊥πΩ

∞∑k=0

(πΩπΘπΩ)kπΩ[·], (60)

whenever ‖πΩπΘ‖F,F < 1. Analysis of H†[UV ∗] is complicated by the fact that UV ∗ and Θ aredependent random variables. Notice, however, that πΘ depends only on the subspace spanned by thecolumns of U , not on any choice of basis for that subspace. UV ∗, on the other hand, does dependon the choice of basis. We will use this fact to decouple H† and UV ∗, and show that the `2 operatornorm of H†[UV ∗] is bounded below a small constant with overwhelming probability.

Lemma 4.7. Let (U, V,Ω) be sampled from the random orthogonal model of rank r with errorprobability ρs. Suppose that r and ρs satisfy

τ(r/m, ρs) < 14 . (61)

Then with probability at least 1 − exp (−Cm+O(logm)), the solution WUV0 to the optimization

problemmin ‖W‖F subj U∗W = 0, WV = 0, πΩ[W ] = −πΩ[UV ∗] (62)

is unique, and satisfies

∥∥WUV0

∥∥2,2≤ 64 τ

( rm, ρs

)(1 +

83

√m

m− r − 1

)+

83

√8π

m− r − 1. (63)

18

Proof. First consider the eventE1 : ‖πΩ[UV ∗]‖2,2 ≤ 64τ.

Fix x,y ∈ N and write

x∗πΩ[UV ∗]y = 〈U∗πΩ[xy∗], V ∗〉 .

On the event EΩU , ‖U∗πΩ[xy∗]‖F ≤ τ . So, as a function of V , f(V ) .= x∗πΩ[UV ∗]y is τ -Lipschitz.Since the distribution of V is invariant under the orthogonal transformation V 7→ −V , f(V ) is asymmetric random variable and 0 is a median. Hence, on the event EΩU the tail bound (51) impliesthat

PV |U,Ω [f(V ) > t] < exp(−mt

2

8τ2

). (64)

Set t = 16τ . A union bound over the ≤ exp(8m) elements of N ×N shows that on EΩU ,

PV |U,Ω[‖πΩ[UV ∗]‖2,2 > 64τ

]≤ PV |U,Ω

[sup

x,y∈N×Nx∗πΩ[UV ∗]y > 16τ

]≤ exp (−24m) ,

and hence we can conclude that

PU,V,Ω [E1] ≥ 1− exp (−24m)− P[EcΩU ]≥ 1− exp (−Cm+O(logm)) .

Now, consider the combined event E1∩EΩΘ. On this event, the representation in Lemma 4.6 is validand

WUV0 = −πΘ⊥πΩ

∞∑k=0

(πΩπΘπΩ)kπΩ[UV ∗]. (65)

For any M , πΘ⊥ [M ] = πU⊥MπV ⊥ , ‖πΘ⊥ [M ]‖2,2 ≤ ‖M‖2,2, so

‖WUV0 ‖2,2 ≤

∥∥∥∥∥πΩ

∞∑k=0

(πΩπΘπΩ)kπΩ[UV ∗]

∥∥∥∥∥2,2

≤ ‖πΩ [UV ∗]‖2,2 +

∥∥∥∥∥∞∑k=1

(πΩπΘπΩ)k[UV ∗]

∥∥∥∥∥2,2

. (66)

We have already addressed the first term. To bound the second, we expand our probability spaceas follows. Consider U and V distributed according to the Haar measure on Wm

m−1. Identify U andV with the first r columns of U and V , respectively, and write U and V for the remaining m− 1− rcolumns. Notice that U and V are indeed distributed according to the random orthogonal modelof rank r, and that U and V follow the random orthogonal model of rank m − r − 1 (although, ofcourse, U and U now dependent random variables). Write∥∥∥∥∥

∞∑k=1

(πΩπΘπΩ)k [UV ∗]

∥∥∥∥∥2,2

=

∥∥∥∥∥∞∑k=1

(πΩπΘπΩ)k[U V ∗ − U V ∗

]∥∥∥∥∥2,2

≤

∥∥∥∥∥∞∑k=1

(πΩπΘπΩ)k[U V ∗

]∥∥∥∥∥2,2

+

∥∥∥∥∥∞∑k=1


]∥∥∥∥∥2,2

.

19

We next show that with overwhelming probability each of these terms is well-controlled. Notice thatif R ∈ SO (m− r − 1) is any orthogonal matrix, then the joint distribution U , U , and U is invariantunder the map

U 7→ U

[I 00 R

]. (67)

This follows from the right orthogonal invariance of the Haar measure on Wmm−1 (see e.g., [11] Sec-

tion 1.4.3). Since this map preserves U and V , it also preserves Θ. Hence, the term of interest,∥∥∥∑∞k=1(πΩπΘπΩ)k[U V ∗

]∥∥∥2,2

is equal in distribution to∥∥∥∑∞k=1(πΩπΘπΩ)k

[URV ∗

]∥∥∥2,2

. The or-

thogonal matrix R is independent of all of the other terms in this expression. This independenceallows us to estimate the norm by first bounding the operator norm of the map

∑∞k=1(πΩπΘπΩ)k and

then applying measure concentration on SO (m− r − 1). For fixed x,y ∈ N , consider the quantity

x∗∞∑k=1

(πΩπΘπΩ)k[URV ∗

]y =

⟨U∗

( ∞∑k=1

(πΩπΘπΩ)k[xy∗]

)V , R

⟩.= 〈M,R〉 .

This is a ‖M‖F -Lipschitz function of R. Since for any w ∈ Sm−1, Rw is uniformly distributed onSm−1, E[e∗iRw] = 0 for all i. So, E[R] = 0 and ER|M 〈M,R〉 = 0. On the event EΩΘ,

‖M‖F ≤∞∑k=1

(πΩπΘπΩ)k[xy∗] ≤ ‖πΩπΘπΩ‖F,F1− ‖πΩπΘπΩ‖F,F

‖xy∗‖F ≤4τ2

1− 4τ2≤ 4τ

3. (68)

Hence, on EΩΘ by (53)

PR|M [〈M,R〉 > γ(m) + t] < 2 exp(−9(m− r − 1)t2

128× τ2

), (69)

where γ(m) .=4τ( rm ,ρs)2

1−4τ( rm ,ρs)2

√8π

m−r−1 ≤13

√8π

m−r−1 . For compactness of notation, set β(m, r) .=√m−r−1m . Then, setting t = 64τ

3β , union bounding over the ≤ exp(8m) pairs in N × N , and thenmoving from N ×N to Sm−1 × Sm−1 (losing at most a factor of 4) gives

PR|U,V,Ω

∥∥∥∥∥∞∑k=1


]∥∥∥∥∥2,2

>256 τ

3β(m, r)+ 4γ(m)

< 2 exp (−24m) . (70)

And so,

PU,V,Ω

∥∥∥∥∥∞∑k=1


]∥∥∥∥∥2,2

>256 τ

3β(m, r)+ 4γ(m)

= PR|U,V,Ω

∥∥∥∥∥∞∑k=1


]∥∥∥∥∥2,2

>256 τ

3β(m, r)+ 4γ(m)

≤ 2 exp (−24m) + P [EcΩΘ] ≤ exp (−Cm+O(logm)) .

20

An identical argument, this time randomizing over R .=[I 00 R

]shows that

PU,V,Ω

∥∥∥∥∥∞∑k=1


]∥∥∥∥∥2,2

>256 τ

3β(m, r)+ 4γ(m)

≤ exp (−Cm+O(logm)) . (71)

Combining terms yields the bound.

4.2.4 Effect of the error signs

We next consider the effect of the error signs. As in the previous proposition, we handle the k = 0and k = 1 . . .∞ parts of the representation in Lemma 4.6 separately.

Lemma 4.8. There exists a function φ(ρs) satisfying limρs0 φ(ρs) = 0, such that if E0 is distributedaccording to the Bernoulli sign and support model with error probability ρs.

PΩ,Σ

[‖sign(E0)‖2,2 ≥ φ(ρs)

√m]≤ exp (−Cm) . (72)

Proof. We first provide a bound on the moment-generating-function for the iid random variablesYij

.= sign(E0ij) of the form E[exp(tY )] ≤ exp(αt2). The moments of Y are

E[Y k] =

1 k = 0,0 k odd,ρs k > 0, k even,

so E[exp(tY )] = 1 +∑∞k=1

ρst2k

(2k)! . Since exp(αt2) = 1 +∑∞k=1

αkt2k

k! , it suffices to choose α such that

ρs ≤(2k)!k!

αk ∀ k = 1, 2, . . . .

Since (2k)!k! ≥ k

k, α ≥ maxk=1,2,...1kρ

1ks suffices. Consider the function ψ : [1,∞)× [0, 1]→ R defined

by ψ(x, y) = 1xy

1x . ψ(1, y) = y, and for all y limx→∞ ψ(x, y) = 0. The only stationary point occurs at

x? = log(1/y), and hence for y > 0 its maximum on [1,∞) is ψ(x?(y), y) = max(y, 1

log(y−1)y1

log(y−1)

).

Notice that limy0 ψ(x?(y), y) = 0, and that

E[exp(tY )] ≤ exp(ψ(x?(ρs), ρs)t2). (73)

Now, for any fixed pair x,y ∈ N , let Z .= x∗ sign(E0) y. Then

E[exp(tZ)] =∏ij

E[exp(txiyjYij)]

≤∏ij

exp(ψ(x?(ρs), ρs)t2(xiyj)2) = exp(ψ(x?(ρs), ρs)t2).

Applying a Chernoff bound (and optimizing the exponent) gives

P [x∗ sign(E0)y > t] < exp(− t2

4ψ(x?(ρs), ρs)

). (74)

21

Union bounding (and recognizing that moving from N×N to Sm−1 loses at most a factor of 4) gives

P[‖sign(E0)‖2,2 ≥ 4t

√m]≤ exp

(8m−m t2

4ψ(x?(ρs), ρs)

). (75)

If we set, e.g., t(ρs) = 8ψ1/2(x?(ρs), ρs), the probability of failure will be bounded by exp(−8m).Further setting φ(ρs) = 4t(ρs) gives the statement of the lemma.

The above lemma goes part of the way to controlling the component of the initial dual vectorinduced by the errors, H†[λ sign(E0)]. A straightforward Martingale argument, detailed in thefollowing lemma, completes the proof.

Lemma 4.9. Consider (U, V,Ω,Σ) drawn from the random orthogonal model of rank r < m withBernoulli error probability ρs and random signs, and suppose that r and ρs satisfy

τ (r/m, ρs) ≤ 14 . (76)

Then with probability at least 1 − exp(−Cm + O(logm)) in (U, V,Ω,Σ), the solution WE0 to the

optimization problem

min ‖W‖F subj U∗W = 0, WV = 0, πΩ[W ] = −πΩ[λ sign(E0)] (77)

is unique, and satisfies ∥∥WE0

∥∥2,2≤ φ(ρs) +

128τ(r/m, ρs)3

, (78)

where φ(·) is the function defined in Lemma 4.8.

Proof. On event EΩΘ, the minimizer is unique, and can be expressed as

WE0 = πΘ⊥πΩ

∞∑k=0

(πΩπΘπΩ)kπΩ[λ sign(E0)].

Since πΘ⊥ is a contraction in the (2, 2) norm,

‖WE0 ‖2,2 ≤

∥∥∥∥∥πΩ

∞∑k=0

(πΩπΘπΩ)kπΩ[λ sign(E0)]

∥∥∥∥∥2,2

≤ ‖πΩ[λ sign[E0]]‖2,2 +

∥∥∥∥∥∞∑k=1


∥∥∥∥∥2,2

.

Lemma 4.8 controls the first term below φ(ρs) with overwhelming probability. For the second term,it is convenient to treat sign(E0) as the projection of an m ×m iid Rademacher matrix Σ onto Ω:sign(E0) = πΩ[Σ]. Fix x,y ∈ N , and notice that

x∗∞∑k=1

(πΩπΘπΩ)kπΩ[λ sign(E0)]y =

⟨λ

∞∑k=1

(πΩπΘπΩ)k[xy∗],Σ

⟩.

.= 〈M,Σ〉.

22

On the event EΩΘ, M ∈ Rm×m has Frobenius norm at most 1√m

4τ2

1−4τ2 ≤ 4τ3√m

. Order theindices (i, j) 1 ≤ i, j ≤ m arbitrarily, and consider the Martingale Z defined by Z0 = 0, Zk =Zk−1 +Mik,jkΣik,jk . The k-th Martingale difference is bounded by |Mik,jk |, and the overall squared`2-norm of the differences is bounded by ‖M‖2F . Hence, by Azuma’s inequality,

P [〈M,Σ〉 > t] < exp(−9mt2

32τ2

). (79)

Set t = 32τ3 . A union bound over the ≤ exp(8m) elements in N ×N shows that on the event EΩΘ,

PΣ|U,V,Ω

∥∥∥∥∥∞∑k=1


∥∥∥∥∥2,2

>128τ

3

≤ exp (−24m) . (80)

Hence, PU,V,Ω,Σ[∥∥∑∞

k=1(πΩπΘπΩ)kπΩ[λ sign(E0)]∥∥

2,2> 128τ

3

]is bounded by

exp (−24m) + P[EcΩΘ] ≤ exp (−Cm+O(logm)) .

4.3 Controlled feedback for sparse matrices

We next bound the operator norm of the projection πΓ, restricted to row- and column-sparse matricesΨc.

Lemma 4.10 (Representation of iterates). Let Θ,Γ,Ω be defined as above. Suppose that ‖πΩπΘ‖F,F <1. Then for all M ∈ Ω⊥,

πΓ[M ] = πΩ⊥πΘ

∞∑k=0

(πΘπΩπΘ)kπΘ[M ]. (81)

Proof. For M ∈ Ω⊥,

vec[πΓM ] = [Q∗ | I•,Ω][

I PP ∗ I

]−1 [QIΩ,•

]vec[M ]

= [Q∗ | I•,Ω][

I PP ∗ I

]−1 [Q vec[M ]

0

].

Under the condition of the lemma, ‖P‖2,2 < 1 so the above inverse is indeed well-defined, and canbe expressed via the Schur complement formula:

vec[πΓM ] = Q∗(I− PP ∗)−1Qvec[M ]− I•,ΩP∗(I− PP ∗)−1Qvec[M ]

= πΩ⊥Q∗(I− PP ∗)−1Qvec[M ]

= πΩ⊥Q∗∞∑k=0

(PP ∗)kQvec[M ].

23

Recognizing that

Q∗∞∑k=0

(PP ∗)kQvec[M ] = vec

[πΘ

∞∑k=0

(πΘπΩπΘ)kπΘ[M ]

]

completes the proof.

Lemma 4.11. Fix any c ∈ (0, 1). Consider (U, V,Ω) drawn from the random orthogonal model ofrank r < m, with Bernoulli error probability ρs. Further suppose that r and ρs satisfy

τ(rm , ρs

)≤ 1

4. (82)

Then with probability at least 1− exp (−Cm+O(logm)),

ξc ≤329

(√rm +

√c+ 2

√H(c)

), (83)

where H(c) is the base-e binary entropy function.

Proof. From the above representation, whenever ‖πΩπΘ‖F,F < 1,

ξc = supM ∈ Ψc

‖M‖F ≤ 1

∥∥∥∥∥πΩ⊥πΘ

∞∑k=0

(πΘπΩπΘ)kπΘ[M ]

∥∥∥∥∥F

≤

∥∥∥∥∥∞∑k=0

(πΘπΩπΘ)k∥∥∥∥∥F,F

supM ∈ Ψc

‖M‖F ≤ 1

‖πΘ[M ]‖F

≤ 11− ‖πΘπΩπΘ‖F,F

supM ∈ Ψc

‖M‖F ≤ 1

‖πΘ[M ]‖F .

Now,

supM ∈ Ψc

‖M‖F ≤ 1

‖πΘ[M ]‖F ≤ supM ∈ Ψc

‖M‖F ≤ 1

‖πUM‖F + ‖MπV ‖F

≤ sup|I|≤cm

‖UI,•‖2,2 + sup|J|≤cm

‖VJ,•‖2,2.

Identify U with the orthogonalization Z(Z∗Z)−1/2 of an m× r iid N (0, 1m ) matrix Z. Then for any

given I ⊂ [m] of size cm, ‖UI,•‖2,2 ≤ ‖ZI,•‖2,2σmin(Z) . Now,

P[σmin(Z) > 1−

√r/m− t1

]< exp

(−mt

21

2

). (84)

24

Meanwhile, for each I of size cm, again

P[‖ZI,•‖2,2 >

√r/m+

√c+ t2

]< exp

(−mt

22

2

). (85)

There are at most exp (mH(c) +O(logm)) such subsets I, where H(·) denotes the base-e binaryentropy function, so

P

[maxI∈([m]

cm)‖ZI,•‖2,2 >

√r/m+

√c+ t2

]< exp

(−m(t22/2−H(c)) +O(logm)

). (86)

Choosing t1 = 12 −

12

√rm , and set t2 = 2

√H(c) and combining bounds gives

ξc ≤ 2

√r/m+

√c+ 2

√H(c)

(1− 4τ(r/m, ρs)2)(1−√r/m)

. (87)

Noticing that√r/m < τ and applying the bound τ < 1/4 to the terms in the denominator completes

the proof.

4.4 Controlling the initial violations

In this section, we analyze the initial dual vector W0 given by the minimum Frobenius norm solutionto the equality constraints of the optimality condition (19), and show that for any constant β, theprobability that ‖S 5

6√m

[UV ∗+W0]‖F > 3β approaches zero. We will treat the parts of UV ∗+W0 =

UV ∗ +H†[−UV ∗] +H†[λ sign(E0)] separately, and use the following simple lemma to combine thebounds:

Lemma 4.12. For all matrices A,B, and α ∈ (0, 1),

‖SΩc

γ [A+B]‖F ≤ ‖SΩc

αγ [A]‖F + ‖SΩc

(1−α)γ [B]‖F . (88)

Proof. Notice that for scalars x, |Sγ [x]| is a convex nonnegative function. Hence, for matricesX ∈ Rm×m, ‖SΩc

γ [X]‖F is again convex (see, e.g, [3] Example 3.14), and so

‖SΩc

γ [A+B]‖F =∥∥∥SΩc

γ

[αAα + (1− α) B

1−α

]∥∥∥F

≤ α∥∥∥SΩc

γ

[Aα

]∥∥∥F

+ (1− α)∥∥∥SΩc

γ

[B

1−α

]∥∥∥F

= ‖SΩc

αγ [A]‖F + ‖SΩc

(1−α)γ [B]‖F .

The strategy, then, is to bound the expectation of ‖SΩc

λ/3[·]‖F for each of the three terms, using thefollowing lemma. It will turn out that for any prespecified p, we can choose C0 such if r < C0

mlogm ,

the expected Frobenius norm of the violations is O(m−p); an application of the Markov inequalitythen bounds the probability that the value deviates above any fixed constant β.

25

Lemma 4.13. Let X be a symmetric random variable satisfying the (subgaussian) tail bound

P [X ≥ t] ≤ exp(−Ct2).

Let Y .= Sγ(X). Then

E[Y 2]≤ 2

Cexp

(−Cγ2

). (89)

Proof. Since X is symmetric, Y is also symmetric, so

E[Y 2] = 4∫ ∞

0

tP[Y ≥ t]dt ≤ 4∫ ∞

0

t exp(−C(t+ γ)2)dt

≤ 4∫ ∞γ

(s− γ) exp(−Cs2)ds ≤ 4∫ ∞γ

s exp(−Cs2)ds.

Lemma 4.14. Let X be a random variable satisfying a tail bound of the form

P [|X| ≥ a+ t] ≤ C1 exp(−C2t2).

Suppose γ > a and let Y .= Sγ(X). Then

E[Y 2]≤ C1

C2exp

(−C2(γ − a)2

). (90)

Proof.

E[Y 2] = 2∫ ∞

0

tP[|Y | ≥ t]dt = 2∫ ∞

0

tP[|X| ≥ γ + t]dt

= 2∫ ∞γ

(s− γ)P[|X| ≥ s]ds ≤ 2C1

∫ ∞γ

(s− γ) exp(−C2(s− a)2

)ds

= 2C1

∫ ∞γ−a

(q + a− γ) exp(−C2q2)dq ≤ C1

C2exp

(−C2(γ − a)2

).

In the previous section, we were interested in controlling the quadratic products x∗W0y involvingthe initial dual vector W0 and arbitrary unit vectors. Here, because the soft thresholding operatorSγ acts elementwise, in this section, we require tighter control over e∗iW0ej . Because there are onlym2 pairs (ei, ej), much tighter control can be established, as is formalized in the following lemma:

Lemma 4.15. Consider U, V drawn from the random orthogonal model of rank r. For any κ > 0,define the events

EU (κ) : maxi‖Ui,•‖2 ≤

√1

κ logm,

EV (κ) : maxi‖Vi,•‖2 ≤

√1

κ logm.

26

Then if r < C0mlog(m) for some C0 <

14κ ,

PU,V [EU (κ) ∩ EV (κ)] ≥ 1− 2m exp(− (κ−1/2 − 2

√C0)2m

8 logm

). (91)

On these good events,

maxi,j

∥∥πΘ[eie∗j ]∥∥F≤ 2√

κ logm. (92)

Proof. Notice that f(U) .= ‖Ui,•‖2 is a 1-Lipschitz function of U . Since∑i ‖Ui,•‖22 = ‖U‖2F = r,

by symmetry E[‖Ui,•‖22] = rm . By the Markov inequality, f has a median no larger than 2E[f ] ≤

2√

E[f2] = 2√

rm . Invoking Lipschitz concentration on Wm

r and union bounding over the m choicesof i,

P[maxi‖Ui,•‖2 >

√1

κ logm

]≤ m exp

(−m

8

(√1

κ logm− 2√

r

m

)2). (93)

An identical calculation applies to EV (κ). Summing the two probabilities of failure completes theproof.

Lemma 4.16. Let (U, V ) be distributed according to the random orthogonal model of rank r < m.For any fixed Ω ⊆ [m]× [m] and κ > 0, on the good event EU (κ)

EV |U

∥∥∥∥∥SΩc

13√m

[UV ∗]

∥∥∥∥∥F

≤ 4√κ logm

m(12−

κ144 ). (94)

Proof. Fix U and consider [UV ∗]i,j = Ui,•V∗j,• as a ‖Ui,•‖2-Lipschitz function of V . On EU (κ), the

Lipschitz constant is bounded by (κ logm)−1/2, and

PV |U [[UV ∗]ij > t] < exp(−κt

2m logm8

). (95)

By Lemma 4.13, on EU ,

EV |U[(S 1

3√m

[[UV ∗]ij ])2]≤ 16κm logm

exp(− κ

72log(m)

). (96)

Hence, summing over ≤ m2 pairs (i, j) ∈ Ωc

EV |U[∥∥∥SΩc

13√m

[UV ∗]∥∥∥2

F

]≤ 16κ logm

m(1− κ72 ). (97)

Applying the Cauchy-Schwarz inequality completes the proof.

Lemma 4.17. Fix any β > 0, κ > 0. Consider (U, V,Ω) drawn from the random orthogonal modelof rank r, with Bernoulli error probability ρs, and suppose that r, ρs satisfy

r < C0m

logm, C0 <

14κ, τ

(rm , ρs

)<

14. (98)

Then there exists m0, C1, C2 > 0 such that for all m > m0,

PU,V,Ω[∥∥∥SΩc

13√m

[H† [UV ∗]

]∥∥∥F> β

]≤ C1m

12−

κ2048

β√κ logm

+ PU,V,Ω [EU (κ)c ∪ EV (κ)c ∪ EcΩΘ] .

27

Proof. We use a similar splitting trick to that in Lemma 4.7. Again let U and V be uniformlydistributed on Wm

m−1. Identify U and V with their first r columns, and let U and V denote theremaining m− r − 1 columns. Write∥∥∥SΩc

13√m

[H†[UV ∗]

]∥∥∥F

=∥∥∥SΩc

13√m

[H†[U V ∗ − U V ∗

]]∥∥∥F

≤∥∥∥SΩc

16√m

[H†[U V ∗

]]∥∥∥F

+∥∥∥SΩc

16√m

[H†[U V ∗

]]∥∥∥F.

We first address the second term. On the event EΩΘ,∥∥∥SΩc1

6√m

[H†[U V ∗

]]∥∥∥F

=

∥∥∥∥∥SΩc1

6√m

[πΘ⊥πΩ

∞∑k=0

(πΩπΘπΩ)kπΩ

[U V ∗

]]∥∥∥∥∥F

=

∥∥∥∥∥SΩc1

6√m

[πΩ⊥πΘ⊥πΩ

∞∑k=0

(πΩπΘπΩ)kπΩ

[U V ∗

]]∥∥∥∥∥F

=

∥∥∥∥∥SΩc1

6√m

[πΘπΩ

∞∑k=0

(πΩπΘπΩ)kπΩ

[U V ∗

]]∥∥∥∥∥F

.

Here, we have used that πΩ⊥πΘ⊥πΩ = πΩ⊥(I − πΘ)πΩ = −πΩ⊥πΘπΩ. Now, since for any R ∈SO(m− r − 1), the joint distribution of (U , V ,Ω) is invariant under the map

U 7→ U

[I 00 R

]. (99)

Moreover, this map preserves U and V , and therefore H†. Hence, if R is distributed accord-

ing to the invariant measure on SO(m − r − 1),∥∥∥∥SΩc

16√m

[H†[U V ∗]

]∥∥∥∥F

is equal in distribution to∥∥∥∥SΩc1

6√m

[πΘπΩ

∑∞k=0(πΩπΘπΩ)kπΩ[URV ∗]

]∥∥∥∥F

. Consider the i, j element of this matrix,

e∗i πΘπΩ

∞∑k=0

(πΩπΘπΩ)kπΩ[URV ∗]ej =

⟨U∗

( ∞∑k=0

(πΩπΘπΩ)kπΩπΘ[eie∗j ]

)V , R

⟩.= 〈M,R〉 .

This is a ‖M‖F -Lipschitz function of R. On EU (κ) ∩ EV (κ) ∩ EΩΘ,

‖M‖F ≤ ‖πΩπΘ‖F,F1− ‖πΩπΘπΩ‖F,F

‖πΘ[eie∗j ]‖F ≤4τ

1− 4τ2

1√κ log(m)

≤ 43√κ logm

.

Let δ .=√

m−r−1m . Since ER [〈M,R〉] = 0, the tail bound (53) implies that

PR|M [〈M,R〉 > q(m) + t] < 2 exp(−9κδ2m log(m)t2

32

), (100)

where

q(m) =43δ

√8π

κm logm. (101)

28

By Lemma 4.14, ER|U,V ,Ω[S 1

6√m

[〈M,R〉]2]

≤ 649κδ2m log(m)

exp

−κδ2 log(m)32

(12− 4δ

√8π

κ log(m)

)2 . (102)

For compactness, let

ζ(r/m, ρs,m) .= 1− κδ2

32

(12− 4δ

√8π

κ log(m)

)2

. (103)

Then summing over the ≤ m2 elements in Ωc gives that on the event EU (κ) ∩ EV (κ) ∩ EΩΘ,

ER|U,V ,Ω

[∥∥∥SΩc1

6√m

[H†[URV ∗

]]∥∥∥2

F

]≤ 64

9κδ2 logmmζ . (104)

By the Markov inequality, on EU (κ) ∩ EV (κ) ∩ EΩΘ,

PR|U,V ,Ω

[∥∥∥SΩc1

6√m

[H†[URV ∗

]]∥∥∥F>β

2

]<

16mζ/2

3βδ√κ log(m)

. (105)

An identical calculation shows that if we set R =[I 00 R

]∈ SO(m− 1), on EU (κ) ∩ EV (κ) ∩ EΩΘ

PR|U,V ,Ω

[∥∥∥SΩc1

6√m

[H†[U RV ∗

]]∥∥∥F>β

2

]<

16mζ/2

3βδ√κ log(m)

. (106)

So,

PU,V,Ω[∥∥∥SΩc

13√m

[H† [UV ∗]

]∥∥∥F> β

]<

32mζ/2

3βδ√κ log(m)

+ P[(EU (κ) ∩ EV (κ))c] + P[EcΩΘ].

Under the assumptions of the Lemma, δ =√

m−r−1m ≥

√1− C0

logm −m−1. Hence, ∃mδ > 0 such

that for all m > mδ, δ(m, r) > 1√2

and so ζ ≤ 1− κ64

(12 − 16

√π

κ logm

)2

. Furthermore, for any κ, ∃mκ

such that for m > mκ, 16√

8πκ logm < 1

4 . For m > max(mδ,mκ), ζ < 1 − κ1024 . For such sufficiently

large m, the multiplier 323δ ≤

32√

23 . Choosing this value for C1 and setting m0 = max(mδ,mκ) gives

the statement of the lemma.

Lemma 4.18 (Box violations induced by error). Fix any α ∈ (0, 1). Let (U, V,Ω,Σ) be distributedaccording to the random orthogonal model of rank r and with error probability ρs, with r and ρssatisfying

τ(r/m, ρs) < 1/4. (107)

Then on the good event EU (κ) ∩ EV (κ) ∩ EΩΘ,

EΣ|U,V,Ω

∥∥∥SΩcα√m

[H† [λ sign(E0)]

]∥∥∥F≤ 2√

κ log(m)m

12−

ακ4 . (108)

29

Proof. We can exploit independence of Ω and Σ by writing sign(E0) = πΩ[Σ]. From the representa-tion in the previous lemma,∥∥∥SΩc

α√m

[H† [λ sign(E0)]

]∥∥∥F

=

∥∥∥∥∥S α√m

[ ∞∑k=1

(πΘπΩ)k [λΣ]

]∥∥∥∥∥F

.

The i, j element of the matrix of interest is

e∗i

∞∑k=1

(πΘπΩ)k [λΣ] ej =

⟨λ

∞∑k=1

(πΩπΘ)kπΘ[eie∗j ],Σ

⟩.= 〈M,Σ〉.

On EU (κ) ∩ EV (κ), ‖πΘ[eie∗j ]‖F ≤√

1κ logm . On EΩΘ,

∥∥∑∞k=1(πΘπΩ)k

∥∥F,F≤ 2τ

1−2τ ≤ 1, and so

‖M‖F ≤ 1√κm logm

. The same Martingale argument as in Lemma 4.9 shows that

PΣ [〈M,Σ〉 > t] < exp(− t2

2‖M‖2F

)= exp

(−κm log(m)t2

2

). (109)

Then by Lemma 4.13,

EΣ|U,V,Ω

[(S α√

m[〈M,Σ〉]

)2]≤ 2κm log(m)

exp(−ακ log(m)

2

). (110)

Summing over the ≤ m2 elements in Ωc to bound E[‖ · ‖2F ] and then applying the Cauchy-Schwarzinequality establishes the result.

The three lemmas in this section combine to yield the following bound on the Frobenius normof the violations.

Corollary 4.19 (Control of box violations). Fix any β > 0, p > 0. There exist constants C0(p) > 0,ρ?s > 0, m0 with the following property: if m > m0 and (U, V,Ω,Σ) are distributed according to therandom orthogonal model of rank

r < C0(p)m

logm, (111)

with Bernoulli error probability ρs ≤ ρ?s and random signs, then

PU,V,Ω,Σ[∥∥∥SΩc

56√m

[UV ∗ +W0]∥∥∥F≤ 3β

]≥ 1 − C

βm−p.

Proof. By Lemma 4.12,∥∥∥SΩc5

6√m

[UV ∗ +W0]∥∥∥F

≤∥∥∥SΩc

13√m

[UV ∗]∥∥∥F

+∥∥∥SΩc

13√m

[H† [UV ∗]]∥∥∥F

+∥∥∥SΩc

16√m

[H†[λ sign(E0)]

]∥∥∥F.

We use the three previous lemmas to estimate each of these terms. Set κ = 2048p + 1024, andC0 = 1

16κ . From Lemma 4.16 and the Markov inequality, for any Ω, on EU ,

PV |U[∥∥∥SΩc

13√m

[UV ∗]∥∥∥F≥ β

]≤ 4

β√κ log(m)

m1/2−κ/144,

30

and so

PU,V,Ω,Σ[∥∥∥SΩc

13√m

[UV ∗]∥∥∥F≥ β

]≤ 4β√κ log(m)

m1/2−κ/144 + P[EU (κ)c] = o(m−p

).

From Lemma 4.18, PU,V,Ω,Σ[∥∥∥∥SΩc

16√m

[H†[λ sign(E0)]

]∥∥∥∥F

≥ β]

≤ 2√κ log(m)

m1/2−κ/24 + P [(EU (κ) ∩ EV (κ))c] + P [EcΩΘ] = o(m−p

).

Finally, by Lemma 4.17 PU,V,Ω,Σ[∥∥∥∥SΩc

13√m

[H†[UV ∗]

]∥∥∥∥F

≥ β]

≤ C1

β√κ log(m)

m1/2−κ/2048 + P [(EU (κ) ∩ EV (κ))c] + P [EcΩΘ]

=C1m

−p

β√κ log(m)

+ o(m−p

).

Summing the three failure probabilities completes the proof.

4.5 Pulling it all together

We close by fitting the probabilistic lemmas in the previous three sections together to give a proofof our main result, Theorem 2.4.

Proof. We show that C0 and m0 can be selected such that with high probability the conditions ofLemma 3.3 are satisfied. Set ε = 1

6√m

. Fix m0 large enough and c? > 0 small enough that on thegood event from Lemma 4.11, as long as C0 ≤ 1,

ξc? ≤329

(√1

logm0+√c? + 2

√H(c?)

)≤ 1

2 . (112)

We then must show that

‖UV ∗ +W0‖1,2 + 2∥∥∥SΩc

56√m

[UV ∗ +W0]∥∥∥F≤ 5

6

√c?. (113)

Notice that‖UV ∗ +W0‖1,2 ≤ ‖UV ∗‖1,2 + ‖W0‖1,2 ≤ max

i‖Vi,•‖2 + ‖W0‖2,2. (114)

From Lemmas 4.7 and 4.9, for any choice of C0, ‖W0‖2,2 is with overwhelming probability boundedby a linear function of τ . For any fixed C0, limm→∞ τ(r/m, ρs) = 2

√ρs. Hence, by choosing m0

large and ρs small, we can bound

‖W0‖2,2 ≤518

√c? (115)

with overwhelming probability. Meanwhile, on the overwhelmingly likely good event EV ,

‖Vi,•‖2 ≤518

√c? (116)

31

for m sufficiently large. Finally, fix β = 536

√c? and choose C0 small enough and m0 large enough

that the conditions of Corollary 4.19 are satisfied. Then, with probability at least 1 − Cm−p, (25)is satisfied. The same calculations apply to (26).

Finally, on these good events,

‖W0‖2,2 +1

1− ξc

∥∥∥SΩc5

6√m

[UV ∗ +W0]∥∥∥F≤ 5

18

√c? + 2× 5

36

√c? < 1, (117)

so (27) holds. By Lemma 3.3, then, there exists a certifying dual vector, and the proof is complete.

5 Implications on Low-Rank Matrix Completion

Our result has strong implications for the low-rank matrix completion problem studied in [6, 8, 5].In matrix completion, the goal is to recover a low rank matrix A0 from an observation consisting of aknown subset Υ .= [m]× [m]\Ω of its entries.6 [6] and [8] studied the following convex programmingheuristic for the matrix completion problem

A0 = arg min ‖A‖∗ subj A(i, j) = A0(i, j) ∀ (i, j) ∈ Υ. (118)

Duality considerations in [6] give the following characterization of A0 that can be recovered bysolving (118):

Lemma 5.1 ([6], Lemma 3.1). Let A0 ∈ Rm×m be a rank-r matrix with reduced singular valuedecomposition USV ∗. As above, let

Θ.= span

`UM∗ |M ∈ Rm×r ∪ MV ∗ |M ∈ Rm×r

´⊂ Rm×m. (119)

Suppose there exists a matrix Y such that

πΘ[Y ] = UV ∗, πΥ⊥ [Y ] = 0, ‖πΘ⊥ [Y ]‖2,2 < 1 . (120)

Then A0 is the unique solution to the semidefinite program

min ‖A‖∗ subj πΥ[A] = πΥ[A0]. (121)

Candes and collaborators then consider the minimum Frobenius solution Y0 to the system ofequations (120) and show that if the number of observations |Υ| is large enough, ‖Y0‖2,2 is boundedbelow one with high probability. Recall that in Section 4, we analyzed the operator norm of theminimum Frobenius norm solution to a similar system of equations. In fact, we will see that Y0

is exactly equal to UV ∗ + WUV0 , where WUV

0 is the part of the initial (RPCA) dual vector thatwas induced by the singular vectors of A0. Using Lemma 4.7, the operator norm of WUV

0 can bebounded below one with overwhelming probability, even in a proportional growth setting. Thisyields Theorem 2.5, which we formally prove below:

6In [6], Ω denotes this set.

32

Proof. Notice that (120) is feasible if and only if the system

πΘ[W ] = 0, πΥ⊥ [W ] = πΥ⊥ [−UV ∗], ‖W‖2,2 < 1 (122)

is feasible in W.= Y − UV ∗. Here, Υ⊥ is the set of missing elements from the matrix to be

completed; we can identify this set with the set of corrupted entries Ω in the low-rank recoveryproblem that is the main focus of this paper. This is again a random subset of [m] × [m] in whichthe inclusion of each pair (i, j) is an independent Bernoulli(ρs) random variable. Hence, if we canshow that the matrix WUV

0 defined in Lemma 4.7 as the minimum Frobenius norm solution to theequality constraints πΘ[W ] = 0, πΩ[W ] = πΩ[−UV ] has operator norm bounded below 1, we willhave further established that A0 uniquely solves (118), and hence can be efficiently recovered bynuclear norm minimization. Lemma 4.7 immediately implies Theorem 2.5 as follows.

First, notice that max(ρr, ρs) < 1289 =⇒ τ(ρr, ρs) < 1

4 . Under this condition, with probabilityat least 1− exp (Cm+O(logm)), the minimizer WUV

0 is uniquely defined and satisfies

‖WUV0 ‖2,2 ≤ 64τ(ρr, ρs)

(1 + 8

3

√1

1− ρr −m−1

)+ 8

3

√2π

(1− ρr)m− 1. (123)

Choose m0 large enough that the second term is < 12 and 1 + 8

3

√1

1−ρr−m−1 ≤ 4. Then on the above

good event, ‖WUV0 ‖2,2 < 256τ(ρr, ρs) + 1

2 . For ρr, ρs sufficiently small, this is bounded below one:for example, max(ρr, ρs) < 1

20492 suffices.7

Thus, matrices A0 distributed according to the random orthogonal model of rank as large asr = Cm can be recovered from incomplete subsets Υ of size ≤ (1− ρs)m2. This is the first result tosuggest that matrix completion should succeed in such a proportional growth scenario. The previousbest result for A0 distributed according to the random orthogonal model, due to [8], showed that

|Υ| = Cmr log8(m) (124)

observations suffice. When r ≥ mC log8(m)

, this result becomes empty, since in this case the prescribednumber of measurements exceeds m2.

The fact that our result holds in proportional growth may seem surprising in light of discussionsin [8] – as discussed there, if all one assumes is that the singular spaces of A0 are incoherent withthe standard basis, then |Υ| = O(mr logm) measurements are necessary. It is important to note,though, that the singular spaces of matrices A0 distributed according to the random orthogonalmodel satisfy rich regularity properties in addition to incoherence. In particular, the submatrixnorm considerations used in bounding ‖πΘπΩ‖F,F in our proof of Theorem 2.4 can be viewed asa kind of restricted isometry property for the singular vectors U and V . Furthermore, our proofsmake heavy use of the independence of U and V in the random orthogonal model. Independence andorthogonal invariance allow us to introduce an auxiliary randomization R over the choice of basisfor U , decoupling H† (which only depends on the subspace spanned by U and not on any choice ofbasis for the space) from UV ∗, which very clearly does depend on the choice of basis. Finally, oneshould also note that our Theorem 2.5 only supersedes (124) for relatively large rank r. For smaller(say fixed) rank r, (124) remains the strongest available result for matrix completion in the randomorthogonal model.

7These estimates are clearly pessimistic: larger constant fractions ρr are possible. However, the argument givenhere already suffices to establish that matrix completion succeeds in a new regime of matrices whose rank is a constantfraction of the dimensionality.

33

6 Iterative Thresholding for Corrupted Low-rank Matrix Recovery

There are a number of possible approaches to solving the robust PCA semidefinite program (2). Forsmall problem sizes, interior point methods offer superior accuracy and convergence rates. However,off-the-shelf interior point solvers become impractical for data matrices larger than about 70 × 70,due to the O(m6) complexity of solving the associated Newton system. In this section, we proposean alternative first-order method with far better scaling behavior, with only O(m3) complexity. Itcan easily work for matrices up to 1, 000 × 1, 000 on a desktop PC. We will use this algorithm forall of our experiments in the next section.

Our approach is a straightforward combination of iterative thresholding algorithms for nuclearnorm minimization [4] and `1-norm minimization [31]. As such, it solves a slightly modified versionof (2), in which a small multiple of the Frobenius norm is appended:

min ‖A‖∗ + λ‖E‖1 + 12τ ‖A‖

2F + 1

2τ ‖E‖2F subj A+ E = D. (125)

Here, τ is a large constant; as τ →∞, we expect the solution to coincide with that of (2).8

Our algorithm for solving (125) combines the algorithms of [4] for nuclear norm minimizationand [31] for `1-norm minimization. It maintains an estimate of a dual variable Y ∈ Rm×n consistingof the Lagrange multipliers for the equality constraint D = A+ E. The algorithm proceeds by softthresholding the singular values of Y to obtain an estimate of A and soft thresholding the matrixentries themselves to obtain an estimate of E. This is made precise in Algorithm 1 below.

Algorithm 1: Iterative Thresholding for Corrupted Low-rank Matrix RecoveryInput: Observation matrix D ∈ Rm×n, weights λ, τ .

while not converged(U, S, V )← svd(Yk−1),Ak ← U [S − τI]+ V

∗,Ek ← sign(Yk−1) [|Yk−1| − λτ11∗]+,Yk ← Yk−1 + δk (D −Ak − Ek),

endOutput: A.

We prove that the iteration in Algorithm 1 converges to the global optimum of (125). The proofis a straightforward modification of the arguments of [4]. Let

φτ (A,E) .= τ‖A‖∗ + λτ‖E‖1 + 12‖A‖

2F + 1

2‖E‖2F .

Lemma 6.1. Let (Z1, Z2) ∈ ∂φτ (A,E), and (Z ′1, Z′2) ∈ ∂φτ (A′, E′). Then

〈Z1 − Z ′1, A−A′〉+ 〈Z2 − Z ′2, E − E′〉 ≥ ‖A−A′‖2F + ‖E − E′‖2F .

Proof. Since (Z1, Z2) ∈ ∂φτ (A,E), Z1 ∈ ∂(τ‖A‖∗+ 12‖A‖

2F ) and Z2 ∈ ∂(λτ‖E‖1 + 1

2‖E‖2F ). We can

therefore write Z1 = τ(UV ∗ + Q) + A, where USV ∗ is a compact singular value decomposition ofA, ‖Q‖2,2 ≤ 1, U∗Q = 0, and QV = 0. A similar expression can be given for Z ′1. So,

〈Z1 − Z ′1, A−A′〉 = τ〈UV ∗ +Q− U ′V ′∗ −Q′, A−A′〉+ ‖A−A′‖2F= τ (‖A‖∗ + ‖A′‖∗ − 〈UV ∗ +Q,A′〉 − 〈U ′V ′∗ +Q′, A〉) + ‖A−A′‖2F ,

8This can most likely be shown by a similar argument to Theorem 3.1 of [4].

34

where we have used that 〈UV ∗, A〉 = tr [V U∗USV ∗] = tr [S] = ‖A‖∗. Since |〈UV ∗ + Q,A′〉| ≤‖UV ∗ + Q‖2,2‖A′‖∗ ≤ ‖A′‖∗ (and likewise for |〈U ′V ′∗ + Q′, A〉|), (126) is ≥ ‖A − A′‖2F . Similarly,Z2 = λτW + E, where Wij = sign(Eij) for Eij 6= 0, and Wij ∈ [−1, 1] if Eij = 0. So, we have that

〈Z2 − Z ′2, E − E′〉 = λτ〈W −W ′, E − E′〉+ ‖E − E′‖2F= λτ (〈W,E〉+ 〈W ′, E′〉 − 〈W,E′〉 − 〈W ′, E〉) + ‖E − E′‖2F= λτ (‖E‖1 + ‖E′‖1 − 〈W,E′〉 − 〈W ′, E〉) + ‖E − E′‖2F . (126)

Since again |〈W,E′〉| ≤ ‖E′‖1 and |〈W ′, E〉| ≤ ‖E‖1, (126) is bounded below by ‖E − E′‖2F . Com-bining the bounds for the two parts completes the proof.

As in [4], let Dα denote the singular value shrinkage operator, which maps a matrix A with SVDA = USV ∗ to U [S − αI]+V ∗.

Lemma 6.2. The thresholding operator ψα,β : (A,E) → (DαA,SβE) solves the optimization prob-lem

minB1,B2

α‖B1‖∗ + β‖B2‖1 + 12‖A−B1‖2F + 1

2‖E −B2‖2F . (127)

Proof. This follows from optimality properties of D and S individually.

Theorem 6.3. Suppose that 0 < δk < 1 for all k. Then the sequence Ak, Ek obtained by Algorithm1 converges to the unique global optimum of (125).

Proof. Suppose that (A?, E?) are optimal for (125), and Y ? is the matrix of Lagrange multipliersfor the constraint A+ E = D. Then ∃(Z1, Z2) ∈ ∂φτ (A?, E?) such that

Z1 − Y ? = 0 and Z2 − Y ? = 0.

Similarly, since (Ak, Ek) = ψτ,λτ (Yk−1, Yk−1), by Lemma 6.2, ∃(Zk1 , Zk2 ) ∈ ∂φτ (Ak, Ek) such that9

Zk1 − Yk−1 = 0 and Zk2 − Yk−1 = 0.

Combining the two, we can write

Yk−1 − Y ? = Zk1 − Z1 and Yk−1 − Y ? = Zk2 − Z2.

This implies, via Lemma 6.1, that

〈Ak −A?, Yk−1 − Y ?〉 = 〈Ak −A?, Zk1 − Z1〉 ≥ ‖Ak −A?‖2F ,〈Ek − E?, Yk−1 − Y ?〉 = 〈Ek − E?, Zk2 − Z2〉 ≥ ‖Ek − E?‖2F .

Now,

‖Yk − Y ?‖2F = ‖Yk−1 − Y ? + δk(D −Ak − Ek)‖2F= ‖Yk−1 − Y ?‖2F + 2δk〈Yk−1 − Y ?, D −Ak − Ek〉+ δ2

k‖D −Ak − Ek‖2F≤ ‖Yk−1 − Y ?‖2F − 2δk〈Yk−1 − Y ?, Ak −A?〉− 2δk〈Yk−1 − Y ?, Ek − E?〉+ δ2

k‖D −Ak − Ek‖2F≤ ‖Yk−1 − Y ?‖2F − 2δk‖Ak −A?‖2F − 2δk‖Ek − E?‖2F

+ 2δ2k‖Ak −A?‖2F + 2δ2

k‖Ek − E?‖2F .9Noticing that the objective function in (127) can be written as φτ (B1, B2)− 〈(B1, B2), (A,E)〉+ 1

2‖(A,E)‖2F .

35

Hence, if 0 < δk < 1, the sequence ‖Yk − Y ?‖F is nonincreasing. This sequence therefore convergesto a limit (possibly 0), and therefore ‖Ak − A?‖F → 0 and ‖Ek − E?‖F → 0, establishing theresult.

7 Simulations and Experiments

In this section, we first perform several simulations corroborating the our theoretical results andclarifying their implications. We then sketch applications to two computer vision tasks involving therecovery of intrinsically low-dimensional data from gross corruption: background estimation fromvideo and face alignment under varying illumination.10

7.1 Simulation: proportional growth.

We demonstrate the efficacy of the proposed iterative thresholding algorithm on randomly generatedmatrices. For simplicity, we restrict our examples to square matrices. We draw A0 according to theindependent random orthogonal model described earlier. We generate E0 as a sparse matrix whosesupport is chosen uniformly at random, and whose non-zero entries are independent and uniformlydistributed in the range [−500, 500]. We apply our proposed algorithm on the matrix D .= A0 +E0

to recover A and E.The results are presented in Table 1. For these experiments, we choose λ = m−1/2 and τ =

10, 000. We observe that the proposed algorithm is successful in recovering A0 even when 10% of itsentries are corrupted. The relative error in recovering A0 decreases with increase in dimensionalitym. We note that though the algorithm in some cases overestimates the rank of A, we observe thatthe extra nonzero singular values of A tend to have very small magnitude.

m rank(A0) ‖E0‖0 ‖A−A0‖F‖A0‖F

rank(A) ‖E‖0 #iterations time (s)

100 5 500 3.6× 10−5 5 511 10,000 120200 10 2,000 2.2× 10−5 10 2,011 8,617 675400 20 8,000 6.7× 10−6 20 8,014 10,000 6,364800 40 32,000 5.2× 10−6 40 32,081 6,775 31,344

100 5 1,000 7.1× 10−5 5 1,005 10,000 141200 10 4,000 2.3× 10−5 10 4,022 10,000 842400 20 16,000 1.1× 10−5 23 16,091 10,000 6,713800 40 64,000 5.8× 10−6 74 64,552 10,000 35,981

Table 1: Proportional growth. Here the rank of the matrix grows in proportion (5%) to thedimensionality m; and the number of corrupted measurements grows in proportion to the numberof entries m2, top 5% and bottom 10%, respectively.

10Note that it might seem to be overkill to cast these relatively simple vision tasks as robust PCA problems. Theremay exist easier and faster solutions that harness special domain information or additional structures in the visualdata. Here, we merely use these intuitive examples and data to illustrate how our algorithm can be used as a generaltool to effectively extract low-dimensional and sparse structures from real data.

36

7.2 Simulation: phase transition wrt (ρr, ρs).

We examine how the rank ofA and the proportion of errors in E affect the performance our algorithm.We fix m = 200 and then construct a random matrix pair (A0, E0) satisfying rank(A0) = ρrm and‖E‖0 = ρsm

2, where both ρr and ρs vary between 0 and 1. The parameters λ and τ are the same asthe previous experiment. For each choice of ρr and ρs, we apply the proposed algorithm to D, andterm a given experiment a success if the recovered A satisfies ‖A−A0‖F

‖A0‖F < 0.05. Each experiment isrepeated over 10 trials and the results are plotted in Figure 1 (left). The color of each cell reflectsthe empirical recovery rate. White denotes perfect recovery in all experiments, and black denotesfailure for all experiments. We see that there is a relatively sharp phase transition between successand failure of the algorithm on the line ρr + ρs = 0.25. To verify this behavior, we repeat theexperiment, but only vary ρr and ρs between 0 and 0.3. These results, seen in Figure 1 (right), showthat phase transition remains fairly sharp even at higher resolution. These results are conservative,and can be potentially improved by increasing the maximum number of iterations (here chosen tobe 5000), or varying the choices of λ and τ .

Figure 1: Phase transition wrt (ρr, ρs). Left: (ρr, ρs) ∈ [0, 1]2. Right: (ρr, ρs) ∈ [0, 0.3]2.

7.3 Experiment: background modeling from video.

Background modeling or subtraction from video sequences is a popular approach to detecting activityin the scene, and finds application in video surveillance from static cameras. The key problem hereis to guarantee robustness to partial occlusions and illumination changes in the scene. Backgroundsubtraction fits nicely into our model. If the individual frames are stacked as columns of a matrixD, then D can be expressed as the sum of a low-rank background matrix and a sparse error matrixrepresenting the activity in the scene. We illustrate this idea by two examples (see Figures 2 and3). In Figure 2, the video sequence consists of 200 frames of a scene in an airport. There is nosignificant change in illumination in the video, but a lot of activity in the foreground. We observethat our algorithm is very effective in separating the background from the activity. In Figure 3, wehave 550 frames from a scene in a lobby. There is little activity in the video, but the illuminationchanges drastically towards the end of the sequence. We see that our algorithm is once again ableto recover the background, irrespective of the illumination change.

37

(a) (b) (c)

Figure 2: Background modeling. (a) Video sequence of a scene in an airport. The size of eachframe is 72× 88 pixels, and a total of 200 frames were used. (b) Static background recovered by ouralgorithm. (c) Sparse error recovered by our algorithm represents activity in the frame.

7.4 Experiment: removing shadows and specularities from face images.

Face recognition is another domain in computer vision where low-dimensional linear models havereceived a great deal of attention due to the work of [1]. The key observation is that under certainidealized circumstances, images of the same face under varying illumination lie near an approximatelynine-dimensional linear subspace known as the harmonic plane. However, since faces are neitherperfectly convex nor Lambertian, face images taken under directional illumination often suffer fromself-shadowing, specularities, or saturations in brightness.

Suppose that we have a matrix D whose columns represent perfectly aligned training imagesof a person’s face under various illumination conditions. Our algorithm offers a principled way ofremoving the shadows and specularities in the images since these artifacts are concentrated on smallportions of the face images i.e., sparse in the image domain. Figure 4 illustrates the results of ouralgorithm on images from the Extended Yale B database [17]. We observe that our algorithm removesthe specularities in the eyes and the shadows around the nose region. This technique is potentiallyuseful for pre-processing training images in face recognition systems to remove such deviations fromthe linear model.

38

(a) (b) (c)

Figure 3: Background modeling. (a) Video sequence of a lobby scene with changing illumination.The size of each frame is 64× 80 pixels, and a total of 550 frames were used. (b) Static backgroundrecovered by our algorithm. (c) Sparse error. The background is correctly recovered even when theillumination in the room changes drastically in the frame on the last row.

8 Discussion and Open Problems

Our results give strong evidence for the efficacy of convex programming approaches to recovering low-rank matrices from non-ideal observations. However, there remain many fascinating open questionsin this area. In this section, we outline several promising directions for future work.

8.1 Robust PCA with proportional growth in rank

Our Theorem 2.4 shows that matrices of rank as large as O(m/ logm) can be recovered from almostall corruption patterns affecting O(m2) of their entries, even with no prior knowledge of the identitiesof the corrupted elements. Moreover, our analysis showed that matrix completion is possible formatrices of rank as large as O(m), even with O(m2) elements missing. It is tempting to conjecturethat for robust PCA, it is similarly possible to remove the logarithmic factor:

Conjecture 8.1. There exist ρ?r > 0, ρ?s > 0 such that if A0 are distributed according to the randomorthogonal model of rank

r ≤ ρ?rm (128)

and E0 is distributed according to the Bernoulli sign-and-support model with error probability ≤ ρ?sm,

39

(a) (b) (c)

Figure 4: Removing shadows and specularities from face images. (a) Cropped and alignedimages of a person’s face under different illuminations from the Extended Yale B database. The sizeof each image is 96 × 84 pixels, and a total of 29 different illuminations were used for each person.(b) Images recovered by our algorithm. (c) The sparse errors returned by our algorithm correspondto specularities in the eyes, shadows around the nose region, or brightness saturations on the face.

then with probability approaching one as m→∞,

(A0, E0) = arg minA,E‖A‖∗ +

1√m‖E‖1 subj A+ E = A0 + E0, (129)

and the minimizer is unique.

The logarithmic factor in our result arises from the (rather crude) use of the Frobenius norm tobound the `1 → `2, `2 → `∞ and `2 → `2 operator norms in Lemma 3.3. Of the terms in Lemma3.3, only ‖S[UV ∗ +W0]‖F does not scale properly in proportional growth (it scales as O(

√m)). A

sharper analysis of the same iterative refinement procedure, that does not depend on this Frobeniusbound, might be able to achieve a proportional growth result. The main difficulty seems to involvequantifying the interactions between the soft thresholding operator and the `2 → `2 operator norm,and the projection operator πΓ and the `1 → `2 operator norm.

40

8.2 Geometric interpretations

Another possible avenue for improving the result and gaining additional insight is via the geometryof the norm

‖(A,E)‖.= ‖A‖∗ + 1√

m‖E‖1. (130)

The unit ball of this norm is

B = conv(B∗ × 0, 0 ×

√mB1

), (131)

and its image under the linear map of interest (A,E) 7→ A+ E is the convex body

C = conv(B∗ ∪

√mB1

). (132)

The pairs (A0, E0) that can be recovered by minimizing ‖·‖ are exactly those that map to points onthe boundary of C. Our main result can be viewed as a partial characterization of the boundary ofthis object. For `1-norm minimization, Donoho and Tanner have used such a geometric interpretationto give a sharp characterization of the phase transition between correct recovery of sparse solutions,building on a long line of work in discrete geometry (see [14] and references therein). For ourproblem, either by leveraging results in geometry or by more careful probabilistic analysis, it maybe possible to better understand the boundary of C, and hence better characterize the performanceof the Robust PCA semidefinite program.

8.3 Robust PCA with noise

Most matrices arising in real applications are not perfectly low-rank. That is, the singular values ofA0 may be compressible, rather than being perfectly sparse. Alternatively, we can imagine that thematrix D was generated by adding both sparse errors and small but dense noise (i.e., Gaussian) toa perfectly low-rank matrix A0:

D = A0 + E0 + Z0, (133)

where Z0 is a Gaussian matrix whose entries have small variance. While our simulations and real-dataexperiments suggest that the convex program (2) still recovers interesting low-rank solutions evenwith noisy data, our theorems do not rigorously pertain to this case. [5] have recently shown thatby solving a slightly modified semidefinite program, low-rank matrices can be stably reconstructedfrom noisy, incomplete measurements: the error in the recovered matrix is proportional to the noiselevel. We believe that similar results are possible for the robust PCA problem considered here. Forinstance, one could start with the following relaxed version of the convex program:

min ‖A‖∗ + λ‖E‖1 subj ‖D −A− E‖2F ≤ ε2. (134)

A natural conjecture is that if ‖Z0‖2F < ε2, the so recovered matrix A is close to the true low rankmatrix A0 in the following sense: ‖A−A0‖2F ≤ cε2 for some constant c.

8.4 Simultaneous error correction and matrix completion

In many real-world applications, we are simultaneously confronted with both missing and corrupteddata. In this case the system must be capable of both filling in the missing entries and detectingand correcting any gross errors that may be present in the available observations. Formally, we

41

imagine that only a subset Υ ⊂ [m]× [m] of the entries of a low-rank matrix A0 are available, and inaddition, a smaller subset Ω ⊂ Υ are subject to large corruption. A natural heuristic for recoveringA0 is then to solve the convex program

min ‖A‖∗ + λ‖E‖1 subj Aij + Eij = Dij ∀ (i, j) ∈ Υ. (135)

Suppose that A0 is drawn from the random orthogonal model of rank O(m/ logm), Υ is a randomsubset in which the occurrence of each entry is a Bernoulli(ρobs), and E0 follows the Bernoulli sign-and-support model with error probability ρs. In this case, our analysis already implies that for ρobs

large and ρs small, (135) successfully recovers A0 with high probability. However, for a given ρs,characterizing the number of measurements |Υ| needed to achieve robustness to almost all corruptionpatterns of support ρsm2 is an open question deserving further analysis.

8.5 More scalable algorithms

While the iterative thresholding algorithm presented in Section 6 represents a significant improve-ment over generic semidefinite programming solvers, on a current desktop computer its applicabilityis limited to matrices of size m = O(103). For larger scale problems, it would be interesting to lookfor more efficient optimization techniques that exploit special structure in this semidefinite program.The brunt of the computation in Algorithm 1 occurs in computing and thresholding the singularvalues of the dual variable Y , which has the complexity O(m3). It may be possible to dramaticallyreduce the overall computation time by clever approximation of or bypassing this operator. Parallelimplementation on GPUs or distributed implementation across multiple machines are other inter-esting avenues for future improvements. Finally, in the spirit of the recent work of [25] on sparsereconstruction and [23] on rank minimization, it may be possible to find efficient greedy algorithmsfor robust PCA whose error correction capability approaches that of the semidefinite program (2).

Acknowledgement

The authors would like to thank the following individuals for enlightening discussions and usefulsuggestions: Emmanuel Candes and Ben Recht of California Institute of Technology, Maryam Fazelof the University of Washington, Harm Derksen of the University of Michigan, Zhouchen Lin ofMicrosoft Research Asia. This work was partially supported by the grants NSF IIS 08-49292, NSFECCS 07-01676, and ONR N00014-09-1-0230. JW was also partially supported by a MicrosoftFellowship.

References

[1] R. Basri and D. Jacobs. Lambertian reflectance and linear subspaces. IEEE Trans. on PatternAnalysis and Machine Intelligence, 25(2):218–233, 2003.

[2] M. Belkin and P. Niyogi. Laplacian eigenmaps for dimensionality reduction and data represen-tation. Neural Computation, 15(6):1373–1396, 2003.

[3] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.

42

[4] J. Cai, E. Candes, and Z. Shen. A singular value thresholding algorithm for matrix completion.preprint, 2008.

[5] E. Candes and Y. Plan. Matrix completion with noise. Proceedings of the IEEE, 2009.

[6] E. Candes and B. Recht. Exact matrix completion via convex optimzation. preprint, 2008.

[7] E. Candes and T. Tao. Decoding by linear programming. IEEE Trans. Information Theory,51(12), 2005.

[8] E. Candes and T. Tao. The power of convex relaxation: Near-optimal matrix completion.preprint, 2009.

[9] V. Chandrasekaran, S. Sanghavi, P. Parrilo, and A. Willsky. Sparse and low-rank matrixdecompositions. In IFAC Symposium on System Identification, 2009.

[10] S. Chen, D. Donoho, and M. Saunders. Atomic decomposition by basis pursuit. SIAM Review,43(1):129–159, 2001.

[11] Y. Chikuse. Statistics on Special Manifolds. Springer-Verlag, 2003.

[12] D. Donoho. High-dimensional data analysis: The curses and blessings of dimensionality. AMSMath Challenges Lecture, 2000.

[13] D. Donoho. For most large underdetermined systems of linear equations the minimal l1-normsolution is also the sparsest solution. Comm. on Pure and Applied Math, 59(6):797–829, 2006.

[14] D. Donoho and J. Tanner. Counting faces of randomly projected polytopes when the projectionradically lowers dimension. Journal of the American Mathematical Society, 22(1):1–53, 2009.

[15] C. Eckart and G. Young. The approximation of one matrix by another of lower rank. Psy-chometrika, 1:211–218, 1936.

[16] M. Fischler and R. Bolles. Random sample consensus: A paradigm for model fitting withapplications to image analysis and automated cartography. Communications of the ACM,24:381–385, 1981.

[17] A. Georghiades, P. Belhumeur, and D. Kriegman. From few to many: Illumination cone modelsfor face recognition under variable lighting and pose. IEEE Trans. on Pattern Analysis andMachine Intelligence, 23(6), 2001.

[18] R. Gnanadesikan and J. Kettenring. Robust estimates, residuals, and outlier detection withmultiresponse data. Biometrics, 28:81–124, 1972.

[19] P. Huber. Robust Statistics. Wiley and Sons, 1981.

[20] I. Jolliffe. Principal Component Analysis. Springer-Verlag, 1986.

[21] Q. Ke and T. Kanade. Robust l1 norm factorization in the presence of outliers and missingdata by alternative convex programming, 2005.

43

[22] M. Ledoux. The Concentration of Measure Phenomenon. American Mathematical Society,2001.

[23] K. Lee and Y. Bresler. Efficient and guaranteed rank minimizatrion by atomic decomposition.preprint, 2009.

[24] V. Milman and G. Schechtman. Asymptotic Theory of Finite Dimensional Normed Spaces.Springer-Verlag, 1986.

[25] D. Needell and J. Tropp. Cosamp: Iterative signal recovery from incomplete and inaccuratesamples, 2008.

[26] B. Recht, M. Fazel, and P. Parillo. Guaranteed minimum rank solution of matrix equations vianuclear norm minimization. submitted to SIAM Review, 2008.

[27] J. Tenenbaum, V. de Silva, and J. Langford. A global geometric framework for nonlineardimensionality reduction. Science, 290(5500):2319–2323, 2000.

[28] F. De La Torre and M. Black. A framework for robust subspace learning. International Journalon Computer Vision, 54:117–142, 2003.

[29] R. Vershynin. Spectral norms of products of random and deterministic matrices. preprint, 2009.

[30] J. Wright and Y. Ma. Dense error correction via `1-minimization. submitted to IEEE Trans-actions on Information Theory, 2008.

[31] W. Yin, E. Hale, and Y. Zhang. Fixed-point continuation for `1-minimization: Methodologyand convergence. preprint, 2008.

44

Robust Principal Component Analysis: Exact Recovery of ... · exact completion of low-rank matrices...

Documents

Transcript of Robust Principal Component Analysis: Exact Recovery of ... · exact completion of low-rank matrices...