Low Rank Approximation Lecture 9 · Manifold optimization General setting:Aim at solving...

Low Rank ApproximationLecture 9

Daniel KressnerChair for Numerical Algorithms and HPC

Institute of Mathematics, [email protected]

1

Manifold optimizationGeneral setting: Aim at solving optimization problem

minX∈Mr

f (X ),

whereMr is a manifold of rank-r matrices or tensors.Goal: Modify classical optimization algorithms (line search, Newton,quasi-Newton, ...) to produce iterates that stay onMr .Advantages over ALS:

I No need to solve subproblems, at least for first-order methods;I Can draw on concepts from classical smooth optimization (line

search strategies, convergence analysis, ...).Two valuable resources:

I Absil/Mahony/Sepulchre’2011:Optimization Algorithms on MatrixManifolds. PUP, 2008. Available fromhttps://press.princeton.edu/absil

I Manopt, a Matlab toolbox for optimizationon manifolds. Available fromhttps://manopt.org/

2

https://press.princeton.edu/absil

https://manopt.org/

ManifoldsFor open sets U ⊂M, V ⊂ Rd chart is bijective function ϕ : U → V.Atlas ofM into Rd is collection of charts (Uα, ϕα) such that:

I⋃α Uα =M

I for any α, β with Uα ∩ Uβ 6= ∅, change of coordinates

ϕβ ϕ−1α : Rd → Rd

is smooth (C∞) on its domain ϕα(Uα ∩ Uβ).

Illustration taken from Wikipedia.

3

Manifolds

In the following, we assume that atlas is maximal. Proper definition ofsmooth manifoldM needs further properties (topology induced bymaximal atlas is Hausdorff and second-countable). See [Lee’2003]and [Absil et al.’2008].Properties ofM:

I finite-dimensional vector spaces are always manifolds;I d = dimension ofM;I M does not need to be connected (in the context of smooth

optimization makes sense to consider connected manifolds only);I function f :M→ R differentiable at point x ∈M if and only if

f ϕ−1 : ϕ(U) ⊂ Rd → R

is differentiable at ϕ(x) for some chart (U , ϕ) with x ∈ U .

4

Manifolds: First examples

LemmaLetM be a smooth manifold and N ⊂M an open subset. Then N isa smooth manifold (of equal dimension).Proof: Given atlas forM obtain atlas for N by selecting charts (U , ϕ)with U ⊂ N .Example: GL(n,R), the set of real invertible n × n matrices, is asmooth manifold.EFY. Show that Rm×n

∗ , the set of real m × n matrices of full rank minm, n, is a smooth manifold.

EFY. Show that the set of n × n symmetric positive definite matrices is a smooth manifold.

Two main classes of matrix manifolds:I embedded submanifolds of Rm×n;

Example: Stiefel manifold of orthonormal bases.I quotient manifolds;

Example: Grassmann manifold Rm×n∗ /GL(n,R).

Will focus on embedded submanifolds (much easier to work with).

5

Immersions and submersion

LetM1,M2 be smooth manifolds and F :M1 →M2. Let x ∈M1and y = F (x) ∈M2. Choose charts ϕ1, ϕ2 around x , y . Thencoordinate representation of F given by

F := ϕ2 F ϕ−11 : Rd1 → Rd2 .

I F is called smooth if F is smooth (that is, C∞).I rank of F at x ∈M1 defined as the rank of DF (ϕ(x1)) (Jacobian

of F at ϕ(x1))I F is called an immersion if its rank equals d1 at every x ∈M1.I F is called a submersion if its rank equals d2 at every x ∈M1.

6

Embedded submanifoldsSubset N ⊂M is called an embedded submanifold of dimension k inM if for each point p ∈ N there is a chart (U , ϕ) inM such that allelements of U ∩ N are obtained by varying first k coordinates only.(See Chapter 5 of [Lee’2003] for more details.)

TheoremLetM,N be smooth manifolds and let F :M→N be a smooth mapwith constant rank `. Then each level set

F−1(y) := x ∈M : F (x) = y

is a closed embedded submanifold of codimension ` inM.

Corollaries:I If F :M→N is a submersion then each level is a closed

embedded submanifold of codimension equal to the dimension ofN .

I In fact, by open submanifold lemma, only need to check full rankcondition of submersion for points in the level set (replaceM bythe open set for which F has full rank).

7

The Stiefel manifold

For m ≥ n, consider the set of all m × n matrices with orthonormalcolumns:

St(m,n) := X ∈ Rm×n : X T X = In.

CorollarySt(m,n) is an embedded submanifold of Rm×n.Proof: Define F : Rm×n → symm(n) as F : X 7→ X T X , wheresymm(n) denotes set of n × n symmetric matrices. At X ∈ St(m,n),consider Jacobian

DF (X ) : H 7→ X T H + HT X .

Given symmetric Y ∈ Rn×n, set H = XY/2. Then DF (X )[H] = Y ;thus DF (X ) is surjective.EFY. What is the dimension of the Stiefel manifold?

8

The manifold of rank-k matrices

Locality of definition of embedded submanifolds implies the followinglemma (Lemma 5.5 in [Lee’2003]).

LemmaLet N be subset of smooth manifoldM. Suppose every point p ∈ Nhas a neighborhood U ⊂M such that U ∩ N is an embeddedsubmanifold of U . Then N is an embedded submanifold ofM.

TheoremGiven m ≥ n, the set

Mk = A ∈ Rm×n : rank(A) = k

is an embedded submanifold of Rm×n for every 0 ≤ k ≤ n.

9


Choose arbitrary A0 ∈Mk . After a suitable permutation, may assumew.l.o.g. that

A0 =

(A11 A12A21 A22

), A11 ∈ Rk×k is invertible.

This property remains true in an open neighborhood U ⊂ Rm×n of A0.Factorize A ∈ U as

A =

(I 0

A21A−111 I

)(A11 00 A22 − A21A−1

11 A12

)(I A−1

11 A120 I

).

Define F : U → R(m−k)×(n−k) as F : A 7→ A22 − A21A−111 A12. Then

F−1(0) = U ∩Mk .

10


For arbitrary Y ∈ R(m−k)×(n−k), we obtain that

DF (A)

[(0 00 Y

)]= Y .

Thus, F is a submersion. In turn, U ∩Mk is an embeddedsubmanifold of U . By lemma,Mk is an embedded submanifold ofRm×n.EFY. What is the dimension ofMk ?

EFY. IsMk connected?

EFY. Prove that the set of symmetric rank-k matrices is an embedded submanifold of Rn×n . Is this manifold connected?

11

Tangent space

In the following, much of the discussion restricted to submanifoldsMembedded in vector space V with inner product 〈·, ·〉 and inducednorm ‖ · ‖.Given smooth curve γ : R→M with x = γ(0), we call γ′(0) ∈ V atangent vector at x . The tangent space TxM⊂ V is the set of alltangent vectors at x .

LemmaTxM is a subspace of V .Proof. If v1, v2 are tangent vectors then there are smooth curvesγ1, γ2 such that x = γ1(0) = γ2(0) and γ′1(0) = v1, γ′2(0) = v2. Toshow that αv1 + βv2 for α, β ∈ R is again a tangent vector, considerchart (U , ϕ) around x such that ϕ(x) = 0. Define

γ(t) = ϕ−1(αϕ(γ1(t)) + βϕ(γ2(t)))

for t sufficiently close to 0. Then γ(0) = x and γ′(0) = αv1 + βv2.EFY. Prove that the dimension of TxM equals the dimension ofM using a coordinate chart.

12

Tangent space

Application of definition to Stiefel manifold. Let

γ(t) = X + tY +O(t2)

be a smooth curve with X ∈ St(m,n). To ensure that γ(t) ∈ St(m,n),we require

In = γ(t)Tγ(t) = (X +tY )T (X +tY )+O(t2) = In+t(X T Y +Y T X )+O(t2).

Thus, X T Y + Y T X = 0 characterizes tangent space:

Tx St(m,n) = Y ∈ Rm×n : X T Y = −Y T X= XW + X⊥W⊥ : W ∈ Rn×n,W = −W T ,W⊥ ∈ R(m−n)×n

where the columns of X⊥ form basis of span(X )⊥

13

Tangent space

WhenM is defined (at least locally) as level set of constant rankfunction F : V → RN , we have

TxM = ker(DF (x)).

Proof. Let v ∈ TxM, that is, there is a curve γ : R→M such thatγ(0) = x and γ′(0) = v . Then, by chain rule,

DF (x)[v ] = DF (x)[γ′(0)] =∂

∂tF (γ(t))

∣∣∣∣t=0

= 0,

because F is constant onM. Thus, TxM⊂ ker(DF (x)), whichcompletes the proof by counting dimensions.

14

Tangent space ofMkRecall thatMk was obtained as level set of local submersion

F : A 7→ A22 − A21A−111 A12.

Given A ∈Mk consider SVD

A =(U U⊥

)(Σ 00 0

)(V V⊥

)T.

We have

DF(

Σ 00 0

)[H] = H22.

Thus, H is in the kernel if and only if H22 = 0. In terms of A thisimplies

TAMk = ker(DF (A)) =(U U⊥

)( Rk×k Rk×(n−k)

R(m−k)×k 0

)(V V⊥

)T

= UMV T + UpV T + UV Tp : M ∈ Rk×k ,UT

p U = V Tp V = 0.

EFY. Compute the tangent space for the embedded submanifold of rank-k symmetric matrices.

15

Riemannian manifold and gradient

For submanifoldM embedded in vector space V : Inner product 〈·, ·〉on V induces inner product on TxM. This turnsM into a Riemannianmanifold.1

The (Riemannian) gradient of smooth f :M→ R at x ∈M is definedas the unique element grad f (x) ∈ TxM that satisfies

〈grad f (x), ξ〉 = Df (x)[ξ], ∀ξ ∈ TxM.

EFY. Prove that the Riemannian gradient satisfies the steepest ascent property

grad f (x)

‖grad f (x)‖2= arg maxξ∈TxM‖ξ‖=1

Df (x)[ξ].

1In general, for a Riemannian manifold one needs to have an inner product on TxMthat varies smoothly wrt x .

16

Riemannian gradient

For submanifoldM embedded in vector space V : The (Euclidean)gradient of f in V admits the decomposition

∇f (x) = Px∇f (x) + P⊥x ∇f (x),

where Px ,P⊥x are the orthogonal projections onto TxM, T⊥x M. Forevery ξ ∈ TxM we have

Df (x)[ξ] = 〈∇f (x), ξ〉 = 〈Px∇f (x), ξ〉+ 〈P⊥x ∇f (x), ξ〉= 〈Px∇f (x), ξ〉,

where we used that P⊥x ∇f (x) ⊥ ξ. Hence,

grad f (x) = Px∇f (x).

The Riemannian gradient is the orthogonal projection of theEuclidean gradient onto the tangent space.

17

Riemannian gradientExample: Given symmetric n × n matrix A, consider traceoptimization problem

minX∈St(n,k)

trace(X T AX )

Study first-order perturbation

trace((X + H)T A(X + H))− trace(X T AX )

= trace(HT AX ) + trace(X T AH) +O(‖H‖2)

= 2〈H,AX 〉+O(‖H‖2).

Euclidean gradient at X given by 2AX .Note that skew(W ) = (W −W T )/2 is orth projection onskew-symmetric matrices. Thus,

PX (Z ) = (I − XX T )Z + X · skew(X T Z ).

grad f (X ) = PX (∇f (X )) = 2(I − XX T )AX + 2X · skew(X T AX )

= 2(AX − X X T AX ).

18

Riemannian gradient

Example: For A ∈Mk consider SVD A = UΣV T with Σ ∈ Rk×k .Define orthogonal projections onto span(U), span(V ), and theircomplements:

PU = UUT , P⊥U = I − UUT , PV = VV T , P⊥V = I − VV T .

Recall that

TAMk = UMV T + UpV T + UV Tp : M ∈ Rk×k ,UT

p U = V Tp V = 0

The three terms of the sum are orthogonal to each other and canthus be considered separately Orthogonal projection onto TAMkgiven by

PA(Z ) = PUZPV + P⊥U ZPV + PUZP⊥V .

EFY. Compute the Riemannian gradient of f (A) = ‖A − B‖2F onMk for given B ∈ Rm×n .

19

Line search: Concepts from Euclidean case

minx∈RN

f (x),

Line search is optimization algorithm of the form

xj+1 = xj + αjηj

with search direction ηj and step size αj > 0.I First-order optimal choice of ηj : ηj = −∇f (xj ) gradient

descent.Motivation for other choices: Faster local convergence(Newton-type methods), exact gradient computation tooexpensive, . . .Gradient-related search directions: 〈ηj ,∇f (xj )〉 < δ < 0 for all j .

20

Line search: Concepts from Euclidean caseI Exact line search chooses

αj = arg minα

f (xj + αηj ).

Only in exceptional cases simple optimization problem, e.g.,admitting closed form solution.EFY. Derive the closed form solution for exact line search applied to

minx∈Rn

1

2xT Ax + bT x

for symmetric positive A ∈ Rn×n and b ∈ Rn .

Alternative: Armijo rule. Let β ∈ (0,1) (typically β = 1/2) andc ∈ (0,1) (e.g., c = 10−4) be fixed parameters. Determinelargest αj ∈ 1, β, β2, β3, . . . such that

f (xj + αjηj )− f (xj ) ≤ cαj∇f (xj )Tηj

holds. (Such αj always exists provided that ηj is descentdirection, i.e., when 〈ηj ,∇f (xj )〉 < 0.)

More details in [J. Nocedal and S. J. Wright. Numerical optimization.Second edition. Springer Series in Operations Research andFinancial Engineering. Springer, 2006].

21

Line search: Extension to manifolds

minx∈M

f (x)

Cannot use line search xk+1 = xj + αjηj , simply because addition isnot well defined inM.Idea:Search along smooth curve γ(α) ∈M with γ(0) = xj andγ′(0) = ηj ∈ TxjM.Step in direction xj + αηj ∈ xj + TxjM and go back to manifold viaretraction:

22

Retraction

Tangent bundle TM :=⋃

x∈M TxM

DefinitionMapping R : TM→M is called a retraction onM if for everyX0 ∈M there exists neighborhood U around (X0,0) ∈ TM such that:

1. U ⊂ dom(R) and R|U : U →M is smooth.2. R(x ,0) = x for all (x ,0) ∈ U .3. DR(x ,0)[0, ξ] = ξ for all (x , ξ) ∈ U .

Will write Rx = R(x , ·) : TxM→M in the following.Intuition behind definition:Property 2 = retraction does nothing to elements on manifold.Property 3 = retraction preserves direction of curves. Equivalentcharacterization: For every tangent vector ξ ∈ TxM, the curveγ : α 7→ Rx (αξ) satisfies γ′(0) = ξ.EFY. What is a retraction for the manifold of invertible n × n matrices (trick question)?

23

RetractionExponential maps are most natural choice of retraction fromtheoretical point of view but often too expensive/too cumbersome tocompute.

In practice for matrix manifolds: Retractions are often built frommatrix decompositions and metric projections.

Example St(n, k): Given Y ∈ Rn×k∗ (i.e., rank(Y ) = k ), the

economy-sized QR decomposition

Y = XR, X T X = Ik , R =@@@

is unique provided that diagonal elements of R are positive. Thisdefines a diffeomorphism

φ : St(n, k)× triu+(k)→ Rn×k∗ , φ : (X ,R) 7→ XR,

where triu+(k) denotes upper triangular matrix with positive diagonalelements. Applying φ−1 just means computing the QRdecomposition. Note that

dim St(n, k) + dim triu+(k) = dimRn×k∗ .

24

Retraction

Abstract setting: LetM be embedded submanifold of vector space Vand N smooth manifold such that

dim(M) + dim(N ) = dim(V ).

Assume there is diffeomorphism

φ :M×N → V∗ : (x , y) 7→ φ(x , y)

for some open subset V∗ of V . Moreover, assume ∃ neutral elementid ∈ N such that φ(x , id) = x for all x ∈M.

LemmaUnder above assumptions,

Rx (η) := π1(φ−1(x + η))

is a retraction onM, where π1 is projection onto first component:π1(x , y) = x.

25

Retraction

Proof of lemma. Need to verify three properties of retraction.Property 1: Immediately follows from assumptions that Rx (ξ) isdefined and smooth for all ξ in a neighborhood of 0 ∈ TxM.Property 2: Rx (0) := π1(φ−1(x)) = π1(x , id) = x .Property 3: Differentiating x = π1 φ−1(φ(x , id)) we obtain for anyξ ∈ TxM that

ξ = D(π1 φ−1)[Dφ(x , id)[ξ,0]]

= D(π1 φ−1)(x)[ξ] = DRx (0)[ξ].

26

RetractionFor z ∈ V sufficiently close toM, metric projection is well defined:

PM(z) := arg minx∈M

‖z − x‖.

Corollary (Lewis/Malick’2008)The map

Rx (η) := PM(x + η)

defines a retraction.

Examples for retractions based on metric projection:I For St(n, k), polar factor Y (Y T Y )−1/2 of Y ∈ Rn×k

∗ defines aretraction.

I For rank-k matrix manifoldMk , best rank-k approximation Tkdefines a retraction.There are other choices; see [Absil/Oseledets’2015: Low-rankretractions: a survey and new results].

EFY. For all examples discussed so far, develop algorithms that efficiently realize the retraction by exploiting the structure of x + η.

EFY. Find a retraction for the manifold of symmetric rank-k matrices.

27

Riemannian line search

xj+1 = Rxj (αjηj ).

Assumption. Sequence ηj is bounded and gradient related:

lim supk→∞

〈grad f (xj ), ηj〉 < 0.

Canonical choice: ηj = −grad f (xj ).Extension of Armijo rule. Let β ∈ (0,1) and c ∈ (0,1) (e.g., c = 10−4)be fixed parameters. Determine largest αj ∈ 1, β, β2, β3, . . . suchthat

f (Rxj (αjηj ))− f (xj ) ≤ cαj〈grad f (xj ), ηj〉 (1)

holds.EFY. Show that the Armijo condition (1) can always be satisfied for sufficiently small αj .

28

Riemannian line search1: for j = 0,1,2,. . . do2: Pick ηj ∈ TxjM such that sequence ηj is gradient-related.3: Choose αj ∈ 1, β, β2, β3, . . . such that Armijo condition is

satisfied.4: Set xj+1 = Rxj (αjηj ).5: end for

Convergence theory in Section 4.3 of [Absil’2008].We call x∗ ∈M a critical point of f if grad f (x∗) = 0.

TheoremEvery accumulation point of xj is a critical point of cost function f .More can be said if manifold (or at least level set) is compact.

CorollaryAssume that L = x ∈M : f (x) ≤ f (x0) is compact. Thenlimj→∞ ‖grad f (xj )‖ → 0.Note thatMk is not compact and it is not clear a priori whether L iscompact..

29

Application tolow-rank matrix

and tensor completion

30

Matrix Completion

PΩ A =

recover? A

PΩ : Rm×n → Rm×n, PΩ X =

Xij if (i , j) ∈ Ω,

0 else.

Applications: image reconstruction, image inpainting, Netflix problemLow-rank matrix completion:

minX

rank(X ) , X ∈ Rm×n

subject to PΩ X = PΩ A

31

Low-rank matrix completion: ( NP-Hard)

minX

rank(X ) , X ∈ Rm×n


Nuclear norm relaxation: ( convex, but expensive)

minX

‖X‖∗ =∑

i

σi , X ∈ Rm×n


Robust low-rank completion: (Assume rank is known)

minX

12‖PΩ X − PΩ A‖2

F , X ∈ Rm×n

subject to rank(X ) = k

Huge body of work! Overview: http://perception.csl.illinois.edu/matrix-rank/

32

Setting

minimizeX

f (X ) :=12‖PΩ(X − A)‖2

F

subject to X ∈Mk :=

X ∈ Rm×n : rank(X ) = k

PΩ : Rm×n → Rm×n

Xij 7→

Xij if (i , j) ∈ Ω,0 if (i , j) 6∈ Ω.

Riemannian gradient given by:

grad f (X ) = PTXMk

(PΩ(X − A)

)with orthogonal projection PTXMk : Rm×n → TXMk .

33

Geometric nonlinear CG for matrix completionInput: Initial guess X0 ∈Mk .η0 ← −grad f (X0)α0 ← argminα f (X0 + αη0)X1 ← RX0 (α0η0)

for i = 1,2, . . . doCompute gradient:ξi ← grad f (Xi )Conjugate direction by PR+ updating rule:ηi ← −ξi + βiTXi−1→Xiηi−1Initial step size from linearized line search:αi ← argminα f (Xi + αηi )Armijo backtracking for sufficient decrease:Find smallest integer m ≥ 0 such thatf (Xi )− f (RXi (2

−mαiηi )) ≥ −1 · 10−4〈ξi ,2−mαiηi〉Obtain next iterate:Xi+1 ← RXi (2

−mαiηi )end for

Cost/iteration: O((m + n)k2 + |Ω|k) ops.

34

Vector transportConjugate gradient method requires combination of gradients forsubsequent iterates:

grad f (X ) ∈ TXMk , grad f (Y ) ∈ TYMk

⇒ grad f (X ) + grad f (Y ) ???

Can be addressed byvector transport:

TX→Y : TXMk → TYMk

TX→Y (ξ) = PTYMk (ξ).

TX→Y (ξ)

Mk

Xη

ξ

TYMk

TXMk

Can be implemented in O((m + n)k2) ops.

35

Numerical experimentsI Comparison to LMAFit [Wen/Yin/Zhang’2010].http://lmafit.blogs.rice.edu/ .

I Oversampling factor OS = |Ω|/(k(2n − k)).I Purely academic example A = ALAT

R with AL,AR = randn.

36

http://lmafit.blogs.rice.edu/

Influence of n

0 50 100 150 200

10−10

10−5

100

iteration

rela

tive r

esid

ual

Convergence curve: k = 40, OS = 3

n=1000n=2000n=4000n=8000n=16000n=32000

I Dashed lines: LMAFit. Solid lines: Nonlinear CG.I time(1 iteration of Nonlinear CG)≈ 2× time(1 iteration of LMAFit)

37

Influence of rank

0 50 100 150 200

10−10

10−5

100

iteration

rela

tive r

esid

ual

Convergence curve: n = 8000, OS = 3

k=10k=20k=30k=40k=50k=60

I Dashed lines: LMAFit. Solid lines: Nonlinear CG.I time(1 iteration of Nonlinear CG)≈ 2× time(1 iteration of LMAFit)

38

Numerical experimentsI Comparison to LMAFit [Wen/Yin/Zhang’2010].http://lmafit.blogs.rice.edu/ .

I Oversampling factor OS = |Ω|/(k(2n − k)) = 8.I 8 000× 8 000 matrix A is obtained from evaluating

f (x , y) =1

1 + |x − y |2

on [0,1]× [0,1].

39

http://lmafit.blogs.rice.edu/

Influence of rank

I Hom: Start with k = 1 and subsequently increase k , usingprevious result as initial guess.

40

Tensor Completion

Low-rank tensor completion:

minX

rank(X ) , X ∈ Rn1×n2×...×nd

subject to PΩ X = PΩA

Applications:I Completion of multidimensional data,

e.g. hyperspectral images, CT ScansI Compression of multivariate

functions with singularitiesI . . .

41

Manifold of Tensors of fixed multilinear rank

Mk :=X ∈ Rn1×...×nd : rank(X ) = k

,

dim(Mk) =d∏

j=1

kj +d∑

i=1

(kini −

ki (ki − 1)

2

).

I Mk is a smooth manifold. Discussed for more general formats in[Holtz/Rohwedder/Schneider’2012], [Uschmajew/Vandereycken’2012]

I Riemannian with metric induced by standard inner product〈X ,Y〉 = 〈X(1),Y(1)〉 (sum of element-wise product)

Manifold structure used inI dynamical low-rank approximation

[Koch/Lubich’2010], [Arnold/Jahnke’2012],[Lubich/Rohwedder/Schneider/Vandereycken’2012],[Khoromskij/Oseledets/Schneider’2012], . . .

I best multilinear approximation [Eldén/Savas’2009], [Ishteva/Absil/VanHuffel/De Lathauwer’2011], [Curtef/Dirr/Helmke’2012]

42

Gradients and Tangent Space TXMk

Every ξ in the tangent space TXMk at X = C ×1 U ×2 V ×3 Wcan be written as:

ξ = S ×1 U ×2 V ×3 W+ C ×1 U⊥ ×2 V ×3 W+ C ×1 U ×2 V⊥ ×3 W+ C ×1 U ×2 V ×3 W⊥

for some S ∈ Rk1×k2×k3 , U⊥ ∈ Rn1×k1 with UT⊥U = 0, . . .

Again, we obtain the Riemannian gradient of the objective function

f (X ) :=12‖PΩ X − PΩA‖2

F

by projecting the Euclidean gradient into the tangent space:

grad f (X ) = PTXMk (PΩ X − PΩA)

43

RetractionCandidate for retraction: Metric projection

RX (ξ) = PX (X + ξ) = arg minZ∈Mk

‖X + ξ −Z‖.

No closed-form solution availableI Replaced by HOSVD truncation.I Seems to work fine.I HOSVD truncation is a retraction

[K./Steinlechner/Vandereycken’13].

44

Geometric Nonlinear CG for Tensor Completion

Input: Initial guess X0 ∈Mk.η0 ← −grad f (X0)α0 ← argminα f (X0 + αη0)X1 ← RX0 (α0η0)

for i = 1,2, . . . doCompute gradient:ξi ← grad f (Xi )Conjugate direction by PR+ updating rule:ηi ← −ξi + βiTXi−1→Xiηi−1Initial step size from linearized line search:αi ← argminα f (Xi + αηi )Armijo backtracking for sufficient decrease:Find smallest integer m ≥ 0 such thatf (Xi )− f (RXi (2

−mαiηi )) ≥ −1 · 10−4〈ξi ,2−mαiηi〉Obtain next iterate:Xi+1 ← RXi (2

−mαiηi )end for Cost/iteration: O(nkd + |Ω|kd−1) ops.

45

Reconstruction of CT Scan199× 199× 150 tensor from scaled CT data set “INCISIX”,(taken from OSIRIX MRI/CT data base[www.osirix-viewer.com/datasets/])

Slice of original tensor HOSVD approx. of rank 21

Sampled tensor (6.7%) Low-rank completion of rank 21

Compares very well with existing results w.r.t. low-rank recovery andspeed, e.g., [Gandy/Recht/Yamada/’2011].

46

Hyperspectral ImageSet of photographs, (204× 268 px) taken across a large range ofwavelengths. 33 samples from ultraviolet to infrared [Image data:Foster et al.’2004]Stacked into a tensor of size 204× 268× 33

Completed Tensor, 16th SliceFinal Rank is k = [50 50 6]

10% of the Original Hyperspectral Imega Tensor, 16th SliceSize of Tensor is [204, 268, 33]

Here: Only 10% of entries known; [Signoretti et al.’2011] use 50%.47

How many samples dowe need?

Matrix case:O(n · logβ n) samples suffice![Candès/Tao’2009]⇒ Completion of tensor byapplying matrix completion tomatricization: O(n2 log(n)). Givesupper bound!

Tensor case:Certainly: |Ω| O(n2)In all cases of convergence exact reconstruction.

Conjecture: |Ω| = O(n · logβ n)

4 4.5 5 5.5 6 6.5 78.5

9

9.5

10

10.5

11

11.5

12

12.5

13

log(n)

log(

Sm

alle

st s

ize

of

nee

ded

to c

onve

rge

)

y = 1.2*x + 4

O(n2)

O(n)

O(n log n)

48

Robustness of Convergence

0 5 10 15 20 25 3010 16

10 14

10 12

10 10

10 8

10 6

10 4

10 2

100

102

Iteration

Rel.

err o

n O

meg

a an

d no

ise le

vels

Noisy completion, n = 100, k = [4, 5, 6]

I Random 100× 100× 100 tensor of multilinear rank (4,5,6)perturbed by white noise.

I Upon convergence reconstruction up to noise level.

49

Final remarks on Riemannian low-rank optimizationI Only discussed first-order methods. Fine for well-conditioned

problems but slow convergence for ill-conditioned problems.I Second-order methods (Newton-like) require Riemannian

Hessian: painful and:I not of much help for well-conditioned problems (low-rank matrix

completion).I linearized equations hard to solve efficiently for low-rank matrix and

tensor manifolds.I Low-rank matrices/tensors can also be viewed as products of

quotient manifolds. Requires careful choice of metric to stayrobust wrt small singular value σk [Ngo/Saad’2012],[Kasai/Mishra, ICML’2016].

I Lots of open problems concerning convergence analysis oflow-rank Riemannian optimization!

50

Low Rank Approximation Lecture 9 · Manifold optimization General setting:Aim at solving...

Documents

Transcript of Low Rank Approximation Lecture 9 · Manifold optimization General setting:Aim at solving...