Near-linear time approximation algorithms for optimal ...jasonalt/poster_NIPS17_ot.pdf · New...

1
Near-linear time approximation algorithms for optimal transport via Sinkhorn iteration Jason Altschuler, Jonathan Weed, Philippe Rigollet Sinkhorn scaling Our contributions Simple and practical algorithm to approximate OT distance between distributions on points in time. n ˜ O (n 2 " -4 ) Based on matrix scaling, an approach pioneered by [Cuturi 2013] We give: New analysis of 50-year-old algorithm (Sinkhorn scaling) New algorithm with much better performance in practice (Greekhorn scaling) Same near-linear time convergence guarantee! Optimal transport Images courtesy M. Cuturi Statistical primitive with many applications in machine learning and optimization Given: C r, c : cost matrix : probability distributions 2 R nn + 2 Δ n OT distance: min hC, P i s.t. P 2 R nn + P 1 = r P > 1 = c } transport polytope U r,c Goal: find satisfying ˆ P 2 U r,c hC, ˆ P i min P 2U r,c hC, P i + " Algorithm 1. Approximately solve program with entropic penalty min P 2U r,c hC, P i- -1 H (P ) 2. Round approximate solution to (see paper!) Alternating Bregman projection onto polytopes Greedy Bregman projection onto simpler polytopes Greenkhorn scaling From entropy to scaling Matrix scaling A X Y · · · · = row/column sums = S (A) Matrix primitive with many applications in theoretical computer science and numerical linear algebra Given: r, c : matrix : desired row/column sums 2 R nn + A 2 R n + Goal: find positive diagonal matrices such that X, Y S (A) := XAY is the Sinkhorn projection. [Sinkhorn 1967] showed that it can be computed by alternating rescalings of rows and columns, but did not give effective bounds on rate of convergence At each step, rescale all rows or all columns [ time] O (n 2 ) Nonnegative potential decreases at each step by f (A) At each step, rescale worst row or column [ time] Nonnegative potential decreases at each step by f (A) O (n) ˜ O (1) 0 While , (large improvement) Total runtime: 1 2n [(r k A1)+ (c k A > 1)] appropriate generalization of K (r k A1)+ (c k A > 1) " 2 " 2 /n " 2 /n After iterations, (variant of Pinsker) ˜ O (n" -2 ) kr - A1k 1 + kc - A > 1k 1 . q (r k A1)+ (c k A > 1) " O (n) · ˜ O (n" -2 )= ˜ O (n 2 " -2 ) K(r k A1)+ K(c k A > 1) ˜ O (1) 0 While , K(r k A1)+ K(c k A > 1) " 2 " 2 " 2 (large improvement) After iterations, ˜ O (" -2 ) kr - A1k 1 + kc - A > 1k 1 . q K(r k A1)+ K(c k A > 1) " (Pinsker) Total runtime: O (n 2 ) · ˜ O (" -2 )= ˜ O (n 2 " -2 ) Theorem [Cuturi 2013]: The penalized program () has a unique solution: () U r,c with " -1 log n A = exp(-C ) argmin P 2U r,c hC, P i- -1 H (P )= S (A) where: (entrywise) S (·) is Sinkhorn (Bregman) projection onto U r,c Penalized OT reduces to matrix scaling Empirical results References J. Altschuler, J. Weed, P. Rigollet. Near-linear time approximation algorithms for optimal transport via Sinkhorn iteration. In Advances in Neural Information Processing Systems 30, 2017. M. Cuturi. Sinkhorn Distances: Lightspeed computation of optimal transport. In Advances in Neural Information Processing Systems 26, 2013. R. Sinkhorn. Diagonal equivalence to matrices with prescribed row and column sums. The American Mathematical Monthly, 1967. XAY 1 = r (XAY ) > 1 = c Neural Information Processing Systems Foundation Potential based analysis (based on dual program): Potential based analysis (based on dual program): cost per iteration number of iterations cost per iteration number of iterations

Transcript of Near-linear time approximation algorithms for optimal ...jasonalt/poster_NIPS17_ot.pdf · New...

Page 1: Near-linear time approximation algorithms for optimal ...jasonalt/poster_NIPS17_ot.pdf · New analysis of 50-year-old algorithm (Sinkhorn scaling) New algorithm with much better performance

Near-linear time approximation algorithms for optimal transport via Sinkhorn iteration

Jason Altschuler, Jonathan Weed, Philippe Rigollet

Sinkhorn scaling

Our contributionsSimple and practical algorithm to approximate OT distance between distributions on points in time.n O(n2"�4)

Based on matrix scaling, an approach pioneered by [Cuturi 2013]

We give:

New analysis of 50-year-old algorithm (Sinkhorn scaling) New algorithm with much better performance in practice (Greekhorn scaling)

Same near-linear time convergence guarantee!

Optimal transport

Images courtesy M. Cuturi

Statistical primitive with many applications in machine learning and optimization

Given:C

r, c

: cost matrix : probability distributions

2 Rn⇥n+

2 �n

OT distance: min hC,P is.t. P 2 Rn⇥n

+

P1 = r

P>1 = c} transport

polytopeUr,c

Goal: find satisfyingP 2 Ur,c

hC, P i minP2Ur,c

hC,P i+ "

Algorithm1. Approximately solve program with entropic penalty

minP2Ur,c

hC,P i � ⌘�1H(P )

2. Round approximate solution to (see paper!)

Alternating Bregman projection onto polytopes Greedy Bregman projection onto simpler polytopes

Greenkhorn scaling

From entropy to scaling

Matrix scaling

AX Y

· ·

· ·

=

row/column sums

= ⇧S(A)

Matrix primitive with many applications in theoretical computer science and numerical linear algebra

Given:

r, c

: matrix : desired row/column sums

2 Rn⇥n+A

2 Rn+

Goal: find positive diagonal matrices such thatX,Y

⇧S(A) := XAY is the Sinkhorn projection.

[Sinkhorn 1967] showed that it can be computed by alternating rescalings of rows and columns, but did not give effective bounds on rate of convergence

At each step, rescale all rows or all columns [ time]O(n2)

Nonnegative potential decreases at each step byf(A)

At each step, rescale worst row or column [ time]

Nonnegative potential decreases at each step byf(A)

O(n)

O(1) 0

While ,

(large improvement)

Total runtime:

1

2n[⇢(r k A1) + ⇢(c k A>1)] appropriate

generalization of K

⇢(r k A1) + ⇢(c k A>1) � "2

� "2/n � "2/n

After iterations,

(variant of Pinsker)

O(n"�2)

kr �A1k1 + kc�A>1k1 .q⇢(r k A1) + ⇢(c k A>1) "

O(n) · O(n"�2) = O(n2"�2)

K(r k A1) +K(c k A>1)

O(1) 0

While ,K(r k A1) +K(c k A>1) � "2

� "2 � "2

(large improvement)

After iterations, O("�2)

kr �A1k1 + kc�A>1k1 .qK(r k A1) +K(c k A>1) "

(Pinsker)

Total runtime: O(n2) · O("�2) = O(n2"�2)

Theorem [Cuturi 2013]: The penalized program (✽) has a unique solution:

(⇤)

Ur,c

with ⌘ ⇡ "�1log n

A = exp(�⌘C)

argminP2Ur,c

hC,P i � ⌘�1H(P ) = ⇧S(A)

where:

(entrywise)

⇧S(·) is Sinkhorn (Bregman) projection onto Ur,c

Penalized OT reduces to matrix scaling

Empirical results

ReferencesJ. Altschuler, J. Weed, P. Rigollet. Near-linear time approximation algorithms for optimal transport via Sinkhorn iteration. In Advances in Neural Information Processing Systems 30, 2017. M. Cuturi. Sinkhorn Distances: Lightspeed computation of optimal transport. In Advances in Neural Information Processing Systems 26, 2013. R. Sinkhorn. Diagonal equivalence to matrices with prescribed row and column sums. The American Mathematical Monthly, 1967.

XAY 1 = r

(XAY )>1 = c

™ Neural InformationProcessing SystemsFoundation

Potential based analysis (based on dual program): Potential based analysis (based on dual program):

cost per iteration number of iterations cost per iteration number of iterations