Sinkhorn Barycenters with Free Support via Frank Wolfe ... · Total Variation. We characterize the...

22
Sinkhorn Barycenters with Free Support via Frank Wolfe algorithm Giulia Luise 1 , Saverio Salzo 2 , Massimiliano Pontil 1,2 , Carlo Ciliberto 3 1 Department of Computer Science, University College London, UK 2 CSML, Istituto Italiano di Tecnologia, Genova, Italy 3 Department of Electrical and Electronic Engineering, Imperial College London, UK Sinkhorn Barycenters via Frank Wolfe 1 / 22

Transcript of Sinkhorn Barycenters with Free Support via Frank Wolfe ... · Total Variation. We characterize the...

Page 1: Sinkhorn Barycenters with Free Support via Frank Wolfe ... · Total Variation. We characterize the sample complexity of an emprical estimator approximating the Sinkhorn gradients.

Sinkhorn Barycenters with Free Support via FrankWolfe algorithm

Giulia Luise1, Saverio Salzo2, Massimiliano Pontil1,2, Carlo Ciliberto3

1 Department of Computer Science, University College London, UK2 CSML, Istituto Italiano di Tecnologia, Genova, Italy

3 Department of Electrical and Electronic Engineering, Imperial College London, UK

Sinkhorn Barycenters via Frank Wolfe 1 / 22

Page 2: Sinkhorn Barycenters with Free Support via Frank Wolfe ... · Total Variation. We characterize the sample complexity of an emprical estimator approximating the Sinkhorn gradients.

Outline

1. Introduction: Goal and Contributions

2. Setting and problem statement

3. Approach

4. Convergence analysis

5. Experiments

Sinkhorn Barycenters via Frank Wolfe 2 / 22

Page 3: Sinkhorn Barycenters with Free Support via Frank Wolfe ... · Total Variation. We characterize the sample complexity of an emprical estimator approximating the Sinkhorn gradients.

Introduction: Goal and Contributions

Goal and contributions

We propose a novel method to compute the barycenter of a set ofprobability distributions with respect to the Sinkhorn divergence that:

• does not fix the support beforehand

• handles both discrete and continuous measures

• admits convergence analysis.

Sinkhorn Barycenters via Frank Wolfe 3 / 22

Page 4: Sinkhorn Barycenters with Free Support via Frank Wolfe ... · Total Variation. We characterize the sample complexity of an emprical estimator approximating the Sinkhorn gradients.

Introduction: Goal and Contributions

Goal and contributions

Our analyais hinges on the following contributions:

• We show that the gradient of the Sinkhorn divergence is Lipschitzcontinuous on the space of probability measures with respect to theTotal Variation.

• We characterize the sample complexity of an emprical estimatorapproximating the Sinkhorn gradients.

• A byproduct of our analysis is the generalization of the Frank-Wolfealgorithm to settings where the objective functional is defined only ona set with empty interior, which is the case for Sinkhorn divergencebarycenter problem.

Sinkhorn Barycenters via Frank Wolfe 4 / 22

Page 5: Sinkhorn Barycenters with Free Support via Frank Wolfe ... · Total Variation. We characterize the sample complexity of an emprical estimator approximating the Sinkhorn gradients.

Setting and problem statement

Setting and Notation

X ⊂ Rd is a compact set

c : X × X → R is a symmetric cost function, e.g. c(·, ·) = ‖· − ·‖22

M+1 (X ) is the space of probability measures on X .

M(X ) is the Banach space of finite signed measures on X .

Sinkhorn Barycenters via Frank Wolfe 5 / 22

Page 6: Sinkhorn Barycenters with Free Support via Frank Wolfe ... · Total Variation. We characterize the sample complexity of an emprical estimator approximating the Sinkhorn gradients.

Setting and problem statement

Entropic Regularized Optimal Transport

For any α, β ∈M+1 (X ), the Optimal Transport problem with entropic

regularization is defined as follow

OTε(α, β) = minπ∈Π(α,β)

∫X 2

c(x, y) dπ(x, y)+εKL(π|α⊗β), ε ≥ 0 (1)

where:

KL(π|α⊗ β) is the Kullback-Leibler divergence between transport planπ and the product distribution α⊗ β

Π(α, β) = {π ∈M1+(X 2) : P1#π = α, P2#π = β} is the transport

polytope (with Pi : X ×X → X the projector onto the i-th componentand # the push-forward)

Sinkhorn Barycenters via Frank Wolfe 6 / 22

Page 7: Sinkhorn Barycenters with Free Support via Frank Wolfe ... · Total Variation. We characterize the sample complexity of an emprical estimator approximating the Sinkhorn gradients.

Setting and problem statement

Sinkhorn Divergences

To remove the bias induced by the KL, [Genevay et al., 2018] proposed toremove the autocorrelation terms −1

2OTε(α, α), −12OTε(β, β) from

OTε(α, β) in order to get a divergence

Sε(α, β) = OTε(α, β)− 1

2OTε(α, α)− 1

2OTε(β, β), (2)

which is nonnegative, convex and metrizes the weak convergence (see [Feydy

et al., 2019]).

In the following we study barycenter problem with this Sinkhorn divergence.

Sinkhorn Barycenters via Frank Wolfe 7 / 22

Page 8: Sinkhorn Barycenters with Free Support via Frank Wolfe ... · Total Variation. We characterize the sample complexity of an emprical estimator approximating the Sinkhorn gradients.

Setting and problem statement

Barycenter Problem

Barycenters of probabilities are useful in a range of applications, as texturemixing, Bayesian inference, imaging.

The barycenter problem w.r.t. Sinkhorn divergence is formulated as follows:

given β1, . . . βm ∈M+1 (X ) input measures, and ω1, . . . , ωm ≥ 0 a set of

weights such that∑m

j=1 ωj = 1, solve

minα∈M+

1 (X )Bε(α), with Bε(α) =

m∑j=1

ωj Sε(α, βj). (3)

Sinkhorn Barycenters via Frank Wolfe 8 / 22

Page 9: Sinkhorn Barycenters with Free Support via Frank Wolfe ... · Total Variation. We characterize the sample complexity of an emprical estimator approximating the Sinkhorn gradients.

Setting and problem statement

Approach: Frank-Wolfe algorithm

Classic methods to approach barycenter problem:

1. fix the support of the barycenter beforehand and optimize the weightsonly (convergence analysis available)

OR

2. alternately optimize on weights and support points (no convergenceguarantees)

Our approach via Frank-Wolfe:

− It iteratively populates the target barycenter, one point at the time;

− It does not require the support to be fixed beforehand;

− There is no hyperparameter tuning.

Sinkhorn Barycenters via Frank Wolfe 9 / 22

Page 10: Sinkhorn Barycenters with Free Support via Frank Wolfe ... · Total Variation. We characterize the sample complexity of an emprical estimator approximating the Sinkhorn gradients.

Approach

Frank-Wolfe Algorithm on Banach spaces

W Banach space, W∗ topological dual and D ⊂ W∗ nonempty, convex,closed, bounded set.G : D → R convex + some smoothness properties

Theorem

Suppose in addition that ∇G is L-Lipschitz continuous with L > 0. Let(wk)k∈N be obtained according to Alg 1. Then, for every integer k ≥ 1,

G(wk)−min G ≤ 2

k + 2L (diamD)2 + ∆k. (4)

Sinkhorn Barycenters via Frank Wolfe 10 / 22

Page 11: Sinkhorn Barycenters with Free Support via Frank Wolfe ... · Total Variation. We characterize the sample complexity of an emprical estimator approximating the Sinkhorn gradients.

Approach

Can Frank-Wolfe be applied?

Optimization domain. M+1 (X ) is convex, closed, and bounded in the

Banach space M(X ): 4

Objective functional. The objective functional Bε is convex since it is aconvex combination of Sε(·, βj), with j = 1 . . .m. 4

Lipschitz continuity of the gradient. This is the most critical condition.

Sinkhorn Barycenters via Frank Wolfe 11 / 22

Page 12: Sinkhorn Barycenters with Free Support via Frank Wolfe ... · Total Variation. We characterize the sample complexity of an emprical estimator approximating the Sinkhorn gradients.

Approach

Lipschitz continuity of Sinkhorn potentials

This is one of the main contributions of the paper.

Theorem

The gradient ∇Sε is Lipschitz continuous, i.e. for all α, α′, β, β′ ∈M+1 (X ),∥∥∇Sε(α, β)−∇Sε(α

′, β′)∥∥∞ . (

∥∥α− α′∥∥TV

+∥∥β − β′∥∥

TV). (5)

It follows that ∇Bε is also Lipschitz continuous and hence our framework issuitable to apply FW algorithm.

Sinkhorn Barycenters via Frank Wolfe 12 / 22

Page 13: Sinkhorn Barycenters with Free Support via Frank Wolfe ... · Total Variation. We characterize the sample complexity of an emprical estimator approximating the Sinkhorn gradients.

Approach

How the algorithm works - I

The inner step in FW algorithm amounts to:

µk+1 ∈ argminµ∈M+

1 (X )

m∑j=1

ωj 〈∇Sε[(·, βj)](αk), µ〉 . (6)

Note that:

• by Bauer maximum principle → solutions of (6) are achieved at theextreme points of the optimization domain;

• extreme points of M+1 (X ) are Dirac deltas.

Hence (6) is equivalent to

µk+1 = δxk+1with xk+1 ∈ argmin

x∈X

m∑j=1

ωj(∇Sε[(·, βj)](αk)(x)

).

(7)

Sinkhorn Barycenters via Frank Wolfe 13 / 22

Page 14: Sinkhorn Barycenters with Free Support via Frank Wolfe ... · Total Variation. We characterize the sample complexity of an emprical estimator approximating the Sinkhorn gradients.

Approach

How the algorithm works - II

Once the new support point xk+1 has been obtained, FW updatecorresponds to

αk+1 = αk +2

k + 2(δxk+1

− αk) =k

k + 2αk +

2

k + 2δxk+1

. (8)

Weights and support points are updated simultaneously at each iteration.

Sinkhorn Barycenters via Frank Wolfe 14 / 22

Page 15: Sinkhorn Barycenters with Free Support via Frank Wolfe ... · Total Variation. We characterize the sample complexity of an emprical estimator approximating the Sinkhorn gradients.

Convergence analysis

Convergence analysis-finite case

Theorem

Suppose that β1, . . . βm ∈M+1 (X ) have finite support and let αk be the

k-th iterate of our algorithm. Then,

Bε(αk)− minα∈M+

1 (X )Bε(α) ≤ Cε

k + 2, (9)

where Cε is a constant depending on ε and on the domain X .

What if the input measures β1, . . . βm ∈M+1 (X ) are continuous and we

only have access to samples?

Sinkhorn Barycenters via Frank Wolfe 15 / 22

Page 16: Sinkhorn Barycenters with Free Support via Frank Wolfe ... · Total Variation. We characterize the sample complexity of an emprical estimator approximating the Sinkhorn gradients.

Convergence analysis

Sample complexity of Sinkhorn Potentials

FW can be applied when only an approximation of the gradient is available.

Hence we need quantify the approximation error between ∇Sε(·, β) and∇Sε(·, β) in terms of the sample size of β:

Theorem (Sample Complexity of Sinkhorn Potentials)

Suppose that c is smooth. Then, for any α, β ∈M+1 (X ) and any empirical

measure β of a set of n points independently sampled from β, we have, forevery τ ∈ (0, 1]

‖∇1Sε(α, β)−∇1Sε(α, β)‖∞ ≤Cε log 3

τ√n

(10)

with probability at least 1− τ .

Sinkhorn Barycenters via Frank Wolfe 16 / 22

Page 17: Sinkhorn Barycenters with Free Support via Frank Wolfe ... · Total Variation. We characterize the sample complexity of an emprical estimator approximating the Sinkhorn gradients.

Convergence analysis

Convergence analysis-general case

Using the sample complexity of Sinkhorn gradient, we are able tocharacterize the convergence analysis of our algorithm in the general setting.

Theorem

Suppose that c is smooth. Let n ∈ N and β1, . . . , βm be empiricaldistributions with n support points, each independently sampled fromβ1, . . . , βm. Let αk be the k-th iterate of our algorithm applied toβ1, . . . , βm. Then for any τ ∈ (0, 1], the following holds with probabilitylarger than 1− τ

Bε(αk)− minα∈M+

1 (X )Bε(α) ≤

Cε log 3mτ

min(k,√n). (11)

Sinkhorn Barycenters via Frank Wolfe 17 / 22

Page 18: Sinkhorn Barycenters with Free Support via Frank Wolfe ... · Total Variation. We characterize the sample complexity of an emprical estimator approximating the Sinkhorn gradients.

Experiments

Barycenter of nested ellipses

Barycenter of 30 randomly generated nested ellipses on a 50× 50 gridsimilarly to [Cuturi and Doucet, 2014]. Each image is interpreted as aprobability distribution in 2D.

Sinkhorn Barycenters via Frank Wolfe 18 / 22

Page 19: Sinkhorn Barycenters with Free Support via Frank Wolfe ... · Total Variation. We characterize the sample complexity of an emprical estimator approximating the Sinkhorn gradients.

Experiments

Barycenters of continuous measures

Barycenter of 5 Gaussian distributions with mean and covariance randomlygenerated.

scatter plot: output of our method

level sets of its density: true Wasserstein barycenter

FW recovers both the mean and covariance of the target barycenter.

Sinkhorn Barycenters via Frank Wolfe 19 / 22

Page 20: Sinkhorn Barycenters with Free Support via Frank Wolfe ... · Total Variation. We characterize the sample complexity of an emprical estimator approximating the Sinkhorn gradients.

Experiments

Matching of a distribution

“Barycenter” of a single measure β ∈M1+(X ).

Solution of this problem is β itself → we can interpret the intermediateiterates as compressed version of the original measure.

FW prioritizes the support points with higher weight.

Sinkhorn Barycenters via Frank Wolfe 20 / 22

Page 21: Sinkhorn Barycenters with Free Support via Frank Wolfe ... · Total Variation. We characterize the sample complexity of an emprical estimator approximating the Sinkhorn gradients.

Experiments

Summary

• We proposed a novel method to compute Sinkhorn barycenter with freesupports via Frank-Wolfe algorithm.

• We proved convergence rate both in case of finite and continuousmeasures.

• We proved two new results on Sinkhorn divergences- Lipschitzcontinuity and sample complexity of the gradient- instrumental for theconvergence analysis of the method.

Sinkhorn Barycenters via Frank Wolfe 21 / 22

Page 22: Sinkhorn Barycenters with Free Support via Frank Wolfe ... · Total Variation. We characterize the sample complexity of an emprical estimator approximating the Sinkhorn gradients.

Experiments

References I

Cuturi, M. and Doucet, A. (2014). Fast computation of wasserstein barycenters. In Xing, E. P.and Jebara, T., editors, Proceedings of the 31st International Conference on MachineLearning, volume 32 of Proceedings of Machine Learning Research, pages 685–693, Bejing,China. PMLR.

Feydy, J., Sejourne, T., Vialard, F.-X., Amari, S.-I., Trouve, A., and Peyre, G. (2019).Interpolating between optimal transport and mmd using sinkhorn divergences. InternationalConference on Artificial Intelligence and Statistics (AIStats).

Genevay, A., Peyre, G., and Cuturi, M. (2018). Learning generative models with sinkhorndivergences. In International Conference on Artificial Intelligence and Statistics, pages1608–1617.

Sinkhorn Barycenters via Frank Wolfe 22 / 22