Time Series Alignment with Global Invariances · 2020-02-11 · Time Series Alignment with Global...

Time Series Alignment with Global Invariances

Titouan Vayer 1 2 Laetitia Chapel 1 2 Nicolas Courty 1 2 Rémi Flamary 3 Yann Soullard 4 5 2

Romain Tavenard 4 5 2

AbstractIn this work we address the problem of compar-ing time series while taking into account both fea-ture space transformation and temporal variabil-ity. The proposed framework combines a latentglobal transformation of the feature space with thewidely used Dynamic Time Warping (DTW). Thelatent global transformation captures the featureinvariance while the DTW (or its smooth coun-terpart soft-DTW) deals with the temporal shifts.We cast the problem as a joint optimization overthe global transformation and the temporal align-ments. The versatility of our framework allowsfor several variants depending on the invarianceclass at stake. Among our contributions we definea differentiable loss for time series and presenttwo algorithms for the computation of time se-ries barycenters under our new geometry. Weillustrate the interest of our approach on both sim-ulated and real world data.

1. IntroductionLearning under distribution shift is one paradigm of ma-jor interest in the machine learning literature. Indeed, thetraining and the test data are often subject to collectionbias (Torralba & Efros, 2011) or can be collected underheterogeneous conditions, because of different times ofmeasurement, contexts or even measurement modalities(e.g. when different sensors are used to measure relatedquantities). In this setting, when it comes to generalize toout-of-distribution samples, machine learning algorithmsare notoriously weak (Ben-David et al., 2010) as they relyon the correlations that are found in the training data (Ar-jovsky et al., 2019). Dedicated paradigms such as domainadaptation (Kouw & Loog, 2019), directly take into accountthis problem in the learning process. One other approach is

1Université de Bretagne Sud, Vannes, France 2UMR IRISA,Rennes, France 3Université Côte d’Azur, Nice, France 4Universitéde Rennes, Rennes, France 5UMR LETG, Rennes, France.Correspondence to: Romain Tavenard <[email protected]>.

to learn, based on some prior knowledge, w.r.t some invari-ance classes in order to be more robust to irrelevant featurestransformations (Battaglia et al., 2018; Goodfellow et al.,2009). In this work, we aim at tackling this problem inthe time series context through the definition of similaritymeasures that naturally encode desirable invariances. Moreprecisely, we introduce similarity measures that are able todeal with both temporal and feature space transformations.

There exists many frameworks to register different spacesunder some classes of invariance. In the shape analysis com-munity, matching objects under rigid transformations is awidely-studied problem. Iterative Closest Point (ICP, Chen& Medioni (1992)) is a standard algorithm for such a task.It acts by alternating two simple steps: (i) matching pointsusing nearest neighbor search and (ii) registering shapestogether based on the obtained matches, which is known asthe orthogonal Procrustes problem and has a closed formsolution (Goodall, 1991). An extension of this algorithmpresented in Alvarez-Melis et al. (2019a) uses optimal trans-port to match points at the first step, and it has been recentlyadapted to tree objects by considering a dedicated invarianceclass for the registration step (Alvarez-Melis et al., 2019b).

In the time series context, in order to deal with both localand global temporal distortions, a similarity measure calledDynamic Time Warping (DTW, Sakoe & Chiba (1978)) hasbeen introduced. It is invariant (up to sampling artefacts)to any monotonically increasing temporal map that wouldalign starting and end times. It has been initially introducedfor speech processing applications and is now widely usedin a variety of contexts such as human activity recognition(Chang et al., 2019), satellite image analysis (Wegner Mauset al., 2019) or medical applications (Huang & Lu, 2020).In this work, we aim at tackling both temporal and fea-ture space invariances. To do so, we state the problem asa joint optimization over temporal alignments and featurespace transformations, as depicted in Figure 1. That generalframework allows the use of either DTW or its smoothedcounterpart softDTW as an alignment procedure. Similarly,though rigid transformations of the feature space seem areasonable invariance class, we show that our method can beused in conjunction with other families of transformations.Such a framework allows considering the case when timeseries differ both in length and feature space dimension-

arX

iv:2

002.

0384

8v1

[st

at.M

L]

10

Feb

2020


Registration

f⇤ 2 V2,3

yf⇤(y)x

⇡⇤, f⇤ = argmin⇡,f hW⇡ , C(x, f(y))i

DTW path ⇡⇤

<latexit sha1_base64="Tb8Cdm9qrViKC7OLduSiTZRzad0=">AAAKr3icnVVdb+NEFJ3uAi3mawuPCGlEW0hWjrHjum20ZLVikeCBhwVttyvF2TKxx8motseMx21Ty/+Bv8c/gF/AK3fGTkI2Tbqqqybj63vPOXPu9WSUxSyXtv3X1oOH773/wfbOh8ZHH3/y6WePdj9/lfNCBPQ04DEXr0ckpzFL6alkMqavM0FJMorp2ejiuXp+dklFznj6Uk4zOkzIOGURC4iE0Pnu9p/+iI5ZWkp2cZOxQBaCVoM8IDHt25ZrUiie4pSH9Dsrl1OIlglLWVIkOGc3tO8ESTU0DF9lDFiaUoFzmvXtTA5xK2Iil21MJG7Zpt02MFylz9IgLkI6FiSbsCAfXLFQTvqW60t6LfXNsIx4zIrkXHBpZWFUVU/WMMgrXuN370lASczGKQ3voJkISmsi9+Q+RDP4Az/iqYxIwuJpSYoxo1WpQ8rM0vEq9e9DO2kgVdjwQ0GuBt93TAkMFyCl6wE/7nRAiQurmWK9jX5AU0mFiUc05lcmVjJwLaprHQeJiRVmHwi5TLmkinRY29czbavXhj3lgWBZLec3GIxcCj0pC2+WmciIX9IlJsfy7mDqOO1yv/T18Ja/sIT+BO6mVVlGbx5XVYXBV+wnRE5Go/JVdV52Tbfan/PXwHlC4rhGdD3TBeX7dUlUTjcldw+tJn29gNYCqb2J13qL+Lombjr2dN6yecccz/J0zwx8AAN0i50JC6/I1Fztn2t1F642IsofX57hDLjVYLUc2+zY1lEbv2Ofju7q0rHlqYEY4JlTXJB0rAY2Y8onc72DfaghYgxdZCk08BYAaPPt9ZWegJhG0o8hO6b47HxNPX7eWlhv4lvgouVmtrEv2HgifVEj+8M1U73q/+HmqQbzV9+feXtWHNzXFu5X+khYe+qo2vYMvuNZzh3njtO75YTTG0tgX/MTTs9nrcSEKcVP+zGBOlP9hDS7hZM/SUwz5FLSEJTAYDlqug7bkg9GNA2x9nHYOrE8z+x0QdoTvIRcF3VVkQNjFDARgOEtCHnt+2lwetZRPeEzDWpIlITeOgVQcqJKXHuNgvmovKsI99jqwSsMb8ayiK5d2zPXMEdWRSfgARQdu6syAGLpp9cwzh/t2ZatL7y6cJrFHmquF3x36yHyUYg4ClCBEkRRiiSsY0RQDn8D5CAbZRAbohJiAlZMP6eoQgbUFpBFIYNA9AI+x3A3aKIp3CvMHJkIA8sl1GZwt4gr1Guo5IDIAb2CvAPI4YAgYK10YP280JwqamzklShCJxqXAU+mI0pJ0GAXegcKC/+PRwJCBjGhdTL4plChKmeeYF2TazXKB6Kf/60zVVTdB01ugf6B2DqVpVZwgW42OLjI6aA/YK1YNjlewqfQe5jo/lxvzFXu5+AT12pzlQuT47w9J6uLV/CuwK/Jr929Zz80M7SDvkRfoxbMyTF6hn5GL9ApCrb/3flq55udbw3HODPeGL/XqQ+2mpov0NJlsP8AGZc2Iw==</latexit>

Figure 1. DTW-GI aligns time series by optimizing on temporal alignment (through Dynamic Time Warping) and feature space transfor-mation (denoted f here). Time series represented here are color-coded trajectories, whose starting (resp. end) point is depicted in blue(resp. red).

ality. We introduce two different optimization proceduresthat could be used to tackle this problem and show exper-imentally that they lead to effectively invariant similaritymeasures. Our method can also be used to compute mean-ingful barycenters even when time series at stake do not liein the same feature space. Finally, we showcase the ver-satility of our method and the importance to jointly learnfeature space transformations and temporal alignments ontwo real-world applications that are time series forecastingfor human motion and cover song identification.

2. Dynamic Time Warping (DTW)Dynamic Time Warping (DTW, Sakoe & Chiba (1978)) is analgorithm used to assess similarity between time series, withextensions to multivariate time series proposed in (Ten Holtet al., 2007; Wöllmer et al., 2009). In its standard form,given two multivariate time series x ∈ RTx×p and y ∈RTy×p of the same dimensionality p, DTW is defined as:

DTW(x,y) = minπ∈A(x,y)

∑(i,j)∈π

d(xi,yj) (1)

where A(x,y) is the set of all admissible alignments be-tween x and y and d is a ground metric. In most cases, d isthe squared Euclidean distance, i.e. d(xi,yj) = ‖xi−yj‖2.

An alignment π is a sequence of pairs of time frames whichis considered to be admissible iff (i) it matches first (andrespectively last) indexes of time series x and y together,(ii) it is monotonically increasing and (iii) it is connected(i.e. every index from one time series must be matchedwith at least one index from the other time series). Efficientcomputation of the above-defined similarity measure can beperformed in quadratic time using dynamic programming,

relying on the following recurrence formula:

DTW (x→t1 ,y→t2) = d(xt1 ,yt2)

+ min

DTW (x→t1 ,y→t2−1)DTW (x→t1−1,y→t2)DTW (x→t1−1,y→t2−1)

(2)

Many variants of this similarity measure have been intro-duced. For example, the set of admissible alignment pathscan be restricted to those lying around the diagonal usingthe so-called Itakura parallelogram or Sakoe-Chiba band, ora maximum path length can be enforced (Zhang et al., 2017).Most notably, a differentiable variant of DTW, coined soft-DTW, has been introduced in Cuturi & Blondel (2017) andis based on previous works on alignment kernels (Cuturiet al., 2007). It replaces the min operation in Equation (2) bya soft-min operator minγ whose smoothness is controlledby a parameter γ > 0, resulting in the DTWγ distance:

DTWγ(x,y) = min γπ∈A(x,y)

∑(i,j)∈π

d(xi,yj) (3)

=− γ log

∑π∈A(x,y)

e−∑

(i,j)∈π d(xi,yj)/γ

.

In the limit case γ = 0, minγ reduces to a hard min operatorand DTWγ is defined as equivalent to the DTW algorithm.

3. DTW with Global InvariancesDespite their widespread use, DTW and softDTW are notable to deal with time series of different dimensionality orto encode feature transformations that may arise betweentime series. In the following, we introduce a new similar-ity measure aiming at aligning time series in this complexsetting and provide ways to compute associated alignments.We also derive a Fréchet mean formulation that allows com-puting barycenters under this new geometry.


DTW DTW-GI

Figure 2. Example alignments between 2D time series (trajecto-ries in the plane). Color coding corresponds to timestamps. OurDTW-GI method jointly estimates temporal alignment and globalrotation between time series. On the contrary, standard DTWalignment fails at capturing feature space distortions and thereforeproduces mostly erroneous alignment (matching in red), exceptat the beginning and end of the time series, whose alignments arepreserved thanks to DTW border constraints (cf. Section 2).

3.1. Definitions

Let x ∈ RTx×px and y ∈ RTy×py be two time series. In thefollowing, we assume px ≥ py without loss of generality. Inorder to allow comparison between time series x and y, wewill optimize on a family of functions F that map y ontothe feature space in which x lies. More formally, we defineDynamic Time Warping with Global Invariances (DTW-GI)as the solution of the following joint optimization problem:

DTW-GI(x,y) = minf∈F,π∈A(x,y)

∑(i,j)∈π

d(xi, f(yj)), (4)

where F is a family of functions from Rpy to Rpx . Notethat this problem can also be written as:

DTW-GI(x,y) = minf∈F,π∈A(x,y)

〈Wπ, C(x, f(y))〉 (5)

where f(y) is a shortcut notation for the transformation fapplied to all observations in y, 〈., .〉 denotes the Frobeniusinner product, Wπ is defined as:

∀i ≤ Tx, j ≤ Ty, (Wπ)i,j =

{1 if (i, j) ∈ π0 otherwise

(6)

and C(x, f(y)) is the cross-similarity matrix of squaredEuclidean distances between samples from x and f(y),respectively. This definition can be extended to the softDTWcase Equation (3) as proposed in the following:

DTWγ-GI(x,y) = minf∈F

min γπ∈A(x,y)

〈Wπ, C(x, f(y))〉 (7)

= minf∈F−γ log

∑π∈A(x,y)

e−〈Wπ,C(x,f(y))〉/γ

Note that, due to the use of a soft-min operator, Equation (7)is no longer a joint optimization.

These similarity measures estimate both temporal alignmentand feature space transformation between time series simul-taneously, allowing the alignement of time series when thesimilarity should be defined up to a global transformation.For instance, one can see in Figure 2 two temporal align-ments between two series in 2D that have been rotated intheir feature space. In this case DTW-GI, whose invari-ant is the space of rotations, recovers the proper alignmentwhereas DTW fails.

Properties of DTW-GI By definition, DTW-GI andsoftDTW-GI are invariant under any global transforma-tion T (·) such that {f ◦ T | f ∈ F} = F (i.e. F is sta-ble under T ), which motivates the names (soft)DTW withGlobal Invariances. It is also straightforward to see thatDTW-GI(x,x) = 0 for any time series x as soon as F con-tains the identity map.

3.2. Optimization

Optimization on the above-defined losses can be performedin several ways, depending of the nature of F . We nowpresent one optimization scheme for each loss.

3.2.1. GRADIENT DESCENT

We first consider the optimization on the softDTW-GI loss inthe case where F is a parametric family of functions that aredifferentiable with respect to their parameters. Optimizingon problem (7) can be done with a gradient descent on the pa-rameters of f . Since softDTW is smooth (contrary to DTW),this strategy can be used to compute gradients of DTWγ-GIw.r.t. the parameter θ of fθ. Complexity for this approachis driven by (i) that of a softDTW computation and (ii) thatof computing fθ(y). If we denote the latter cf , overall com-plexity for this approach is hence O(niter(TxTypx + cf )).Note that when Riemannian optimization is involved, anextra complexity term has to be added, corresponding to thecost of projecting gradients onto the considered manifold.This cost is O(p3y) for example when optimization is per-formed on the Stiefel manifold (Wen & Yin, 2013), which isan important case for our applications, as discussed in moredetails in the following.

3.2.2. BLOCK COORDINATE DESCENT (BCD)

When DTW-GI is concerned, we introduce another strategythat consists in alternating minimization over (i) the tem-poral alignment and (ii) the feature space transformationsin Equation (5). We will refer to this strategy as Block-Coordinate Descent (BCD) in the following.

Optimization over the alignment path given a fixed transfor-mation f solely consists in a DTW alignment, as describedin Section 2. For a fixed alignment path, the optimization


problem then becomes:

minf∈F〈Wπ, C(x, f(y))〉 . (8)

Recall that C is a matrix of squared distances, which meansthat the problem above is a weighted least square problem.Depending on F , there can exist a closed form solution forthis problem (e.g. when F is the set of affine maps with nofurther constraints). Let us first note that the matrix C canbe rewritten as:

C(x, f(y)) = ux + vTf(y) − 2xf(y)T (9)

where ux = (‖x1‖2, . . . , ‖xTx‖2)T andvf(y) = (‖f(y1)‖2, . . . , ‖f(yTy )‖2)T . Note that theoptimization problem reduces to the linear term on the rightif F is a set of norm preserving operations.

Estimating f in the Stiefel manifold Let us consider thecase where F is the set of linear maps whose linear operatoris an orthonormal matrix, hence lying on the Stiefel mani-fold that we denote Vpy,px in the following. This invarianceclass encodes rigid transformations of the features. In thiscase, the optimization problem becomes:

minP∈Vpy,px

⟨Wπ,ux + vTf(y) − 2xPyT

⟩(10)

and we have vf(y) = (‖y1‖2, . . . , ‖yTy‖2)T = vy sincethe considered applications are norm-preserving. Overall,we get the following optimization problem:

minP∈Vpy,px

⟨Wπ,ux + vTy

⟩− 2

⟨Wπ,xPyT

⟩(11)

whose solution is equivalent to solving:

maxP∈Vpy,px

⟨Wπ,xPyT

⟩= max

P∈Vpy,px

⟨xTWπy,P

⟩(12)

since the term⟨Wπ,ux + vTy

⟩does not depend in P.

As described in Jaggi (2013), the latter problem can besolved exactly using Singular Value Decomposition (SVD):if UΣV T = M is the SVD of a matrix M of shape(py, px), then S? = UV T is a solution to the linear problemsupS∈Vpy,px 〈S,M〉. Note that an extension of this methodin the case where F is an affine map whose linear part liesin the Stiefel manifold is straightforward, as discussed forexample in Lawrence et al. (2019).

Interestingly, this optimization strategy where we alternatebetween time series alignment, i.e. time correspondencesbetween both time series, and feature space transform op-timization can be seen as a variant of the Iterative ClosestPoint (ICP) method in image registration (Chen & Medioni,1992), in which nearest neighbors are replaced by matches

resulting from DTW alignment. Its overall complexity isthen O(niter(TxTypx + p2xpy)). This complexity is equal tothat of the gradient-descent when px = O(py). However, inpractice, the number of iterations required is much lower forthis BCD variant, making it a very competitive optimizationscheme, as discussed in Section 4.

3.3. Barycenters

Let us now assume we are given a set {x(i)}i of time se-ries of possibly different lengths and dimensionalities. Abarycenter of this set in the DTW-GI sense is a solution tothe following optimization problem:

minb∈RT×p

∑i

wi minfi∈F

DTW (x(i), fi(b)), (13)

where weights {wi}i as well as barycenter length T anddimensionality p are provided as input to the problem. Notethat, with this formulation, when F is the Stiefel manifold,p is supposed to be lower or equal to the dimensionality ofany time series in the set {x(i)}i.

In terms of optimization, as for similarity estimation, twoschemes can be used. First, softDTW-GI barycenters canbe estimated through gradient descent (and when the set ofseries to be averaged is large, a stochastic variant relyingon minibatches can easily be implemented). Second, whenBCD is used for time series alignment, barycenters canbe estimated using a similar approach as DTW BarycenterAveraging (DBA, Petitjean et al. (2011)), that would consistin alternating between barycentric coordinate estimationand DTW-GI alignments.

4. ExperimentsIn this section, we provide an experimental study of DTW-GI (and its soft counterpart) on simulated data and real-world datasets. Unless otherwise specified, the set F offeature space transforms is the set of affine maps whose lin-ear part lies in the Stiefel manifold. In all our experiments,tslearn (Tavenard et al., 2017) implementation is used forbaseline methods and gradient descent on the Stiefel man-ifold is performed using GeoOpt (Kochurov et al., 2019;Becigneul & Ganea, 2019) in conjunction with PyTorch.Open source code of our method will be released uponpublication and is provided as supplementary material.

4.1. Timings

We are first interested in a quantitative evaluation of thetemporal complexity of our methods. Note that the the-oretical complexity of DTW and softDTW are the same,hence any difference observed in this series of experimentsbetween DTW-GI and softDTW-GI would be solely dueto their optimization schemes discussed in Section 3.2. In


23 24 25 26 27 28 29 210

Time series length

2−14

2−11

2−8

2−5

2−2

21

24

27

Com

puta

tion

time

(s)

DTWsoftDTWDTW-GIsoftDTW-GI

23 24 25 26 27 28 29 210

Time series dimensionality

2−16

2−12

2−8

2−4

20

24

28

Com

puta

tion

time

(s)


Figure 3. Computing time as a function of time series length (left) and dimensionality (right). Solid lines correspond to mean values andshaded areas correspond to 20th (resp. 80th) percentiles.

these experiments, the number of iterations for BCD as wellas the number of gradient steps for the gradient descentoptimizer are set to 5,000. The BCD algorithm used forDTW-GI is stopped as soon as it reaches a local minimum,while early stopping is used for the gradient-descent variantwith a patience parameter set to 100 iterations.

We first study the computation time as a function of thelength of the time series involved. To do so, we generaterandom time series in dimension 8 and vary their lengthfrom 8 to 1,024 timestamps. Figure 3 (left) shows a clearquadratic trend for all 4 methods presented. Note that DTW-GI and its BCD optimizer clearly outperform the gradientdescent strategy used for softDTW-GI because the latter re-quires more iterations before early stopping can be triggered.Building on this, we now turn our focus on the impact of fea-ture space dimensionality p (with a fixed time series lengthof 32). Baselines are asymptotically linear with respect top. Since feature space registration is performed throughoptimization on the Stiefel manifold, both our optimizationschemes rely on Singular Value Decomposition, which leadsto an O(p3) complexity that can also be observed for bothmethods in Figure 3 (right).

4.2. Rotational invariance

We now evaluate the ability of our method to recover invari-ance to rotation. To do so, we rely on a synthetic datasetof noisy spiral-like 2d trajectories. For increasing values ofan angle θ, we generate pairs of spirals rotated by θ withadditive gaussian noise. Alignments between a referencetime series and variants that are subject to an increasing ro-tation θ are computed and repeated 50 times per angle. Theratio of each distance to the distance when θ = 0 is reportedin Figure 4. One can clearly see that the GI counterpartsof DTW and softDTW are invariant to rotation in the 2dfeature space, while DTW and softDTW are not.

0 π/2 π 3π/2 2πAngle

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

Dis

tanc

era

tio


Figure 4. Illustration of the rotation invariance provided by DTW-GI. Mean distance ratios are reported as solid lines and shadedareas correspond to 20th (resp. 80th) percentiles.

4.3. Barycenter computation

So as to better grasp the notion of similarity captured by ourmethods, we compute barycenters using the strategy pre-sented in Section 3.3. Barycenters are computed for 3 differ-ent datasets: the first two are made of 2d trajectories of ro-tated and noisy spirals or folia, and the third one is composedof both 2- and 3-dimensional spirals (see samples in the leftpart of Figure 5). For each dataset, we provide barycentersobtained by two baseline methods. DTW Barycenter Aver-aging (DBA, Petitjean et al. (2011)) is used for DTW whilesoftDTW resorts to a gradient-descent scheme to computethe barycenters. Their GI counterpart use the same algo-rithms but rely on the alignments obtained from DTW-GIand softDTW-GI respectively. Note that the baselines can-


Sample 1 Sample 2 Sample 3 Sample 4 Sample 5 DTW softDTW DTW-GI softDTW-GI

Figure 5. Barycenter computation using (i) DTW and softDTW baseline approaches, (ii) their rotation-invariant counterparts DTW-GIand softDTW-GI. Each row correspond to a different dataset, and the latter one contains both 2d and 3d trajectories, hence cannot betackled by any baseline method. Trajectories are color-coded from blue (beginning of the series) to red (end of the series).

not be used for the third dataset since features of the timeseries do not lie in the same ambient space.

For the 2d spiral dataset, all the reconstructed barycenterscan be considered as meaningful. This is because of thequasi-symmetries of the data at stake, that allows DTW andsoftDTW to recover sensible barycenters. Note howeverthat the outer loop of the spiral (the one that suffers themost from the rotation) is better reconstructed using DTW-GI and softDTW-GI variants. When it comes to the foliatrajectories, that are more impacted by rotations, baselinebarycenters fail to capture the inherent structure of the tra-jectories at stake, while both our methods generate smoothand representative barycenters. DTW-GI and softDTW-GIare even able to recover barycenters when datasets are madeof series that do not lie in the same space, as shown in thethird row of Figure 5. Finally, in all three settings consid-ered, temporal alignments successfully capture the irregularsampling from the samples to be averaged (denser towardsthe center of the spiral / loop of the folium).

4.4. Time series forecasting

To further illustrate the benefit of our approach, we considera time series forecasting problem (Le Guen & Thome, 2019),where the goal is to infer the future of a partial time series.In this setting, we suppose that we have access to a trainingset of full time series X, with x(i) ∈ X a time series oflength T and dimensionality p, and another test set of partialtime series Y where each y ∈ Y is of length T ′ < T

and dimensionality p. The goal is to predict the values fortimestamps T ′ to T for each test time series. We will denoteby x→T ′ the beginning of the time series x (up to time T ′)and xT ′→ its end (from time T ′ to time T ).

Let d(y,xi) denote a dissimilarity measure between timeseries y and xi associated with a transformation fi ∈ F :Rpx → Rpy that maps the features of xi onto the featuresof y. This function aims at capturing the desired invariancesin the feature space, as described in the previous section.A typical example is when d is the softDTW-GI cost, thenthe fi are the Stiefel linear maps which capture the possiblerigid transformation between the features. We propose topredict the future of a time series y as follows:

yT ′→ =∑i

ad

(y→T ′ ,x

(i)→T ′

)fi

(x(i)T ′→

)(14)

where ad is the attention kernel:

ad(y,xi) =e−λd(y,xi)∑j e−λd(y,xj)

(15)

with λ > 0. The prediction is based on the known times-tamps for the time series of the training set and on transfor-mations fi that aim at capturing the latent transformationbetween training and test time series. The attention kernelgives more importance to time series that are close to thetime series we want to forecast w.r.t. the notion of dissim-ilarity d. Note that for large values of λ, the softmax inEquation (15) converges to a hard max and the proposedapproach corresponds to a nearest neighbor imputation.


Sample 1 Sample 2 Sample 3 Sample 4 Sample 5 Sample 6 Sample 7 Sample 8

Ground truth L2 L2+Procrustes softDTW softDTW+Procrustes softDTW-GI

Figure 6. Examples of the forecasted subseries. Training samples x(i)

T ′→ are depicted in top row while ground truth yT ′→ and structuredpredictions yT ′→ are represented in bottom row. For each movement an arrow indicates the orientation of the subject. The beginning ofthe movement is displayed in shaded blue while the end is displayed in bold red.

λ L2 L2+Procrustes softDTW softDTW+Procrustes softDTW-GI

10−3 98.24 +/- 1.14 16.50 +/- 1.25 75.41 +/- 0.90 16.46 +/- 1.24 12.34 +/- 0.6510−2 97.21 +/- 1.11 16.12 +/- 1.00 65.39 +/- 0.69 15.71 +/- 0.94 8.30 +/- 1.2610−1 83.65 +/- 0.74 12.63 +/- 0.65 65.39 +/- 0.69 11.32 +/- 0.58 13.41 +/- 1.05100 65.39 +/- 0.69 19.62 +/- 1.17 65.39 +/- 0.69 19.96 +/- 0.91 17.61 +/- 2.23

Table 1. Average prediction loss on tests subjects with different values of λ

Dataset and methodology We use the Human3.6Mdataset (Ionescu et al., 2014) which consists of 3.6 millionvideo frames of human movements recorded in a controlledindoor motion capture setting. This dataset is composed of7 actors performing 15 activities (“Walking”, “Sitting” ...)twice. We are interested in forecasting the 3D positions ofthe subject joints evolving over time. We follow the samedata partition as Coskun et al. (2017): the training set has5 subjects (S1, S5, S6, S7 and S8) and the remaining 2subjects (S9 and S11) compose the test set. In our experi-ments, we set the limit frames as T ′ = 100, T = 200 whichcorresponds to predicting two seconds of movement giventhe previous two seconds. To emulate possible changes insignal acquisition (e.g. rotations of the camera), we ran-domly rotate the train subjects w.r.t. the z-axis. We considerthe movements of type “Walking”, “WalkDog” and “Walk-Together” for the training set and “Walking” for the testset. Top row of Figure 6 illustrates samples of movementsresulting from this procedure and the resulting dataset isprovided as supplementary material.

Competing structured prediction methods Since ourmotions are in 3d, we look for global transformations ofthe features fi : R3 → R3. We use softDTW-GI as oursimilarity measure and its associated map fi as described inEquation (7). We compare softDTW-GI to 5 baselines, thatcorrespond to different pairs of time series similarity mea-sure and feature space invariances. The first two baselines

do not encode any feature space invariance and are based on`2 and softDTW similarities respectively. They are denotedL2 and softDTW and use the identity map for all fi. Wealso consider a Procrustes baseline (Goodall, 1991) definedas:

d(y→T ′ ,x(i)→T ′) = min

P∈V3,3,b∈R3‖(x(i)→T ′P

T +b)−y→T ′‖22(16)

The corresponding transformation fi is the affine map basedon the optimal P?,b? found by the previous problem. Wedenote this baseline L2+Procrustes. The last baselineis computed by first registering series using the Procrustesprocedure defined above and then using the similarity mea-sure d(y→T ′ ,x

(i)→T ′) = DTWγ(y→T ′ ,x

(i)→T ′P

?T +b?). Itis denoted as softDTW+Procrustes.

Results Qualitative and quantitative results are providedin Figure 6 and Table 1 respectively. We evaluate, for eachtest subject, the `2 reconstruction loss ‖yT ′→− yT ′→‖2 be-tween the ground truth time series and its prediction. Table 1displays the average loss on the test subjects for differentvalues of the soft-max parameter λ ∈ {10−3, ..., 100}. Fig-ure 6 presents examples of reconstructed movements.

In all cases we observe that softDTW and L2 lead tothe worst reconstruction losses. Moreover, one can ob-serve qualitatively that these methods struggle to copewith the difference in orientation of the subjects thus tend-ing to shrink the prediction. On the contrary, by captur-


20 40 60 80Rank

0.0

0.2

0.4

0.6

0.8

1.0R

ecal

l

DTW (OTI), MR1=27.7DTW-GI (Stiefel), MR1=28.1DTW-GI (OTI), MR1=23.7

Figure 7. Cover song identification using the covers80 dataset.Methods are compared in terms of recall.

ing the possible spatial variability, L2+Procrustes andsoftDTW+Procrustes perform reasonably well but dostruggle with the dynamics of the movement as shown inFigure 6. It can be explained by the fact that the optimaltransformations found by the Procrustes analysis suppose aone-to-one correspondence of the points (i.e. y→T ′(t) cor-responds to x

(i)→T ′(t) at time t) and do not consider the tem-

poral shifts between them. In this way, both methods leadto unrealistic transformations when the dynamics of move-ments are not the same. On the opposite, softDTW-GImethod leads to the best results, highlighting the benefits ofour approach over methods that wether discard the temporalvariability of the movements (L2, L2+Procrustes) orits spatial variability (softDTW). More importantly, theresults advocate for a joint realignment of time and spacesince it leads to better results than a two-step procedure suchas softDTW+Procrustes which first finds the featuretransformation and then aligns series in time.

4.5. Cover song identification

Cover song identification is the task of retrieving, for a givenquery song, its covers (i.e. different versions of the samesong) in a training set. State-of-the-art methods either relyon anchor matches in the songs and/or on temporal align-ments. In most related works, chroma or harmonic pitchclass profile (HPCP) features are usually chosen, as theycapture harmonic characteristics of the songs at stake (Heoet al., 2017).

For this experiment, we use the covers80 dataset (Ellis &Cotton, 2007) that consists in 80 cover pairs of pop songsand we evaluate the performance in terms of recall. Sincethe selection of features is not our main focus, we choose toextract chroma energy normalized statistics (CENS, Mülleret al. (2005)) over half a second windows. We comparevariants of our method to a baseline that consists in a DTW

alignment between songs transposed to the same key usingthe Optimal Transposition Index (OTI, Serra et al. (2008)).This OTI computes a transposition based on average energyin each semitone band. Note that we also compared ourapproach to the Smith Waterman algorithm in place of DTWbut we chose to put it in supplementary material since it didnot lead to significant improvement over the DTW baseline.

Figure 7 presents recall scores for compared methods as wellas the mean rank of the first correctly identified cover (MR1)that is a standard evaluation metric for the task (used inMIREX1 for example). Presented results show that DTW-GIwith feature space optimization on the Stiefel manifold getscomparable results to the baseline. Interestingly enough,one can notice that the OTI strategy could be adapted to ourcontext. Using the BCD optimization scheme, we are ableto compute the optimal transposition index along the align-ment path (instead of computing it on averaged features) ateach iteration of the algorithm. This leads to a significantimprovement of the performance and illustrates both theversatility of our method and the importance of performingjoint feature space transformation and temporal alignment.

5. Conclusion and PerspectivesWe propose in this paper a novel similarity measure thatcan compare time series across different spaces in order totackle both temporal and feature space invariances. Thiswork extends the well-known Dynamic Time Warping algo-rithm to deal with time series from different spaces thanksto the introduction of a joint optimization over temporalalignements and space transformations. In addition, we pro-vide a formulation for the computation of the barycenter ofa set of times series under our new geometry, which is, tothe best of our knowledge, the first barycenter formulationfor a set of heterogeneous time series. Another importantspecial case of our approach allows for performing temporalalignment of time series with invariance to rotations in thefeature space.

We illustrate our approach on several datasets. First, we usesimulated time series to study the computational complexityof our approach and illustrate invariance to rotations. Then,we apply our approach on two real-life datasets for humanmotion prediction and cover song identification where invari-ant similarity measures are shown to improve performance.

Extension of this work will consider scenarios where fea-tures of the series do not lie in a Euclidean space, whichwould allow covering the case of structured data such asgraphs evolving over time, for example. Future works alsoinclude the use of our methods in more elaborated modelswhere, following ideas from Cai et al. (2019); Iwana et al.

1https://www.music-ir.org/mirex/wiki/2019:Audio_Cover_Song_Identification

https://www.music-ir.org/mirex/wiki/2019:Audio_Cover_Song_Identification

https://www.music-ir.org/mirex/wiki/2019:Audio_Cover_Song_Identification


(2020), softDTW-GI could be used as a feature extractorin neural networks. It could also serve as a loss to trainheterogeneous time series forecasting models (Le Guen &Thome, 2019; Cuturi & Blondel, 2017).

ReferencesAlvarez-Melis, D., Jegelka, S., and Jaakkola, T. S. Towards

optimal transport with global invariances. In Proceedingsof the International Conference on Artificial Intelligenceand Statistics, 2019a.

Alvarez-Melis, D., Mroueh, Y., and Jaakkola, T. S. Un-supervised hierarchy matching with optimal transportover hyperbolic spaces. arXiv preprint arXiv:1911.02536,2019b.

Arjovsky, M., Bottou, L., Gulrajani, I., and Lopez-Paz, D. Invariant risk minimization. arXiv preprintarXiv:1907.02893, 2019.

Battaglia, P., Hamrick, J. B. C., Bapst, V., Sanchez, A.,Zambaldi, V., Malinowski, M., Tacchetti, A., Raposo, D.,Santoro, A., Faulkner, R., Gulcehre, C., Song, F., Ballard,A., Gilmer, J., Dahl, G. E., Vaswani, A., Allen, K., Nash,C., Langston, V. J., Dyer, C., Heess, N., Wierstra, D.,Kohli, P., Botvinick, M., Vinyals, O., Li, Y., and Pascanu,R. Relational inductive biases, deep learning, and graphnetworks. arXiv, 2018.

Becigneul, G. and Ganea, O.-E. Riemannian adaptive op-timization methods. In Proceedings of the InternationalConference on Learning Representations, 2019.

Ben-David, S., Lu, T., Luu, T., and Pál, D. Impossibilitytheorems for domain adaptation. In Proceedings of theInternational Conference on Artificial Intelligence andStatistics, pp. 129–136, 2010.

Cai, X., Xu, T., Yi, J., Huang, J., and Rajasekaran, S.Dtwnet: a dynamic time warping network. In NeuralInformation Processing Systems, pp. 11636–11646. Cur-ran Associates, Inc., 2019.

Chang, C.-Y., Huang, D.-A., Sui, Y., Fei-Fei, L., andNiebles, J. C. D3tw: Discriminative differentiable dy-namic time warping for weakly supervised action align-ment and segmentation. In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition,pp. 3546–3555, 2019.

Chen, Y. and Medioni, G. Object modelling by registrationof multiple range images. Image and Vision Computing,10(3):145 – 155, 1992.

Coskun, H., Achilles, F., DiPietro, R., Navab, N., andTombari, F. Long short-term memory kalman filters:

Recurrent neural estimators for pose regularization. InProceedings of the International Conference on ComputerVision, Oct 2017.

Cuturi, M. and Blondel, M. Soft-dtw: a differentiable lossfunction for time-series. In Proceedings of the Interna-tional Conference on Machine Learning, pp. 894–903.JMLR. org, 2017.

Cuturi, M., Vert, J.-P., Birkenes, O., and Matsui, T. Akernel for time series based on global alignments. InProceedings of the IEEE International Conference onAcoustics, Speech, and Signal Processing, volume 2, pp.II–413. IEEE, 2007.

Ellis, D. P. and Cotton, C. V. The 2007 labrosa cover songdetection system. 2007.

Goodall, C. Procrustes methods in the statistical analysis ofshape. Journal of the Royal Statistical Society: Series B(Methodological), 53(2):285–321, 1991.

Goodfellow, I., Lee, H., Le, Q. V., Saxe, A., and Ng, A. Y.Measuring invariances in deep networks. In Bengio, Y.,Schuurmans, D., Lafferty, J. D., Williams, C. K. I., andCulotta, A. (eds.), Neural Information Processing Sys-tems, pp. 646–654. Curran Associates, Inc., 2009.

Heo, H., Kim, H. J., Kim, W. S., and Lee, K. Cover songidentification with metric learning using distance as afeature. In Proceedings of the International Society forMusic Information Retrieval Conference, 2017.

Huang, S.-F. and Lu, H.-P. Classification of temporal datausing dynamic time warping and compressed learning.Biomedical Signal Processing and Control, 57:101781,2020.

Ionescu, C., Papava, D., Olaru, V., and Sminchisescu, C.Human3.6m: Large scale datasets and predictive methodsfor 3d human sensing in natural environments. IEEETransactions on Pattern Analysis and Machine Intelli-gence, 36(7):1325–1339, jul 2014.

Iwana, B., Frinken, V., and Uchida, S. Dtw-nn: A novelneural network for time series recognition using dynamicalignment between inputs and weights. Knowledge-BasedSystems, 188, 1 2020. ISSN 0950-7051. doi: 10.1016/j.knosys.2019.104971.

Jaggi, M. Revisiting frank-wolfe: Projection-free sparseconvex optimization. In Proceedings of the InternationalConference on Machine Learning, 2013.

Kochurov, M., Kozlukov, S., Karimov, R., and Yanush, V.Geoopt: Adaptive riemannian optimization in PyTorch,2019. https://github.com/geoopt/geoopt.

http://arxiv.org/abs/1911.02536

http://arxiv.org/abs/1907.02893

https://github.com/geoopt/geoopt


Kouw, W. M. and Loog, M. A review of domain adapta-tion without target labels. IEEE Transactions on PatternAnalysis and Machine Intelligence, 2019.

Lawrence, J., Bernal, J., and Witzgall, C. A purely algebraicjustification of the kabsch-umeyama algorithm. Journalof Research of the National Institute of Standards andTechnology, 124:1–6, 2019.

Le Guen, V. and Thome, N. Shape and time distortionloss for training deep time series forecasting models sup-plementary material. In Neural Information ProcessingSystems, 2019.

Müller, M., Kurth, F., and Clausen, M. Audio matchingvia chroma-based statistical features. In Proceedings ofthe International Society for Music Information RetrievalConference, 2005.

Petitjean, F., Ketterlin, A., and Gançarski, P. A globalaveraging method for dynamic time warping, with appli-cations to clustering. 44(3):678 – 693, 2011.

Sakoe, H. and Chiba, S. Dynamic programming algorithmoptimization for spoken word recognition. IEEE Trans-actions on Acoustics, Speech and Signal Processing, 26(1):43–49, 1978.

Serra, J., Gómez, E., and Herrera, P. Transposing chromarepresentations to a common key. In Proceeding sof theIEEE Conference on The Use of Symbols to RepresentMusic and Multimedia Objects, pp. 45–48, 2008.

Tavenard, R., Faouzi, J., Vandewiele, G., Divo, F., Androz,G., Holtz, C., Payne, M., Yurchak, R., Rußwurm, M.,Kolar, K., and Woods, E. tslearn: A machine learningtoolkit dedicated to time-series data, 2017. https://github.com/rtavenar/tslearn.

Ten Holt, G. A., Reinders, M. J., and Hendriks, E. Multi-dimensional dynamic time warping for gesture recogni-tion. In Annual conference of the Advanced School forComputing and Imaging, volume 300, pp. 1, 2007.

Torralba, A. and Efros, A. A. Unbiased look at dataset bias.In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition, pp. 1521–1528, 2011.

Wegner Maus, V., Câmara, G., Appel, M., and Pebesma, E.dtwsat: Time-weighted dynamic time warping for satel-lite image time series analysis in R. Journal of StatisticalSoftware, 88(5):1–31, 2019.

Wen, Z. and Yin, W. A feasible method for optimizationwith orthogonality constraints. Mathematical Program-ming, 142(1-2):397–434, 2013.

Wöllmer, M., Al-Hames, M., Eyben, F., Schuller, B., andRigoll, G. A multidimensional dynamic time warpingalgorithm for efficient multimodal fusion of asynchronousdata streams. Neurocomputing, 73(1-3):366–380, 2009.

Zhang, Z., Tavenard, R., Bailly, A., Tang, X., Tang, P., andCorpetti, T. Dynamic time warping under limited warpingpath length. Information Sciences, 393:91–107, 2017.

https://github.com/rtavenar/tslearn

https://github.com/rtavenar/tslearn


20 40 60 80Rank

0.0

0.2

0.4

0.6

0.8

1.0R

ecal

l

SW (OTI), MR1=30.2SW-GI (OTI), MR1=30.5DTW-GI (OTI), MR1=23.7

Figure 8. Cover song identification using the covers80 dataset.Methods are compared in terms of recall.

A. An Extra Cover Song IdentificationExperiment

In section 4.5, we have experimented using Dynamic TimeWarping as the base metric to align time series. Smith-Waterman algorithm can also be used for this task. We showin Figure 8 that our method applied to DTW with key trans-position along the alignment (denoted “DTW-GI (OTI)”)outperforms both its Smith-Waterman counterpart and abaseline that would use Smith-Waterman in conjunctionwith pre-processing step based on OTI.

Time Series Alignment with Global Invariances · 2020-02-11 · Time Series Alignment with Global...

Documents

Transcript of Time Series Alignment with Global Invariances · 2020-02-11 · Time Series Alignment with Global...