Clustering and Recognition of Spatiotemporal Features ...

9
Clustering and Recognition of Spatiotemporal Features through Interpretable Embedding of Sequence to Sequence Recurrent Neural Networks Kun Su 1 , Eli Shlizerman 1,2 1 Department of Electrical & Computer Engineering 2 Department of Applied Mathematics University of Washington Seattle, WA 98195 suk4,[email protected] Abstract Encoder-decoder recurrent neural network models (RNN Seq2Seq) have achieved great success in ubiquitous areas of computation and applications. It was shown to be suc- cessful in modeling data with both temporal and spatial de- pendencies for translation or prediction tasks. In this study, we propose an embedding approach to visualize and inter- pret the representation of data by these models. Furthermore, we show that the embedding is an effective method for un- supervised learning and can be utilized to estimate the opti- mality of model training. In particular, we demonstrate that embedding space projections of the decoder states of RNN Seq2Seq model trained on sequences prediction are organized in clusters capturing similarities and differences in the dy- namics of these sequences. Such performance corresponds to an unsupervised clustering of any spatio-temporal features and can be employed for time-dependent problems such as temporal segmentation, clustering of dynamic activity, self- supervised classification, action recognition, failure predic- tion, etc. We test and demonstrate the application of the em- bedding methodology to time-sequences of 3D human body poses. We show that the methodology provides a high-quality unsupervised categorization of movements. Introduction Recurrent Sequence to Sequence (Seq2Seq) network models use internal states to process sequences of inputs (Hochreiter and Schmidhuber 1997; Luong, Pham, and Manning 2015; Sutskever, Vinyals, and Le 2014; Cho et al. 2014). The spe- ciality of Seq2Seq is that these models consist of encoder and decoder components. The encoder typically processes input sequences and constructs a latent representation of the sequences. In addition, the encoder passes the last internal state to the decoder as an ”initialization” of the decoder. With this information the decoder transforms or generates novel sequences with a similar distribution. Such an archi- tecture and its variants, e.g. attention-based Seq2Seq (Cho et al. 2014), showed compelling performance in applications of machine translation (Sutskever, Vinyals, and Le 2014), speech recognition (Graves, Mohamed, and Hinton 2013), and human motion prediction (Gui et al. 2018). While Seq2Seq and its variants achieve strong perfor- mance on various applications, a consistent interpretation of how the encoder-decoder structure capable to embed the data for general time-series data (i.e. multi dimensional or- dered sequences) and how such interpretation can be used to estimate the performance of the model is an active re- search topic. In this paper, we propose a dimension reduc- tion approach to visualize and interpret representation within Seq2Seq model of general time-series data. The main contri- bution of this paper is to provide constructive insights on the properties which allow the encoder and the decoder compo- nents to operate optimally and to propose a low dimensional visualization method of the representation that the encoder and the decoder construct. With our embedding method we find a remarkable property of Seq2Seq: the network being trained to predict the future evolution of a sequence self- organizes the hidden units representation into separate iden- tities (clusters or classes). The clusters are embedded in the embedding space as attractors to which embedded encoder trajectories lead. We show on human body poses data how organization of the attractors provides an unsupervised easy means for clustering sequential data into distinct identities. Interpretability has been an important aspect of any ar- tificial neural network (ANN) and is expected to provide generic methodologies to assess the capabilities of different models and evaluate performance of models for given tasks. Associating interpretability with various types of ANN is often a challenging undertaking as the typical state-of-the- art network models are high dimensional and include many components and processing stages (layers). The problem be- comes more challenging when dealing with RNN sequence models and in unsupervised tasks such as synthesis and translation, in which encoder-decoder networks where found to be prevalent. In these tasks, interpretability is aimed to capture the model synthesis procedures and to demonstrate how these are being improved over training and since learn- ing is unsupervised interpretation methodologies have the potential to provide a framework to assist with enhancing the optimization and learning. Typically, existing interpretation methods applicable to ANNs are application oriented. For Convolutional Neural Networks (CNN), models were explained within images (Alain and Bengio 2016; Kim et al. 2017; Zeiler and Fer- gus 2014). For RNN, interpretation and visualization tools focus on natural language processing applications (Karpa- thy, Johnson, and Fei-Fei 2015; Collins, Sohl-Dickstein, and Sussillo 2016; Foerster et al. 2017; Strobelt et al. 2019). arXiv:1905.12176v2 [cs.LG] 31 Jan 2020

Transcript of Clustering and Recognition of Spatiotemporal Features ...

Page 1: Clustering and Recognition of Spatiotemporal Features ...

Clustering and Recognition of Spatiotemporal Features through InterpretableEmbedding of Sequence to Sequence Recurrent Neural Networks

Kun Su1, Eli Shlizerman 1,2

1Department of Electrical & Computer Engineering2Department of Applied Mathematics

University of WashingtonSeattle, WA 98195

suk4,[email protected]

Abstract

Encoder-decoder recurrent neural network models (RNNSeq2Seq) have achieved great success in ubiquitous areasof computation and applications. It was shown to be suc-cessful in modeling data with both temporal and spatial de-pendencies for translation or prediction tasks. In this study,we propose an embedding approach to visualize and inter-pret the representation of data by these models. Furthermore,we show that the embedding is an effective method for un-supervised learning and can be utilized to estimate the opti-mality of model training. In particular, we demonstrate thatembedding space projections of the decoder states of RNNSeq2Seq model trained on sequences prediction are organizedin clusters capturing similarities and differences in the dy-namics of these sequences. Such performance correspondsto an unsupervised clustering of any spatio-temporal featuresand can be employed for time-dependent problems such astemporal segmentation, clustering of dynamic activity, self-supervised classification, action recognition, failure predic-tion, etc. We test and demonstrate the application of the em-bedding methodology to time-sequences of 3D human bodyposes. We show that the methodology provides a high-qualityunsupervised categorization of movements.

IntroductionRecurrent Sequence to Sequence (Seq2Seq) network modelsuse internal states to process sequences of inputs (Hochreiterand Schmidhuber 1997; Luong, Pham, and Manning 2015;Sutskever, Vinyals, and Le 2014; Cho et al. 2014). The spe-ciality of Seq2Seq is that these models consist of encoderand decoder components. The encoder typically processesinput sequences and constructs a latent representation of thesequences. In addition, the encoder passes the last internalstate to the decoder as an ”initialization” of the decoder.With this information the decoder transforms or generatesnovel sequences with a similar distribution. Such an archi-tecture and its variants, e.g. attention-based Seq2Seq (Cho etal. 2014), showed compelling performance in applicationsof machine translation (Sutskever, Vinyals, and Le 2014),speech recognition (Graves, Mohamed, and Hinton 2013),and human motion prediction (Gui et al. 2018).

While Seq2Seq and its variants achieve strong perfor-mance on various applications, a consistent interpretationof how the encoder-decoder structure capable to embed the

data for general time-series data (i.e. multi dimensional or-dered sequences) and how such interpretation can be usedto estimate the performance of the model is an active re-search topic. In this paper, we propose a dimension reduc-tion approach to visualize and interpret representation withinSeq2Seq model of general time-series data. The main contri-bution of this paper is to provide constructive insights on theproperties which allow the encoder and the decoder compo-nents to operate optimally and to propose a low dimensionalvisualization method of the representation that the encoderand the decoder construct. With our embedding method wefind a remarkable property of Seq2Seq: the network beingtrained to predict the future evolution of a sequence self-organizes the hidden units representation into separate iden-tities (clusters or classes). The clusters are embedded in theembedding space as attractors to which embedded encodertrajectories lead. We show on human body poses data howorganization of the attractors provides an unsupervised easymeans for clustering sequential data into distinct identities.

Interpretability has been an important aspect of any ar-tificial neural network (ANN) and is expected to providegeneric methodologies to assess the capabilities of differentmodels and evaluate performance of models for given tasks.Associating interpretability with various types of ANN isoften a challenging undertaking as the typical state-of-the-art network models are high dimensional and include manycomponents and processing stages (layers). The problem be-comes more challenging when dealing with RNN sequencemodels and in unsupervised tasks such as synthesis andtranslation, in which encoder-decoder networks where foundto be prevalent. In these tasks, interpretability is aimed tocapture the model synthesis procedures and to demonstratehow these are being improved over training and since learn-ing is unsupervised interpretation methodologies have thepotential to provide a framework to assist with enhancingthe optimization and learning.

Typically, existing interpretation methods applicable toANNs are application oriented. For Convolutional NeuralNetworks (CNN), models were explained within images(Alain and Bengio 2016; Kim et al. 2017; Zeiler and Fer-gus 2014). For RNN, interpretation and visualization toolsfocus on natural language processing applications (Karpa-thy, Johnson, and Fei-Fei 2015; Collins, Sohl-Dickstein, andSussillo 2016; Foerster et al. 2017; Strobelt et al. 2019).

arX

iv:1

905.

1217

6v2

[cs

.LG

] 3

1 Ja

n 20

20

Page 2: Clustering and Recognition of Spatiotemporal Features ...

Our proposed work is based and inspired by these worksand extends the methodology to generic sequences, in whichtextual sequences are a sub-class with special time depen-dence (semantics). Generic multidimensional time series arespatio-temporal sequences including nontrivial correlationsin space and time. Beyond text data, there are works inspiredfrom neuronal networks investigating the dynamics of RNN(Recanatesi et al. 2018; Farrell et al. 2019). In our work, weaim to provide interpretation of encoder-decoder (Seq2Seq)network models for general spatiotemporal data.

We test our methods on prediction tasks of synthetic dataand of typical movements of human body joints. Thereare several RNN-based Seq2Seq models that achieve suc-cess on human motion prediction (Martinez, Black, andRomero 2017) and outperform the previous non-Seq2Seqbased RNN models such as ERD (Fragkiadaki et al. 2015)and S-RNN (Jain et al. 2016). Recently, Generative Ad-versarial Networks (Gui et al. 2018) have achieved betterperformance on this task, with the predictor network beingRNN Seq2Seq.

We show that Seq2Seq optimization with gradient de-scent based methods forms a low dimensional embeddingof internal states. The embedding can be mapped and visu-alized through Proper Orthogonal Decomposition (POD) ofconcatenated encoder and decoder internal states - the in-terpretable embedding. Within this embedding, the decoderevolution for each distinct sequence (decoder trajectory) isseparable from other distinct sequences. Furthermore, eachdistinct decoder trajectory preserves both spatial and tem-poral properties of the sequence. The encoder trajectory ini-tiated from various starting points connects them in the in-terpretable embedding space with the appropriate decodertrajectory. Monitoring the interpretable embedding spaceand projected trajectories in it during training shows the ef-fect of training on data representation and assists to iden-tify an optimal regime between under- and over- fitting. Weconstruct synthetic data examples to demonstrate the con-struction of the interpretable embedding space. Next, weapply the construction of the interpretable embedding andanalyze Seq2Seq performance on human joints movementsdatasets:H3.6M that contains 15 different types of real bodymovement sequences, such as walking, eating, etc (Ionescuet al. 2014). To show generality and example application forthe proposed approach, we apply it to CMU Motion capturedataset and perform unsupervised action recognition task.

Setup and Spatiotemporal States of the Seq2SeqModel: RNN Seq2Seq model is utilized for sequence pre-diction (synthesis of a new sequence based on input se-quence) or translation (mapping the input sequence to a newrepresentation) (Fig. 1). For general time series prediction,given a sequence of spatiotemporal data as input, Seq2Seqpredicts the future sequence. We stack the sequence of in-put to the encoder as a matrix and call it the encoder inputmatrix X ∈ RTe×M , where each row xt ∈ R1×M is a timestep of the input sequence at time t, Te is the number of timesteps and M is the number of dimensions in the input data.Similarly, we construct the target sequences as target out-put matrix Y ∈ RTd×M , where Td is the number of outputsequence steps to be predicted.

Figure 1: Seq2Seq architecture: We consider the inputs, en-coder states, decoder states and outputs as spatiotemporalmatrices X,E,D, Y respectively. Here the recurrent net-work block is Gated Recurrent Unit (GRU).

In addition, forward propagation of the input in RNNSeq2Seq computes the internal states of the encoder at eachtime step. We concatenate and denote them as the encoderstates matrix E ∈ RTe×N where each row et ∈ R1×N rep-resents the states of all internal units at t andN is the numberof hidden units. Similarly, we also define the decoder statesmatrix D ∈ RTd×N . Typically, there is an additional fullyconnected linear map transforming the decoder states to thedimensions of the output space. We denote the decoder out-put matrix Y ∈ RTd×M where yt is the predicted yt outputat time t. Fig. 1 demonstrates the structure of the compo-nents of RNN Seq2Seq matrices. In the figure, the encoderand the decoder are single layer GRU units that can shareor not share parameters with each other depending on appli-cations. Our approach is applicable to general types of unitssuch as LSTMs/GRUs and number of layers. Additionally,as demonstrated, the decoder uses the output of a previoustime step as an input to the current time step in both train-ing and testing, except for the initial time step in which itreceives its input from the last step of the encoder. In termsof time series prediction task, the cost function is typicallydefined as the MSE between target outputs and decoderoutputs, J = 1

Td

∑Td

t=1(yt − yt)2, however, other norms orcost functions can be considered.

Proper Orthogonal Decomposition of SpatiotemporalMatrices: Since forward propagation within RNN Seq2Seqcan be represented through spatiotemporal matrices, we pro-pose to apply the POD method as a dimensionality reduc-tion algorithm to construct a low dimensional interpretableembedding (Shlizerman, Schroder, and Kutz 2012). Specif-ically, we use the Singular Value Decomposition to de-compose the matrices into orthogonal spatial modes (PCs)and time dependent coefficients and singular values (scal-ing) associated with each mode. Particularly, given a matrixA ∈ RT×N we center the matrix around the origin, obtain-ing Ac, and then apply SVD such that Ac = UΣV T , whereU ∈ RT×T is an orthogonal matrix of time dependent co-efficients, Σ ∈ RT×N is the matrix of singular values andV ∈ RN×N is the matrix of spatially dependent compo-nents. To determine the number of PCs, we compute thesingular value energy (SVE), SVE =

∑ni=1 σ

2i where σi

is singular value corresponding to PC mode i. We compute

Page 3: Clustering and Recognition of Spatiotemporal Features ...

the number of modes such that 90% or 99% of the energyis retained (SVEk =

∑ki=1 σ

2i ,

SVEk

SVE ≤ p where k is thenumber of modes and p is the percentage). With the numberof dominant spatial features, we can truncate Ac by project-ing it onto the PC modes to get a low dimensional matrixA(PC=n) ∈ RT×n, where (PC=n) denotes n principal compo-nents are retained. If we choose n = 2 or 3, we can visualizethe representation in 2D or 3D. The axes are the orthogonalPC modes.

Clustering: While visualization of projected dynamicscould be informative (Maaten and Hinton 2008), 2D or3D visualized dynamics do not reveal the intricacies in therepresentation of different datasets. In particular, here wewould like to evaluate the separability of projections of dis-tinct trajectories in the interpretable embedding space. Wethereby propose to augment the embedding with clusteringapproaches such as K-means++, an extended version of thestandard K-means algorithm (Arthur and Vassilvitskii 2007)or agglomerative clustering, a bottom up approach of hi-erarchical clustering. Since the number of trajectories andtime steps are known, we can use the Adjusted Rand Index(ARI) to evaluate the clustering performance.

Interpretable Embedding for Seq2Seq NetworksThe generic goal of RNN Seq2Seq model is to continue (pre-dict) the evolution of each given input sequence chosen froma (test) dataset which includes K different types of multidi-mensional time series. Such a goal is challenging as it re-quires the network to generate a sequence which superim-poses the typical dynamics of that particular type and theindividual dynamics of the given tested input sequence. Weshow that the methods described above can be applied toconstruct an interpretable embedding for the RNN Seq2Seqmodel depicted in Fig.2.

To construct the embedding space basis we concatenatethe matrices E and D for each single forward propagationinto a state matrix

S = [ ED ] ∈ R(Te+Td)×N , S =

[S1

...SK

].

Concatenation of forward propagation evolution for all con-sidered time series in the dataset will result with a globalstate matrix denoted as S. POD application on S providesthe PC modes which are the axes of the interpretable embed-ding space. Dimension reduction of the embedding space isperformed by considering only n PCn modes which singularvalues are included in representation of particular total SVE(e.g. 90% or 99%) and truncating the rest of the modes.

To inspect the propagation of single time series throughthe network, we can project the matrices E, D or S onto thelow dimensional space spanned by PCn modes. As we showbelow, we find that the dimension of the embedding spacecan be very low even for high dimensional data. We depictthe structure of such projections onto PC3 embedding spacein Fig. 2 bottom. Encoder trajectories (black, left) start frominitial points projected to the embedding space (we chooseinitial states as zeros and therefore the trajectories are al-ways initiated at the origin) and evolve in different direc-tions in the space. Decoder trajectories (red, middle) appear

Figure 2: POD is performed on Ek and Dk, obtained foreach type of data k and stacked to S. Three distinct trajecto-ries are shown in the low dimensional embedding space. Theencoder trajectories (black) start from the origin and divergein different directions. The decoder trajectories (attractors;red) are hence placed in separate locations in the space. Thelast point of each encoder trajectory (green) connects the en-coder and the decoder trajectories.

as attractors in the space and clustering approaches are usedto determine (e.g. K-means) how separable they are in thespace. The encoder and the decoder connect via a single timestep which corresponds to a point (green, right). Composi-tion of the three types of projections corresponds to an inter-pretation of the propagation in RNN Seq2Seq. In particular,we show that the encoder trajectory takes the sequence froman initial point and evolves it to the corresponding startingpoint on the decoder trajectory (we call it attractor or clus-ter). The decoder continues the evolution from there. As weshow below, it appears that the gradient descent training suc-ceeds to organize the decoder attractors in the embeddingspace such that they are easily clustered. Such an arrange-ment explains the uniqueness of RNN Se2Seq training inwhich the cost function minimizes the error between the de-coder output and the actual output and there is no minimiza-tion on the encoder. Therefore, the decoder is trained to pre-dict different features for different types of inputs (trainingto optimize clustering of types of data and capture uniquefeatures) with the encoder trajectory (and not only the lasttime step of it) being a sequential constraint that connectsthe cluster to the initial state. In practice, the longer the en-coding sequence the better is the prediction (clustering prop-erty) and such an interpretation is evident from the low di-mensional embedding space as well.

It is possible to monitor the the construction and thechanges in the embedding space and projected trajectorieswithin it during training. Specifically, at each training it-eration i, for each type of data, we construct the matricesEi, Di,Si and perform the aforementioned procedures.

Page 4: Clustering and Recognition of Spatiotemporal Features ...

Figure 3: Projected trajectories in the embedding spacespanned by PC3 for the encoder states, decoder states anddecoder outputs matrices. Dashed and solid lines representXcircle and Xellipse respectively. Opaque and transparentpoints denote the starting and the ending points of the tra-jectory.

Interpretable Embedding Space Applied toSynthetic Data

We first create two types of synthetic 2D trajectories whichfollow a circumference of (i) a unit circle Xcircle (ii) an el-lipseXellipse. The encoder input is binned into Te time stepssuch that Xcircle, Xellipse ∈ RTe×2 with Te = 50 such thatit completes a full period. Given full circle or ellipse dynam-ics as the encoder input, the objective is to predict anotherfull circle or ellipse by the decoder, i.e., the target sequenceY is expected to be the same as the input sequence (Y ≡ X)and the decoder output matrix Y has the same dimensionas Y andX . The cost function J = 1

Td

∑Td

t=1(yt−yt)2 is theMSE between the target and the prediction with Td = Te.For both the encoder and the decoder we use a single layerGRU with 16 neurons (E,D ∈ RTe×16) and the ADAM op-timizer to train the model for 5000 iterations where the costfunction almost converges to zero such that the model canpredict the trajectory with high accuracy.

We show in Fig. 3 the projections of forward propagationin RNN Seq2Seq trained on (i) circle (ii) ellipse (iii) bothcircle and ellipse. In each model, we obtain the PC3 embed-ding space from the matrix S and examine the projections ofthe encoder states matrix E, decoder states matrix D ontothe space and also show the decoder output matrix Y pro-jected onto x − y space. We observe that the projections ofthe encoder and the decoder states are deformed and not nec-essarily preserve the same form as the input and the output,however, linear transformation of the decoder deforms thetrajectory to be correctly represented in the x−y space. No-tably, all encoder projections start near the origin since ourinitial states are zero by default.

The last state of the encoder in NLP applications is typ-ically considered to contain the information of all previousstates, an assumption that gives rise to the attention-basedSeq2Seq model. However, our results indicate that the fullencoder sequence is important and the last encoder state issimply providing a starting point for the decoder to continue.

Figure 4: Snapshots of the evolution of trajectories projectedto the interpretable embedding space when they undergotraining. We use similar line identifiers and colors as in Fig.3. At the beginning of the training, two trajectories are verysimilar and positioned close to each other. As training pro-ceeds the two type of trajectories emerge and separate fromeach other.

Such an effect of the encoder is easily observed in consid-ering continuous time series (as we show in prediction ofhuman movement data) and deviates from the interpretationof the encoder role in textual semantic sequences.

We focus on the model trained on both the circle and theellipse to better understand how RNN Seq2Seq is able topredict both spatio-temporal series in an unsupervised waywithout any guidance. Projections onto the PC embeddingspace (right column of Fig. 3). Encoder projections indicatethat the two trajectories corresponding to distinct types ofoutput sequences, start from points near the origin howeverdiverge as the sequence evolves and end up at more distantpoints. The projected trajectories of the decoder states startfrom these distinct points and continue to perform predictionin completely separable shapes. Effectively we observe thatthe decoder states projections are clustered in the embeddingspace. Application of agglomerative clustering and cosinesimilarity indeed verify these observations.

To visualize how Seq2Seq learns to differentiate the twoshapes with training, we keep track of the evolution of statesrepresentation and depict in Fig. 4 the three projections as inFig. 3 at iterations i = 10, 100, 1000, 3000. In the beginningof training (i = 10), all projections of the two shapes aresimilar. Projections of the decoder states appear to be dis-tinct for a few several initial points but when the sequenceevolves forward, the trajectories appear to converge to thesame point. When the model undergoes additional training,after i = 100, Seq2Seq appears to have learned a singlepattern (ellipse like), however, is unable to generate two dis-tinct predicted shapes. At i = 1000 the two trajectories sep-arate and obtain accurate predictions at the same time. Theevolution during training reveals that the model learns onegeneral pattern first and then gradually evolves into two dif-ferent separable patterns.

Notably, during training the loss is inverse proportionalto the clustering performance (ARI). This means that withtraining the RNN Seq2Seq model learns to recognize thetype of input sequence. We conjecture that separability is

Page 5: Clustering and Recognition of Spatiotemporal Features ...

Figure 5: Top: Left: Decoder states embedded projectionwith one-hot encoding; Right: The inverse correlation be-tween loss (loss) and clustering performance (blue) of unitcircle and ellipse with one-hot encoding (solid line) andwithout (dashed line). Bottom: Left: decoder representationof centered (dashed) and shifted (dotted) circles. Right: de-coder representation of same circle with different rotationfrequency: Te = 50 (dashed) and Te = 25 (dashed-dotted).

correlated with successful training and test the conjectureby repeating the same prediction task with additional one-hot encoding label for the circle and the ellipse. Indeed,with the additional labels, the model learns these two se-quences much faster and they appear separable in the em-bedding space (Fig. 5 top). We also verify that RNN Seq2seqmodel can encoder both spatial and temporal features in theembedding space. For spatial features, we train the modelto learn centered and shifted circles. While both trajecto-ries are circles, RNN Seq2seq and the projections of its de-coder states to the embedding space distinguish the trajec-tories well (Fig. 5 bottom left). For temporal features, wetrain the network to predict two centered unit circles, sam-pled with different rates of Te = 50 and Te = 25 (differentrotation speeds on the circle). While both circle trajectoriescoincide in x − y space, RNN Seq2seq and the projectionsof its decoder states to the embedding space result with sep-arable attractors with apparent temporal feature differencebetween the first and second cycles (Fig. 5 bottom right).These investigations help us to conclude that the embeddingspace can be effectively used with clustering to represent andassist in evaluation of different types of learned spatial andtemporal features.

Human Body Joints Movements DataTo test the proposed interpretable embedding methodologyon realistic data, we use the Human 3.6 million (H3.6M),which is currently one of the largest publicly available datasets of motion capture data (Ionescu et al. 2014). It includes7 actors performing 15 various activities such as walking,sitting, and posing. Each movement is repeated in 2 differenttrials. We use different people for training and testing; 6 ofthe actors motion as the training set and the other actors mo-tion as a testing set. Human pose is represented as an expo-nential map representation of each joint, with a special pre-processing of global translation and rotation. The number offeatures of body joints dynamics isM = 54. For consistencywith the previous synthetic example, we provide Te = 50

Figure 6: Different human joint movement types and theircorresponding decoder states projected to the interpretableembedding space.

Figure 7: Left: Examples of evolution of the number ofdominant modes to reach 90% and 99% in encoder and de-coder for walking and sitting,respectively. Right: evolutionof absolute changes of singular values along the training for”walking” decoder states matrix.

frames of input data and predict next Td = 50 frames. Asa result, the actual input matrix, actual output matrixand decoder output matrix will be X,Y, Y ∈ RTd×54.Seq2Seq model was shown to be successful in such a pre-diction task (Martinez, Black, and Romero 2017). We use asimilar setup with a single layer GRU of N = 1024 unitsfor both the encoder and the decoder. The encoder and de-coder states matrices and are E,D ∈ RTe×N . The costfunction is the mean square error between the ground truthand the predicted output J = 1

Td

∑Td

t=1(yt − yt)2. We use abatch size of B = 15 such that every training iteration cancontain all actions. We train the model with various gradientdescent based optimizers and show our results for ADAM(the fastest converging optimizer for this model). Here weshow results with a similar network setup as in previouswork, however, our investigations indicate that our methodis applicable to other variants of RNN Seq2seq model (e.g.non-parameter sharing RNNs, multi-layer RNNs, etc).

We visualize the projections onto PC3 embedding spacefor every type of action. In order to see the pattern for the fullaction, we train the model for prediction and then continu-ously perform forward propagation on encoder and decodersequences in a sliding a window starting from the very be-ginning. In Fig. 6, we show examples of the joints evolution

Page 6: Clustering and Recognition of Spatiotemporal Features ...

(connected with lines) alongside with the decoder states pro-jection onto PC3 embedding space. We clearly observe thateach individual action has its own trajectory evolution (at-tractor) pattern even in 3D. For example, as expected, walk-ing data corresponds to periodic circular pattern with ratherfixed period and actions such as sitting correspond to non-periodic trajectories in the embedding space. To get a betterunderstanding on what should be the appropriate dimensionsof the embedding space we monitor the number of dominantPC modes to reach 90% and 99% SVE for encoder and de-coder matrices POD for every type of action during training(Fig. 7).

We observe a low number of dominant modes (< 10 and< 20 on average for 90% and 99% respectively) needed tospan most of the energy. The number of modes increaseswith training and the requirement of the energy threshold(accuracy of the embedding). The rate of the increase de-pends on the type of movement. For example, as shownin Fig. 7, the number of dominant modes for ”sitting” ismuch smaller than for walking. Furthermore, the number ofdominant modes required to reach 90% SVE increases moreslowly than numbers of modes to reach 99%. These resultsindicate that the model learns additional details with train-ing. However, most of the main features captured in (90%)are learned in the first iterations. Such an observation is sup-ported by analyzing the gradients of the singular values dur-ing training (Fig. 7, right). As training proceeds, additionalsingular values are gradually being optimized. Notably, weobserve that encoder states would have more modes than de-coder states. The reason stems from the encoder trajectorystarting from the origin and connecting to the decoder attrac-tors in various parts of the embedding space. Such trajecto-ries are hence including mixed characteristics of the attractorand of the path to it resulting in more irregular trajectoriesrequiring additional modes to be represented.

Table 1: Clustering resultsTraining iterations Dimension ARI (%)

Encoder states

10dim=3 52

dim=1024 61

4000dim=3 56

dim=1024 79

10000dim=3 59

dim=1024 86

Decoder states

10dim=3 29

dim=1024 32

4000dim=3 97

dim=1024 100

10000dim=3 94

dim=1024 96

Joints datadim=3 59dim=54 80

To understand how Seq2Seq forms distinct attractorsand differentiates various actions, we construct matricesE,D, Y for all 15 actions for which the model was trained.We apply the K-means++ clustering to D to evaluate the sep-arability of Seq2Seq in the interpretable embedding spaceand the dimension of the space which provides efficient clus-

tering property. (Table 1).We compare clustering of the encoder and the decoder

trajectories with respect to ARI at different training itera-tions and for different dimensions (dim= 3 and dim= 1024)of the embedding space. In addition, we cluster the bodyjoints data (in dim= 3 and dim= 54). Only clustering ofthe decoder attractors is able to reach 100% clustering inboth measures for dim= 1024 of the embedding space. Thenext best clustering performance is for the decoder attrac-tors in dim= 3 (97% ARI which is significantly higher thanjoints data for dim=3 and dim=54) at iterations = 4000. Af-ter this number of iterations the attractors start to approacheach other instead of diverging. Encoder trajectories are notclustered well in any dimension and their clustering doesnot significantly change with training. In Fig. 8 we visualizethe clusters (for joints coordinates, overfitted trained model,trained model on all movements) in 3D. We show that evenvisually, the decoder attractors are scattered in the space in-dicating almost perfect clustering property of PC3 space. Ifwe keep adding the number of PCs to 10, it would be enoughto achieve the perfect clustering result.

To understand how training shapes the decoder attractorsand their clustering, we mark each distinct movement attrac-tor by a different color and monitor their representation inPC3 embedding space for various training iterations (Fig. 9).One of our key observations is that the clustering perfor-mance reaches a peak after considerable training ( 4000 it-erations) which is at the same time that the validation lossreaches its minimum point. After that, the validation lossincreases as clustering becomes inferior. Such performanceis known as over-fitting eventually will lead to a model per-forming on certain actions only. We demonstrate such a caseby training Seq2Seq with more ”walking” action data andthen test the model with all actions. Indeed, no matter whattype of action is given as an input, the decoder states will al-ways be similar to a circular trajectory that represents the”walking” action. Hence, we propose that the clusteringproperty of the embedding space could be used as an indi-cation for sufficient and optimal training of the model. Toshow the importance of the separability of clusters, similarto synthetic case, we compare the results with one-hot en-coding data. We find that the model converges much fasterand achieves stable high clustering performance very early.In both cases, clustering performance is strongly correlatedwith clustering, i.e., when overfitting starts to occur, cluster-ing performance starts to deteriorate.

Unsupervised action recognitionTo demonstrate the generality and a practical application forour methodology, we utilize the embedding space of RNNSeq2Seq and its clustering property to perform unsuper-vised action recognition on CMU human body motion cap-ture dataset. We evaluate our methods in two cases: (i) werandomly choose sequences from up to eight different ac-tions (walking, running, jumping, soccer, basketball, washwindow, directing traffic and basketball signal) and concate-nate them together (up to 14400 frames, 2 mins), followingthe same rule from (Li et al. 2018), see Fig.10. (ii) Sincemanually concatenated sequences have discontinuity in the

Page 7: Clustering and Recognition of Spatiotemporal Features ...

Figure 8: Visualization of clustering in 3D. Left: Joints coor-dinates projected data, Middle: Overfitting for walking data.Right: Decoder states at training iterations=2000.

Figure 9: Left: projection of D on the basis of S at differ-ent training iterations, each color represents an action type.Right: comparison of loss (red) with clustering (blue) duringtraining, with one-hot encoding (dashed) and without one-hot encoding (solid).

prediction, in the second case we choose trials 1 to 14 fromsubject 861, where each sequence contains continuous multi-ple actions with variable duration. On average each sequencecontains 8000 frames and each action sequence has humanmanually annotated time segments (Barbic et al. 2004). Foreach of the two cases, we train various RNN Seq2Seq pre-dictive models to predict a future sequence task with no us-age of labels in training. Our results indicate that the Adver-sarial Geometry-Aware Encoder-Decoder (AGED) model(Gui et al. 2018), shown to be one of the state-of-art mo-tion prediction model based on RNN Seq2Seq as predic-tor, obtains the best prediction results. We therefore applyour embedding approach in conjunction with the trainedAGED model to perform action recognition. Specifically, allsequences which include variety of movements, composedusing approach (i) or (ii), are scanned with AGED forwardpropagation performed to collect the decoder states. The de-coder states are then projected to the embedding space. Ag-glomerative clustering with single linkage and cosine sim-ilarity is then used to generate clustering (labeling similarsegments to belong to a cluster) . With the the embeddingspace and decoder attractors within it, unsupervised activ-

1http://mocap.cs.cmu.edu/search.php?subjectnumber=86

Table 2: Clustering ComparisonMethod Accuracy (%)

LRR (Liu, Lin, and Yu 2010) 65.08SSC (Elhamifar and Vidal 2009) 76.65

ACA (Zhou, De la Torre, and Hodgins 2008) 84.50TSSC-LC (Clopton et al. 2017) 86.08

Ours 94.75

Figure 10: Interpretable embedding of decoder states facili-tates unsupervised action clustering of a sequence of hu-man activities with varying durations and individuals.

ity recognition of sequences can be performed. The onlyinformation required is the number of unique actions (clus-ters) that we would like to obtain (see Fig. 10 and supple-mentary materials for examples). Testing the algorithm oncomposed produces the following performance: in case (i),the approach achieves above 98% frame-level accuracy onaverage; in case (ii), we evaluate our accuracy on by frame-level as well and compare with previously reported meth-ods (Clopton et al. 2017). Our method reaches an accuracyof 94.75% on average and outperforms previous methods .

ConclusionWe propose a novel construction of interpretable embed-ding for the hidden states of the Seq2Seq model. The em-bedding clarifies the role of the encoder and the decodercomponents of Seq2Seq such that encoder embedded tra-jectories direct the evolution from the origin to the decodertrajectories represented as attractors. Our findings indicatea remarkable property of networks: the network trained topredict the future evolution of a sequence self-organizesthe hidden units representation into separate identities. Theidentities are revealed through our proposed embedding andclustering. We demonstrate the construction and the utiliza-tion of the embedding space on both synthetic and humanbody joints datasets. We show that the embedding can in-spect training and determine the goodness of fit. Further-more, we present an algorithm for unsupervised clustering ofany spatio-temporal features. It utilizes training of Seq2Seqto predict future actions and analyzes the learned represen-tation with the interpretable embedding to generate cluster-ing. We show that such approach allows to perform actionrecognition on human human body pose data, i.e. to gener-ate an unsupervised time-segmentation and clustering of hu-man movements. The algorithm achieves significantly morerobust performance and accuracy than previously proposedapproaches.

Page 8: Clustering and Recognition of Spatiotemporal Features ...

References[Alain and Bengio 2016] Alain, G., and Bengio, Y. 2016.Understanding intermediate layers using linear classifierprobes. arXiv preprint arXiv:1610.01644.

[Arthur and Vassilvitskii 2007] Arthur, D., and Vassilvitskii,S. 2007. k-means++: The advantages of careful seeding.In Proceedings of the eighteenth annual ACM-SIAM sympo-sium on Discrete algorithms, 1027–1035. Society for Indus-trial and Applied Mathematics.

[Barbic et al. 2004] Barbic, J.; Safonova, A.; Pan, J.-Y.;Faloutsos, C.; Hodgins, J. K.; and Pollard, N. S. 2004. Seg-menting motion capture data into distinct behaviors. In Pro-ceedings of Graphics Interface 2004, 185–194. CanadianHuman-Computer Communications Society.

[Cho et al. 2014] Cho, K.; Van Merrienboer, B.; Gulcehre,C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; and Bengio,Y. 2014. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprintarXiv:1406.1078.

[Clopton et al. 2017] Clopton, L.; Mavroudi, E.; Tsakiris,D. M.; Ali, D. H.; and Vidal, D. R. 2017. Temporal sub-space clustering for unsupervised action segmentation.

[Collins, Sohl-Dickstein, and Sussillo 2016] Collins, J.;Sohl-Dickstein, J.; and Sussillo, D. 2016. Capacity andtrainability in recurrent neural networks. arXiv preprintarXiv:1611.09913.

[Elhamifar and Vidal 2009] Elhamifar, E., and Vidal, R.2009. Sparse subspace clustering. In 2009 IEEE Conferenceon Computer Vision and Pattern Recognition, 2790–2797.IEEE.

[Farrell et al. 2019] Farrell, M. S.; Recanatesi, S.; Lajoie, G.;and Shea-Brown, E. 2019. Dynamic compression and ex-pansion in a classifying recurrent network. bioRxiv 564476.

[Foerster et al. 2017] Foerster, J. N.; Gilmer, J.; Sohl-Dickstein, J.; Chorowski, J.; and Sussillo, D. 2017. Inputswitched affine networks: An rnn architecture designed forinterpretability. In Proceedings of the 34th InternationalConference on Machine Learning-Volume 70, 1136–1145.JMLR. org.

[Fragkiadaki et al. 2015] Fragkiadaki, K.; Levine, S.; Felsen,P.; and Malik, J. 2015. Recurrent network models for hu-man dynamics. In Proceedings of the IEEE InternationalConference on Computer Vision, 4346–4354.

[Graves, Mohamed, and Hinton 2013] Graves, A.; Mo-hamed, A.-r.; and Hinton, G. 2013. Speech recognition withdeep recurrent neural networks. In 2013 IEEE internationalconference on acoustics, speech and signal processing,6645–6649. IEEE.

[Gui et al. 2018] Gui, L.-Y.; Wang, Y.-X.; Liang, X.; andMoura, J. M. 2018. Adversarial geometry-aware humanmotion prediction. In Proceedings of the European Confer-ence on Computer Vision (ECCV), 786–803.

[Hochreiter and Schmidhuber 1997] Hochreiter, S., andSchmidhuber, J. 1997. Long short-term memory. Neuralcomputation 9(8):1735–1780.

[Ionescu et al. 2014] Ionescu, C.; Papava, D.; Olaru, V.; andSminchisescu, C. 2014. Human3. 6m: Large scale datasetsand predictive methods for 3d human sensing in natural en-vironments. IEEE transactions on pattern analysis and ma-chine intelligence 36(7):1325–1339.

[Jain et al. 2016] Jain, A.; Zamir, A. R.; Savarese, S.; andSaxena, A. 2016. Structural-rnn: Deep learning on spatio-temporal graphs. In Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition, 5308–5317.

[Karpathy, Johnson, and Fei-Fei 2015] Karpathy, A.; John-son, J.; and Fei-Fei, L. 2015. Visualizing and understandingrecurrent networks. arXiv preprint arXiv:1506.02078.

[Kim et al. 2017] Kim, B.; Wattenberg, M.; Gilmer, J.; Cai,C.; Wexler, J.; Viegas, F.; and Sayres, R. 2017. In-terpretability beyond feature attribution: Quantitative test-ing with concept activation vectors (tcav). arXiv preprintarXiv:1711.11279.

[Li et al. 2018] Li, C.; Zhang, Z.; Sun Lee, W.; and Hee Lee,G. 2018. Convolutional sequence to sequence model forhuman dynamics. In The IEEE Conference on ComputerVision and Pattern Recognition (CVPR).

[Liu, Lin, and Yu 2010] Liu, G.; Lin, Z.; and Yu, Y. 2010.Robust subspace segmentation by low-rank representation.In ICML, volume 1, 8.

[Luong, Pham, and Manning 2015] Luong, M.-T.; Pham, H.;and Manning, C. D. 2015. Effective approaches toattention-based neural machine translation. arXiv preprintarXiv:1508.04025.

[Maaten and Hinton 2008] Maaten, L. v. d., and Hinton, G.2008. Visualizing data using t-sne. Journal of machinelearning research 9(Nov):2579–2605.

[Martinez, Black, and Romero 2017] Martinez, J.; Black,M. J.; and Romero, J. 2017. On human motion predictionusing recurrent neural networks. In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition,2891–2900.

[Recanatesi et al. 2018] Recanatesi, S.; Farrell, M.; Lajoie,G.; Deneve, S.; Rigotti, M.; and Shea-Brown, E. 2018. Sig-natures and mechanisms of low-dimensional neural predic-tive manifolds. bioRxiv 471987.

[Shlizerman, Schroder, and Kutz 2012] Shlizerman, E.;Schroder, K.; and Kutz, J. N. 2012. Neural activitymeasures and their dynamics. SIAM Journal on AppliedMathematics 72(4):1260–1291.

[Strobelt et al. 2019] Strobelt, H.; Gehrmann, S.; Behrisch,M.; Perer, A.; Pfister, H.; and Rush, A. M. 2019. S eq 2seq-v is: A visual debugging tool for sequence-to-sequencemodels. IEEE transactions on visualization and computergraphics 25(1):353–363.

[Sutskever, Vinyals, and Le 2014] Sutskever, I.; Vinyals, O.;and Le, Q. V. 2014. Sequence to sequence learning with neu-ral networks. In Advances in neural information processingsystems, 3104–3112.

[Zeiler and Fergus 2014] Zeiler, M. D., and Fergus, R.2014. Visualizing and understanding convolutional net-

Page 9: Clustering and Recognition of Spatiotemporal Features ...

works. In European conference on computer vision, 818–833. Springer.

[Zhou, De la Torre, and Hodgins 2008] Zhou, F.; De laTorre, F.; and Hodgins, J. K. 2008. Aligned cluster analysisfor temporal segmentation of human motion. In 2008 8thIEEE international conference on automatic face & gesturerecognition, 1–7. IEEE.