Semi-supervised Learning with Encoder-Decoder Recurrent ... · Semi-supervised Learning with...

35
Semi-supervised Learning with Encoder-Decoder Recurrent Neural Networks: Experiments with Motion Capture Sequences elix G. Harvey a , Christopher Pal a a epartement de g´ enie informatique et g´ enie logiciel ´ Ecole Polytechnique Montr´ eal Montr´ eal, Qu´ ebec, Canada, H3T 1J4 Abstract Recent work on sequence to sequence translation using Recurrent Neural Net- works (RNNs) based on Long Short Term Memory (LSTM) architectures has shown great potential for learning useful representations of sequential data. A one-to-many encoder-decoder(s) scheme allows for a single encoder to provide representations serving multiple purposes. In our case, we present an LSTM en- coder network able to produce representations used by two decoders: one that re- constructs, and one that classifies if the training sequence has an associated label. This allows the network to learn representations that are useful for both discrimi- native and reconstructive tasks at the same time. This paradigm is well suited for semi-supervised learning with sequences and we test our proposed approach on an action recognition task using motion capture (MOCAP) sequences. We find that semi-supervised feature learning can improve state-of-the-art movement classifi- cation accuracy on a public MOCAP dataset, on which we defined a new realistic Email addresses: [email protected] (F´ elix G. Harvey), [email protected] (Christopher Pal) Preprint submitted to Pattern Recognition January 27, 2020 arXiv:1511.06653v3 [cs.CV] 25 Jul 2016

Transcript of Semi-supervised Learning with Encoder-Decoder Recurrent ... · Semi-supervised Learning with...

Semi-supervised Learning with Encoder-DecoderRecurrent Neural Networks: Experiments with Motion

Capture Sequences

Felix G. Harveya, Christopher Pala

aDepartement de genie informatique et genie logicielEcole Polytechnique Montreal

Montreal, Quebec, Canada, H3T 1J4

Abstract

Recent work on sequence to sequence translation using Recurrent Neural Net-

works (RNNs) based on Long Short Term Memory (LSTM) architectures has

shown great potential for learning useful representations of sequential data. A

one-to-many encoder-decoder(s) scheme allows for a single encoder to provide

representations serving multiple purposes. In our case, we present an LSTM en-

coder network able to produce representations used by two decoders: one that re-

constructs, and one that classifies if the training sequence has an associated label.

This allows the network to learn representations that are useful for both discrimi-

native and reconstructive tasks at the same time. This paradigm is well suited for

semi-supervised learning with sequences and we test our proposed approach on an

action recognition task using motion capture (MOCAP) sequences. We find that

semi-supervised feature learning can improve state-of-the-art movement classifi-

cation accuracy on a public MOCAP dataset, on which we defined a new realistic

Email addresses: [email protected] (Felix G. Harvey),[email protected] (Christopher Pal)

Preprint submitted to Pattern Recognition January 27, 2020

arX

iv:1

511.

0665

3v3

[cs

.CV

] 2

5 Ju

l 201

6

partition based on subjects. Further, we find that even when using only labeled

data and a primarily discriminative objective, the addition of a reconstructive de-

coder can serve as a form of regularization that reduces over-fitting and improves

test set accuracy.

Keywords: Semi-Supervised, LSTM, Encoder-Decoder, Action Recognition,

MOCAP, HDM05

1. Introduction

It is often the case that for a given task a small amount of labeled data is

available compared to a much larger amount of unlabeled data. In these cases,

semi-supervised learning may be preferred to supervised learning as it uses all

the available data for training, and has good regularization and optimization prop-

erties [12, 4]. A common technique for semi-supervised learning is to perform

training in two phases: unsupervised pre-training, followed by supervised fine

tuning [18, 4, 12, 34]. The unsupervised pre-learning task often consists of train-

ing a variant of an auto-encoder (e.g. a denoising auto-encoder) to reconstruct

the data. This helps the network bring its initial parameters into a good region of

the highly dimensional parameter space for the non-convex optimization which

is the supervised task. One of the features of this work is that our model forces

its parameters to stay in such regions as training occurs, by having reconstructive

objectives in one or two of its modules.

Recent advances in Recurrent Encoder-Decoder networks have afforded mod-

els the ability to perform both supervised learning [31, 9] and unsupervised learn-

ing [29]. These architectures are often based on the idea that recurrent neural

networks (RNNs) can model temporal dependencies and that when handling a se-

2

quence, the last hidden state of an RNN can contain information about the whole

sequence. Therefore, this representation has a fixed length, even if sequences do

not. The separation between the encoder and the decoder network(s) allows one

to easily add, modify or re-purpose decoders for desired tasks. Using multiple

decoders forces the encoder to learn rich, multipurpose representations. This can

also allow semi-supervised training in a single phase. In our case, we combine the

unsupervised pre-training phase with the supervised fine-tuning phase to jointly

train a classifier decoder and a reconstruction decoder, which both use the rep-

resentation provided by the encoder. We used a particular type of RNN known

as a Long-Short-Term-Memory (LSTM) model to encode and decode sequences,

and a multilayer perceptron to classify them. We also studied the effect of adding

a per-frame decoder that uses frame representations provided by the frame en-

coder (see sections 4 and 5). We show that our architecture improves movement

recognition accuracy on a newly defined realistic partitioning of a popular public

Motion Capture (MOCAP) dataset.

Our main contributions are:

• The introduction of a novel architecture of the Recurrent Encoder-Decoder

type that allows for supervised and unsupervised learning in a common

model.

• The definition and execution of experiments using a more realistic vali-

dation and test set partitioning of a widely used public MOCAP dataset,

thereby facilitating more informative future evaluations.

• An improvement over our implementation of current state-of-the-art tech-

niques for action recognition on such well defined experiments.

3

• The provision of further evidence of the benefits of semi-supervised learn-

ing, even when using a completely labeled dataset.

2. MOCAP

Motion capture (MOCAP) technologies allow one to track and save move-

ments of an actor wearing a special suit with multiple markers on it. The recorded

positions of these markers at each timestep make it possible to apply these move-

ments to virtual characters in order to simulate realistic motions. These tech-

nologies find use in multiple areas, such as video games, movies and health. As

MOCAP is used more extensively in these applications, easy searches through

databases of sequences becomes a desirable tool. While sequences could be la-

beled or annotated to facilitate search, in practice, sequences are often not la-

beled, or labeled at a coarse semantic level. A key element for labeling MOCAP

sequences is human action recognition. This challenge can be seen as sequence

classification, when only one action is performed in the sequence, or as sequence

to sequence translation, when we want a fuller description of what is happening or

when multiple actions are performed. This work focuses on sequence classifica-

tion, even though it is a potential first step towards the long-term goal of sequence

translation from MOCAP to natural language.

2.1. Datasets

One challenge with the application of deep learning on MOCAP data is the

lack of strongly labeled data. For this work, we used the two biggest publicly

available MOCAP datasets that we are aware of. The first is the HDM05 pub-

lic dataset [25]. It contains 2329 labeled cuts that are very well suited for action

recognition. There are about 100 classes of movements, which can be reduced

4

to 65 when the number of repetitions or the side of the limb starting the action

(left, right) are ignored. In this work, we use the same 65 classes defined by

Cho & Chen [8]. The second dataset we use is the CMU Graphics Lab Motion

Capture Database1. This is to our knowledge the biggest public MOCAP dataset

in terms of number of frames. It contains 2148 weakly labeled or unlabeled se-

quences. This dataset can hardly be used for supervised learning as the labeling

of sequences, if any, was only made to give high level indications, and does not

seem to have followed any stable conventions throughout the dataset. Works by

Zhu et al. [35], Ijjina et al. [22], and Barnachon et al. [2] all use different cus-

tom classes definitions to get some quantitative results on CMU for classification.

In the present work, we use this dataset for unsupervised learning only. Table 3

shows more info on the datasets used.

2.2. Previous Work on Action Recognition

Some interesting work has been made on action recognition in MOCAP se-

quences in recent years, many of which depend on some well designed, hand-

crafted features.

For example, Chaudhry et al. [6] created bio-inspired features based on the

findings of Hung et al. [21] on the neural encoding of shapes and, using Support

Vector Machines (SVMs), have obtained good results on classification of 11 ac-

tions from the HDM05. Ijjina et al. [22] use some joints distances metrics based

on some domain knowledge (about actions found in a particular dataset) to create

principled features that are then used as inputs to a neural classifier (pre-trained

as a stacked auto-encoder). They reach good accuracy for 3 custom classes in

1http://mocap.cs.cmu.edu/

5

the CMU dataset. Using this prior domain knowledge especially helps when the

dataset is somewhat specialized and may contain actions of a certain type. How-

ever, if the goal is to have a generic action classifier that handles at least as much

actions as found in HDM05, it might be more appropriate to learn those features

with a more complex architecture. Barnachon et al. [2] use a learnt vocabulary

of key poses (based on a K-Means variant) and use distances between histograms

of sub-actions in order to classify ongoing actions. They present good accuracy

(96.67%) on a custom subset of 33 actions from HDM05 (where training samples

are taken at random). In our case, we wish to perform classification on the 65

HDM05 actions.

End-to-end neural approaches have also been tried on HDM05 and CMU in

which cases discriminating features are learnt throughout the training of a neural

network. Cho & Chen [8] have obtained good movement classification rates on

simple sequences (cuts) on the HDM05 dataset using a simple Multi-Layer Per-

ceptron (MLP)+Auto-encoder hybrid. Chen & Koskela [7] tested multiple types

of features, using a fast technique they call Extreme Machine Learning to clas-

sify, again, HDM05 cuts. Results were really good in both cases, with accuracies

of over 95% and 92% with 65 and 40 action classes respectively. Their models

were trained at the frame level, and sequence classification was done by majority

voting. Other work by Du et al. [11] treated the simple sequences’ classification

problem with the same action classes as Cho & Chen [8] with a hierarchical net-

work handling in its first layer parts of the body separately (i.e. torso, arms and

legs), and concatenating some of these parts in each layer until the whole body is

treated in the last hidden layer. They worked with RNNs to use context informa-

tion, instead of concatenating features of some previous frames at each timestep

6

like Cho & Chen [8] and Chen & Koskela [7]. This led to better results, and their

classification accuracy on simple sequences reached 96.92%. Finally, Zhu et al.

[35] have a similar, but less constrained recurrent architecture that is regularized

by a weight penalty based on the l2,1 norm (Cotter et al. [10]), which encourages

parts the network to focus the most meaningful joints’ or features’ interactions.

They report 97.25% accuracy on HDM05 for classification of simple sequences,

with 65 classes.

2.3. Defining a Good Test Set

Based on their results, the aforementioned methods seemingly solve the prob-

lem of action recognition in the HDM05 dataset. However, upon closer inspection

it becomes clearer that there is an underlying problem for these results (except

for Chaudhry et al. [6]), which lies in the definition of the validation and test

sets. The experiments were performed using 10-fold cross validation with 10 bal-

anced partitions of shuffled sequences. This means that takes or frames recorded

with a particular actor could be found in the training set as well as in the test

set. This configuration is therefore unrepresentative of typical realistic situations

where new takes are recorded with new actors. If an actor is asked to repeat five

times the same movement in five different takes, then these will probably be very

similar, and shuffling the frames or sequences will insert an undesired bias in the

test set. Chaudhry et al. [6], on the other hand, isolate their validation and test

sets based on subjects performing the actions. This type of evaluation is therefore

more representative of reality and a better measure of a given method’s general-

ization performance. Moreover, it seems that in previous works [8, 7, 11, 35] there

is no proper test set. Each fold is used for early stopping as well as for evaluating

the network. Having an unseen test set would then be a better way to once again

7

assess the generalization capacity of the network. Table 1 shows the results of

Table 1: Accuracies (Acc.) with different test sets, and using techniques from Cho

& Chen [8], Du et al. [11] and Zhu et al. [35].

TECHNIQUE TEST SET ACC.(%)

CHO & CHEN RANDOM 10%, BALANCED 95.61

CHO & CHEN RANDOM 40%, BALANCED 94.13

CHO & CHEN ACTORS [’TR’, ’DG’] 64.36

CHO & CHEN ACTORS [’TR’, ’DG’] - (PP) 81.64

DU ET AL. RANDOM 10%, BALANCED 92.98

DU ET AL. ACTORS [’TR’, ’DG’] - (PP) 70.63

ZHU ET AL. RANDOM 10%, BALANCED 94.53

ZHU ET AL. ACTORS [’TR’ , ’DG’] - (PP) 81.64

our attempts at re-creating results of previous state-of-the-art works [8, 11, 35]

on the HDM05 dataset. It also shows accuracies for these techniques when using

partitions based on actors. In this setting, we use actors with initials ’tr’ and ’dg’

as test subjects, and the actor ’bk’ as a validation subject. In this scenario, we

always train the network once on the three partitions (with early stopping w.r.t

subject ’bk’), then start a new training with the train and validation sets com-

bined. In this second phase, early stopping is done when accuracy on the previous

training set reaches the same level as in the first phase. This is done in order to

maximize the amount of training data, while preventing over-fitting. Moreover,

since using three out of five actors from HDM05 for validation and testing leaves

about 40% of the sequences in the training set, we tested again the method from

8

Cho & Chen [8] with a balanced, shuffled partition having the same proportions

(40%, 20%, 40%) for each set to see if this was the only factor influencing the

declining results. Finally, we applied our own pre-processing (PP) of the data

with these techniques with our newly defined actor-based partitions to make fairer

comparisons later. The main difference with our pre-processing of the data is that

we allow the hips of the actor not to be always parallel to the floor. This makes

it easier to recognize some movements like the cartWheel motion, as shown in

Figure 1. More details on pre-processing is given in section 5.1. As we can see,

results using a realistic partition of HDM05 are significantly lower, but our own

pre-processing method of the data helps. Since the techniques of Cho & Chen

[8] and Zhu et al. [35] yielded the best results with our actor-based partitions and

with our pre-processing method, the baseline for the rest of this work will be the

81.64% accuracy reached by those methods.

Figure 1: Comparison of pre-processing methods for a cartWheel movement from

HDM05. UP: Same as Cho & Chen [8]. DOWN: our own.

9

3. Recurrent Neural Networks Review

We will quickly describe here the building blocks of our architecture, which

are recurrent neural networks and associated advances.

3.1. Vanilla Recurrent Neural Networks

At their core, RNNs are artificial neural networks in which hidden layer units

have connections towards themselves through time. This means that at each

timestep, hidden layers receive lower layers’ current outputs as well as their own

output from the previous timestep. This allows the network to use past context

information to better model temporal sequences. This is why RNNs have often

proven over the years to be very powerful on multiple sequential problems, such

as speech recognition [15, 28, 14], handwriting recognition [17], text generation

[30, 16], or in our case MOCAP action recognition [11, 35]. The forward pass

of the hidden layer of an RNN, to get the hidden state h at time t, is very similar

to the one of an ordinary feed-forward neural net, except for the added previous

hidden state ht−1 as an input:

ht = σ(Wxhxt + Whhht−1 + bh) (1)

where Wxh, Whh are weight matrices and bh is the bias vector for the hidden

layer. The operator σ() is a differentiable activation function such as a sigmoid or

a tanh operation.

3.2. Bi-Directional Recurrent Networks

RNNs make use of past context in order to model a sequence up to a certain

point. In many problems however, future context may be available and useful.

In those cases, using bi-directional RNNs (BRNNs) can make the network more

10

powerful. BRRNs hidden layers have two sets of units. One set handles the

sequence in chronological order, while the other handles it in reverse order. The

output of such a layer is the concatenation of the hidden activations of both sets.

Outputs of bi-directional recurrent layers can contain information about all the

sequence at each timestep (past and future).

Figure 2: Unfolded bi-directional RNN. hf and hb are hidden states of the forward

and the backward sets respectively. The [] symbol represents the concatenation

operation of both hidden states at time t. Based on figure 3.5 in [16].

3.3. Long-Short-Term Memory

One known problem with RNNs is that they can be very hard to train to model

long-term dependencies because of the vanishing gradient problem [3, 19]. One of

the most popular and effective methods to counter this problem is the use of Long-

Short-Term Memory networks (LSTMs), as presented by Hochreiter & Schmidhu-

ber [20]. These recurrent networks have, instead of simple hidden units, memory

cells having input, output, and forget gates that determine whether information

11

is added to, released from, and kept in the cell at each timestep. This enables

the recurrent network to keep past context information for a long time internally,

therefore allowing it to model long time dependencies. One LSTM cell is shown in

Figure 3. In our context, we do not use in-cell connections (also called peepholes)

as they have not been found to be useful in recent experiments [5]. Therefore,

gates, cell values, and hidden outputs are calculated as follow:

i = sigmoid(Wxixt + Whiht−1 + bi) (2)

o = sigmoid(Wxoxt + Whoht−1 + bo) (3)

f = sigmoid(Wxfxt + Whfht−1 + bf ) (4)

ct = f� ct−1 + i� tanh(Wxcxt + Whcht−1 + bc) (5)

ht = o� tanh(ct) (6)

Where W and b are weight matrices, and bias vectors respectively2. The operation

� is an element-wise multiplication.

3.4. Recurrent Encoder-Decoders

A major advantage and key attribute of Recurrent Encoder-Decoders is their

ability to transform variable-length sequences into a fixed-size vector in the en-

coder, then use one or more decoders to decode this vector for different purposes.

Using an RNN as an encoder allows one to obtain this representation of the whole

input sequence. Cho et al. [9] as well as Sutskever et al. [31] have used this ap-

proach for sequence-to-sequence translation, with some differences in the choice

2We will use these two same symbols (W and b) without re-defining them throughout the rest

of this paper as they will always have the same definition even though they do not contain the same

values nor have the same dimensions in each layer.

12

Figure 3: The LSTM cell used for this work. It doesn’t have in-cell connections

(peepholes). This is similar to figure 4.2 in [16], but we do not use in-cell connec-

tions.

of hidden units and in the use of an additional summary vector (and set of weights)

in the case of Cho et al. [9]. Both approaches need a symbol of end-of-sequence

to allow input and target sequences to have different lengths. They are trained to

maximize the conditional probability of the target sequence given the input se-

quence. Our approach is more closely related to the one used by Srivastava et al.

[29] in which they perform unsupervised learning, by either reconstructing the

sequence, predicting the next frames, or both.

13

Figure 4: The FR-SRC variant of the architecture studied. This network produces

3 types of outputs w.r.t. a sequence X = [x1, ..., xT ] and its parameters θ. The set

θSC includes all the weights and biases used to compute class probabilities. The

hidden states of the frame encoder, sequence encoder and sequence reconstruc-

tive decoder are denoted here by hFE , hSE and hSD respectively. The sequence

representation c is created with the hidden state of the sequence encoder at time

T , and hc represents the sequence classifier’s fully connected layers (the softmax

activation is not explicitly shown here).

4. Our Model

4.1. The Architecture

Figure 4 shows an overview of the FR-SRC variant of the proposed archi-

tecture. The model is composed of 5 main components: a per-frame encoder, a

per-frame reconstructive decoder, a sequence encoder, a sequence reconstructive

decoder, and a sequence classifier. Each decoder, along with the classifier pro-

14

duces an output used to calculate a cost. Each of these components are added to

produce evermore meaningful features as we go up the layers by having multiple

costs influencing more directly different modules, in a way loosely similar to lad-

der networks [26]. Another module that we have used for some experiments and

that is not shown in the figure is the per-frame classifier, which tries to classify

the action based on single frames. This modules takes the per-frame encodings to

produce probabilities of actions.

The frame auto-encoder’s role is to learn robust per-frame features in an unsu-

pervised manner by reconstructing the clean version (xt) of a corrupted frame (xt)

at time t [33]. The reconstructive error (lFRE,t) we use is the well known mean

squared error and we apply it for each frame, before calculating its average over

the frames to get lFRE:

hFEt = z(xt) (7)

xt = g(hFEt ) (8)

lFR,t = 1/2||xt − xt||2 (9)

lFR = 1/TT∑t=1

lFR,t (10)

In equation 7, z() is the encoding function learnt by the bottom feed-forward

layers of the per-frame auto-encoder, while g() (eq. 8) is the decoding function

of the module learnt by its upper layers. In further equations, HFE will stand for

the sequence of features [hFE1 , ...,hFE

T ] and we will dismiss the corruption sign

over x (x) as we will show equations for a test setting, where the frames are not

corrupted.

The per-frame classifier uses hFEt as an input to yield belief values on move-

15

ment classes for every frame:

at = W(hFEt ) + b (11)

These activations are then summed over all frames and a softmax operation is

applied on the result, yielding class probabilities P (yk) given all the frames xt of

the whole sequence X, and the parameters of the frame encoder θFE:

af =T∑t=1

at (12)

P (yk|X, θFE) = sf,k = eaf,k/(K∑i=1

eaf,i) (13)

This is similar to the operation used by Du et al. [11] to classify sequences based

on a sequence of activations but differs in the fact that we do not use outputs from

recurrent layers.

We then use the negative log-likelihood of the correct class as our frame-based

classification error (lFC):

lFC = −log(P (Y = yk|X, θFE)). (14)

The combination of the frame auto-encoder and the frame classifier gives some-

thing very similar to Cho & Chen’s [8] approach, except that each frame’s input

does not contain information about a previous frame. When per-frame recon-

struction is not used, the model still encodes frames with z() before outputting

probabilities with a softmax.

The LSTM encoder’s purpose is to encode the whole sequence of learnt fea-

tures into a fixed length summary vector that models temporal dependencies, and

which can be used for supervised or unsupervised tasks.

c(X) = tanh(WschSET + b) (15)

16

Here, c(X) is the output of a fully connected layer that has the weight matrix Wsc.

It uses the last hidden state of the LSTM encoder hSET as an input. The encoder

itself takes HFE as an input sequence. See equations 2 to 6 for calculations of the

hidden state of the LSTM encoder.

If the sequence reconstructive decoder is present, it learns to reconstruct the

feature sequence HFE that was fed to the LSTM encoder. As explained by Sri-

vastava et al. [29], the LSTM decoder can use its own previous prediction at each

timestep to predict the current output, making it a conditional decoder. This is

what we use in this work. As we can see from Figure 4, the summary vector

c(X) is also fed at each timestep to the LSTM decoder. This vector can therefore

serve multiple purposes, and it is up to the network to learn how it will use it

even though we can guide it through assignment of weights on the different costs.

With the outputted HFE

= [hFE

1 , ..., hFE

T ] from the decoder, we can calculate our

feature sequence reconstruction error (lSR) :

hDt = tanh(Wihh

FE

t−1 + WhhhDt−1

+ Wchc(X) + b) (16)

hFE

t = tanh(WhDt + b) (17)

lSR,t = 1/2||hFE

t − hFEt ||2 (18)

lSR = 1/TT∑t=1

lSR,t (19)

In our case, hFE

0 is initialized to a zero vector to handle the first frame (t = 1).

The sequence classifier is a simple feed-forward MLP that outputs class prob-

abilities based on the summary vector. This is the main task of interest, and the

sequence classifier is therefore always used in our experiments. We again use the

17

negative log-likelihood as the sequence classification error (lSCE):

hC = Wc(X) + b (20)

aseq = WhC + b (21)

P (yk|X, θSC) = sseq,k = easeq,k/(K∑i=1

easeq,i) (22)

lSC = −log(P (Y = yk|X, θSC)) (23)

Using a sequence classification ratio r, we can define different models with dif-

ferent loss functions, enabling some or all of the modules of the architecture. To

emphasize on the task of interest, we always put the sequence classification er-

ror from the summary vector (lSCE) against the mean of the other used errors,

as shown in Table 2. Setting r to 1 will result in a Sequence Classifier (SC)

network only. Adding feature sequence reconstruction to this model will yield

a Sequence Reconstructive Classier (SRC). Adding instead frame reconstruction

to the SC will give a Frame Reconstructive-Sequence Classifier (FR-SC), while

adding frame reconstruction to the SRC will yield a Frame Reconstructive SRC

(FR-SRC). Finally, using all modules will give a Frame Reconstructive Classifier-

SRC (FRC-SRC).

Since these loss functions as well as all activations functions of the network

are differentiable with respect to each of its parameters, we can employ stochastic

gradient descent (SGD) and back-propagation through time (BPTT) to train the

network.

18

Table 2: Variants of the architecture and their loss functions.

MODEL LOSS FUNCTION

SC lSC

SRC r ∗ lSC + (1− r) ∗ lSR

FR-SC r ∗ lSC + (1− r) ∗ lFR

FR-SRC r ∗ lSC + (1− r) ∗ lSR + lFR

2

FRC-SRC r ∗ lSC + (1− r) ∗ lFC + lSR + lFR

3

5. Experiments

5.1. Data

The data in these experiments come from the open HDM05 and CMU MO-

CAP datasets. They both are recorded at 120 frames per second (fps) and contain

more than 30 markers’ positions. In our case, we use 23 common markers be-

tween the two datasets. We work with the C3D file format, which contains series

of positions for each marker. Our preprocessing of the data consists mainly of

orienting, centering and scaling the point cloud of every frame given by the files.

The orientation process is a basis change of all 3D positions so that the actor’s hips

are always facing the same direction, while allowing them to not be parallel to the

floor. We then center the hips of the actor at position (0,0,0) and scale so every

marker is always in the interval [−1, 1]. This can help handling different actors

with different sizes. Since we use 23 markers, each frame vector has a dimension

of 69. To speed up training, we use only 1 frame out of 4 to create shorter, but

still fluid sequences, yielding a 30 frames-per-second rate. Other specifications

19

Table 3: Specifications of the two publicly available datasets used, when frame

rate is reduced to 30 fps.

HDM05 CMU

NUMBER OF SEQUENCES 2 329 2 531

MIN. LENGTH (FRAMES) 6 3

MEAN LENGTH (FRAMES) 66 467

MAX LENGTH (FRAMES) 226 5 737

NUMBER OF ACTORS 5 144

with this frame rate are shown in Table 3. Our final experiments using sliding

windows use windows of 30 frames (1 second) with an offset of 15 frames. Pre-

liminary tests were conducted with shorter and longer windows on a subset of

HDM05, and 30 frames seemed like an optimal choice, even though not critical.

When classifying sequences longer than 30 frames, we use a simple majority vot-

ing strategy on the windows to select the movement class. In all experiments,

we use an additive Gaussian noise with a standard deviation of 0.05 and mean 0

on markers’ positions for training. We use minibatches of size 8 when handling

HDM05 only data, and minibatches of size 32 when using CMU and HDM05.

Sequences shorter than 30 frames are zero-padded. We use binary masks to apply

calculations of outputs and cost evaluations with valid frames only.

5.2. Network Specifications

The frame encoder is closely related to the one used by Cho & Chen [8], as it

has two hidden layers of [1024, 512] units. Two extra layers of [1024, 69] units are

used by the reconstruction decoder with tied weights with the encoder. The frame

20

classifier only has a special softmax layer applied to the output of the frame en-

coder. This layer applies the softmax operation on the sum of its linear activations

for each frame, as shown in equations 11, 12 and 13. The LSTM encoder, has 3

hidden layers of [512, 512, 256] LSTM memory cells. As explained in section 3.2,

the output of a single bi-directional recurrent layer can contain, at each timestep

information for the whole sequence. We therefore use bi-directionality only in the

first LSTM layer of the sequence encoder. This means that the second layer of the

LSTM encoder has an input of size 2∗512 containing past and future information.

The c layer, outputting the summary vector is of size 1024, and the hc layer is of

size 512. A normal softmax layer is placed on top of hc to output probabilities.

This means the feed forward layers on top of the LSTM encoder have the same

size as those used at frame level. Each layer of the LSTM decoder has a number

of units equal to size of the output of its corresponding layer in the encoder. This

leads to [256, 512, 1024] memory cells. All non-linear activations used in the net-

work consist of the tanh() function except for the input, output and forget gates

of the memory cells that use sigmoid activations.

For feed-forward layers’ initializations, their weight are drawn uniformly from

[−√

1/fanin,√

1/fanin], while we use orthonormal initialization for recurrent

weight matrices. All biases are initialized at 0, except for LSTM forget gates

which are initialized to 1, as proposed by Gers et al. [13] and Jozefowicz et al.

[23]. The learning rate is initialized to 0.04, and is halved when the validation ac-

curacy is not improved in three consecutive epochs, until it reaches below 0.0001.

We use early stopping with a tolerance of 25 epochs. We use a 0.9 momentum

value.

21

5.3. Preliminary Study of Ratios and Windows’ Widths

Before conducting experiments with all variants (Table 2) of the architecture

on the HDM05 and CMU datasets, we tested the network using the FR-SC model.

These preliminary experiments aimed at exploring how the network would per-

form using different sequence classification cost ratios, and when feeding whole

sequences of movements instead of sliding windows. Those tests were performed

in two training phases. First, we used the data from two actors for training, one

for the validation set and two for the test set. After this first phase, we combined

both the training and the validation sets and started a second second training phase.

Early stopping was performed based on monitoring the loss of the original, smaller

training set to identify when it had reached the level obtained when the validation

set was used for early stopping. Training in this way maximizes the amount of

data available to the method, but allows early stopping to be used in a heuristic but

principled way. Results are shown in Table 4. We first tried with the ratio r = 1.0,

Table 4: Accuracies on test set with different classification loss ratios and sliding

windows’ width.

MODEL WIDTH RATIO ACCURACY(%)

BASELINE 1 0.5 81.64

SC 30 1.00 81.97

FR-SC 30 0.75 84.67

FR-SC 30 0.50 84.02

FR-SC UNLIMITED 0.75 79.13

which turns the network into a SC model, since only lsce is used. This supervised-

22

only network beat the baseline accuracy of 81.64% with a score of 81.97%. This

may indicate that using sliding windows and majority voting (compared to Zhu

et al. [35] even when using recurrence can help classification. We then tried ratios

r = 0.75 and r = 0.5. The latter is the one used in our baseline [8]. The best

result accuracy was obtained with r = 0.75, showing that a higher weight on the

(supervised) task of interest helps. These three experiments were conducted with

the sliding windows + majority voting strategy. We followed the exploration by

trying to take full advantage of the Recurrent Encoder-Decoder architecture by

handling whole sequences (no sliding windows). This implies encoding whole

sequences into the fixed sized vector c(X). We used the best ratio from the first

three tests, and obtained a slightly lower accuracy. This might be due to the fact

that more LSTM cells are would be needed in the sequence encoder/decoders to

learn to model dependencies on many more temporal scales.

5.4. Regularization by Reconstruction

The experiments we conducted here used the three separate sets from HDM05

(first phase of training described in section 5.3). Examining these results, we are

able to see the regularization effects of adding different types of reconstructive

modules and losses to the network’s composite error function. Table 5 shows

these effects. As we can see, adding the frame reconstructive module (FR-SC)

helps a lot to reduce over-fitting. Figure 5 clearly shows this effect during training.

Adding the feature sequence reconstruction loss (FR-SRC) and then the frame-

based classification (FRC-SRC) also have a beneficial impact on over-fitting com-

pared to the SC model, but to a lesser extent. This, however might be due to the

fact that these two bigger networks (FRC-SRC, FR-SRC) have a lot more capac-

ity due to their LSTM decoder and may therefore tend to over-fit in this limited

23

data setting (we use here only 40% of HDM05 for training). In order to vali-

date this hypothesis, further tests with more training data were conducted on these

networks as well as on the FR-SC.

Table 5: Regularization effects of the chosen model (and corresponding loss func-

tion) on the accuracies with HDM05 data only.

MODEL TRAIN(%) VALID(%) TEST(%)

SC 99.00 67.26 78.18

FR-SC 98.77 77,71 82.51

FR-SRC 97.43 71.20 79.80

FRC-SRC 98.77 70.22 78.73

5.5. Training Reconstructive and Semi-supervised models with more data

The final experiments here were performed with two training phases, as in sec-

tion 5.3. Importantly, as they also use the validation set for training and perform

early stopping using the original train set they therefore involve a 50% larger train-

ing set compared to the experiments of Table 5. Table 6 shows our results for the

movement classification task using our more representative test set for HDM05.

We compare our results with our implementation of the baseline techniques from

Cho & Chen [8] and Zhu et al. [35] on the same test set. Since the CMU data,

in terms of number of frames, outnumbers HDM05 by a significant factor, we di-

vide errors on reconstruction of unlabeled data by this factor in order the keep our

classification error ratio valid. Gaussian noise is re-generated for each example,

so no sequences are exactly the same. Experiments performed with the combined

24

Figure 5: Visualization of the effect of adding per-frame reconstructive decoders

on over-fitting throughout training.

datasets (HDM05+CMU) used pre-trained networks to accelerate training, e.i. we

used the networks already trained on HDM05. On HDM05 only, the best model

was the FRC-SRC, which used all 4 losses, supporting our main hypotheses that

we can obtain higher quality representations of the data when using specialized

modules with associated costs in the architecture and that the added network ca-

pacity is useful with bigger training sets. We were surprised by the low perfor-

mance of the FR-SRC on HDM05 only. It seems that in the single experiment

with FR-SRC on HDM05 only, the network might have got stuck during training

in a bad local minimum of the loss function and that adding the unlabeled CMU

data was enough for it to step out of this minimum, a known advantage of semi-

supervised learning. Of course, multiple tests in each setting would help gather

more robust results and standard deviations. This would be of great interest for the

case of the competing FR-SRC and FRC-SRC models which yielded close results

25

Table 6: Impact of the chosen model (and corresponding loss function) on the test

accuracy.

MODEL DATASET ACCURACY (%)

BASELINE HDM05 81.64

FR-SC HDM05 84.67

FR-SC HDM05+CMU 84.23

FR-SRC HDM05 80.24

FR-SRC HDM05+CMU 85.64

FRC-SRC HDM05 85.53

FRC-SRC HDM05+CMU 85.10

with the combined datasets. Indeed, the frame-based classification module might

not be as useful as other modules. Estimating probabilities of an action based on a

single frame without context might be a task too hard for the network. Therefore,

the per-frame encoding layers might try to reduce the very high loss on frame-

based classification by (often unsuccessfully) producing discriminative features at

the expense of higher reconstruction loss, resulting in less useful features to send

to the LSTM encoder.

Figure 6 shows the confusion matrix generated wih the best performing net-

work. This shows that interestingly, the network is confused between actions

PunchLFront (26) and PunchLSide (27) but not between PunchRFront(28) and

PunchRSide (29). We propose that this may be due to the fact that actors were

right handed and that movements may have been clearer when using their strong

hand.

26

Figure 6: Confusion matrix on HDM05 classification for the best performing net-

work (FR-SRC trained on HDM05 + CMU).

5.6. Clustering HDM05

Using the FR-SRC network that got the best results on HDM05 classification,

we produced and performed clustering on the summary vectors it produced for the

test set, unseen during training. We used a Gaussian Mixture Model (GMM) [27]

initialised with K-means++ [1], where K was found by using 10% of the set and a

validation set to find the best likelihood. This system found 30 clusters that we can

visualize in Figure 7. Note that feature vectors have 1024 dimensions and clusters

27

were found in that space, while we used the t-SNE algorithm [24] to create a 2D

visualization. Some clusters were annotated after manual inspection to give an

idea of what movements the network clustered. We can see that such a trained

Figure 7: 2D visualization of clusters found by the FR-SRC network. Some of

those were annotated after manual inspection of individual sub-sequences inside

clusters.

network could help accelerate labeling MOCAP sequences of movements since

sequences in the most well defined clusters could be labeled in batch. However,

manual annotation seems to suggest that HDM05 actions have a considerable im-

28

pact of the clustered actions, since almost all clusters could be associated with one

or two HDM05 labels.

6. Conclusions and Discussion

Recurrent Encoder-Decoder architectures with multiple decoders provide an

attractive framework for semi-supervised, multipurpose representation learning.

Our experiments show that even our simplest configuration using an RNN is able

to outperform our implementation of the state-of-the-art for HDM05 movement

classification with a realistic partition of data. We also found that we were able

to push those results higher through enabling the various modules of the proposed

architecture. Our results indicate that the inclusion of reconstructive decoders

appears to have a regularizing effect and reduces over-fitting.

To properly evaluate this technique (and others) we have defined a realistic

test set on the public HDM05 dataset that we hope can serve as a realistic bench-

marking set in future works on MOCAP classification. We found unclear the

definition of the evaluation in certain previous works as we could not reach sim-

ilar results using the same architectures. Nevertheless we showed that with the

same gradient-based learning method, our architecture yielded better results with

well defined training, validation and test sets.

Additionally, we showed that such a network is well suited for clustering

as learnt representations compress reconstructive and discriminative information

about sequences. This could help label datasets in batch or create a motion search

engine based on a distance metric in that learnt feature space.

As future work we are interested in exploring alternative decoders, such as

next-frame(s) predictors, to provide even richer features. Dynamic sequence clas-

29

sification loss ratios could be also tested in order to mimic two-phases semi-

supervised learning (starting with more weight on reconstructive objective and

progressing towards more weight on the discriminative objective).

Acknowledgements

We thank Ubisoft for their support for this research as well as the authors of

the Theano framework [32].

References

[1] Arthur, David and Vassilvitskii, Sergei. k-means++: The advantages of care-

ful seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium

on Discrete algorithms, pp. 1027–1035. Society for Industrial and Applied

Mathematics, 2007.

[2] Barnachon, Mathieu, Bouakaz, Saıda, Boufama, Boubakeur, and Guillou,

Erwan. Ongoing human action recognition with motion capture. Pattern

Recognition, 47(1):238–247, 2014.

[3] Bengio, Yoshua, Simard, Patrice, and Frasconi, Paolo. Learning long-term

dependencies with gradient descent is difficult. Neural Networks, IEEE

Transactions on, 5(2):157–166, 1994.

[4] Bengio, Yoshua, Lamblin, Pascal, Popovici, Dan, Larochelle, Hugo, et al.

Greedy layer-wise training of deep networks. Advances in neural informa-

tion processing systems, 19:153, 2007.

[5] Breuel, Thomas M. Benchmarking of lstm networks. arXiv preprint

arXiv:1508.02774, 2015.

30

[6] Chaudhry, Rizwan, Ofli, Ferda, Kurillo, Gregorij, Bajcsy, Ruzena, and Vidal,

Rene. Bio-inspired dynamic 3d discriminative skeletal features for human

action recognition. In Computer Vision and Pattern Recognition Workshops

(CVPRW), 2013 IEEE Conference on, pp. 471–478. IEEE, 2013.

[7] Chen, Xi and Koskela, Markus. Classification of rgb-d and motion capture

sequences using extreme learning machine. In Image Analysis, pp. 640–651.

Springer, 2013.

[8] Cho, Kyunghyun and Chen, Xi. Classifying and visualizing motion cap-

ture sequences using deep neural networks. arXiv preprint arXiv:1306.3874,

2013.

[9] Cho, Kyunghyun, Van Merrienboer, Bart, Gulcehre, Caglar, Bahdanau,

Dzmitry, Bougares, Fethi, Schwenk, Holger, and Bengio, Yoshua. Learn-

ing phrase representations using rnn encoder-decoder for statistical machine

translation. arXiv preprint arXiv:1406.1078, 2014.

[10] Cotter, Shane F, Rao, Bhaskar D, Engan, Kjersti, and Kreutz-Delgado, Ken-

neth. Sparse solutions to linear inverse problems with multiple measurement

vectors. IEEE Transactions on Signal Processing, 53(7):2477–2488, 2005.

[11] Du, Yong, Wang, Wei, and Wang, Liang. Hierarchical recurrent neural net-

work for skeleton based action recognition. In Proceedings of the IEEE

Conference on Computer Vision and Pattern Recognition, pp. 1110–1118,

2015.

[12] Erhan, Dumitru, Bengio, Yoshua, Courville, Aaron, Manzagol, Pierre-

Antoine, Vincent, Pascal, and Bengio, Samy. Why does unsupervised pre-

31

training help deep learning? The Journal of Machine Learning Research,

11:625–660, 2010.

[13] Gers, Felix A, Schmidhuber, Jurgen, and Cummins, Fred. Learning to forget:

Continual prediction with lstm. Neural computation, 12(10):2451–2471,

2000.

[14] Graves, Alan, Jaitly, Navdeep, and Mohamed, Abdel-rahman. Hybrid speech

recognition with deep bidirectional lstm. In Automatic Speech Recognition

and Understanding (ASRU), 2013 IEEE Workshop on, pp. 273–278. IEEE,

2013.

[15] Graves, Alan, Mohamed, Abdel-rahman, and Hinton, Geoffrey. Speech

recognition with deep recurrent neural networks. In Acoustics, Speech and

Signal Processing (ICASSP), 2013 IEEE International Conference on, pp.

6645–6649. IEEE, 2013.

[16] Graves, Alex. Supervised sequence labelling with recurrent neural networks,

volume 385. Springer, 2012.

[17] Graves, Alex, Liwicki, Marcus, Bunke, Horst, Schmidhuber, Jurgen, and

Fernandez, Santiago. Unconstrained on-line handwriting recognition with

recurrent neural networks. In Advances in Neural Information Processing

Systems, pp. 577–584, 2008.

[18] Hinton, Geoffrey E, Osindero, Simon, and Teh, Yee-Whye. A fast learning

algorithm for deep belief nets. Neural computation, 18(7):1527–1554, 2006.

[19] Hochreiter, Sepp. The vanishing gradient problem during learning recurrent

32

neural nets and problem solutions. International Journal of Uncertainty,

Fuzziness and Knowledge-Based Systems, 6(02):107–116, 1998.

[20] Hochreiter, Sepp and Schmidhuber, Jurgen. Long short-term memory. Neu-

ral computation, 9(8):1735–1780, 1997.

[21] Hung, Chia-Chun, Carlson, Eric T, and Connor, Charles E. Medial axis

shape coding in macaque inferotemporal cortex. Neuron, 74(6):1099–1113,

2012.

[22] Ijjina, Earnest Paul et al. Classification of human actions using pose-based

features and stacked auto encoder. Pattern Recognition Letters, 2016.

[23] Jozefowicz, Rafal, Zaremba, Wojciech, and Sutskever, Ilya. An empirical

exploration of recurrent network architectures. In Proceedings of the 32nd

International Conference on Machine Learning (ICML-15), pp. 2342–2350,

2015.

[24] Maaten, Laurens van der and Hinton, Geoffrey. Visualizing data using t-sne.

Journal of Machine Learning Research, 9(Nov):2579–2605, 2008.

[25] Muller, Meinard, Roder, Tido, Clausen, Michael, Eberhardt, Bernhard,

Kruger, Bjorn, and Weber, Andreas. Documentation mocap database hdm05,

2007.

[26] Rasmus, Antti, Berglund, Mathias, Honkala, Mikko, Valpola, Harri, and

Raiko, Tapani. Semi-supervised learning with ladder networks. In Advances

in Neural Information Processing Systems, pp. 3532–3540, 2015.

33

[27] Reynolds, Douglas. Gaussian mixture models. Encyclopedia of biometrics,

pp. 827–832, 2015.

[28] Sak, Hasim, Senior, Andrew, and Beaufays, Francoise. Long short-term

memory based recurrent neural network architectures for large vocabulary

speech recognition. arXiv preprint arXiv:1402.1128, 2014.

[29] Srivastava, Nitish, Mansimov, Elman, and Salakhutdinov, Ruslan. Un-

supervised learning of video representations using lstms. arXiv preprint

arXiv:1502.04681, 2015.

[30] Sutskever, Ilya, Martens, James, and Hinton, Geoffrey E. Generating text

with recurrent neural networks. In Proceedings of the 28th International

Conference on Machine Learning (ICML-11), pp. 1017–1024, 2011.

[31] Sutskever, Ilya, Vinyals, Oriol, and Le, Quoc VV. Sequence to sequence

learning with neural networks. In Advances in neural information processing

systems, pp. 3104–3112, 2014.

[32] Team, The Theano Development, Al-Rfou, Rami, Alain, Guillaume, Alma-

hairi, Amjad, Angermueller, Christof, Bahdanau, Dzmitry, Ballas, Nicolas,

Bastien, Frederic, Bayer, Justin, Belikov, Anatoly, et al. Theano: A python

framework for fast computation of mathematical expressions. arXiv preprint

arXiv:1605.02688, 2016.

[33] Vincent, Pascal, Larochelle, Hugo, Bengio, Yoshua, and Manzagol, Pierre-

Antoine. Extracting and composing robust features with denoising autoen-

coders. In Proceedings of the 25th international conference on Machine

learning, pp. 1096–1103. ACM, 2008.

34

[34] Yu, Dong, Deng, Li, and Dahl, G. Roles of pre-training and fine-tuning in

context-dependent dbn-hmms for real-world speech recognition. In Proc.

NIPS Workshop on Deep Learning and Unsupervised Feature Learning,

2010.

[35] Zhu, Wentao, Lan, Cuiling, Xing, Junliang, Zeng, Wenjun, Li, Yanghao,

Shen, Li, and Xie, Xiaohui. Co-occurrence feature learning for skeleton

based action recognition using regularized deep lstm networks. In Thirtieth

AAAI Conference on Artificial Intelligence, 2016.

35