Semi-supervised Learning with Encoder-Decoder Recurrent ... · Semi-supervised Learning with...
Transcript of Semi-supervised Learning with Encoder-Decoder Recurrent ... · Semi-supervised Learning with...
Semi-supervised Learning with Encoder-DecoderRecurrent Neural Networks: Experiments with Motion
Capture Sequences
Felix G. Harveya, Christopher Pala
aDepartement de genie informatique et genie logicielEcole Polytechnique Montreal
Montreal, Quebec, Canada, H3T 1J4
Abstract
Recent work on sequence to sequence translation using Recurrent Neural Net-
works (RNNs) based on Long Short Term Memory (LSTM) architectures has
shown great potential for learning useful representations of sequential data. A
one-to-many encoder-decoder(s) scheme allows for a single encoder to provide
representations serving multiple purposes. In our case, we present an LSTM en-
coder network able to produce representations used by two decoders: one that re-
constructs, and one that classifies if the training sequence has an associated label.
This allows the network to learn representations that are useful for both discrimi-
native and reconstructive tasks at the same time. This paradigm is well suited for
semi-supervised learning with sequences and we test our proposed approach on an
action recognition task using motion capture (MOCAP) sequences. We find that
semi-supervised feature learning can improve state-of-the-art movement classifi-
cation accuracy on a public MOCAP dataset, on which we defined a new realistic
Email addresses: [email protected] (Felix G. Harvey),[email protected] (Christopher Pal)
Preprint submitted to Pattern Recognition January 27, 2020
arX
iv:1
511.
0665
3v3
[cs
.CV
] 2
5 Ju
l 201
6
partition based on subjects. Further, we find that even when using only labeled
data and a primarily discriminative objective, the addition of a reconstructive de-
coder can serve as a form of regularization that reduces over-fitting and improves
test set accuracy.
Keywords: Semi-Supervised, LSTM, Encoder-Decoder, Action Recognition,
MOCAP, HDM05
1. Introduction
It is often the case that for a given task a small amount of labeled data is
available compared to a much larger amount of unlabeled data. In these cases,
semi-supervised learning may be preferred to supervised learning as it uses all
the available data for training, and has good regularization and optimization prop-
erties [12, 4]. A common technique for semi-supervised learning is to perform
training in two phases: unsupervised pre-training, followed by supervised fine
tuning [18, 4, 12, 34]. The unsupervised pre-learning task often consists of train-
ing a variant of an auto-encoder (e.g. a denoising auto-encoder) to reconstruct
the data. This helps the network bring its initial parameters into a good region of
the highly dimensional parameter space for the non-convex optimization which
is the supervised task. One of the features of this work is that our model forces
its parameters to stay in such regions as training occurs, by having reconstructive
objectives in one or two of its modules.
Recent advances in Recurrent Encoder-Decoder networks have afforded mod-
els the ability to perform both supervised learning [31, 9] and unsupervised learn-
ing [29]. These architectures are often based on the idea that recurrent neural
networks (RNNs) can model temporal dependencies and that when handling a se-
2
quence, the last hidden state of an RNN can contain information about the whole
sequence. Therefore, this representation has a fixed length, even if sequences do
not. The separation between the encoder and the decoder network(s) allows one
to easily add, modify or re-purpose decoders for desired tasks. Using multiple
decoders forces the encoder to learn rich, multipurpose representations. This can
also allow semi-supervised training in a single phase. In our case, we combine the
unsupervised pre-training phase with the supervised fine-tuning phase to jointly
train a classifier decoder and a reconstruction decoder, which both use the rep-
resentation provided by the encoder. We used a particular type of RNN known
as a Long-Short-Term-Memory (LSTM) model to encode and decode sequences,
and a multilayer perceptron to classify them. We also studied the effect of adding
a per-frame decoder that uses frame representations provided by the frame en-
coder (see sections 4 and 5). We show that our architecture improves movement
recognition accuracy on a newly defined realistic partitioning of a popular public
Motion Capture (MOCAP) dataset.
Our main contributions are:
• The introduction of a novel architecture of the Recurrent Encoder-Decoder
type that allows for supervised and unsupervised learning in a common
model.
• The definition and execution of experiments using a more realistic vali-
dation and test set partitioning of a widely used public MOCAP dataset,
thereby facilitating more informative future evaluations.
• An improvement over our implementation of current state-of-the-art tech-
niques for action recognition on such well defined experiments.
3
• The provision of further evidence of the benefits of semi-supervised learn-
ing, even when using a completely labeled dataset.
2. MOCAP
Motion capture (MOCAP) technologies allow one to track and save move-
ments of an actor wearing a special suit with multiple markers on it. The recorded
positions of these markers at each timestep make it possible to apply these move-
ments to virtual characters in order to simulate realistic motions. These tech-
nologies find use in multiple areas, such as video games, movies and health. As
MOCAP is used more extensively in these applications, easy searches through
databases of sequences becomes a desirable tool. While sequences could be la-
beled or annotated to facilitate search, in practice, sequences are often not la-
beled, or labeled at a coarse semantic level. A key element for labeling MOCAP
sequences is human action recognition. This challenge can be seen as sequence
classification, when only one action is performed in the sequence, or as sequence
to sequence translation, when we want a fuller description of what is happening or
when multiple actions are performed. This work focuses on sequence classifica-
tion, even though it is a potential first step towards the long-term goal of sequence
translation from MOCAP to natural language.
2.1. Datasets
One challenge with the application of deep learning on MOCAP data is the
lack of strongly labeled data. For this work, we used the two biggest publicly
available MOCAP datasets that we are aware of. The first is the HDM05 pub-
lic dataset [25]. It contains 2329 labeled cuts that are very well suited for action
recognition. There are about 100 classes of movements, which can be reduced
4
to 65 when the number of repetitions or the side of the limb starting the action
(left, right) are ignored. In this work, we use the same 65 classes defined by
Cho & Chen [8]. The second dataset we use is the CMU Graphics Lab Motion
Capture Database1. This is to our knowledge the biggest public MOCAP dataset
in terms of number of frames. It contains 2148 weakly labeled or unlabeled se-
quences. This dataset can hardly be used for supervised learning as the labeling
of sequences, if any, was only made to give high level indications, and does not
seem to have followed any stable conventions throughout the dataset. Works by
Zhu et al. [35], Ijjina et al. [22], and Barnachon et al. [2] all use different cus-
tom classes definitions to get some quantitative results on CMU for classification.
In the present work, we use this dataset for unsupervised learning only. Table 3
shows more info on the datasets used.
2.2. Previous Work on Action Recognition
Some interesting work has been made on action recognition in MOCAP se-
quences in recent years, many of which depend on some well designed, hand-
crafted features.
For example, Chaudhry et al. [6] created bio-inspired features based on the
findings of Hung et al. [21] on the neural encoding of shapes and, using Support
Vector Machines (SVMs), have obtained good results on classification of 11 ac-
tions from the HDM05. Ijjina et al. [22] use some joints distances metrics based
on some domain knowledge (about actions found in a particular dataset) to create
principled features that are then used as inputs to a neural classifier (pre-trained
as a stacked auto-encoder). They reach good accuracy for 3 custom classes in
1http://mocap.cs.cmu.edu/
5
the CMU dataset. Using this prior domain knowledge especially helps when the
dataset is somewhat specialized and may contain actions of a certain type. How-
ever, if the goal is to have a generic action classifier that handles at least as much
actions as found in HDM05, it might be more appropriate to learn those features
with a more complex architecture. Barnachon et al. [2] use a learnt vocabulary
of key poses (based on a K-Means variant) and use distances between histograms
of sub-actions in order to classify ongoing actions. They present good accuracy
(96.67%) on a custom subset of 33 actions from HDM05 (where training samples
are taken at random). In our case, we wish to perform classification on the 65
HDM05 actions.
End-to-end neural approaches have also been tried on HDM05 and CMU in
which cases discriminating features are learnt throughout the training of a neural
network. Cho & Chen [8] have obtained good movement classification rates on
simple sequences (cuts) on the HDM05 dataset using a simple Multi-Layer Per-
ceptron (MLP)+Auto-encoder hybrid. Chen & Koskela [7] tested multiple types
of features, using a fast technique they call Extreme Machine Learning to clas-
sify, again, HDM05 cuts. Results were really good in both cases, with accuracies
of over 95% and 92% with 65 and 40 action classes respectively. Their models
were trained at the frame level, and sequence classification was done by majority
voting. Other work by Du et al. [11] treated the simple sequences’ classification
problem with the same action classes as Cho & Chen [8] with a hierarchical net-
work handling in its first layer parts of the body separately (i.e. torso, arms and
legs), and concatenating some of these parts in each layer until the whole body is
treated in the last hidden layer. They worked with RNNs to use context informa-
tion, instead of concatenating features of some previous frames at each timestep
6
like Cho & Chen [8] and Chen & Koskela [7]. This led to better results, and their
classification accuracy on simple sequences reached 96.92%. Finally, Zhu et al.
[35] have a similar, but less constrained recurrent architecture that is regularized
by a weight penalty based on the l2,1 norm (Cotter et al. [10]), which encourages
parts the network to focus the most meaningful joints’ or features’ interactions.
They report 97.25% accuracy on HDM05 for classification of simple sequences,
with 65 classes.
2.3. Defining a Good Test Set
Based on their results, the aforementioned methods seemingly solve the prob-
lem of action recognition in the HDM05 dataset. However, upon closer inspection
it becomes clearer that there is an underlying problem for these results (except
for Chaudhry et al. [6]), which lies in the definition of the validation and test
sets. The experiments were performed using 10-fold cross validation with 10 bal-
anced partitions of shuffled sequences. This means that takes or frames recorded
with a particular actor could be found in the training set as well as in the test
set. This configuration is therefore unrepresentative of typical realistic situations
where new takes are recorded with new actors. If an actor is asked to repeat five
times the same movement in five different takes, then these will probably be very
similar, and shuffling the frames or sequences will insert an undesired bias in the
test set. Chaudhry et al. [6], on the other hand, isolate their validation and test
sets based on subjects performing the actions. This type of evaluation is therefore
more representative of reality and a better measure of a given method’s general-
ization performance. Moreover, it seems that in previous works [8, 7, 11, 35] there
is no proper test set. Each fold is used for early stopping as well as for evaluating
the network. Having an unseen test set would then be a better way to once again
7
assess the generalization capacity of the network. Table 1 shows the results of
Table 1: Accuracies (Acc.) with different test sets, and using techniques from Cho
& Chen [8], Du et al. [11] and Zhu et al. [35].
TECHNIQUE TEST SET ACC.(%)
CHO & CHEN RANDOM 10%, BALANCED 95.61
CHO & CHEN RANDOM 40%, BALANCED 94.13
CHO & CHEN ACTORS [’TR’, ’DG’] 64.36
CHO & CHEN ACTORS [’TR’, ’DG’] - (PP) 81.64
DU ET AL. RANDOM 10%, BALANCED 92.98
DU ET AL. ACTORS [’TR’, ’DG’] - (PP) 70.63
ZHU ET AL. RANDOM 10%, BALANCED 94.53
ZHU ET AL. ACTORS [’TR’ , ’DG’] - (PP) 81.64
our attempts at re-creating results of previous state-of-the-art works [8, 11, 35]
on the HDM05 dataset. It also shows accuracies for these techniques when using
partitions based on actors. In this setting, we use actors with initials ’tr’ and ’dg’
as test subjects, and the actor ’bk’ as a validation subject. In this scenario, we
always train the network once on the three partitions (with early stopping w.r.t
subject ’bk’), then start a new training with the train and validation sets com-
bined. In this second phase, early stopping is done when accuracy on the previous
training set reaches the same level as in the first phase. This is done in order to
maximize the amount of training data, while preventing over-fitting. Moreover,
since using three out of five actors from HDM05 for validation and testing leaves
about 40% of the sequences in the training set, we tested again the method from
8
Cho & Chen [8] with a balanced, shuffled partition having the same proportions
(40%, 20%, 40%) for each set to see if this was the only factor influencing the
declining results. Finally, we applied our own pre-processing (PP) of the data
with these techniques with our newly defined actor-based partitions to make fairer
comparisons later. The main difference with our pre-processing of the data is that
we allow the hips of the actor not to be always parallel to the floor. This makes
it easier to recognize some movements like the cartWheel motion, as shown in
Figure 1. More details on pre-processing is given in section 5.1. As we can see,
results using a realistic partition of HDM05 are significantly lower, but our own
pre-processing method of the data helps. Since the techniques of Cho & Chen
[8] and Zhu et al. [35] yielded the best results with our actor-based partitions and
with our pre-processing method, the baseline for the rest of this work will be the
81.64% accuracy reached by those methods.
Figure 1: Comparison of pre-processing methods for a cartWheel movement from
HDM05. UP: Same as Cho & Chen [8]. DOWN: our own.
9
3. Recurrent Neural Networks Review
We will quickly describe here the building blocks of our architecture, which
are recurrent neural networks and associated advances.
3.1. Vanilla Recurrent Neural Networks
At their core, RNNs are artificial neural networks in which hidden layer units
have connections towards themselves through time. This means that at each
timestep, hidden layers receive lower layers’ current outputs as well as their own
output from the previous timestep. This allows the network to use past context
information to better model temporal sequences. This is why RNNs have often
proven over the years to be very powerful on multiple sequential problems, such
as speech recognition [15, 28, 14], handwriting recognition [17], text generation
[30, 16], or in our case MOCAP action recognition [11, 35]. The forward pass
of the hidden layer of an RNN, to get the hidden state h at time t, is very similar
to the one of an ordinary feed-forward neural net, except for the added previous
hidden state ht−1 as an input:
ht = σ(Wxhxt + Whhht−1 + bh) (1)
where Wxh, Whh are weight matrices and bh is the bias vector for the hidden
layer. The operator σ() is a differentiable activation function such as a sigmoid or
a tanh operation.
3.2. Bi-Directional Recurrent Networks
RNNs make use of past context in order to model a sequence up to a certain
point. In many problems however, future context may be available and useful.
In those cases, using bi-directional RNNs (BRNNs) can make the network more
10
powerful. BRRNs hidden layers have two sets of units. One set handles the
sequence in chronological order, while the other handles it in reverse order. The
output of such a layer is the concatenation of the hidden activations of both sets.
Outputs of bi-directional recurrent layers can contain information about all the
sequence at each timestep (past and future).
Figure 2: Unfolded bi-directional RNN. hf and hb are hidden states of the forward
and the backward sets respectively. The [] symbol represents the concatenation
operation of both hidden states at time t. Based on figure 3.5 in [16].
3.3. Long-Short-Term Memory
One known problem with RNNs is that they can be very hard to train to model
long-term dependencies because of the vanishing gradient problem [3, 19]. One of
the most popular and effective methods to counter this problem is the use of Long-
Short-Term Memory networks (LSTMs), as presented by Hochreiter & Schmidhu-
ber [20]. These recurrent networks have, instead of simple hidden units, memory
cells having input, output, and forget gates that determine whether information
11
is added to, released from, and kept in the cell at each timestep. This enables
the recurrent network to keep past context information for a long time internally,
therefore allowing it to model long time dependencies. One LSTM cell is shown in
Figure 3. In our context, we do not use in-cell connections (also called peepholes)
as they have not been found to be useful in recent experiments [5]. Therefore,
gates, cell values, and hidden outputs are calculated as follow:
i = sigmoid(Wxixt + Whiht−1 + bi) (2)
o = sigmoid(Wxoxt + Whoht−1 + bo) (3)
f = sigmoid(Wxfxt + Whfht−1 + bf ) (4)
ct = f� ct−1 + i� tanh(Wxcxt + Whcht−1 + bc) (5)
ht = o� tanh(ct) (6)
Where W and b are weight matrices, and bias vectors respectively2. The operation
� is an element-wise multiplication.
3.4. Recurrent Encoder-Decoders
A major advantage and key attribute of Recurrent Encoder-Decoders is their
ability to transform variable-length sequences into a fixed-size vector in the en-
coder, then use one or more decoders to decode this vector for different purposes.
Using an RNN as an encoder allows one to obtain this representation of the whole
input sequence. Cho et al. [9] as well as Sutskever et al. [31] have used this ap-
proach for sequence-to-sequence translation, with some differences in the choice
2We will use these two same symbols (W and b) without re-defining them throughout the rest
of this paper as they will always have the same definition even though they do not contain the same
values nor have the same dimensions in each layer.
12
Figure 3: The LSTM cell used for this work. It doesn’t have in-cell connections
(peepholes). This is similar to figure 4.2 in [16], but we do not use in-cell connec-
tions.
of hidden units and in the use of an additional summary vector (and set of weights)
in the case of Cho et al. [9]. Both approaches need a symbol of end-of-sequence
to allow input and target sequences to have different lengths. They are trained to
maximize the conditional probability of the target sequence given the input se-
quence. Our approach is more closely related to the one used by Srivastava et al.
[29] in which they perform unsupervised learning, by either reconstructing the
sequence, predicting the next frames, or both.
13
Figure 4: The FR-SRC variant of the architecture studied. This network produces
3 types of outputs w.r.t. a sequence X = [x1, ..., xT ] and its parameters θ. The set
θSC includes all the weights and biases used to compute class probabilities. The
hidden states of the frame encoder, sequence encoder and sequence reconstruc-
tive decoder are denoted here by hFE , hSE and hSD respectively. The sequence
representation c is created with the hidden state of the sequence encoder at time
T , and hc represents the sequence classifier’s fully connected layers (the softmax
activation is not explicitly shown here).
4. Our Model
4.1. The Architecture
Figure 4 shows an overview of the FR-SRC variant of the proposed archi-
tecture. The model is composed of 5 main components: a per-frame encoder, a
per-frame reconstructive decoder, a sequence encoder, a sequence reconstructive
decoder, and a sequence classifier. Each decoder, along with the classifier pro-
14
duces an output used to calculate a cost. Each of these components are added to
produce evermore meaningful features as we go up the layers by having multiple
costs influencing more directly different modules, in a way loosely similar to lad-
der networks [26]. Another module that we have used for some experiments and
that is not shown in the figure is the per-frame classifier, which tries to classify
the action based on single frames. This modules takes the per-frame encodings to
produce probabilities of actions.
The frame auto-encoder’s role is to learn robust per-frame features in an unsu-
pervised manner by reconstructing the clean version (xt) of a corrupted frame (xt)
at time t [33]. The reconstructive error (lFRE,t) we use is the well known mean
squared error and we apply it for each frame, before calculating its average over
the frames to get lFRE:
hFEt = z(xt) (7)
xt = g(hFEt ) (8)
lFR,t = 1/2||xt − xt||2 (9)
lFR = 1/TT∑t=1
lFR,t (10)
In equation 7, z() is the encoding function learnt by the bottom feed-forward
layers of the per-frame auto-encoder, while g() (eq. 8) is the decoding function
of the module learnt by its upper layers. In further equations, HFE will stand for
the sequence of features [hFE1 , ...,hFE
T ] and we will dismiss the corruption sign
over x (x) as we will show equations for a test setting, where the frames are not
corrupted.
The per-frame classifier uses hFEt as an input to yield belief values on move-
15
ment classes for every frame:
at = W(hFEt ) + b (11)
These activations are then summed over all frames and a softmax operation is
applied on the result, yielding class probabilities P (yk) given all the frames xt of
the whole sequence X, and the parameters of the frame encoder θFE:
af =T∑t=1
at (12)
P (yk|X, θFE) = sf,k = eaf,k/(K∑i=1
eaf,i) (13)
This is similar to the operation used by Du et al. [11] to classify sequences based
on a sequence of activations but differs in the fact that we do not use outputs from
recurrent layers.
We then use the negative log-likelihood of the correct class as our frame-based
classification error (lFC):
lFC = −log(P (Y = yk|X, θFE)). (14)
The combination of the frame auto-encoder and the frame classifier gives some-
thing very similar to Cho & Chen’s [8] approach, except that each frame’s input
does not contain information about a previous frame. When per-frame recon-
struction is not used, the model still encodes frames with z() before outputting
probabilities with a softmax.
The LSTM encoder’s purpose is to encode the whole sequence of learnt fea-
tures into a fixed length summary vector that models temporal dependencies, and
which can be used for supervised or unsupervised tasks.
c(X) = tanh(WschSET + b) (15)
16
Here, c(X) is the output of a fully connected layer that has the weight matrix Wsc.
It uses the last hidden state of the LSTM encoder hSET as an input. The encoder
itself takes HFE as an input sequence. See equations 2 to 6 for calculations of the
hidden state of the LSTM encoder.
If the sequence reconstructive decoder is present, it learns to reconstruct the
feature sequence HFE that was fed to the LSTM encoder. As explained by Sri-
vastava et al. [29], the LSTM decoder can use its own previous prediction at each
timestep to predict the current output, making it a conditional decoder. This is
what we use in this work. As we can see from Figure 4, the summary vector
c(X) is also fed at each timestep to the LSTM decoder. This vector can therefore
serve multiple purposes, and it is up to the network to learn how it will use it
even though we can guide it through assignment of weights on the different costs.
With the outputted HFE
= [hFE
1 , ..., hFE
T ] from the decoder, we can calculate our
feature sequence reconstruction error (lSR) :
hDt = tanh(Wihh
FE
t−1 + WhhhDt−1
+ Wchc(X) + b) (16)
hFE
t = tanh(WhDt + b) (17)
lSR,t = 1/2||hFE
t − hFEt ||2 (18)
lSR = 1/TT∑t=1
lSR,t (19)
In our case, hFE
0 is initialized to a zero vector to handle the first frame (t = 1).
The sequence classifier is a simple feed-forward MLP that outputs class prob-
abilities based on the summary vector. This is the main task of interest, and the
sequence classifier is therefore always used in our experiments. We again use the
17
negative log-likelihood as the sequence classification error (lSCE):
hC = Wc(X) + b (20)
aseq = WhC + b (21)
P (yk|X, θSC) = sseq,k = easeq,k/(K∑i=1
easeq,i) (22)
lSC = −log(P (Y = yk|X, θSC)) (23)
Using a sequence classification ratio r, we can define different models with dif-
ferent loss functions, enabling some or all of the modules of the architecture. To
emphasize on the task of interest, we always put the sequence classification er-
ror from the summary vector (lSCE) against the mean of the other used errors,
as shown in Table 2. Setting r to 1 will result in a Sequence Classifier (SC)
network only. Adding feature sequence reconstruction to this model will yield
a Sequence Reconstructive Classier (SRC). Adding instead frame reconstruction
to the SC will give a Frame Reconstructive-Sequence Classifier (FR-SC), while
adding frame reconstruction to the SRC will yield a Frame Reconstructive SRC
(FR-SRC). Finally, using all modules will give a Frame Reconstructive Classifier-
SRC (FRC-SRC).
Since these loss functions as well as all activations functions of the network
are differentiable with respect to each of its parameters, we can employ stochastic
gradient descent (SGD) and back-propagation through time (BPTT) to train the
network.
18
Table 2: Variants of the architecture and their loss functions.
MODEL LOSS FUNCTION
SC lSC
SRC r ∗ lSC + (1− r) ∗ lSR
FR-SC r ∗ lSC + (1− r) ∗ lFR
FR-SRC r ∗ lSC + (1− r) ∗ lSR + lFR
2
FRC-SRC r ∗ lSC + (1− r) ∗ lFC + lSR + lFR
3
5. Experiments
5.1. Data
The data in these experiments come from the open HDM05 and CMU MO-
CAP datasets. They both are recorded at 120 frames per second (fps) and contain
more than 30 markers’ positions. In our case, we use 23 common markers be-
tween the two datasets. We work with the C3D file format, which contains series
of positions for each marker. Our preprocessing of the data consists mainly of
orienting, centering and scaling the point cloud of every frame given by the files.
The orientation process is a basis change of all 3D positions so that the actor’s hips
are always facing the same direction, while allowing them to not be parallel to the
floor. We then center the hips of the actor at position (0,0,0) and scale so every
marker is always in the interval [−1, 1]. This can help handling different actors
with different sizes. Since we use 23 markers, each frame vector has a dimension
of 69. To speed up training, we use only 1 frame out of 4 to create shorter, but
still fluid sequences, yielding a 30 frames-per-second rate. Other specifications
19
Table 3: Specifications of the two publicly available datasets used, when frame
rate is reduced to 30 fps.
HDM05 CMU
NUMBER OF SEQUENCES 2 329 2 531
MIN. LENGTH (FRAMES) 6 3
MEAN LENGTH (FRAMES) 66 467
MAX LENGTH (FRAMES) 226 5 737
NUMBER OF ACTORS 5 144
with this frame rate are shown in Table 3. Our final experiments using sliding
windows use windows of 30 frames (1 second) with an offset of 15 frames. Pre-
liminary tests were conducted with shorter and longer windows on a subset of
HDM05, and 30 frames seemed like an optimal choice, even though not critical.
When classifying sequences longer than 30 frames, we use a simple majority vot-
ing strategy on the windows to select the movement class. In all experiments,
we use an additive Gaussian noise with a standard deviation of 0.05 and mean 0
on markers’ positions for training. We use minibatches of size 8 when handling
HDM05 only data, and minibatches of size 32 when using CMU and HDM05.
Sequences shorter than 30 frames are zero-padded. We use binary masks to apply
calculations of outputs and cost evaluations with valid frames only.
5.2. Network Specifications
The frame encoder is closely related to the one used by Cho & Chen [8], as it
has two hidden layers of [1024, 512] units. Two extra layers of [1024, 69] units are
used by the reconstruction decoder with tied weights with the encoder. The frame
20
classifier only has a special softmax layer applied to the output of the frame en-
coder. This layer applies the softmax operation on the sum of its linear activations
for each frame, as shown in equations 11, 12 and 13. The LSTM encoder, has 3
hidden layers of [512, 512, 256] LSTM memory cells. As explained in section 3.2,
the output of a single bi-directional recurrent layer can contain, at each timestep
information for the whole sequence. We therefore use bi-directionality only in the
first LSTM layer of the sequence encoder. This means that the second layer of the
LSTM encoder has an input of size 2∗512 containing past and future information.
The c layer, outputting the summary vector is of size 1024, and the hc layer is of
size 512. A normal softmax layer is placed on top of hc to output probabilities.
This means the feed forward layers on top of the LSTM encoder have the same
size as those used at frame level. Each layer of the LSTM decoder has a number
of units equal to size of the output of its corresponding layer in the encoder. This
leads to [256, 512, 1024] memory cells. All non-linear activations used in the net-
work consist of the tanh() function except for the input, output and forget gates
of the memory cells that use sigmoid activations.
For feed-forward layers’ initializations, their weight are drawn uniformly from
[−√
1/fanin,√
1/fanin], while we use orthonormal initialization for recurrent
weight matrices. All biases are initialized at 0, except for LSTM forget gates
which are initialized to 1, as proposed by Gers et al. [13] and Jozefowicz et al.
[23]. The learning rate is initialized to 0.04, and is halved when the validation ac-
curacy is not improved in three consecutive epochs, until it reaches below 0.0001.
We use early stopping with a tolerance of 25 epochs. We use a 0.9 momentum
value.
21
5.3. Preliminary Study of Ratios and Windows’ Widths
Before conducting experiments with all variants (Table 2) of the architecture
on the HDM05 and CMU datasets, we tested the network using the FR-SC model.
These preliminary experiments aimed at exploring how the network would per-
form using different sequence classification cost ratios, and when feeding whole
sequences of movements instead of sliding windows. Those tests were performed
in two training phases. First, we used the data from two actors for training, one
for the validation set and two for the test set. After this first phase, we combined
both the training and the validation sets and started a second second training phase.
Early stopping was performed based on monitoring the loss of the original, smaller
training set to identify when it had reached the level obtained when the validation
set was used for early stopping. Training in this way maximizes the amount of
data available to the method, but allows early stopping to be used in a heuristic but
principled way. Results are shown in Table 4. We first tried with the ratio r = 1.0,
Table 4: Accuracies on test set with different classification loss ratios and sliding
windows’ width.
MODEL WIDTH RATIO ACCURACY(%)
BASELINE 1 0.5 81.64
SC 30 1.00 81.97
FR-SC 30 0.75 84.67
FR-SC 30 0.50 84.02
FR-SC UNLIMITED 0.75 79.13
which turns the network into a SC model, since only lsce is used. This supervised-
22
only network beat the baseline accuracy of 81.64% with a score of 81.97%. This
may indicate that using sliding windows and majority voting (compared to Zhu
et al. [35] even when using recurrence can help classification. We then tried ratios
r = 0.75 and r = 0.5. The latter is the one used in our baseline [8]. The best
result accuracy was obtained with r = 0.75, showing that a higher weight on the
(supervised) task of interest helps. These three experiments were conducted with
the sliding windows + majority voting strategy. We followed the exploration by
trying to take full advantage of the Recurrent Encoder-Decoder architecture by
handling whole sequences (no sliding windows). This implies encoding whole
sequences into the fixed sized vector c(X). We used the best ratio from the first
three tests, and obtained a slightly lower accuracy. This might be due to the fact
that more LSTM cells are would be needed in the sequence encoder/decoders to
learn to model dependencies on many more temporal scales.
5.4. Regularization by Reconstruction
The experiments we conducted here used the three separate sets from HDM05
(first phase of training described in section 5.3). Examining these results, we are
able to see the regularization effects of adding different types of reconstructive
modules and losses to the network’s composite error function. Table 5 shows
these effects. As we can see, adding the frame reconstructive module (FR-SC)
helps a lot to reduce over-fitting. Figure 5 clearly shows this effect during training.
Adding the feature sequence reconstruction loss (FR-SRC) and then the frame-
based classification (FRC-SRC) also have a beneficial impact on over-fitting com-
pared to the SC model, but to a lesser extent. This, however might be due to the
fact that these two bigger networks (FRC-SRC, FR-SRC) have a lot more capac-
ity due to their LSTM decoder and may therefore tend to over-fit in this limited
23
data setting (we use here only 40% of HDM05 for training). In order to vali-
date this hypothesis, further tests with more training data were conducted on these
networks as well as on the FR-SC.
Table 5: Regularization effects of the chosen model (and corresponding loss func-
tion) on the accuracies with HDM05 data only.
MODEL TRAIN(%) VALID(%) TEST(%)
SC 99.00 67.26 78.18
FR-SC 98.77 77,71 82.51
FR-SRC 97.43 71.20 79.80
FRC-SRC 98.77 70.22 78.73
5.5. Training Reconstructive and Semi-supervised models with more data
The final experiments here were performed with two training phases, as in sec-
tion 5.3. Importantly, as they also use the validation set for training and perform
early stopping using the original train set they therefore involve a 50% larger train-
ing set compared to the experiments of Table 5. Table 6 shows our results for the
movement classification task using our more representative test set for HDM05.
We compare our results with our implementation of the baseline techniques from
Cho & Chen [8] and Zhu et al. [35] on the same test set. Since the CMU data,
in terms of number of frames, outnumbers HDM05 by a significant factor, we di-
vide errors on reconstruction of unlabeled data by this factor in order the keep our
classification error ratio valid. Gaussian noise is re-generated for each example,
so no sequences are exactly the same. Experiments performed with the combined
24
Figure 5: Visualization of the effect of adding per-frame reconstructive decoders
on over-fitting throughout training.
datasets (HDM05+CMU) used pre-trained networks to accelerate training, e.i. we
used the networks already trained on HDM05. On HDM05 only, the best model
was the FRC-SRC, which used all 4 losses, supporting our main hypotheses that
we can obtain higher quality representations of the data when using specialized
modules with associated costs in the architecture and that the added network ca-
pacity is useful with bigger training sets. We were surprised by the low perfor-
mance of the FR-SRC on HDM05 only. It seems that in the single experiment
with FR-SRC on HDM05 only, the network might have got stuck during training
in a bad local minimum of the loss function and that adding the unlabeled CMU
data was enough for it to step out of this minimum, a known advantage of semi-
supervised learning. Of course, multiple tests in each setting would help gather
more robust results and standard deviations. This would be of great interest for the
case of the competing FR-SRC and FRC-SRC models which yielded close results
25
Table 6: Impact of the chosen model (and corresponding loss function) on the test
accuracy.
MODEL DATASET ACCURACY (%)
BASELINE HDM05 81.64
FR-SC HDM05 84.67
FR-SC HDM05+CMU 84.23
FR-SRC HDM05 80.24
FR-SRC HDM05+CMU 85.64
FRC-SRC HDM05 85.53
FRC-SRC HDM05+CMU 85.10
with the combined datasets. Indeed, the frame-based classification module might
not be as useful as other modules. Estimating probabilities of an action based on a
single frame without context might be a task too hard for the network. Therefore,
the per-frame encoding layers might try to reduce the very high loss on frame-
based classification by (often unsuccessfully) producing discriminative features at
the expense of higher reconstruction loss, resulting in less useful features to send
to the LSTM encoder.
Figure 6 shows the confusion matrix generated wih the best performing net-
work. This shows that interestingly, the network is confused between actions
PunchLFront (26) and PunchLSide (27) but not between PunchRFront(28) and
PunchRSide (29). We propose that this may be due to the fact that actors were
right handed and that movements may have been clearer when using their strong
hand.
26
Figure 6: Confusion matrix on HDM05 classification for the best performing net-
work (FR-SRC trained on HDM05 + CMU).
5.6. Clustering HDM05
Using the FR-SRC network that got the best results on HDM05 classification,
we produced and performed clustering on the summary vectors it produced for the
test set, unseen during training. We used a Gaussian Mixture Model (GMM) [27]
initialised with K-means++ [1], where K was found by using 10% of the set and a
validation set to find the best likelihood. This system found 30 clusters that we can
visualize in Figure 7. Note that feature vectors have 1024 dimensions and clusters
27
were found in that space, while we used the t-SNE algorithm [24] to create a 2D
visualization. Some clusters were annotated after manual inspection to give an
idea of what movements the network clustered. We can see that such a trained
Figure 7: 2D visualization of clusters found by the FR-SRC network. Some of
those were annotated after manual inspection of individual sub-sequences inside
clusters.
network could help accelerate labeling MOCAP sequences of movements since
sequences in the most well defined clusters could be labeled in batch. However,
manual annotation seems to suggest that HDM05 actions have a considerable im-
28
pact of the clustered actions, since almost all clusters could be associated with one
or two HDM05 labels.
6. Conclusions and Discussion
Recurrent Encoder-Decoder architectures with multiple decoders provide an
attractive framework for semi-supervised, multipurpose representation learning.
Our experiments show that even our simplest configuration using an RNN is able
to outperform our implementation of the state-of-the-art for HDM05 movement
classification with a realistic partition of data. We also found that we were able
to push those results higher through enabling the various modules of the proposed
architecture. Our results indicate that the inclusion of reconstructive decoders
appears to have a regularizing effect and reduces over-fitting.
To properly evaluate this technique (and others) we have defined a realistic
test set on the public HDM05 dataset that we hope can serve as a realistic bench-
marking set in future works on MOCAP classification. We found unclear the
definition of the evaluation in certain previous works as we could not reach sim-
ilar results using the same architectures. Nevertheless we showed that with the
same gradient-based learning method, our architecture yielded better results with
well defined training, validation and test sets.
Additionally, we showed that such a network is well suited for clustering
as learnt representations compress reconstructive and discriminative information
about sequences. This could help label datasets in batch or create a motion search
engine based on a distance metric in that learnt feature space.
As future work we are interested in exploring alternative decoders, such as
next-frame(s) predictors, to provide even richer features. Dynamic sequence clas-
29
sification loss ratios could be also tested in order to mimic two-phases semi-
supervised learning (starting with more weight on reconstructive objective and
progressing towards more weight on the discriminative objective).
Acknowledgements
We thank Ubisoft for their support for this research as well as the authors of
the Theano framework [32].
References
[1] Arthur, David and Vassilvitskii, Sergei. k-means++: The advantages of care-
ful seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium
on Discrete algorithms, pp. 1027–1035. Society for Industrial and Applied
Mathematics, 2007.
[2] Barnachon, Mathieu, Bouakaz, Saıda, Boufama, Boubakeur, and Guillou,
Erwan. Ongoing human action recognition with motion capture. Pattern
Recognition, 47(1):238–247, 2014.
[3] Bengio, Yoshua, Simard, Patrice, and Frasconi, Paolo. Learning long-term
dependencies with gradient descent is difficult. Neural Networks, IEEE
Transactions on, 5(2):157–166, 1994.
[4] Bengio, Yoshua, Lamblin, Pascal, Popovici, Dan, Larochelle, Hugo, et al.
Greedy layer-wise training of deep networks. Advances in neural informa-
tion processing systems, 19:153, 2007.
[5] Breuel, Thomas M. Benchmarking of lstm networks. arXiv preprint
arXiv:1508.02774, 2015.
30
[6] Chaudhry, Rizwan, Ofli, Ferda, Kurillo, Gregorij, Bajcsy, Ruzena, and Vidal,
Rene. Bio-inspired dynamic 3d discriminative skeletal features for human
action recognition. In Computer Vision and Pattern Recognition Workshops
(CVPRW), 2013 IEEE Conference on, pp. 471–478. IEEE, 2013.
[7] Chen, Xi and Koskela, Markus. Classification of rgb-d and motion capture
sequences using extreme learning machine. In Image Analysis, pp. 640–651.
Springer, 2013.
[8] Cho, Kyunghyun and Chen, Xi. Classifying and visualizing motion cap-
ture sequences using deep neural networks. arXiv preprint arXiv:1306.3874,
2013.
[9] Cho, Kyunghyun, Van Merrienboer, Bart, Gulcehre, Caglar, Bahdanau,
Dzmitry, Bougares, Fethi, Schwenk, Holger, and Bengio, Yoshua. Learn-
ing phrase representations using rnn encoder-decoder for statistical machine
translation. arXiv preprint arXiv:1406.1078, 2014.
[10] Cotter, Shane F, Rao, Bhaskar D, Engan, Kjersti, and Kreutz-Delgado, Ken-
neth. Sparse solutions to linear inverse problems with multiple measurement
vectors. IEEE Transactions on Signal Processing, 53(7):2477–2488, 2005.
[11] Du, Yong, Wang, Wei, and Wang, Liang. Hierarchical recurrent neural net-
work for skeleton based action recognition. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, pp. 1110–1118,
2015.
[12] Erhan, Dumitru, Bengio, Yoshua, Courville, Aaron, Manzagol, Pierre-
Antoine, Vincent, Pascal, and Bengio, Samy. Why does unsupervised pre-
31
training help deep learning? The Journal of Machine Learning Research,
11:625–660, 2010.
[13] Gers, Felix A, Schmidhuber, Jurgen, and Cummins, Fred. Learning to forget:
Continual prediction with lstm. Neural computation, 12(10):2451–2471,
2000.
[14] Graves, Alan, Jaitly, Navdeep, and Mohamed, Abdel-rahman. Hybrid speech
recognition with deep bidirectional lstm. In Automatic Speech Recognition
and Understanding (ASRU), 2013 IEEE Workshop on, pp. 273–278. IEEE,
2013.
[15] Graves, Alan, Mohamed, Abdel-rahman, and Hinton, Geoffrey. Speech
recognition with deep recurrent neural networks. In Acoustics, Speech and
Signal Processing (ICASSP), 2013 IEEE International Conference on, pp.
6645–6649. IEEE, 2013.
[16] Graves, Alex. Supervised sequence labelling with recurrent neural networks,
volume 385. Springer, 2012.
[17] Graves, Alex, Liwicki, Marcus, Bunke, Horst, Schmidhuber, Jurgen, and
Fernandez, Santiago. Unconstrained on-line handwriting recognition with
recurrent neural networks. In Advances in Neural Information Processing
Systems, pp. 577–584, 2008.
[18] Hinton, Geoffrey E, Osindero, Simon, and Teh, Yee-Whye. A fast learning
algorithm for deep belief nets. Neural computation, 18(7):1527–1554, 2006.
[19] Hochreiter, Sepp. The vanishing gradient problem during learning recurrent
32
neural nets and problem solutions. International Journal of Uncertainty,
Fuzziness and Knowledge-Based Systems, 6(02):107–116, 1998.
[20] Hochreiter, Sepp and Schmidhuber, Jurgen. Long short-term memory. Neu-
ral computation, 9(8):1735–1780, 1997.
[21] Hung, Chia-Chun, Carlson, Eric T, and Connor, Charles E. Medial axis
shape coding in macaque inferotemporal cortex. Neuron, 74(6):1099–1113,
2012.
[22] Ijjina, Earnest Paul et al. Classification of human actions using pose-based
features and stacked auto encoder. Pattern Recognition Letters, 2016.
[23] Jozefowicz, Rafal, Zaremba, Wojciech, and Sutskever, Ilya. An empirical
exploration of recurrent network architectures. In Proceedings of the 32nd
International Conference on Machine Learning (ICML-15), pp. 2342–2350,
2015.
[24] Maaten, Laurens van der and Hinton, Geoffrey. Visualizing data using t-sne.
Journal of Machine Learning Research, 9(Nov):2579–2605, 2008.
[25] Muller, Meinard, Roder, Tido, Clausen, Michael, Eberhardt, Bernhard,
Kruger, Bjorn, and Weber, Andreas. Documentation mocap database hdm05,
2007.
[26] Rasmus, Antti, Berglund, Mathias, Honkala, Mikko, Valpola, Harri, and
Raiko, Tapani. Semi-supervised learning with ladder networks. In Advances
in Neural Information Processing Systems, pp. 3532–3540, 2015.
33
[27] Reynolds, Douglas. Gaussian mixture models. Encyclopedia of biometrics,
pp. 827–832, 2015.
[28] Sak, Hasim, Senior, Andrew, and Beaufays, Francoise. Long short-term
memory based recurrent neural network architectures for large vocabulary
speech recognition. arXiv preprint arXiv:1402.1128, 2014.
[29] Srivastava, Nitish, Mansimov, Elman, and Salakhutdinov, Ruslan. Un-
supervised learning of video representations using lstms. arXiv preprint
arXiv:1502.04681, 2015.
[30] Sutskever, Ilya, Martens, James, and Hinton, Geoffrey E. Generating text
with recurrent neural networks. In Proceedings of the 28th International
Conference on Machine Learning (ICML-11), pp. 1017–1024, 2011.
[31] Sutskever, Ilya, Vinyals, Oriol, and Le, Quoc VV. Sequence to sequence
learning with neural networks. In Advances in neural information processing
systems, pp. 3104–3112, 2014.
[32] Team, The Theano Development, Al-Rfou, Rami, Alain, Guillaume, Alma-
hairi, Amjad, Angermueller, Christof, Bahdanau, Dzmitry, Ballas, Nicolas,
Bastien, Frederic, Bayer, Justin, Belikov, Anatoly, et al. Theano: A python
framework for fast computation of mathematical expressions. arXiv preprint
arXiv:1605.02688, 2016.
[33] Vincent, Pascal, Larochelle, Hugo, Bengio, Yoshua, and Manzagol, Pierre-
Antoine. Extracting and composing robust features with denoising autoen-
coders. In Proceedings of the 25th international conference on Machine
learning, pp. 1096–1103. ACM, 2008.
34
[34] Yu, Dong, Deng, Li, and Dahl, G. Roles of pre-training and fine-tuning in
context-dependent dbn-hmms for real-world speech recognition. In Proc.
NIPS Workshop on Deep Learning and Unsupervised Feature Learning,
2010.
[35] Zhu, Wentao, Lan, Cuiling, Xing, Junliang, Zeng, Wenjun, Li, Yanghao,
Shen, Li, and Xie, Xiaohui. Co-occurrence feature learning for skeleton
based action recognition using regularized deep lstm networks. In Thirtieth
AAAI Conference on Artificial Intelligence, 2016.
35