An Incremental Learning approach using Sequence Models
Transcript of An Incremental Learning approach using Sequence Models
Alvaro Conde Lemos Neto
An Incremental Learning approach using
Sequence Models
Belo Horizonte - Minas Gerais
July, 2021
Alvaro Conde Lemos Neto
An Incremental Learning approach using Sequence Models
Final dissertation presented to the Gradu-ate Program in Electrical Engineering of theFederal University of Minas Gerais in partialfulfillment of the requirements for the degreeof Master in Electrical Engineering.
Federal University of Minas Gerais - UFMG
Graduate Program in Electrical Engineering - PPGEE
Machine Intelligence and Data Science Laboratory - MINDS
Supervisor: Cristiano Leite de Castro
Belo Horizonte - Minas Gerais
July, 2021
Alvaro Conde Lemos Neto An Incremental Learning approach using SequenceModels/ Alvaro Conde Lemos Neto. – Belo Horizonte - Minas Gerais, July, 2021-63 p. : il. (algumas color.) ; 30 cm.Supervisor: Cristiano Leite de CastroDissertation (Master’s) – Federal University of Minas Gerais - UFMGGraduate Program in Electrical Engineering - PPGEEMachine Intelligence and Data Science Laboratory - MINDS , July, 2021.1. Machine Learning. 2. Incremental Learning. 3. Recurrent Neural Networks.
I dedicate this work to my grandmother Estela, whose thirst for knowledge, kindness to
others and ability to overcome challenges with grace constantly inspires me
Acknowledgements
First and foremost I am extremely grateful to my supervisor Prof. Cristiano
Leite de Castro. His invaluable advice, continuous support, patience, and friendship were
deterministic during my Master’s degree research.
I am also deeply grateful to Rodrigo Amador Coelho for sharing his expertise in
the field of evolving data streams with me, which enriched this research and led to a great
partnership in two scientific publications. Looking forward to more to come! I extend my
gratitude to all colleagues at the MINDS lab.
I would like to offer my special thanks to Prof. Luciano de Errico and Pedro Dias,
who guided me during all my undergraduate years and helped to shape the researcher I
am today. Also am grateful for the friendship and contributions made with all colleagues
at the LabCOM lab.
I’m deeply thankful to Prof. Andre Paim and Prof. Luiz Bambirra for all the
contributions and insights made during my dissertation defense.
I extend my thanks to Inter and Avenue Code, for in many occasions giving me the
flexibility to endeavor my research, and to Bwtech for helping me join the post-graduation
program.
I would like to thank my parents Cristina and Jorge, my sister Maria, my brother
Gabriel and all my family, whom without this would not have been possible. I extend my
gratitude to my dear friends, the family I built during the years.
I would also like to express my gratitude to my girlfriend and soon-to-be wife,
Thais, who during all these years has been a source of happiness, love, and friendship,
which has been fuel for me to finish this work.
Finally, I extend my sincere gratitude to Lena, who took me into her family and
many times made the coffee that kept me awake writing this work at night.
“I am learning all the time. The tombstone will be my diploma.”
— Eartha Kitt
Abstract
Due to Big Data and the Internet of Things, machine learning algorithms targeted
specifically to model evolving data streams have gained attention from both academia and
industry. Although most of the proposed solutions in the literature have reported being
successful in learning from non-stationary streaming settings, their complexity and the
need for extra resources may be a constraint for their deployment in real applications.
Aiming at less complexity without losing performance, this work proposes incremental
variants of Recurrent Neural Networks with minor changes, that can tackle evolving data
stream problems such as concept drift and the elasticity-plasticity dilemma without neither
needing a dedicated drift detector nor a memory management system. Results achieved
from benchmark datasets have shown that the performance of the proposed methods is
better than the other methods for most of the experiments while obtaining competitive
results for the remaining ones.
Keywords: Machine Learning, Incremental Learning, Neural Networks, Recurrent Neural
Networks, Long Short-Term Memory, Gated Recurrent Unit
List of Figures
Figure 1 – A generic schema for incremental learning algorithms [Gama et al., 2014] 21
Figure 2 – Types of concept drifts. Circles represent data points; colors represent
different classes; and dotted lines represent decision boundaries [Gama
et al., 2014] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Figure 3 – How concept drifts can occur over time [Gama et al., 2014] . . . . . . . 27
Figure 4 – An RNN with two layers . . . . . . . . . . . . . . . . . . . . . . . . . . 30
Figure 5 – An LSTM cell (adapted from Olah [2015]) . . . . . . . . . . . . . . . . 34
Figure 6 – A GRU cell (adapted from Olah [2015]) . . . . . . . . . . . . . . . . . . 35
Figure 7 – Data stream sub-batching . . . . . . . . . . . . . . . . . . . . . . . . . 38
Figure 8 – Collection of predictions of the same sample . . . . . . . . . . . . . . . 42
Figure 9 – Chess dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
Figure 10 – Electricity dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
Figure 11 – Weather dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
Figure 12 – Sine 1 dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
Figure 13 – Sine 2 dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
Figure 14 – SEA dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Figure 15 – Stagger dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Figure 16 – Spam dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
List of Tables
Table 1 – Example of a dataset for a sentiment classification problem, where in-
puts are movie criticisms and outputs are the rating of those criticisms.
Adapted from Ng [2018] . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Table 2 – Overall performance achieved by all tested methods . . . . . . . . . . . 49
List of abbreviations and acronyms
ADWIN ADaptive sliding Window
ARF Adaptive Random Forests
FFNN Feed Forward Neural Network
GRU Gated Recurrent Unit
ISM Incremental Sequence Model
LSTM Long Short-Term Memory
ML Machine Learning
MINDS Machine Intelligence and Data Science
RNN Recurrent Neural Network
SAM-kNN Self Adjusting Memory k Nearest Neighbor
SGD Stochastic Gradient Descent
SVM Support Vector Machines
UFMG Universidade Federal de Minas Gerais
XGBoost eXtreme Gradient Boosting (XGBoost)
Contents
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2 State-of-the-art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3 Theoretical Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.1 Incremental Learning foundations . . . . . . . . . . . . . . . . . . . . . . . 25
3.2 Sequence Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2.1 Recurrent Neural Network . . . . . . . . . . . . . . . . . . . . . . . 29
3.2.1.1 Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2.1.2 The problem of long-term dependencies . . . . . . . . . . 31
3.2.2 Long Short-Term Memory Neural Network . . . . . . . . . . . . . . 33
3.2.3 Gated Recurrent Unit Neural Network . . . . . . . . . . . . . . . . 35
4 Incremental Sequence Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.1 Sub-batching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.1.1 Input transformation . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.1.2 Output transformation . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.2 Statefulness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5 Experimental Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.2 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.3 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
6.1 Overall performance analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 49
6.2 Non-i.i.d. and real datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
6.3 i.i.d. and synthetic datasets . . . . . . . . . . . . . . . . . . . . . . . . . . 50
6.4 i.i.d. and real datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
19
Chapter 1
Introduction
In recent years Big Data changed from just a buzzword to becoming a real problem
for many companies. The increase of processing and memory capabilities of computers,
more sensors generating data, and the rise of the Internet of Things gave businesses
the opportunity to generate more insights by understanding their products and clients
better, but also brought a burden to IT departments. More data require more storage,
more processing power, and the need for techniques to process this data in a distributed
manner because in many cases they do not fit into a single machine [Mehta et al., 2017,
De Francisci Morales et al., 2016, Zliobaite et al., 2016].
This challenge reflects in the Machine Learning (ML) field as well. Usually, ML
models are designed to handle finite datasets so that the learning process involves solving
complex optimization problems, such as Support Vector Machines (SVMs) [Cortes and
Vapnik, 1995, Awad and Khanna, 2015]. In addition, most ML models are trained with
chunks of data that are supposed to be stationary, which in the real world it is rarely the
case. Instead, information is always coming in an infinite stream of data that may change
over time due to a phenomenon known as concept drift [Ditzler et al., 2015].
To take that into account, research is been made since the past decades in the
field of Incremental Learning, which encompasses models that continuously learn from
streaming datasets that also deal with concept drift [Gama et al., 2014]. Although those
models solve the Big Data challenges imposed on ML, most of them are structured in
a complex way around classic supervised learning batch algorithms. On the other hand,
Recurrent Neural Networks (RNNs) - especially their variant Long Short-Term Memory
(LSTM) [Hochreiter and Schmidhuber, 1997] - seem to fit perfectly for streaming scenarios:
their hidden state and, more importantly, the memory cell was built to deal with stability-
plasticity dilemma and avoid catastrophic forgetting [Gepperth and Hammer, 2016], which
are problems inherent to concept drift.
The goal of this work is to tackle the problem of supervised learning from streaming
data by leveraging RNN’s intrinsic characteristics. By applying minor changes in LSTM
20 Chapter 1. Introduction
and other RNN architectures, the author believes that a family of incremental learning
algorithms can be proposed, one that does not need explicit drift detection, memory
management, nor modifies the weight update rule of RNNs, as reported in other past
studies [Mirza et al., 2020, Marschall et al., 2020, Xu et al., 2020].
The contributions of this work are:
1. The use of RNN’s intrinsic ability to deal with the stability-plasticity dilemma.
2. Modeling data streams as infinite non-i.i.d. sequences, leveraging the power of
RNNs to track time dependencies and the built-in online learning characteristics of
neural networks.
3. The sub-batching mechanism, an effective use of sliding windowing for data aug-
mentation.
4. The statefulness mechanism, a way to keep the internal state of RNNs with the
most updated context of the sequence.
5. The implementation and open sourcing of the pyISM1 package, containing a family
of Incremental Sequence Models (ISMs) with an scikit-multiflow API.
6. The presentation of an early version of this work on the XXIII Brazilian Conference
on Automation, as well as the invitation and further acceptance to the publication
of an extended version of that same early work on a special edition of the Journal of
Control, Automation and Electrical Systems.
The rest of this paper is organized as follows: Chapter 2 brings the state-of-the-art
of incremental learning, while Chapter 3 presents its theory. Chapter 4 describes the
proposed approach and in Chapter 5, the experimental methodology is presented. Chapter
6 shows the findings of the comparative analysis of online methods and final discussions
and conclusions are presented in Chapter 7.
1 https://github.com/alvarolemos/pyism
21
Chapter 2
State-of-the-art
Incremental Learning is trending and difficult subject, but it is not a new one.
According to Gama et al. [2014], its origins goes back to Perceptron [Rosenblatt, 1958]
(also known as the first artificial neuron), because of its ability to update its weights with
the current example. Due to the diverse range of applications of data streaming learning
algorithms, many other methods have been proposed since then and they are available
today in popular computational frameworks targeted specifically for evolving data streams
(e.g.: MOA1, scikit-multiflow2).
Gama et al. [2014] proposed a generic schema for incremental learning systems
that are composed by four main components: memory management, learning algorithm,
loss estimation, and change detection, as can be seen in Figure 1.
predictionslossesinputs
Loss Estimation
module
ChangeDetectionmodule
LearningAlgorithmmodule
MemoryManagement
module
learning algorithm update trigger
forgetting mechanism trigger
x(t)
y(t)
Figure 1 – A generic schema for incremental learning algorithms [Gama et al., 2014]
A brief description of those modules follows:
• Memory Management: responsible for receiving incoming data samples (x(t),y(t)),
and specifying when and which of those samples should be used for learning and
which should be forgotten.
1 https://moa.cms.waikato.ac.nz/2 https://scikit-multiflow.github.io/
22 Chapter 2. State-of-the-art
• Learning Algorithm: is the one responsible to model a mapping function that
generalizes inputs x(t) to outputs y(t)
• Loss Estimation: the module that compares the predictions y(t) from the learning
algorithm with their associated ground truth y(t) and outputs a loss metric
• Change Detection: a module that explicitly detects concept drifts and in many
systems triggers an update for the learning algorithm and the memory management
module
Many approaches have been proposed in the literature that can be represented
by the aforementioned schema. They can easily be divided into two major categories:
active approaches, which explicitly detect concept drifts, and passive approaches, those
that continuously adapt themselves without explicit awareness of occurring drifts [Losing
et al., 2016].
Most methods of the active approach are agnostic to the drift detection mechanism
so that they consider drift detectors to be independent of the learning model. This is the
case of many state-of-the-art methods, like the Oza Bagging ADWIN Classifier, which is a
mixture of the Online Bagging algorithm [Oza, 2005] with the ADaptive sliding Window
(ADWIN) [Bifet and Gavalda, 2007] drift detection mechanism. Another method that
follows the same structure is the Adaptive Random Forests (ARF) [Gomes et al., 2017b],
which is an incremental version of the original Random Forest algorithm [Breiman, 2001].
For every incoming sample from a data stream, ARF runs a drift detection mechanism for
each tree of the ensemble and, depending on the value obtained, it either warns a possible
drift (which triggers the training of a background tree) or detects it; in which case the
trained background tree is used to replace the existing one. Similar recent work is the
Adaptive XGBoost [Montiel et al., 2020], which adapts the XGBoost algorithm [Chen and
Guestrin, 2016] for evolving data streams. Adaptive XGboost and ARF were reported to
achieve good results when combined with the ADWIN drift detector.
For the passive approach, Self Adjusting Memory k Nearest Neighbor (SAM-kNN)
[Losing et al., 2016] was reported to show good performance over data stream classification
problems. SAM-kNN is an ensemble of two models, one trained with the most recent
samples from the stream, called the Short Term Memory, and another one trained with
past samples, called Long Term Memory. The drift detection occurs passively through
the training of the Short Term Memory model with different subsets of a window of
recent samples; this process is analogous to a hyperparameter tuning taking place for
every incoming sample. For a more detailed review of ensemble-based incremental learning
methods, one can recommend the studies of [Gomes et al., 2017a, Krawczyk et al., 2017].
In the scope of Artificial Neural Networks (ANNs), online extensions of the ELM
(Extreme Learning Machines) topology have been proposed [Li et al., 2019, Shao and
23
Er, 2016]. The Forgetting Parameters Extreme Learning Machine (FP-ELM) method has
an incremental training with regularization (L2-norm) and employs a forgetting factor
on the subset of observations that were learned at the previous time instant so that this
subset can be reused along with the current sample. A more recent study on the theme of
recurrent nets achieved promising results by introducing the covariance of the present and
one-time step past input vectors into the gating structure of LSTMs and GRUs (Gated
Recurrent Unit) [Mirza et al., 2020].
In the next chapter, the theoretical foundations of incremental learning will be
introduced, as well as how recurrent neural networks work.
25
Chapter 3
Theoretical Background
This chapter will present the foundations of incremental learning. This includes
formalizing concepts like data stream, how incremental learning compares to batch learning
and what is concept drift. Also, the prequential evaluation will be shown, which is an
evaluation methodology tailored specifically to the incremental learning setting. After that,
sequence models will be introduced, specifically, vanilla recurrent neural networks and
some of its variants, which will make room for the idealization of the proposed method in
the following chapter.
3.1 Incremental Learning foundations
Data stream
In contrast to a batch dataset with N samples that is bounded from sample x(0)
to x(N−1), a data stream S is unbounded (i.e. doesn’t have a known beginning or end).
Such a dataset presents new instances x(t) overtime to its consumer systems, along with
their true label y(t), where x(t) is a feature vector made available at time t and, in that
case, a consumer system would be a supervised learning model.
There may be some delay between the time x(t) and y(t) are made available, but
the scope of this work is restricted for scenarios where y(t) is presented immediately after
x(t), which is the case for most existing works focused on incremental learning [Gomes
et al., 2017b].
Learning modes
An ML supervised model can be trained in two distinguished learning modes:
offline learning and incremental or online learning. In the former, the whole dataset is
available at the time of training, whereas in the latter, the model process data as they
26 Chapter 3. Theoretical Background
come in a real-time stream that may be infinite. According to Gama et al. [2014], in this
online scenario two main issues must be considered:
1. If the dataset is infinite, the learning algorithm should have infinite memory to
accommodate it, which is unreal;
2. Even if the algorithm was able to keep a long history of past data, the data distribution
may change as time passes by, which would make past data stale regarding the
current data distribution.
Thus an incremental learning algorithm is subject to a trade-off where it has to
keep past information so the model does not learn outliers, but it can not keep too much
information, because the system has memory constraints and it also needs to be able to
learn new concepts Gepperth and Hammer [2016], Hoi et al. [2018].
Furthermore, when dealing with non-stationary data streams, a change in the
relation between the input data and the target variable can occur at any time. This is a
phenomenon called concept drift and will be introduced next.
Concept drift
A challenge that all incremental learning algorithms are subject to is the change
in the relation between the input data and the target variable p(x, y) (the joint probability
function), which can occur at any time. This event, known in the literature as a concept
drift, is formally defined in Equation 3.1
∃x : p(x, y)ti 6= p(x, y)ti+1(3.1)
where t0 and t1 are two different time instants and p(x, y)t = p(y|x)tp(x)t [Gama et al.,
2014].
Depending on which components of the aforementioned relation change, the
concept drift can be distinguished between the two following types (also illustrated in
Figure 2):
• Real concept drift: changes that affect p(y|x), thus affecting the decision boundary
• Virtual concept drift: changes that affect only the input data distribution p(x), but
not affecting the decision boundary
Another important aspect of concept drifts is when and how they happen. Since
they occur over time, they can occur abruptly, monotonically/incrementally, gradually
3.1. Incremental Learning foundations 27
Figure 2 – Types of concept drifts. Circles represent data points; colors represent differentclasses; and dotted lines represent decision boundaries [Gama et al., 2014]
and, due to seasonal effects, they can even reoccur. Figure 3 illustrates these possibilities,
as well as outliers, which should not be mixed with true drifts, since they are anomalies
that occur at random.
Figure 3 – How concept drifts can occur over time [Gama et al., 2014]
How an algorithm adapts to a concept drift is a challenge: react too quick, old
information is lost; wait too long, concept drifts are not caught at all. This tradeoff is
known as the stability-plasticity dilemma [Mermillod et al., 2013].
Performance evaluation
The most commonly used technique to evaluate the performance of a traditional
batch learning algorithm is k-fold cross-validation [James et al., 2013]. It works in the
following fashion:
1. uniformly split the dataset into k parts (folds)
2. use k − 1 folds to train a model
3. use the remaining fold to evaluate the model’s performance with a metric suitable
for the problem
4. average the evaluated k performance metrics in order to obtain the final metric
Despite its wide adoption in the batch setting, k-fold cross-validation is not
applicable to streaming, since the temporal order of data would be shuffled.
Another evaluation technique that is popular for the batch setting is holdout
[James et al., 2013], where a larger fraction of the dataset is used for training, and the
28 Chapter 3. Theoretical Background
remaining portion is used for testing. Compared to k-fold cross-validation it is much
less time-consuming and is suitable for streaming scenarios since the test portion can be
restricted to be time indexes ahead of those used during the training. Nevertheless, there
are two problems when using holdout for incremental learning:
1. Since part of the dataset is, as the name suggests, holdout, not all samples are used
during training
2. Due to concept drift, it is possible that the training and testing sets belong to
different concepts
Because of those limitations, a technique called prequential evaluation [Dawid,
1984, Gama et al., 2009] is a more appropriate choice for evolving data stream problems.
It works as follows:
1. an initial batch with samples ranging from t0 to ti is used to pretrain the model
2. new batches are presented to the model, which makes predictions
3. these predictions and their true labels are used to calculate the prequential error
with an evaluation metric of choice (suitable for the problem in question), which is
stored in the system, and the current performance is incrementally updated
4. the same batch used for predictions is now used for training
5. this procedure keeps running indefinitely while there are new batches of data coming
Since new batches are used for predictions first, and then for training, this technique
is also known as interleaved test-than-train. It is tailored for evolving data stream scenarios
and solves the aforementioned problems associated with the holdout technique, once the
whole dataset is used for training and even though at some point a batch from a new
concept may be used for testing, the overall performance over time is smoother.
3.2 Sequence Models
Before diving into the mechanics of sequence models, it is important to define
what are sequences. A sequence is a set of feature vectors {x(0), x(1), · · · , x(T−2), x(T−1)}ordered along some dimension. Typically, the ordering takes place in the time dimension
and, as such, each element of the sequence is commonly referred to as a time step. Since
each time step is a feature vector, the tth time step of sequence x can be expressed as
x(t) = [x(t)0 , x
(t)1 , · · · , x
(t)M−2, x
(t)M−1], where M is the number of features.
Examples of sequences are:
3.2. Sequence Models 29
• a tweet, containing a sequence of words, each represented by a feature vector of word
embeddings
• a musical sheet, containing a sequence of musical notes, and each of which can be
represented by a feature vector of intensity and tone
• a time series, composed of a sequence of data points, each of which represented
either by a single value, or multiple values (for multivariate time series)
As well as traditional supervised learning models are not trained with a single
sample, sequence models should not be trained with only one sequence. The difference
here is that while traditional models expect 2D datasets with N samples that have M
features each, sequence models expect 3D datasets with N sequences of T timesteps, each
of which having M features.
Another difference is that traditional supervised learning algorithms expect inde-
pendent and identically distributed (i.i.d.) datasets, so the order that their samples are
presented to the model during training is not relevant. In the other hand, due to the
ordered nature of sequences, sequence models expect non-i.i.d sequences.
One last characteristic specific of sequence models is that they impose no restric-
tions to the sequence length: they can be of any size.
Now that sequences and the main goals of sequence models were defined, it is
time to depict the most popular family of such models: recurrent neural networks.
3.2.1 Recurrent Neural Network
Recurrent Neural Networks (RNNs) [Elman, 1990] are a family of neural nets
specialized in processing sequences. The main component of an RNN is the cell, which can
be thought as a layer from a feedforward neural network (FFNN) and, as such, there are
hidden cells and output cells. Both hidden and output cells compute Equation 3.2, while
Equation 3.3 is only computed by output cells.
h(t) = tanh(Wh(t−1) +Ux(t) + b) (3.2)
y(t) = g(V h(t) + c) (3.3)
where:
• x(t) is the tth element of a sequence
• h(t) is the output of hidden layer associated with the tth element of a sequence
30 Chapter 3. Theoretical Background
• h(t−1) is the output of hidden layer associated with the previous time step
• U is the input weight matrix of a hidden cell
• W is the hidden weight matrix of a hidden cell
• b is the bias vector of a hidden cell
• V is the input weight matrix of an output cell
• c is the bias vector of an output cell
• g is a generic output activation function. Usually it is one of the following:
– linear for regression problems: g(x) = x
– sigmoid for binary classification problems: g(x) = 11+e−x
– softmax for multiclass classification problems: gi(x) = exi∑Kj=1 e
xj∀i ∈ {1, 2, · · · , K}
As in an FFNN, a cell output can be connected to another cell’s input. As an
example, h(t) from Equation 3.2 could be the input of Equation 3.3. However, different
from FFNNs that do not have the concept of time step, a cell representing time step t− 1
can have its output h(t−1) connected to the cell of time step t. Figure 4 illustrates these
two cases. It also makes it explicit that RNNs can have multiple layers, each of them
composed of multiple cells, one for each time step.
xx x x (5)x (4) (3) (2) (1)
(0)h1
h0 h0 h0
h1h1h1h1
(5) (0) (1) (2) (3) (4) h0
RNNCell
RNNCell
RNNCell
RNNCell
RNNCell
RNNCell
RNNCell
RNNCell
RNNCell
RNNCell
h0 h0
(5)h1 (4) (3) (2) (1)
h0 h0 h0 (5) (1) (2) (3) (4) h0h0
ŷŷ ŷŷ (1) (2) (3) (4) ŷ (5)
Figure 4 – An RNN with two layers
There are two important design aspects of RNN cells that ensure the already
presented main goals of sequence models:
• By having h(t−1) as an input of the tth time step cell, the sequence non-i.i.d. charac-
teristic is enforced. In other words, the (hidden) state at time step t− 1 is fed into
t
3.2. Sequence Models 31
• Even though an RNN is composed by multiple cells, all of them share the same
parameters U , W , b, V , c. This characteristic makes it possible for a single RNN
to work with sequences of varying length
3.2.1.1 Learning
The learning procedure of an RNN is not much different than the way neural net-
works in general are trained: labeled sequences {(x(0),y(0)), (x(1),y(1)), · · · , (x(T−1),y(T−1))}are fed to the network, then propagated in a forward manner by calculating Equations 3.2
and 3.3, which lead to a sequence of predictions {y(0), y(1), · · · , y(T−1)}. These predictions,
and their associated ground truths, are fed to a loss function in order to obtain the loss
vector L = [L0,L1, · · · ,LN−1,LN ]ᵀ, where Li is the ith sequence loss that is obtained by
summing the loss of all of its time steps: Li =∑T
t L(t)i .
With L in hands, the model’s parameters U , W , b, V and c can be updated
with any differential approach. Equation 3.4 exemplifies these updates using the Stochastic
Gradient Descent (SGD) [Bishop, 2006] algorithm.
W ←−W − η ∂L∂W
U ←− U − η ∂L∂U
V ←− V − η ∂L∂V
b←− b− η∂L∂b
c←− c− η∂L∂c
(3.4)
where η is the learning rate and ∂L∂θ
is the gradient vector’s component of loss L with
respect to parameter θ. Even though SGD was used to exemplify those updates, other
approaches are valid, like batch gradient descent, RMSProp, Adam [Kingma and Ba, 2014],
and many others.
3.2.1.2 The problem of long-term dependencies
The RNN presented so far is usually referred to as Vanilla RNN, since it is one
of the simplest RNN architectures in literature. Such simplicity usually leads to poor
performance when dealing with long-term dependencies.
To better understand this phenomenon, consider a sentiment classifier application
that receives as inputs movie criticisms (made of sequences of words) and outputs the
aforementioned criticisms’ classifications (Positive or Negative). A dataset sample is
presented in Table 1.
32 Chapter 3. Theoretical Background
Table 1 – Example of a dataset for a sentiment classification problem, where inputs aremovie criticisms and outputs are the rating of those criticisms. Adapted fromNg [2018]
Criticism ClassificationFilm with very good performance and excellent script PositiveMediocre production NegativeBest movie of the year PositiveThat movie was the greatest PositiveIt lacked a good script, good actors, a good scenario and goodspecial effects
Negative
Supposing the model in question is a well-trained classifier, it is straightforward to
see why it would correctly classify the first, third, and fourth samples as positive: U will
give high relevance for words good, excellent, best and greatest in such a way that it will
be propagated in the forward pass via W until a positive classification is predicted. The
same is true for the second sample: word mediocre will activate the cell and will culminate
in a negative classification.
The last sample, though, is tricky: lacked has to active the cell in such a way that
all the following occurrences of good should not reverse the classification. An exaggerated
version of that criticism should shed a light on why this is difficult:
It lacked a good script, good actors, a good scenario, good special effects, a good
soundtrack, a good director, good producers, a good marketing campaign, good costume
and good photography
As much as the word lacked activates the cell in a way that will propagate a
negative representation along with the time steps, there are so many occurrences of the
word good after it that their repeated contributions will at some point outshine the
negative contribution of lacked.
Long term dependencies do not impose challenges only during predictions, but
also during training. Consider the gradient ∂L∂W
of the loss in respect to the hidden state
weight in Equation 3.5, and its component ∂h(t)
∂h(k) in Equation 3.6:
∂L∂W
=T∑t=1
t∑k=1
∂L(t)
∂h(t)
∂h(t)
∂h(k)
∂+h(k)
∂W(3.5)
∂h(t)
∂h(k)=
t−1∏i=k
W ᵀdiag(tanh′(x(i−1))) (3.6)
Equation 3.6 contains product of tanh′(·), which is a function bounded in the
range [0, 1]. This means that for k << t, there will be many products of values in that
3.2. Sequence Models 33
range and ∂h(t)
∂h(k) −→ 0. This means that the contributions of time steps in the beginning
of the sequence (like it is the case of lacked) will not have any influence in the gradient∂h(t)
∂h(k) of Equation 3.5, thus any updates in W that would reinforce that words similar to
lack are negative, wouldn’t take place. This is a well known phenomenon called vanishing
gradients, which was detailed in Pascanu et al. [2013].
3.2.2 Long Short-Term Memory Neural Network
Long Short-Term Memory (LSTM) [Hochreiter and Schmidhuber, 1997] is a more
elaborate RNN architecture that is tailored to solve the problem of vanishing gradients.
An LSTM cell also receives as inputs both the current time step and the previous step’s
representation, but that representation is composed of two parts: the hidden state h(t),
and the cell state s(t). To better understand what they are and how they work, consider
an LSTM cell’s computations in Equation 3.7, as well as its graphical representation in
Figure 5.
f (t) = σ(Wfh(t−1) +Ufx
(t) + bf )
i(t) = σ(Wih(t−1) +Uix
(t) + bi)
o(t) = σ(Woh(t−1) +Uox
(t) + bo)
s(t) = tanh(Wh(t−1) +Ux(t) + b)
s(t) = f (t) � s(t−1) + i(t) � s(t)
h(t) = o(t) � tanh(s(t))
y(t) = softmax(V h(t) + c)
(3.7)
In step 1 of Figure 5, the previous time step’s memory cell s(t−1) carries
information from t− 1. Going back to the exemplified long criticism, let’s first consider
the case that the word x(t−1) in that time step was lacked. In that case, this information
should pass. The component responsible for that is f (t), computed in 2 .
Note that f (t) is activated by a sigmoid function, which means that it’s elements
are bounded by the range [0, 1]. Since it has its own parameters Wf , Uf and bf , it can
learn that h(t−1) brings relevant information, resulting in a vector f (t) composed mostly
of values close to 1, yielding in the element-wise product in 3 a result close to s(t−1).
Mathematically, it means that f (t) � s(t−1) ≈ s(t−1) and that f (t) selectively did not forget
the content of s(t−1). Because of that, it is called the forget gate and its selective memory
mechanism helps to solve the forward pass problem of long term sequences.
Now consider that the current input x(t) is a representation of any occurrence of
the word good, from that same long criticism example. In 4 , a temporary cell state is
computed, holding information related to the word good. In 5 the input gate i(t) is
34 Chapter 3. Theoretical Background
𝞼 𝞼 tanh 𝞼
tanh
+
2
5
6
4
73
8
9
10
x(t)
h(t-1)
s(t-1)
f (t)
i (t) o(t)
s(t)
s (t)~
s(t)
h(t)
ŷ(t)
softmax
111
Figure 5 – An LSTM cell (adapted from Olah [2015])
computed, and it’s responsibility is to selectively read new information. This is possible
because it also has its own parameters Wi, Ui and bi and, by knowing that h(t−1) brings
relevant information and x(t) somewhat contradicts it, the element-wise product in 6
won’t let much of this new information pass through. With the remembered (or not
forgotten) information in 3 and the new information of 6 , in 7 the current time
step’s cell state is computed.
Since the sum of 7 may have resulted in a vector whose elements go beyond
the range [−1, 1] and the hidden state h(t) should be limit in this range, in 8 the cell
state is normalized to guarantee that, and then subjected to a last element-wise product
with the output gate, which selectively writes in 10 the new information to the final
hidden state. 11 is computed only if the cell is part of the outermost layer of the LSTM
neural net.
Summarizing, gates f (t), i(t) and o(t) enables information to be selectively forgotten,
read and written to the cell’s states s(t) and h(t), which solves the long term dependency
problem during forward pass. During backward pass, the vanishing gradients problem is
also solved, specially due to the cell state’s derivatives that involves a summation, instead
of a product, as it is the case of Equation 3.6. These derivations are explained with details
in Hochreiter and Schmidhuber [1997].
3.2. Sequence Models 35
3.2.3 Gated Recurrent Unit Neural Network
Although convincing, many may argue that the intuitions behind the LSTM
architecture are arbitrary. In fact, there are many variants of the LSTM, all of which aim
to solve the vanishing gradients problem. One of the most famous variants is the Gated
Recurrent Unit (GRU) neural network Chung et al. [2014].
The main difference between the two is that while LSTMs have a forget gate to
selectively block old information and an input gate to selectively let new information flow,
the GRUs use a single gate to accomplish both: the same gate u(t) used for reading is
also used to forget, by multiplying the cell state in an element-wise manner by (1− u(t)).
Also, the cell and hidden state are merged. With this design choice, GRUs have fewer
parameters to optimize during training, being faster. The GRU cell’s equations are listed
in Equation 3.8, while Figure 6 illustrates it.
u(t) = σ(Wuh(t−1) +Uux
(t))
r(t) = σ(Wrh(t−1) +Urx
(t))
h(t) = tanh(Wr(t) � h(t−1) +Ux(t))
h(t) = (1− u(t))� h(t−1) + u(t) � h(t)
y(t) = softmax(V h(t) + c)
(3.8)
h(t-1)
x(t)
𝞼 𝞼r(t)
h(t)
ŷ(t)
softmax
tanh
-1
u(t)
+
h(t)~
Figure 6 – A GRU cell (adapted from Olah [2015])
36 Chapter 3. Theoretical Background
Now that incremental learning and its foundations were formalized, as well as the
intuitions behind RNNs and their variants were shown, the ideas proposed to apply them
to streaming scenarios are presented in the next chapter.
37
Chapter 4
Incremental Sequence Models
Chapter 2 depicted that most solutions proposed to deal with incremental learning
problems rely on enhanced versions of classic batch supervised learning models with
complex mechanisms to comply with evolving data streams requirements. On the other
hand, Chapter 3 suggested that RNNs have inherent properties, such as time dependencies
modeling and selective forgetting, that make them naturally attractive to be applied to
streaming scenarios.
Accordingly, this chapter brings our ideas on how to adapt RNNs to deal with
evolving data streams and proposes the Incremental Sequence Models (ISMs), which are
simple adaptions on RNNs for incremental learning problems.
4.1 Sub-batching
The incremental learning frameworks mentioned in Chapter 2 have incremental
learning model implementations with APIs that expect batches of 2D datasets so that
the batch dimension of n ×m is composed by number of instances (n) and number of
features (m). However, as discussed in Chapter 3, RNNs expect n× t×m 3D datasets,
where the new dimension t represents the number of time steps. That said, in order to
adapt an RNN-based algorithm for scenarios that expect 2D datasets, there is a need to
transform batches of samples into batches of sequences.
There are two-dimensional adjustments that would easily provide the desired
results. The first of them would be to consider that each sample of the 2D dataset is a
sequence of one time step. Its dimension would change from N ×M to N × 1×M (n = N ,
t = 1 and m = M). Although it would work, training a sequence model of sequences of
length 1 does not make much sense: its main strength of modeling time dependencies and
selectively forgetting, reading, and writing would be lost.
The second approach would be to consider that the whole batch is a single sequence,
resulting in a dimension of 1×N ×M (n = 1, t = N and m = M). In contrast to the first
38 Chapter 4. Incremental Sequence Models
approach, this one leverages the power of RNNs, but training a supervised learning model
with a single instance would lead to overfitting.
A better approach named sub-batching was used, which is composed of two parts:
an input transformation of a 2D batch of samples into a 3D batch of sequences and an
output transformation, which convert a 3D batch of predictions into 2D, enabling the use
of RNNs that complies with popular incremental learning frameworks 2D datasets APIs.
4.1.1 Input transformation
To transform a 2D batch of samples into a 3D batch of sequences, the samples
are sorted in ascending order by the time they arrive. Then, the batch is broken down into
multiple smaller batches (or sub-batches) in a sliding window manner. Figure 7 illustrates
this procedure. In blue, we have the current batch, with length 10, which is split into
sub-batches of length 5 (in red), each of which is presented to the model as a sample
sequence. Hence, according to this example, a 10 ×M batch of ten samples became a
6× 5×M batch of six sample sub-batches, where M is the number of features. The batch
and sub-batch sizes (10 and 5, respectively) were chosen just to exemplify, since they are
parameters of the ISM models.
61 ...... 545351 5250 55 56 57 58 59 60 694910
55545352
545351 52
51
50
2nd sub-batch
1st sub-batch
...
1st batchfrom data stream
2nd batchfrom data stream
3rd batchfrom data stream
(2) (5)(4)(3)(1)
h(1)ssssss
hh
ŷ56
hhh
ŷ59
LSTMCell
LSTMCell
LSTMCell
LSTMCell
LSTMCell
ŷ58ŷ57ŷ55(2) (5)(4)(3)(1)
(2) (5)(4)(3)(0)
(0)
56555452 533rd sub-batch
55 56 5753 544th sub-batch
5855 56 57545th sub-batch
55 56 57 58 596th sub-batch
Figure 7 – Data stream sub-batching
To better understand the sub-batching procedure, let us formalize it mathemat-
ically. The input of this procedure is the batch of N samples represent by the N ×Mmatrix X in Equation 4.1, where xn = [xn,0 xn,1 ··· xn,M−1] is the nth sample, which is a
4.1. Sub-batching 39
feature vector of size M .
X =
x0
x1
...
xN−1
(4.1)
The desired output are sub-batches of X, which are matrices represented by the
B × T ×M 3D tensor S in Equation 4.2.
S =
Sᵀ
0
Sᵀ1...
SᵀB−1
(4.2)
where the number of sub-batches B is given by B = N − T + 1 and T is the sub-batch size.
Sub-batches S0, S1, S2 and SB−1 are represented in Equation 4.3.
S0 =
x0
x1
x2
...
xT−1
S1 =
x1
x2
x3
...
xT
S2 =
x2
x3
x4
...
xT+1
· · · SB−1 =
xN−T
xN−T+1
xN−T+2
· · ·xN−1
(4.3)
Another way to represent S is by Equation 4.4, where each row is one sub-batch.
S =
x0 x1 x2 · · · xT−1
x1 x2 x3 · · · xT
x2 x3 x4 · · · xT+1
......
.... . .
...
xN−T xN−T+1 xN−T+2 · · · xN−1
(4.4)
Sub-batching was implemented using many vectorization features from Python’s
NumPy1 package and was inspired by Azman [2020]. The first feature is index slicing, where
given the index matrix I in Equation 4.5 of the output tensor S, one can simply obtain
1 https://numpy.org
40 Chapter 4. Incremental Sequence Models
the desired result by computing S = X[I].
I =
0 1 2 · · · (T − 1)
1 2 3 · · · (T )
2 3 4 · · · (T + 1)...
......
. . ....
(N − T ) (N − T + 1) (N − T + 2) · · · (N − 1)
(4.5)
The index matrix I can be obtained by Equation 4.6.
I =
0 0 0 · · · 0
1 1 1 · · · 1
2 2 2 · · · 2...
......
. . ....
(N − T ) (N − T ) (N − T ) · · · (N − T )
+
0 1 2 · · · (T − 1)
0 1 2 · · · (T − 1)
0 1 2 · · · (T − 1)...
......
. . ....
0 1 2 · · · (T − 1)
(4.6)
It is easy to notice that the columns of the first term of Equation 4.6 are all equal
to [0 1 2 ··· (N−T )] ᵀ, while to rows of the second term are equal to [0 1 2 ··· (T−1)] . Because of
that, another NumPy vectorization feature called broadcasting was used to compute I. It is a
native feature of NumPy’s main data structure, the ndarray, where during an element-wise
operation of array’s with diverging dimensions, the arrays columns or rows are repeated in
order to obtain arrays with the same dimension in both terms of the operation. That said,
a broadcast version of 4.6 was used and is presented in Equation 4.7.
I =
0
1
2...
(N − T )
+[0 1 2 · · · (T − 1)
](4.7)
4.1.2 Output transformation
The same transformation procedure used to transform X into S, is also applied
for X’s associated labels Y , which is represented by Equation 4.8.
Y =
y0
y1...
yN−1
(4.8)
4.1. Sub-batching 41
By applying sub-batching to Y , Ys is obtained, as represented by Equation 4.9.
Ys =
y0 y1 y2 · · · yT−1y1 y2 y3 · · · yT
y2 y3 y4 · · · yT+1
......
.... . .
...
yN−T yN−T+1 yN−T+2 · · · yN−1
(4.9)
likewise S, Ys is also a 3D tensor and has dimension N × T × C, where the first two
dimensions N and T are the number of samples and sequence length (same as S), and C is
the dimension of each single output yn, which can be the number of classes of a multiclass
classification dataset.
Although sequence models like RNNs expect outputs with the 3D shape of Ys,
this is not the case for most consumers of incremental learning models. Instead, they
expect 2D batches of predictions of shape N × C. That said, Ys (Equation 4.10) have to
be transformed into Y (Equation 4.11).
Ys =
y0,0 y1,0 y2,0 · · · yT−1,0
y1,1 y2,1 y3,1 · · · yT,1
y2,2 y3,2 y4,2 · · · yT+1,2
......
.... . .
...
yN−T,B−1 yN−T+1,B−1 yN−T+2,B−1 · · · yN−1,B−1
(4.10)
Y =
y0
y1...
yN−1
(4.11)
where each vector yn,b of Ys is a prediction of the input sample xn on the bth sub-batch.
This notation was used to make explicit that although both y1,1 and y1,0 are
predictions of the input x1, they are different. So, the challenge here is to obtain an
aggregated version yn, composed of all yn,bs. The way that all predictions of the same
sample were obtained was by collecting all the elements of each diagonal that starts in the
bottom-left and ends on the top-right. Figure 8 pictures this procedure, for N = 10 and
B = 6.
42 Chapter 4. Incremental Sequence Models
y0,0 y1,0 y2,0 y3,0 y4,0
y1,1 y2,1 y3,1 y4,1 y5,1
y2,2 y3,2 y4,2 y5,2 y6,2
y3,3 y4,3 y5,3 y6,3 y7,3
y4,4 y5,4 y6,4 y7,4 y8,4
y5,5 y6,5 y7,5 y8,5 y9,5
Figure 8 – Collection of predictions of the same sample
After collecting all predictions of the same sample, they are aggregated and Y
is obtained. The aggregation function chosen was arithmetic mean, but other functions
could be used, like exponentially weighted average.
4.2 Statefulness
In Chapter 3 the power of RNNs in modeling sequences was highlighted. One of
the main components of their architectures that enable information to be carried through
their cells are their states h(t) and, for LSTMs, the memory cell s(t). It is important to
note, though, that those states are reset for every sequence that is presented to the model.
In other words, the components within a sequence are non-i.i.d., but the sequences are i.i.d.
This may be acceptable for applications like the exemplified sentiment classifier, where
each movie criticism is in fact independent from each other, but for evolving data streams,
this may not be the case.
The idea here is to think of a data stream as a sequence with infinite length and,
because of that, it is always relevant to have a context of the past. This way, a concept drift
can be detected automatically without the need of complex drift detection mechanisms,
like those mentioned in Chapter 2.
To accomplish that, the states of the last sample xN−1 of each batch is stored and
it is replicated for every sub-batch of the next batch. By doing so, the following problems
are solved:
1. No need to keep a history of past samples that may violate the system’s memory
constraints, since only the RNN cell’s weights and the last values (meaning, at time
t) of the cell states should be kept in memory;
2. No need to implement a specific mechanism to detect a concept drift, since they
should be captured automatically by the memory cell.
4.2. Statefulness 43
The ideas proposed in this chapter could be extended to distinct supervised
Machine Learning problems, such as regression, classification and forecasting.
45
Chapter 5
Experimental Methodology
This chapter will present the Incremental Sequence Model variants that were
created, the state-of-the-art methods used for comparison, and their respective parameteri-
zation. The procedure to obtain the performance evaluation metrics will also be detailed,
as well as the datasets used in the experiments.
5.1 Methods
To evaluate the effectiveness of the proposed algorithm in evolving data stream
scenarios, experiments were conducted using six variants of the Incremental Sequence
Model. Three RNN cell types were used: vanilla RNN, LSTM and GRU, each of them
with two settings: with and without statefulness. The model names were prefixed with I to
denote they are Incremental versions of base models (e.g. IRNN). Also, for the variants
with the statefulness feature activated, their names were suffixed with -ST (e.g. IRNN-ST).
Summarizing, the six used setups were:
• IRNN-ST: ISM with vanilla RNN cells and statefulness turned on
• IRNN: ISM with vanilla RNN cells and statefulness turned off
• ILSTM-ST: ISM with LSTM cells and statefulness turned on
• ILSTM: ISM with LSTM cells and statefulness turned off
• IGRU-ST: ISM with GRU cells and statefulness turned on
• IGRU: ISM with GRU cells and statefulness turned off
The ISMs were compared with well known methods in literature: Oza Bagging
ADWIN [Oza, 2005], Adaptive Random Forest [Gomes et al., 2017b] and SAM kNN
[Losing et al., 2016], using Prequential Evaluation [Gama et al., 2009], that was described
in Section 3.1. All ISM variants had the same hyperparameter setting for all experiments:
46 Chapter 5. Experimental Methodology
• 3 RNN layers, with 150, 100 and 50 units respectively, all of them using the hyperbolic
tangent activation function
• Adam was chosen as the optimizer due to its consistent performance on different
problems [Schmidt et al., 2021]. It was parameterized with TensorFlow’s default
argument values: η = 0.001, β1 = 0.9, β2 = 0.999 and epsilon = 1e−7
• the loss function was set to binary cross-entropy
The other incremental learning methods were configured with their default hyper-
parameters as suggested in the Scikit-multiflow framework1.
5.2 Performance Evaluation
To increase statistical significance, all experiments were repeated multiple times
and their results were evaluated in two distinct formats:
1. Accuracy evolution plots, to graphically show how well the methods perform during
the whole stream, specially when concept drifts occur
2. Overall accuracy, to summarize a method’s performance on a given dataset
Both evaluation formats are based on the batch accuracy vector am,dr represented
in Equation 5.1.
am,dr =
[am,dr,0 am,d
r,1 · · · am,dr,B−1
](5.1)
where am,dr,b is the accuracy on the bth batch of the rth experiment run for the mth model
on the dth dataset.
Since the evaluations are specific for a model and dataset pair, all of their batch
accuracies can be represented by the matrix Am,d in Equation 5.2.
Am,d =
am,d0,0 am,d
0,1 · · · am,d0,B−1
am,d1,0 am,d
1,1 · · · am,d1,B−1
......
. . ....
am,dR−1,0 am,d
R−1,1 · · · am,dR−1,B−1
(5.2)
1 Available online on: scikit-multiflow.github.io
5.3. Datasets 47
With 5.2, the average am,davg and standard deviation am,d
std of each batch can be
calculated, as seen in Equations 5.3 and 5.4, respectively.
am,davg =
∑R−1r=0 am,d
r,0
R∑R−1r=0 am,d
r,1
R...∑R−1
r=0 am,dr,B−1
R
ᵀ
(5.3)
am,dstd =
√∑R−1r=0 (am,d
r,0 −|am,d0 |)2
R√∑R−1r=0 (am,d
r,1 −|am,d1 |)2
R...√∑R−1
r=0 (am,dr,B−1−|a
m,dB−1|)2
R
ᵀ
(5.4)
From Equations 5.3 and 5.4, the accuracy evolution vector am,d can be obtained,
as shown in Equation 5.5, and the accuracy evolution plot can be drawn.
am,d = am,davg ± a
m,dstd (5.5)
The overall accuracy am,doverall also comes from Equations 5.3 and 5.4, as can be
seen in Equation 5.6.
am,doverall = avg(am,d
avg )± avg(am,dstd ) (5.6)
where avg(a) is a simplified notation for∑B−1
b=0 a
B.
5.3 Datasets
Experiments were performed over four synthetic and four real datasets. Stream
generators2 were used to create the synthetic datasets and the real datasets are available
at the GitHub repository3.
• Chess: consists of real game records of one player over a period from 2007 December
to 2010 March. Since a chess player learns from its previous games, the dataset is
considered to be non independent and identically distributed (non-i.i.d.). This dataset
has 8 attributes, 2 classes and 503 instances, which were divided into a first batch of
153 samples and subsequent batches of 50 samples. The sub-batch size was set to 30
samples for ILSTM.
2 Available at: scikit-multiflow.github.io/3 Available at: github.com/ogozuacik/concept-drift-datasets-scikit-multiflow
48 Chapter 5. Experimental Methodology
• Electricity: is formed from real data collected in the Australian electricity market.
In this market, prices fluctuate according to the demand, and data is sampled every
30 minutes. Since electricity demand is seasonal, this is a non-i.i.d. dataset and such
characteristic was taken into account in the experimental setup: sub-batch size of 48
samples (one day worth of data), batches of 336 samples (one week of data), except
for the first batch with 960 samples, which adds up to all the 45312 instances of the
dataset, that has 8 dimensions and 2 classes.
• Weather: is formed by 50 years of real daily measurements with several meteorological
information, which characterizes this dataset as non-i.i.d.. It has 2 classes, 8 attributes
and 18159 instances, divided into a first batch of 1369 samples and subsequent batches
of 730 samples (two years of data). For ILSTM, the sub-batch size was set to 365
samples (one year of data).
• Sine 1: is a synthetic and i.i.d. dataset with a total of 10000 instances, 2 dimensions,
two different concepts with 5000 instances each and an abrupt change of concept.
In the first concept, all points below the curve y = sin(x) are classified as positive
and after the drift, the classification is reversed. The experimental setup was: 300
samples for the initial batch, 100 samples for the subsequent ones and 60 samples
for the ILSTM’s sub-batches.
• Sine 2: is a synthetic and i.i.d. dataset with a total of 10000 instances of 4 dimensions
so that two of them represent only noise; it also has two different concepts with
5000 instances each and gradual change. The gradual change of concept was carried
out by a sigmoid function f(t) = 1/(1 + e−4(t−p)/w), where p = 5000 is the position
in which the change happens and w = 500 is the width of the transition. For this
dataset, we have followed the same experimental setup of Sine 1.
• SEA: is a synthetic and i.i.d. dataset with 60000 instances, 3 dimensions and 2
classes. The dataset presents four different concepts with 15000 instances each; 10%
of the data are noisy. The experimental setup was the same for the Sine 1 dataset.
• Stagger: is a synthetic and i.i.d. dataset with 30000 instances, 3 attributes, and two
abrupt drifts taking place in the transition regions of three different concepts with
10000 instances each. It has boolean values, generated by three different functions,
where each concept represents one of these functions. The experimental setup was
the same for the Sine 1 dataset.
• Spam: is a real dataset containing a ham/spam collection of legitimate messages
with 6213 instances and 499 attributes. Since the messages are not correlated, this is
an i.i.d. dataset. The experimental setup was: 313 samples for the initial batch, 100
samples for the subsequent ones, and 60 samples for the ILSTM’s sub-batches.
49
Chapter 6
Results
This chapter will present the results of the experimental setup detailed in Chapter
5. First, the overall performance will be shown in section 6.1, and then a more thorough
analysis will be made for the following three groups of datasets:
• non-i.i.d. and real datasets, composed by Chess, Electricity and Weather, in section
6.2
• i.i.d. and synthetic datasets, composed by Sine 1, Sine 2, SEA and Stagger, in
section 6.3
• i.i.d. and real datasets, composed by Spam, in section 6.4
6.1 Overall performance analysis
The overall performance for all experiments can be seen in Table 2. Each row
represents the results of all methods for a single dataset and the one with the highest
average accuracy is highlighted in bold red. Columns show the results for a single method
across all datasets. A cell in this table represents the overall accuracy (as seen in Equation
5.6) of a method, for the 5 experiment runs on the dataset.
Table 2 – Overall performance achieved by all tested methods
Model IRNN IRNN-ST IGRU IGRU-ST ILSTM ILSTM-ST ARF SAM-KNN OB-ADWINDatasetChess 77.27± 4.32 75.00± 3.95 78.37± 4.39 77.70± 3.43 79.67± 3.86 78.37± 3.73 77.67± 3.06 74.50± 2.86 76.67± 4.05Electricity 76.14± 9.73 75.47± 9.90 78.76± 10.06 81.50± 9.19 75.10± 11.22 73.33± 12.19 77.97± 10.38 64.75± 9.26 69.58± 7.77Weather 73.99± 3.44 73.99± 3.92 75.79± 3.47 76.25± 3.14 75.68± 3.27 75.73± 3.33 73.90± 3.82 73.43± 5.60 75.32± 3.45Sine 1 98.50± 7.84 96.11± 11.57 97.67± 9.61 74.69± 25.15 98.35± 8.07 97.31± 9.00 96.97± 8.27 96.80± 13.72 92.90± 20.71Sine 2 94.36± 9.11 92.95± 8.79 94.20± 8.71 94.56± 8.90 94.57± 9.19 94.49± 8.51 89.44± 10.05 86.01± 17.52 86.63± 13.70SEA 83.68± 3.91 82.83± 4.52 84.68± 3.43 84.09± 3.46 84.45± 3.88 84.14± 4.09 87.39± 3.61 87.16± 2.74 86.44± 2.69Stagger 61.90± 28.23 72.44± 29.27 82.89± 23.89 85.14± 23.99 98.72± 5.56 98.58± 6.07 99.64± 3.13 99.08± 5.90 97.98± 9.14Spam 88.88± 14.25 88.86± 14.24 91.19± 16.59 89.83± 19.08 87.96± 17.00 86.78± 17.89 90.15± 10.28 91.42± 8.26 89.89± 9.13
The proposed ISM methods achieved the highest average accuracy for the non-i.i.d.
and real datasets. For the i.i.d. and synthetic ones, they achieved the highest average
50 Chapter 6. Results
accuracy for the Sine 1 and Sine 2 datasets, whereas for the remaining ones (SEA and
Stagger), the state-of-the-art methods used for comparison ranked the highest results.
The comparison methods also obtained the highest average accuracy for the i.i.d and real
dataset (Spam). Still, the ISM models obtained comparable results.
In the following three sections a more in-depth analysis of the experimental results
will be made. Instead of an aggregate view (as presented in Table 2), accuracy evolution
over time plots will be used, representing the accuracy evolution vectors from Equation
5.5. In such plots, a thick gray line will represent the average accuracy for each training
batch (across all experiments), in lighter gray, the standard deviation and, in dashed red,
the timestamps where a concept drift occurred for the synthetic datasets; there is no
information regarding drifts for the real ones.
6.2 Non-i.i.d. and real datasets
For the non-i.i.d. and real datasets, the ISM methods obtained the best overall
accuracy average results. The accuracy over time plots for the Chess, Electricity and
Weather datasets can be seen in Figures 9, 10 and 11, respectively.
6.3 i.i.d. and synthetic datasets
The proposed model also ranked the top results for two of the i.i.d. synthetic
datasets, which are Sine 1 and Sine 2 and their accuracy evolution plots can be seen in
Figures 12 and 13, respectively.
Since those are synthetic datasets, their concept drift time indexes are known
and are highlighted in red in the aforementioned plots. It’s evident that for the proposed
methods the accuracy drawdown that usually occurs after the drift tends to be smaller
than for the other methods.
For the SEA dataset, although the results for the ISMs were very close, the
comparison methods obtained the top results. By analyzing Figure 14, the quicker accuracy
recovery performed by the ISM methods was also present in these experiments.
The Stagger dataset was another one that the comparison methods obtained the
best results. Figure 15 shows that ILSTM and ILSTM-ST had comparable results, while the
other ISM variants had very poor performance.
6.4 i.i.d. and real datasets
For the Spam dataset, the only dataset in this category, the state-of-the-art models
obtained the best average results. Figure 16 show its accuracy evolution plot, where its
6.4. i.i.d. and real datasets 51
0.7
0.8
IRNN
0.7
0.8
IRNN-ST
0.7
0.8
IGRU
0.7
0.8
IGRU-ST
0.7
0.8
ILSTM
0.7
0.8
ILSTM-ST
0.7
0.8
SAM-KNN
0.7
0.8
ARF
150 200 250 300 350 400 450
0.7
0.8
OB-ADWIN
Sample index over time
Accu
racy
Figure 9 – Chess dataset
evident that the ISM methods had higher variance than the others.
52 Chapter 6. Results
0.50
0.75
1.00IRNN
0.50
0.75
1.00IRNN-ST
0.50
0.75
1.00IGRU
0.50
0.75
1.00IGRU-ST
0.50
0.75
1.00ILSTM
0.50
0.75
1.00ILSTM-ST
0.50
0.75
1.00SAM-KNN
0.50
0.75
1.00ARF
0 10000 20000 30000 40000
0.50
0.75
1.00OB-ADWIN
Sample index over time
Accu
racy
Figure 10 – Electricity dataset
6.4. i.i.d. and real datasets 53
0.6
0.7
0.8
IRNN
0.6
0.7
0.8
IRNN-ST
0.6
0.7
0.8
IGRU
0.6
0.7
0.8
IGRU-ST
0.6
0.7
0.8
ILSTM
0.6
0.7
0.8
ILSTM-ST
0.6
0.7
0.8
SAM-KNN
0.6
0.7
0.8
ARF
2000 4000 6000 8000 10000 12000 14000 16000 180000.6
0.7
0.8
OB-ADWIN
Sample index over time
Accu
racy
Figure 11 – Weather dataset
54 Chapter 6. Results
0.5
1.0IRNN
0.5
1.0IRNN-ST
0.5
1.0IGRU
0.5
1.0IGRU-ST
0.5
1.0ILSTM
0.5
1.0ILSTM-ST
0.5
1.0SAM-KNN
0.5
1.0ARF
0 2000 4000 6000 8000 10000
0.5
1.0OB-ADWIN
Sample index over time
Accu
racy
Figure 12 – Sine 1 dataset
6.4. i.i.d. and real datasets 55
0.5
1.0IRNN
0.5
1.0IRNN-ST
0.5
1.0IGRU
0.5
1.0IGRU-ST
0.5
1.0ILSTM
0.5
1.0ILSTM-ST
0.5
1.0SAM-KNN
0.5
1.0ARF
0 2000 4000 6000 8000 10000
0.5
1.0OB-ADWIN
Sample index over time
Accu
racy
Figure 13 – Sine 2 dataset
56 Chapter 6. Results
0.6
0.8
IRNN
0.6
0.8
IRNN-ST
0.6
0.8
IGRU
0.6
0.8
IGRU-ST
0.6
0.8
ILSTM
0.6
0.8
ILSTM-ST
0.6
0.8
SAM-KNN
0.6
0.8
ARF
0 10000 20000 30000 40000 50000 60000
0.6
0.8
OB-ADWIN
Sample index over time
Accu
racy
Figure 14 – SEA dataset
6.4. i.i.d. and real datasets 57
0.5
1.0IRNN
0.5
1.0IRNN-ST
0.5
1.0IGRU
0.5
1.0IGRU-ST
0.5
1.0ILSTM
0.5
1.0ILSTM-ST
0.5
1.0SAM-KNN
0.5
1.0ARF
0 5000 10000 15000 20000 25000 30000
0.5
1.0OB-ADWIN
Sample index over time
Accu
racy
Figure 15 – Stagger dataset
58 Chapter 6. Results
0.0
0.5
1.0IRNN
0.0
0.5
1.0IRNN-ST
0.0
0.5
1.0IGRU
0.0
0.5
1.0IGRU-ST
0.0
0.5
1.0ILSTM
0.0
0.5
1.0ILSTM-ST
0.0
0.5
1.0SAM-KNN
0.0
0.5
1.0ARF
1000 2000 3000 4000 5000 60000.0
0.5
1.0OB-ADWIN
Sample index over time
Accu
racy
Figure 16 – Spam dataset
59
Chapter 7
Conclusion
The proposed method achieved good results, having simplicity as its main strength.
In contrast to other incremental learning algorithms, it does not need a complex memory
structure, nor explicit mechanisms for detecting drifts or forgetfulness. Instead, it uses
RNN’s intrinsic ability to deal with the stability-plasticity dilemma, which I believe is
the main contribution of this work.
The successful results of this work relied on other contributions too. The first of
them was the idea of considering data streams as infinite non-i.i.d. sequences and feeding
them to RNNs. It leveraged the power of RNNs to track time dependencies and the built-in
online learning characteristics of neural networks. Although it may be a strong assumption
to consider all data streams to be non-i.i.d., it is reasonable for most real-world problems
involving streaming data.
The second important contribution was the sub-batching mechanism, which is an
effective use of sliding windowing for data augmentation that made it possible to generate
a batch of sequences out of a batch of samples. It proved to be effective and a vectorized
implementation was demonstrated.
The third relevant contribution was the statefulness, which proposed a way to
keep the internal state of RNNs always with the most updated context of the sequence.
This feature obtained good results, but it is important to note that experiments with it
disabled also performed very well. So it is not a silver bullet and a deeper study could be
done to understand this specific characteristic better.
The fourth important contribution was the implementation and open sourcing of
the pyISM1 Python package, containing a family of ISM models with an scikit-multiflow
API. It is also based on TensorFlow and is extensible for any type RNN layer implemented
using the keras API. The implementation followed industry-level software engineering
practices such as composition, extensibility, and test coverage.
1 https://github.com/alvarolemos/pyism
60 Chapter 7. Conclusion
The last important contributions were the presentation of an early version of this
work on the XXIII Brazilian Conference on Automation, as well as the invitation and
further acceptance to the publication of an extended version of that same early work on a
special edition of the Journal of Control, Automation and Electrical Systems.
Although good results were achieved, the proposed methods have limitations:
as suspected, their applicability is limited to non-i.i.d. problems. Also, although not
measured during the experiments, the proposed method is based on complex models, so
computational costs constraints may exist.
At last, the author believes that there is room for the following future works:
• Go beyond RNNs, since they are not the only existing sequence models. A starting
point would be the Temporal Convolution Network (TCN) [Bai et al., 2018], which
has shown competitive results and promises to be faster than RNNs.
• Test the proposed method in different learning tasks, such as regression and fore-
casting.
• Another important work could be a deeper analysis of the results using statistical
tests, since there were overlaps in the overall accuracy results, as seen in Table 2.
• Re-run the experiments, considering resource consumption related metrics, like
memory usage and elapsed time.
• Leverage different approaches to present the results, like drawdown charts.
61
References
M. Awad and R. Khanna. Support vector machines for classification. In Efficient Learning
Machines, pages 39–66. Springer, 2015.
S. K. Azman. Fast and robust sliding window vectorization
with numpy, Jun 2020. URL https://towardsdatascience.com/
fast-and-robust-sliding-window-vectorization-with-numpy-3ad950ed62f5.
S. Bai, J. Z. Kolter, and V. Koltun. An empirical evaluation of generic convolutional and
recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271, 2018.
A. Bifet and R. Gavalda. Learning from time-changing data with adaptive windowing. In
Proceedings of the 2007 SIAM international conference on data mining, pages 443–448.
SIAM, 2007.
C. M. Bishop. Pattern recognition. Machine learning, 128(9), 2006.
L. Breiman. Random forests. Machine learning, 45(1):5–32, 2001.
T. Chen and C. Guestrin. Xgboost: A scalable tree boosting system. In Proceedings of
the 22nd acm sigkdd international conference on knowledge discovery and data mining,
pages 785–794, 2016.
J. Chung, C. Gulcehre, K. Cho, and Y. Bengio. Empirical evaluation of gated recurrent
neural networks on sequence modeling. arXiv preprint arXiv:1412.3555, 2014.
C. Cortes and V. Vapnik. Support-vector networks. Machine learning, 20(3):273–297,
1995.
A. P. Dawid. Present position and potential developments: Some personal views statistical
theory the prequential approach. Journal of the Royal Statistical Society: Series A
(General), 147(2):278–290, 1984.
G. De Francisci Morales, A. Bifet, L. Khan, J. Gama, and W. Fan. Iot big data stream
mining. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge
discovery and data mining, pages 2119–2120, 2016.
62 References
G. Ditzler, M. Roveri, C. Alippi, and R. Polikar. Learning in nonstationary environments:
A survey. IEEE Computational Intelligence Magazine, 10(4):12–25, 2015.
J. L. Elman. Finding structure in time. Cognitive science, 14(2):179–211, 1990.
J. Gama, R. Sebastiao, and P. P. Rodrigues. Issues in evaluation of stream learning
algorithms. In Proceedings of the 15th ACM SIGKDD international conference on
Knowledge discovery and data mining, pages 329–338, 2009.
J. Gama, I. Zliobaite, A. Bifet, M. Pechenizkiy, and A. Bouchachia. A survey on concept
drift adaptation. ACM computing surveys (CSUR), 46(4):1–37, 2014.
A. Gepperth and B. Hammer. Incremental learning algorithms and applications. 2016.
H. M. Gomes, J. P. Barddal, F. Enembreck, and A. Bifet. A survey on ensemble learning
for data stream classification. ACM Computing Surveys (CSUR), 50(2):1–36, 2017a.
H. M. Gomes, A. Bifet, J. Read, J. P. Barddal, F. Enembreck, B. Pfharinger, G. Holmes,
and T. Abdessalem. Adaptive random forests for evolving data stream classification.
Machine Learning, 106(9-10):1469–1495, 2017b.
S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):
1735–1780, 1997.
S. C. Hoi, D. Sahoo, J. Lu, and P. Zhao. Online learning: A comprehensive survey. arXiv
preprint arXiv:1802.02871, 2018.
G. James, D. Witten, T. Hastie, and R. Tibshirani. An introduction to statistical learning,
volume 112. Springer, 2013.
D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint
arXiv:1412.6980, 2014.
B. Krawczyk, L. L. Minku, J. Gama, J. Stefanowski, and M. Wozniak. Ensemble learning
for data stream analysis: A survey. Information Fusion, 37:132–156, 2017.
L. Li, R. Sun, S. Cai, K. Zhao, and Q. Zhang. A review of improved extreme learning
machine methods for data stream classification. Multimedia Tools and Applications, 78
(23):33375–33400, 2019.
V. Losing, B. Hammer, and H. Wersing. Knn classifier with self adjusting memory for
heterogeneous concept drift. In 2016 IEEE 16th international conference on data mining
(ICDM), pages 291–300. IEEE, 2016.
O. Marschall, K. Cho, and C. Savin. A unified framework of online learning algorithms
for training recurrent neural networks. Journal of Machine Learning Research, 21(135):
1–34, 2020.
References 63
S. Mehta et al. Concept drift in streaming data classification: Algorithms, platforms and
issues. Procedia computer science, 122:804–811, 2017.
M. Mermillod, A. Bugaiska, and P. Bonin. The stability-plasticity dilemma: Investigating
the continuum from catastrophic forgetting to age-limited learning effects. Frontiers in
psychology, 4:504, 2013.
A. H. Mirza, M. Kerpicci, and S. S. Kozat. Efficient online learning with improved lstm
neural networks. Digital Signal Processing, 102:102742, 2020.
J. Montiel, R. Mitchell, E. Frank, B. Pfahringer, T. Abdessalem, and A. Bifet. Adaptive
xgboost for evolving data streams. arXiv preprint arXiv:2005.07353, 2020.
A. Ng. Sentiment classification, 2018. URL https://www.coursera.org/learn/
nlp-sequence-models/lecture/Jxuhl/sentiment-classification.
C. Olah. Understanding lstm networks. 2015.
N. C. Oza. Online bagging and boosting. In 2005 IEEE international conference on
systems, man and cybernetics, volume 3, pages 2340–2345. Ieee, 2005.
R. Pascanu, T. Mikolov, and Y. Bengio. On the difficulty of training recurrent neural
networks. In International conference on machine learning, pages 1310–1318, 2013.
F. Rosenblatt. The perceptron: a probabilistic model for information storage and organi-
zation in the brain. Psychological review, 65(6):386, 1958.
R. M. Schmidt, F. Schneider, and P. Hennig. Descending through a crowded valley-
benchmarking deep learning optimizers. In International Conference on Machine
Learning, pages 9367–9376. PMLR, 2021.
Z. Shao and M. J. Er. An online sequential learning algorithm for regularized extreme
learning machine. Neurocomputing, 173:778–788, 2016.
R. Xu, Y. Cheng, Z. Liu, Y. Xie, and Y. Yang. Improved long short-term memory based
anomaly detection with concept drift adaptive method for supporting iot services. Future
Generation Computer Systems, 112:228–242, 2020.
I. Zliobaite, M. Pechenizkiy, and J. Gama. An overview of concept drift applications. In
Big data analysis: new algorithms for a new society, pages 91–114. Springer, 2016.