An Incremental Learning approach using Sequence Models

Alvaro Conde Lemos Neto

An Incremental Learning approach using

Sequence Models

Belo Horizonte - Minas Gerais

July, 2021

Alvaro Conde Lemos Neto

An Incremental Learning approach using Sequence Models

Final dissertation presented to the Gradu-ate Program in Electrical Engineering of theFederal University of Minas Gerais in partialfulfillment of the requirements for the degreeof Master in Electrical Engineering.

Federal University of Minas Gerais - UFMG

Graduate Program in Electrical Engineering - PPGEE

Machine Intelligence and Data Science Laboratory - MINDS

Supervisor: Cristiano Leite de Castro

Belo Horizonte - Minas Gerais

July, 2021

Alvaro Conde Lemos Neto An Incremental Learning approach using SequenceModels/ Alvaro Conde Lemos Neto. – Belo Horizonte - Minas Gerais, July, 2021-63 p. : il. (algumas color.) ; 30 cm.Supervisor: Cristiano Leite de CastroDissertation (Master’s) – Federal University of Minas Gerais - UFMGGraduate Program in Electrical Engineering - PPGEEMachine Intelligence and Data Science Laboratory - MINDS , July, 2021.1. Machine Learning. 2. Incremental Learning. 3. Recurrent Neural Networks.

I dedicate this work to my grandmother Estela, whose thirst for knowledge, kindness to

others and ability to overcome challenges with grace constantly inspires me

Acknowledgements

First and foremost I am extremely grateful to my supervisor Prof. Cristiano

Leite de Castro. His invaluable advice, continuous support, patience, and friendship were

deterministic during my Master’s degree research.

I am also deeply grateful to Rodrigo Amador Coelho for sharing his expertise in

the field of evolving data streams with me, which enriched this research and led to a great

partnership in two scientific publications. Looking forward to more to come! I extend my

gratitude to all colleagues at the MINDS lab.

I would like to offer my special thanks to Prof. Luciano de Errico and Pedro Dias,

who guided me during all my undergraduate years and helped to shape the researcher I

am today. Also am grateful for the friendship and contributions made with all colleagues

at the LabCOM lab.

I’m deeply thankful to Prof. Andre Paim and Prof. Luiz Bambirra for all the

contributions and insights made during my dissertation defense.

I extend my thanks to Inter and Avenue Code, for in many occasions giving me the

flexibility to endeavor my research, and to Bwtech for helping me join the post-graduation

program.

I would like to thank my parents Cristina and Jorge, my sister Maria, my brother

Gabriel and all my family, whom without this would not have been possible. I extend my

gratitude to my dear friends, the family I built during the years.

I would also like to express my gratitude to my girlfriend and soon-to-be wife,

Thais, who during all these years has been a source of happiness, love, and friendship,

which has been fuel for me to finish this work.

Finally, I extend my sincere gratitude to Lena, who took me into her family and

many times made the coffee that kept me awake writing this work at night.

“I am learning all the time. The tombstone will be my diploma.”

— Eartha Kitt

Abstract

Due to Big Data and the Internet of Things, machine learning algorithms targeted

specifically to model evolving data streams have gained attention from both academia and

industry. Although most of the proposed solutions in the literature have reported being

successful in learning from non-stationary streaming settings, their complexity and the

need for extra resources may be a constraint for their deployment in real applications.

Aiming at less complexity without losing performance, this work proposes incremental

variants of Recurrent Neural Networks with minor changes, that can tackle evolving data

stream problems such as concept drift and the elasticity-plasticity dilemma without neither

needing a dedicated drift detector nor a memory management system. Results achieved

from benchmark datasets have shown that the performance of the proposed methods is

better than the other methods for most of the experiments while obtaining competitive

results for the remaining ones.

Keywords: Machine Learning, Incremental Learning, Neural Networks, Recurrent Neural

Networks, Long Short-Term Memory, Gated Recurrent Unit

List of Figures

Figure 1 – A generic schema for incremental learning algorithms [Gama et al., 2014] 21

Figure 2 – Types of concept drifts. Circles represent data points; colors represent

different classes; and dotted lines represent decision boundaries [Gama

et al., 2014] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

Figure 3 – How concept drifts can occur over time [Gama et al., 2014] . . . . . . . 27

Figure 4 – An RNN with two layers . . . . . . . . . . . . . . . . . . . . . . . . . . 30

Figure 5 – An LSTM cell (adapted from Olah [2015]) . . . . . . . . . . . . . . . . 34

Figure 6 – A GRU cell (adapted from Olah [2015]) . . . . . . . . . . . . . . . . . . 35

Figure 7 – Data stream sub-batching . . . . . . . . . . . . . . . . . . . . . . . . . 38

Figure 8 – Collection of predictions of the same sample . . . . . . . . . . . . . . . 42

Figure 9 – Chess dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

Figure 10 – Electricity dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

Figure 11 – Weather dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

Figure 12 – Sine 1 dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

Figure 13 – Sine 2 dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

Figure 14 – SEA dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

Figure 15 – Stagger dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

Figure 16 – Spam dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

List of Tables

Table 1 – Example of a dataset for a sentiment classification problem, where in-

puts are movie criticisms and outputs are the rating of those criticisms.

Adapted from Ng [2018] . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

Table 2 – Overall performance achieved by all tested methods . . . . . . . . . . . 49

List of abbreviations and acronyms

ADWIN ADaptive sliding Window

ARF Adaptive Random Forests

FFNN Feed Forward Neural Network

GRU Gated Recurrent Unit

ISM Incremental Sequence Model

LSTM Long Short-Term Memory

ML Machine Learning

MINDS Machine Intelligence and Data Science

RNN Recurrent Neural Network

SAM-kNN Self Adjusting Memory k Nearest Neighbor

SGD Stochastic Gradient Descent

SVM Support Vector Machines

UFMG Universidade Federal de Minas Gerais

XGBoost eXtreme Gradient Boosting (XGBoost)

Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2 State-of-the-art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3 Theoretical Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.1 Incremental Learning foundations . . . . . . . . . . . . . . . . . . . . . . . 25

3.2 Sequence Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.2.1 Recurrent Neural Network . . . . . . . . . . . . . . . . . . . . . . . 29

3.2.1.1 Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.2.1.2 The problem of long-term dependencies . . . . . . . . . . 31

3.2.2 Long Short-Term Memory Neural Network . . . . . . . . . . . . . . 33

3.2.3 Gated Recurrent Unit Neural Network . . . . . . . . . . . . . . . . 35

4 Incremental Sequence Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.1 Sub-batching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.1.1 Input transformation . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.1.2 Output transformation . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.2 Statefulness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5 Experimental Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5.2 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

5.3 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

6.1 Overall performance analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 49

6.2 Non-i.i.d. and real datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

6.3 i.i.d. and synthetic datasets . . . . . . . . . . . . . . . . . . . . . . . . . . 50

6.4 i.i.d. and real datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

19

Chapter 1

Introduction

In recent years Big Data changed from just a buzzword to becoming a real problem

for many companies. The increase of processing and memory capabilities of computers,

more sensors generating data, and the rise of the Internet of Things gave businesses

the opportunity to generate more insights by understanding their products and clients

better, but also brought a burden to IT departments. More data require more storage,

more processing power, and the need for techniques to process this data in a distributed

manner because in many cases they do not fit into a single machine [Mehta et al., 2017,

De Francisci Morales et al., 2016, Zliobaite et al., 2016].

This challenge reflects in the Machine Learning (ML) field as well. Usually, ML

models are designed to handle finite datasets so that the learning process involves solving

complex optimization problems, such as Support Vector Machines (SVMs) [Cortes and

Vapnik, 1995, Awad and Khanna, 2015]. In addition, most ML models are trained with

chunks of data that are supposed to be stationary, which in the real world it is rarely the

case. Instead, information is always coming in an infinite stream of data that may change

over time due to a phenomenon known as concept drift [Ditzler et al., 2015].

To take that into account, research is been made since the past decades in the

field of Incremental Learning, which encompasses models that continuously learn from

streaming datasets that also deal with concept drift [Gama et al., 2014]. Although those

models solve the Big Data challenges imposed on ML, most of them are structured in

a complex way around classic supervised learning batch algorithms. On the other hand,

Recurrent Neural Networks (RNNs) - especially their variant Long Short-Term Memory

(LSTM) [Hochreiter and Schmidhuber, 1997] - seem to fit perfectly for streaming scenarios:

their hidden state and, more importantly, the memory cell was built to deal with stability-

plasticity dilemma and avoid catastrophic forgetting [Gepperth and Hammer, 2016], which

are problems inherent to concept drift.

The goal of this work is to tackle the problem of supervised learning from streaming

data by leveraging RNN’s intrinsic characteristics. By applying minor changes in LSTM

20 Chapter 1. Introduction

and other RNN architectures, the author believes that a family of incremental learning

algorithms can be proposed, one that does not need explicit drift detection, memory

management, nor modifies the weight update rule of RNNs, as reported in other past

studies [Mirza et al., 2020, Marschall et al., 2020, Xu et al., 2020].

The contributions of this work are:

1. The use of RNN’s intrinsic ability to deal with the stability-plasticity dilemma.

2. Modeling data streams as infinite non-i.i.d. sequences, leveraging the power of

RNNs to track time dependencies and the built-in online learning characteristics of

neural networks.

3. The sub-batching mechanism, an effective use of sliding windowing for data aug-

mentation.

4. The statefulness mechanism, a way to keep the internal state of RNNs with the

most updated context of the sequence.

5. The implementation and open sourcing of the pyISM1 package, containing a family

of Incremental Sequence Models (ISMs) with an scikit-multiflow API.

6. The presentation of an early version of this work on the XXIII Brazilian Conference

on Automation, as well as the invitation and further acceptance to the publication

of an extended version of that same early work on a special edition of the Journal of

Control, Automation and Electrical Systems.

The rest of this paper is organized as follows: Chapter 2 brings the state-of-the-art

of incremental learning, while Chapter 3 presents its theory. Chapter 4 describes the

proposed approach and in Chapter 5, the experimental methodology is presented. Chapter

6 shows the findings of the comparative analysis of online methods and final discussions

and conclusions are presented in Chapter 7.

1 https://github.com/alvarolemos/pyism

21

Chapter 2

State-of-the-art

Incremental Learning is trending and difficult subject, but it is not a new one.

According to Gama et al. [2014], its origins goes back to Perceptron [Rosenblatt, 1958]

(also known as the first artificial neuron), because of its ability to update its weights with

the current example. Due to the diverse range of applications of data streaming learning

algorithms, many other methods have been proposed since then and they are available

today in popular computational frameworks targeted specifically for evolving data streams

(e.g.: MOA1, scikit-multiflow2).

Gama et al. [2014] proposed a generic schema for incremental learning systems

that are composed by four main components: memory management, learning algorithm,

loss estimation, and change detection, as can be seen in Figure 1.

predictionslossesinputs

Loss Estimation

module

ChangeDetectionmodule

LearningAlgorithmmodule

MemoryManagement

module

learning algorithm update trigger

forgetting mechanism trigger

x(t)

y(t)

Figure 1 – A generic schema for incremental learning algorithms [Gama et al., 2014]

A brief description of those modules follows:

• Memory Management: responsible for receiving incoming data samples (x(t),y(t)),

and specifying when and which of those samples should be used for learning and

which should be forgotten.

1 https://moa.cms.waikato.ac.nz/2 https://scikit-multiflow.github.io/

22 Chapter 2. State-of-the-art

• Learning Algorithm: is the one responsible to model a mapping function that

generalizes inputs x(t) to outputs y(t)

• Loss Estimation: the module that compares the predictions y(t) from the learning

algorithm with their associated ground truth y(t) and outputs a loss metric

• Change Detection: a module that explicitly detects concept drifts and in many

systems triggers an update for the learning algorithm and the memory management

module

Many approaches have been proposed in the literature that can be represented

by the aforementioned schema. They can easily be divided into two major categories:

active approaches, which explicitly detect concept drifts, and passive approaches, those

that continuously adapt themselves without explicit awareness of occurring drifts [Losing

et al., 2016].

Most methods of the active approach are agnostic to the drift detection mechanism

so that they consider drift detectors to be independent of the learning model. This is the

case of many state-of-the-art methods, like the Oza Bagging ADWIN Classifier, which is a

mixture of the Online Bagging algorithm [Oza, 2005] with the ADaptive sliding Window

(ADWIN) [Bifet and Gavalda, 2007] drift detection mechanism. Another method that

follows the same structure is the Adaptive Random Forests (ARF) [Gomes et al., 2017b],

which is an incremental version of the original Random Forest algorithm [Breiman, 2001].

For every incoming sample from a data stream, ARF runs a drift detection mechanism for

each tree of the ensemble and, depending on the value obtained, it either warns a possible

drift (which triggers the training of a background tree) or detects it; in which case the

trained background tree is used to replace the existing one. Similar recent work is the

Adaptive XGBoost [Montiel et al., 2020], which adapts the XGBoost algorithm [Chen and

Guestrin, 2016] for evolving data streams. Adaptive XGboost and ARF were reported to

achieve good results when combined with the ADWIN drift detector.

For the passive approach, Self Adjusting Memory k Nearest Neighbor (SAM-kNN)

[Losing et al., 2016] was reported to show good performance over data stream classification

problems. SAM-kNN is an ensemble of two models, one trained with the most recent

samples from the stream, called the Short Term Memory, and another one trained with

past samples, called Long Term Memory. The drift detection occurs passively through

the training of the Short Term Memory model with different subsets of a window of

recent samples; this process is analogous to a hyperparameter tuning taking place for

every incoming sample. For a more detailed review of ensemble-based incremental learning

methods, one can recommend the studies of [Gomes et al., 2017a, Krawczyk et al., 2017].

In the scope of Artificial Neural Networks (ANNs), online extensions of the ELM

(Extreme Learning Machines) topology have been proposed [Li et al., 2019, Shao and

23

Er, 2016]. The Forgetting Parameters Extreme Learning Machine (FP-ELM) method has

an incremental training with regularization (L2-norm) and employs a forgetting factor

on the subset of observations that were learned at the previous time instant so that this

subset can be reused along with the current sample. A more recent study on the theme of

recurrent nets achieved promising results by introducing the covariance of the present and

one-time step past input vectors into the gating structure of LSTMs and GRUs (Gated

Recurrent Unit) [Mirza et al., 2020].

In the next chapter, the theoretical foundations of incremental learning will be

introduced, as well as how recurrent neural networks work.

25

Chapter 3

Theoretical Background

This chapter will present the foundations of incremental learning. This includes

formalizing concepts like data stream, how incremental learning compares to batch learning

and what is concept drift. Also, the prequential evaluation will be shown, which is an

evaluation methodology tailored specifically to the incremental learning setting. After that,

sequence models will be introduced, specifically, vanilla recurrent neural networks and

some of its variants, which will make room for the idealization of the proposed method in

the following chapter.

3.1 Incremental Learning foundations

Data stream

In contrast to a batch dataset with N samples that is bounded from sample x(0)

to x(N−1), a data stream S is unbounded (i.e. doesn’t have a known beginning or end).

Such a dataset presents new instances x(t) overtime to its consumer systems, along with

their true label y(t), where x(t) is a feature vector made available at time t and, in that

case, a consumer system would be a supervised learning model.

There may be some delay between the time x(t) and y(t) are made available, but

the scope of this work is restricted for scenarios where y(t) is presented immediately after

x(t), which is the case for most existing works focused on incremental learning [Gomes

et al., 2017b].

Learning modes

An ML supervised model can be trained in two distinguished learning modes:

offline learning and incremental or online learning. In the former, the whole dataset is

available at the time of training, whereas in the latter, the model process data as they

26 Chapter 3. Theoretical Background

come in a real-time stream that may be infinite. According to Gama et al. [2014], in this

online scenario two main issues must be considered:

1. If the dataset is infinite, the learning algorithm should have infinite memory to

accommodate it, which is unreal;

2. Even if the algorithm was able to keep a long history of past data, the data distribution

may change as time passes by, which would make past data stale regarding the

current data distribution.

Thus an incremental learning algorithm is subject to a trade-off where it has to

keep past information so the model does not learn outliers, but it can not keep too much

information, because the system has memory constraints and it also needs to be able to

learn new concepts Gepperth and Hammer [2016], Hoi et al. [2018].

Furthermore, when dealing with non-stationary data streams, a change in the

relation between the input data and the target variable can occur at any time. This is a

phenomenon called concept drift and will be introduced next.

Concept drift

A challenge that all incremental learning algorithms are subject to is the change

in the relation between the input data and the target variable p(x, y) (the joint probability

function), which can occur at any time. This event, known in the literature as a concept

drift, is formally defined in Equation 3.1

∃x : p(x, y)ti 6= p(x, y)ti+1(3.1)

where t0 and t1 are two different time instants and p(x, y)t = p(y|x)tp(x)t [Gama et al.,

2014].

Depending on which components of the aforementioned relation change, the

concept drift can be distinguished between the two following types (also illustrated in

Figure 2):

• Real concept drift: changes that affect p(y|x), thus affecting the decision boundary

• Virtual concept drift: changes that affect only the input data distribution p(x), but

not affecting the decision boundary

Another important aspect of concept drifts is when and how they happen. Since

they occur over time, they can occur abruptly, monotonically/incrementally, gradually

3.1. Incremental Learning foundations 27

Figure 2 – Types of concept drifts. Circles represent data points; colors represent differentclasses; and dotted lines represent decision boundaries [Gama et al., 2014]

and, due to seasonal effects, they can even reoccur. Figure 3 illustrates these possibilities,

as well as outliers, which should not be mixed with true drifts, since they are anomalies

that occur at random.

Figure 3 – How concept drifts can occur over time [Gama et al., 2014]

How an algorithm adapts to a concept drift is a challenge: react too quick, old

information is lost; wait too long, concept drifts are not caught at all. This tradeoff is

known as the stability-plasticity dilemma [Mermillod et al., 2013].

Performance evaluation

The most commonly used technique to evaluate the performance of a traditional

batch learning algorithm is k-fold cross-validation [James et al., 2013]. It works in the

following fashion:

1. uniformly split the dataset into k parts (folds)

2. use k − 1 folds to train a model

3. use the remaining fold to evaluate the model’s performance with a metric suitable

for the problem

4. average the evaluated k performance metrics in order to obtain the final metric

Despite its wide adoption in the batch setting, k-fold cross-validation is not

applicable to streaming, since the temporal order of data would be shuffled.

Another evaluation technique that is popular for the batch setting is holdout

[James et al., 2013], where a larger fraction of the dataset is used for training, and the


remaining portion is used for testing. Compared to k-fold cross-validation it is much

less time-consuming and is suitable for streaming scenarios since the test portion can be

restricted to be time indexes ahead of those used during the training. Nevertheless, there

are two problems when using holdout for incremental learning:

1. Since part of the dataset is, as the name suggests, holdout, not all samples are used

during training

2. Due to concept drift, it is possible that the training and testing sets belong to

different concepts

Because of those limitations, a technique called prequential evaluation [Dawid,

1984, Gama et al., 2009] is a more appropriate choice for evolving data stream problems.

It works as follows:

1. an initial batch with samples ranging from t0 to ti is used to pretrain the model

2. new batches are presented to the model, which makes predictions

3. these predictions and their true labels are used to calculate the prequential error

with an evaluation metric of choice (suitable for the problem in question), which is

stored in the system, and the current performance is incrementally updated

4. the same batch used for predictions is now used for training

5. this procedure keeps running indefinitely while there are new batches of data coming

Since new batches are used for predictions first, and then for training, this technique

is also known as interleaved test-than-train. It is tailored for evolving data stream scenarios

and solves the aforementioned problems associated with the holdout technique, once the

whole dataset is used for training and even though at some point a batch from a new

concept may be used for testing, the overall performance over time is smoother.

3.2 Sequence Models

Before diving into the mechanics of sequence models, it is important to define

what are sequences. A sequence is a set of feature vectors {x(0), x(1), · · · , x(T−2), x(T−1)}ordered along some dimension. Typically, the ordering takes place in the time dimension

and, as such, each element of the sequence is commonly referred to as a time step. Since

each time step is a feature vector, the tth time step of sequence x can be expressed as

x(t) = [x(t)0 , x

(t)1 , · · · , x

(t)M−2, x

(t)M−1], where M is the number of features.

Examples of sequences are:

3.2. Sequence Models 29

• a tweet, containing a sequence of words, each represented by a feature vector of word

embeddings

• a musical sheet, containing a sequence of musical notes, and each of which can be

represented by a feature vector of intensity and tone

• a time series, composed of a sequence of data points, each of which represented

either by a single value, or multiple values (for multivariate time series)

As well as traditional supervised learning models are not trained with a single

sample, sequence models should not be trained with only one sequence. The difference

here is that while traditional models expect 2D datasets with N samples that have M

features each, sequence models expect 3D datasets with N sequences of T timesteps, each

of which having M features.

Another difference is that traditional supervised learning algorithms expect inde-

pendent and identically distributed (i.i.d.) datasets, so the order that their samples are

presented to the model during training is not relevant. In the other hand, due to the

ordered nature of sequences, sequence models expect non-i.i.d sequences.

One last characteristic specific of sequence models is that they impose no restric-

tions to the sequence length: they can be of any size.

Now that sequences and the main goals of sequence models were defined, it is

time to depict the most popular family of such models: recurrent neural networks.

3.2.1 Recurrent Neural Network

Recurrent Neural Networks (RNNs) [Elman, 1990] are a family of neural nets

specialized in processing sequences. The main component of an RNN is the cell, which can

be thought as a layer from a feedforward neural network (FFNN) and, as such, there are

hidden cells and output cells. Both hidden and output cells compute Equation 3.2, while

Equation 3.3 is only computed by output cells.

h(t) = tanh(Wh(t−1) +Ux(t) + b) (3.2)

y(t) = g(V h(t) + c) (3.3)

where:

• x(t) is the tth element of a sequence

• h(t) is the output of hidden layer associated with the tth element of a sequence


• h(t−1) is the output of hidden layer associated with the previous time step

• U is the input weight matrix of a hidden cell

• W is the hidden weight matrix of a hidden cell

• b is the bias vector of a hidden cell

• V is the input weight matrix of an output cell

• c is the bias vector of an output cell

• g is a generic output activation function. Usually it is one of the following:

– linear for regression problems: g(x) = x

– sigmoid for binary classification problems: g(x) = 11+e−x

– softmax for multiclass classification problems: gi(x) = exi∑Kj=1 e

xj∀i ∈ {1, 2, · · · , K}

As in an FFNN, a cell output can be connected to another cell’s input. As an

example, h(t) from Equation 3.2 could be the input of Equation 3.3. However, different

from FFNNs that do not have the concept of time step, a cell representing time step t− 1

can have its output h(t−1) connected to the cell of time step t. Figure 4 illustrates these

two cases. It also makes it explicit that RNNs can have multiple layers, each of them

composed of multiple cells, one for each time step.

xx x x (5)x (4) (3) (2) (1)

(0)h1

h0 h0 h0

h1h1h1h1

(5) (0) (1) (2) (3) (4) h0

RNNCell

RNNCell

RNNCell

RNNCell

RNNCell

RNNCell

RNNCell

RNNCell

RNNCell

RNNCell

h0 h0

(5)h1 (4) (3) (2) (1)

h0 h0 h0 (5) (1) (2) (3) (4) h0h0

ŷŷ ŷŷ (1) (2) (3) (4) ŷ (5)

Figure 4 – An RNN with two layers

There are two important design aspects of RNN cells that ensure the already

presented main goals of sequence models:

• By having h(t−1) as an input of the tth time step cell, the sequence non-i.i.d. charac-

teristic is enforced. In other words, the (hidden) state at time step t− 1 is fed into

t


• Even though an RNN is composed by multiple cells, all of them share the same

parameters U , W , b, V , c. This characteristic makes it possible for a single RNN

to work with sequences of varying length

3.2.1.1 Learning

The learning procedure of an RNN is not much different than the way neural net-

works in general are trained: labeled sequences {(x(0),y(0)), (x(1),y(1)), · · · , (x(T−1),y(T−1))}are fed to the network, then propagated in a forward manner by calculating Equations 3.2

and 3.3, which lead to a sequence of predictions {y(0), y(1), · · · , y(T−1)}. These predictions,

and their associated ground truths, are fed to a loss function in order to obtain the loss

vector L = [L0,L1, · · · ,LN−1,LN ]ᵀ, where Li is the ith sequence loss that is obtained by

summing the loss of all of its time steps: Li =∑T

t L(t)i .

With L in hands, the model’s parameters U , W , b, V and c can be updated

with any differential approach. Equation 3.4 exemplifies these updates using the Stochastic

Gradient Descent (SGD) [Bishop, 2006] algorithm.

W ←−W − η ∂L∂W

U ←− U − η ∂L∂U

V ←− V − η ∂L∂V

b←− b− η∂L∂b

c←− c− η∂L∂c

(3.4)

where η is the learning rate and ∂L∂θ

is the gradient vector’s component of loss L with

respect to parameter θ. Even though SGD was used to exemplify those updates, other

approaches are valid, like batch gradient descent, RMSProp, Adam [Kingma and Ba, 2014],

and many others.

3.2.1.2 The problem of long-term dependencies

The RNN presented so far is usually referred to as Vanilla RNN, since it is one

of the simplest RNN architectures in literature. Such simplicity usually leads to poor

performance when dealing with long-term dependencies.

To better understand this phenomenon, consider a sentiment classifier application

that receives as inputs movie criticisms (made of sequences of words) and outputs the

aforementioned criticisms’ classifications (Positive or Negative). A dataset sample is

presented in Table 1.


Table 1 – Example of a dataset for a sentiment classification problem, where inputs aremovie criticisms and outputs are the rating of those criticisms. Adapted fromNg [2018]

Criticism ClassificationFilm with very good performance and excellent script PositiveMediocre production NegativeBest movie of the year PositiveThat movie was the greatest PositiveIt lacked a good script, good actors, a good scenario and goodspecial effects

Negative

Supposing the model in question is a well-trained classifier, it is straightforward to

see why it would correctly classify the first, third, and fourth samples as positive: U will

give high relevance for words good, excellent, best and greatest in such a way that it will

be propagated in the forward pass via W until a positive classification is predicted. The

same is true for the second sample: word mediocre will activate the cell and will culminate

in a negative classification.

The last sample, though, is tricky: lacked has to active the cell in such a way that

all the following occurrences of good should not reverse the classification. An exaggerated

version of that criticism should shed a light on why this is difficult:

It lacked a good script, good actors, a good scenario, good special effects, a good

soundtrack, a good director, good producers, a good marketing campaign, good costume

and good photography

As much as the word lacked activates the cell in a way that will propagate a

negative representation along with the time steps, there are so many occurrences of the

word good after it that their repeated contributions will at some point outshine the

negative contribution of lacked.

Long term dependencies do not impose challenges only during predictions, but

also during training. Consider the gradient ∂L∂W

of the loss in respect to the hidden state

weight in Equation 3.5, and its component ∂h(t)

∂h(k) in Equation 3.6:

∂L∂W

=T∑t=1

t∑k=1

∂L(t)

∂h(t)

∂h(t)

∂h(k)

∂+h(k)

∂W(3.5)

∂h(t)

∂h(k)=

t−1∏i=k

W ᵀdiag(tanh′(x(i−1))) (3.6)

Equation 3.6 contains product of tanh′(·), which is a function bounded in the

range [0, 1]. This means that for k << t, there will be many products of values in that


range and ∂h(t)

∂h(k) −→ 0. This means that the contributions of time steps in the beginning

of the sequence (like it is the case of lacked) will not have any influence in the gradient∂h(t)

∂h(k) of Equation 3.5, thus any updates in W that would reinforce that words similar to

lack are negative, wouldn’t take place. This is a well known phenomenon called vanishing

gradients, which was detailed in Pascanu et al. [2013].

3.2.2 Long Short-Term Memory Neural Network

Long Short-Term Memory (LSTM) [Hochreiter and Schmidhuber, 1997] is a more

elaborate RNN architecture that is tailored to solve the problem of vanishing gradients.

An LSTM cell also receives as inputs both the current time step and the previous step’s

representation, but that representation is composed of two parts: the hidden state h(t),

and the cell state s(t). To better understand what they are and how they work, consider

an LSTM cell’s computations in Equation 3.7, as well as its graphical representation in

Figure 5.

f (t) = σ(Wfh(t−1) +Ufx

(t) + bf )

i(t) = σ(Wih(t−1) +Uix

(t) + bi)

o(t) = σ(Woh(t−1) +Uox

(t) + bo)

s(t) = tanh(Wh(t−1) +Ux(t) + b)

s(t) = f (t) � s(t−1) + i(t) � s(t)

h(t) = o(t) � tanh(s(t))

y(t) = softmax(V h(t) + c)

(3.7)

In step 1 of Figure 5, the previous time step’s memory cell s(t−1) carries

information from t− 1. Going back to the exemplified long criticism, let’s first consider

the case that the word x(t−1) in that time step was lacked. In that case, this information

should pass. The component responsible for that is f (t), computed in 2 .

Note that f (t) is activated by a sigmoid function, which means that it’s elements

are bounded by the range [0, 1]. Since it has its own parameters Wf , Uf and bf , it can

learn that h(t−1) brings relevant information, resulting in a vector f (t) composed mostly

of values close to 1, yielding in the element-wise product in 3 a result close to s(t−1).

Mathematically, it means that f (t) � s(t−1) ≈ s(t−1) and that f (t) selectively did not forget

the content of s(t−1). Because of that, it is called the forget gate and its selective memory

mechanism helps to solve the forward pass problem of long term sequences.

Now consider that the current input x(t) is a representation of any occurrence of

the word good, from that same long criticism example. In 4 , a temporary cell state is

computed, holding information related to the word good. In 5 the input gate i(t) is


𝞼 𝞼 tanh 𝞼

tanh

+

2

5

6

4

73

8

9

10

x(t)

h(t-1)

s(t-1)

f (t)

i (t) o(t)

s(t)

s (t)~

s(t)

h(t)

ŷ(t)

softmax

111

Figure 5 – An LSTM cell (adapted from Olah [2015])

computed, and it’s responsibility is to selectively read new information. This is possible

because it also has its own parameters Wi, Ui and bi and, by knowing that h(t−1) brings

relevant information and x(t) somewhat contradicts it, the element-wise product in 6

won’t let much of this new information pass through. With the remembered (or not

forgotten) information in 3 and the new information of 6 , in 7 the current time

step’s cell state is computed.

Since the sum of 7 may have resulted in a vector whose elements go beyond

the range [−1, 1] and the hidden state h(t) should be limit in this range, in 8 the cell

state is normalized to guarantee that, and then subjected to a last element-wise product

with the output gate, which selectively writes in 10 the new information to the final

hidden state. 11 is computed only if the cell is part of the outermost layer of the LSTM

neural net.

Summarizing, gates f (t), i(t) and o(t) enables information to be selectively forgotten,

read and written to the cell’s states s(t) and h(t), which solves the long term dependency

problem during forward pass. During backward pass, the vanishing gradients problem is

also solved, specially due to the cell state’s derivatives that involves a summation, instead

of a product, as it is the case of Equation 3.6. These derivations are explained with details

in Hochreiter and Schmidhuber [1997].


3.2.3 Gated Recurrent Unit Neural Network

Although convincing, many may argue that the intuitions behind the LSTM

architecture are arbitrary. In fact, there are many variants of the LSTM, all of which aim

to solve the vanishing gradients problem. One of the most famous variants is the Gated

Recurrent Unit (GRU) neural network Chung et al. [2014].

The main difference between the two is that while LSTMs have a forget gate to

selectively block old information and an input gate to selectively let new information flow,

the GRUs use a single gate to accomplish both: the same gate u(t) used for reading is

also used to forget, by multiplying the cell state in an element-wise manner by (1− u(t)).

Also, the cell and hidden state are merged. With this design choice, GRUs have fewer

parameters to optimize during training, being faster. The GRU cell’s equations are listed

in Equation 3.8, while Figure 6 illustrates it.

u(t) = σ(Wuh(t−1) +Uux

(t))

r(t) = σ(Wrh(t−1) +Urx

(t))

h(t) = tanh(Wr(t) � h(t−1) +Ux(t))

h(t) = (1− u(t))� h(t−1) + u(t) � h(t)

y(t) = softmax(V h(t) + c)

(3.8)

h(t-1)

x(t)

𝞼 𝞼r(t)

h(t)

ŷ(t)

softmax

tanh

-1

u(t)

+

h(t)~

Figure 6 – A GRU cell (adapted from Olah [2015])


Now that incremental learning and its foundations were formalized, as well as the

intuitions behind RNNs and their variants were shown, the ideas proposed to apply them

to streaming scenarios are presented in the next chapter.

37

Chapter 4

Incremental Sequence Models

Chapter 2 depicted that most solutions proposed to deal with incremental learning

problems rely on enhanced versions of classic batch supervised learning models with

complex mechanisms to comply with evolving data streams requirements. On the other

hand, Chapter 3 suggested that RNNs have inherent properties, such as time dependencies

modeling and selective forgetting, that make them naturally attractive to be applied to

streaming scenarios.

Accordingly, this chapter brings our ideas on how to adapt RNNs to deal with

evolving data streams and proposes the Incremental Sequence Models (ISMs), which are

simple adaptions on RNNs for incremental learning problems.

4.1 Sub-batching

The incremental learning frameworks mentioned in Chapter 2 have incremental

learning model implementations with APIs that expect batches of 2D datasets so that

the batch dimension of n ×m is composed by number of instances (n) and number of

features (m). However, as discussed in Chapter 3, RNNs expect n× t×m 3D datasets,

where the new dimension t represents the number of time steps. That said, in order to

adapt an RNN-based algorithm for scenarios that expect 2D datasets, there is a need to

transform batches of samples into batches of sequences.

There are two-dimensional adjustments that would easily provide the desired

results. The first of them would be to consider that each sample of the 2D dataset is a

sequence of one time step. Its dimension would change from N ×M to N × 1×M (n = N ,

t = 1 and m = M). Although it would work, training a sequence model of sequences of

length 1 does not make much sense: its main strength of modeling time dependencies and

selectively forgetting, reading, and writing would be lost.

The second approach would be to consider that the whole batch is a single sequence,

resulting in a dimension of 1×N ×M (n = 1, t = N and m = M). In contrast to the first

38 Chapter 4. Incremental Sequence Models

approach, this one leverages the power of RNNs, but training a supervised learning model

with a single instance would lead to overfitting.

A better approach named sub-batching was used, which is composed of two parts:

an input transformation of a 2D batch of samples into a 3D batch of sequences and an

output transformation, which convert a 3D batch of predictions into 2D, enabling the use

of RNNs that complies with popular incremental learning frameworks 2D datasets APIs.

4.1.1 Input transformation

To transform a 2D batch of samples into a 3D batch of sequences, the samples

are sorted in ascending order by the time they arrive. Then, the batch is broken down into

multiple smaller batches (or sub-batches) in a sliding window manner. Figure 7 illustrates

this procedure. In blue, we have the current batch, with length 10, which is split into

sub-batches of length 5 (in red), each of which is presented to the model as a sample

sequence. Hence, according to this example, a 10 ×M batch of ten samples became a

6× 5×M batch of six sample sub-batches, where M is the number of features. The batch

and sub-batch sizes (10 and 5, respectively) were chosen just to exemplify, since they are

parameters of the ISM models.

61 ...... 545351 5250 55 56 57 58 59 60 694910

55545352

545351 52

51

50

2nd sub-batch

1st sub-batch

...

1st batchfrom data stream

2nd batchfrom data stream

3rd batchfrom data stream

(2) (5)(4)(3)(1)

h(1)ssssss

hh

ŷ56

hhh

ŷ59

LSTMCell

LSTMCell

LSTMCell

LSTMCell

LSTMCell

ŷ58ŷ57ŷ55(2) (5)(4)(3)(1)

(2) (5)(4)(3)(0)

(0)

56555452 533rd sub-batch

55 56 5753 544th sub-batch

5855 56 57545th sub-batch

55 56 57 58 596th sub-batch

Figure 7 – Data stream sub-batching

To better understand the sub-batching procedure, let us formalize it mathemat-

ically. The input of this procedure is the batch of N samples represent by the N ×Mmatrix X in Equation 4.1, where xn = [xn,0 xn,1 ··· xn,M−1] is the nth sample, which is a

4.1. Sub-batching 39

feature vector of size M .

X =

x0

x1

...

xN−1

(4.1)

The desired output are sub-batches of X, which are matrices represented by the

B × T ×M 3D tensor S in Equation 4.2.

S =

Sᵀ

0

Sᵀ1...

SᵀB−1

(4.2)

where the number of sub-batches B is given by B = N − T + 1 and T is the sub-batch size.

Sub-batches S0, S1, S2 and SB−1 are represented in Equation 4.3.

S0 =

x0

x1

x2

...

xT−1

S1 =

x1

x2

x3

...

xT

S2 =

x2

x3

x4

...

xT+1

· · · SB−1 =

xN−T

xN−T+1

xN−T+2

· · ·xN−1

(4.3)

Another way to represent S is by Equation 4.4, where each row is one sub-batch.

S =

x0 x1 x2 · · · xT−1

x1 x2 x3 · · · xT

x2 x3 x4 · · · xT+1

......

.... . .

...

xN−T xN−T+1 xN−T+2 · · · xN−1

(4.4)

Sub-batching was implemented using many vectorization features from Python’s

NumPy1 package and was inspired by Azman [2020]. The first feature is index slicing, where

given the index matrix I in Equation 4.5 of the output tensor S, one can simply obtain

1 https://numpy.org

https://numpy.org


the desired result by computing S = X[I].

I =

0 1 2 · · · (T − 1)

1 2 3 · · · (T )

2 3 4 · · · (T + 1)...

......

. . ....

(N − T ) (N − T + 1) (N − T + 2) · · · (N − 1)

(4.5)

The index matrix I can be obtained by Equation 4.6.

I =

0 0 0 · · · 0

1 1 1 · · · 1

2 2 2 · · · 2...

......

. . ....

(N − T ) (N − T ) (N − T ) · · · (N − T )

+

0 1 2 · · · (T − 1)

0 1 2 · · · (T − 1)

0 1 2 · · · (T − 1)...

......

. . ....

0 1 2 · · · (T − 1)

(4.6)

It is easy to notice that the columns of the first term of Equation 4.6 are all equal

to [0 1 2 ··· (N−T )] ᵀ, while to rows of the second term are equal to [0 1 2 ··· (T−1)] . Because of

that, another NumPy vectorization feature called broadcasting was used to compute I. It is a

native feature of NumPy’s main data structure, the ndarray, where during an element-wise

operation of array’s with diverging dimensions, the arrays columns or rows are repeated in

order to obtain arrays with the same dimension in both terms of the operation. That said,

a broadcast version of 4.6 was used and is presented in Equation 4.7.

I =

0

1

2...

(N − T )

+[0 1 2 · · · (T − 1)

](4.7)

4.1.2 Output transformation

The same transformation procedure used to transform X into S, is also applied

for X’s associated labels Y , which is represented by Equation 4.8.

Y =

y0

y1...

yN−1

(4.8)

4.1. Sub-batching 41

By applying sub-batching to Y , Ys is obtained, as represented by Equation 4.9.

Ys =

y0 y1 y2 · · · yT−1y1 y2 y3 · · · yT

y2 y3 y4 · · · yT+1

......

.... . .

...

yN−T yN−T+1 yN−T+2 · · · yN−1

(4.9)

likewise S, Ys is also a 3D tensor and has dimension N × T × C, where the first two

dimensions N and T are the number of samples and sequence length (same as S), and C is

the dimension of each single output yn, which can be the number of classes of a multiclass

classification dataset.

Although sequence models like RNNs expect outputs with the 3D shape of Ys,

this is not the case for most consumers of incremental learning models. Instead, they

expect 2D batches of predictions of shape N × C. That said, Ys (Equation 4.10) have to

be transformed into Y (Equation 4.11).

Ys =

y0,0 y1,0 y2,0 · · · yT−1,0

y1,1 y2,1 y3,1 · · · yT,1

y2,2 y3,2 y4,2 · · · yT+1,2

......

.... . .

...

yN−T,B−1 yN−T+1,B−1 yN−T+2,B−1 · · · yN−1,B−1

(4.10)

Y =

y0

y1...

yN−1

(4.11)

where each vector yn,b of Ys is a prediction of the input sample xn on the bth sub-batch.

This notation was used to make explicit that although both y1,1 and y1,0 are

predictions of the input x1, they are different. So, the challenge here is to obtain an

aggregated version yn, composed of all yn,bs. The way that all predictions of the same

sample were obtained was by collecting all the elements of each diagonal that starts in the

bottom-left and ends on the top-right. Figure 8 pictures this procedure, for N = 10 and

B = 6.


y0,0 y1,0 y2,0 y3,0 y4,0

y1,1 y2,1 y3,1 y4,1 y5,1

y2,2 y3,2 y4,2 y5,2 y6,2

y3,3 y4,3 y5,3 y6,3 y7,3

y4,4 y5,4 y6,4 y7,4 y8,4

y5,5 y6,5 y7,5 y8,5 y9,5

Figure 8 – Collection of predictions of the same sample

After collecting all predictions of the same sample, they are aggregated and Y

is obtained. The aggregation function chosen was arithmetic mean, but other functions

could be used, like exponentially weighted average.

4.2 Statefulness

In Chapter 3 the power of RNNs in modeling sequences was highlighted. One of

the main components of their architectures that enable information to be carried through

their cells are their states h(t) and, for LSTMs, the memory cell s(t). It is important to

note, though, that those states are reset for every sequence that is presented to the model.

In other words, the components within a sequence are non-i.i.d., but the sequences are i.i.d.

This may be acceptable for applications like the exemplified sentiment classifier, where

each movie criticism is in fact independent from each other, but for evolving data streams,

this may not be the case.

The idea here is to think of a data stream as a sequence with infinite length and,

because of that, it is always relevant to have a context of the past. This way, a concept drift

can be detected automatically without the need of complex drift detection mechanisms,

like those mentioned in Chapter 2.

To accomplish that, the states of the last sample xN−1 of each batch is stored and

it is replicated for every sub-batch of the next batch. By doing so, the following problems

are solved:

1. No need to keep a history of past samples that may violate the system’s memory

constraints, since only the RNN cell’s weights and the last values (meaning, at time

t) of the cell states should be kept in memory;

2. No need to implement a specific mechanism to detect a concept drift, since they

should be captured automatically by the memory cell.

4.2. Statefulness 43

The ideas proposed in this chapter could be extended to distinct supervised

Machine Learning problems, such as regression, classification and forecasting.

45

Chapter 5

Experimental Methodology

This chapter will present the Incremental Sequence Model variants that were

created, the state-of-the-art methods used for comparison, and their respective parameteri-

zation. The procedure to obtain the performance evaluation metrics will also be detailed,

as well as the datasets used in the experiments.

5.1 Methods

To evaluate the effectiveness of the proposed algorithm in evolving data stream

scenarios, experiments were conducted using six variants of the Incremental Sequence

Model. Three RNN cell types were used: vanilla RNN, LSTM and GRU, each of them

with two settings: with and without statefulness. The model names were prefixed with I to

denote they are Incremental versions of base models (e.g. IRNN). Also, for the variants

with the statefulness feature activated, their names were suffixed with -ST (e.g. IRNN-ST).

Summarizing, the six used setups were:

• IRNN-ST: ISM with vanilla RNN cells and statefulness turned on

• IRNN: ISM with vanilla RNN cells and statefulness turned off

• ILSTM-ST: ISM with LSTM cells and statefulness turned on

• ILSTM: ISM with LSTM cells and statefulness turned off

• IGRU-ST: ISM with GRU cells and statefulness turned on

• IGRU: ISM with GRU cells and statefulness turned off

The ISMs were compared with well known methods in literature: Oza Bagging

ADWIN [Oza, 2005], Adaptive Random Forest [Gomes et al., 2017b] and SAM kNN

[Losing et al., 2016], using Prequential Evaluation [Gama et al., 2009], that was described

in Section 3.1. All ISM variants had the same hyperparameter setting for all experiments:

46 Chapter 5. Experimental Methodology

• 3 RNN layers, with 150, 100 and 50 units respectively, all of them using the hyperbolic

tangent activation function

• Adam was chosen as the optimizer due to its consistent performance on different

problems [Schmidt et al., 2021]. It was parameterized with TensorFlow’s default

argument values: η = 0.001, β1 = 0.9, β2 = 0.999 and epsilon = 1e−7

• the loss function was set to binary cross-entropy

The other incremental learning methods were configured with their default hyper-

parameters as suggested in the Scikit-multiflow framework1.

5.2 Performance Evaluation

To increase statistical significance, all experiments were repeated multiple times

and their results were evaluated in two distinct formats:

1. Accuracy evolution plots, to graphically show how well the methods perform during

the whole stream, specially when concept drifts occur

2. Overall accuracy, to summarize a method’s performance on a given dataset

Both evaluation formats are based on the batch accuracy vector am,dr represented

in Equation 5.1.

am,dr =

[am,dr,0 am,d

r,1 · · · am,dr,B−1

](5.1)

where am,dr,b is the accuracy on the bth batch of the rth experiment run for the mth model

on the dth dataset.

Since the evaluations are specific for a model and dataset pair, all of their batch

accuracies can be represented by the matrix Am,d in Equation 5.2.

Am,d =

am,d0,0 am,d

0,1 · · · am,d0,B−1

am,d1,0 am,d

1,1 · · · am,d1,B−1

......

. . ....

am,dR−1,0 am,d

R−1,1 · · · am,dR−1,B−1

(5.2)

1 Available online on: scikit-multiflow.github.io

5.3. Datasets 47

With 5.2, the average am,davg and standard deviation am,d

std of each batch can be

calculated, as seen in Equations 5.3 and 5.4, respectively.

am,davg =

∑R−1r=0 am,d

r,0

R∑R−1r=0 am,d

r,1

R...∑R−1

r=0 am,dr,B−1

R

ᵀ

(5.3)

am,dstd =

√∑R−1r=0 (am,d

r,0 −|am,d0 |)2

R√∑R−1r=0 (am,d

r,1 −|am,d1 |)2

R...√∑R−1

r=0 (am,dr,B−1−|a

m,dB−1|)2

R

ᵀ

(5.4)

From Equations 5.3 and 5.4, the accuracy evolution vector am,d can be obtained,

as shown in Equation 5.5, and the accuracy evolution plot can be drawn.

am,d = am,davg ± a

m,dstd (5.5)

The overall accuracy am,doverall also comes from Equations 5.3 and 5.4, as can be

seen in Equation 5.6.

am,doverall = avg(am,d

avg )± avg(am,dstd ) (5.6)

where avg(a) is a simplified notation for∑B−1

b=0 a

B.

5.3 Datasets

Experiments were performed over four synthetic and four real datasets. Stream

generators2 were used to create the synthetic datasets and the real datasets are available

at the GitHub repository3.

• Chess: consists of real game records of one player over a period from 2007 December

to 2010 March. Since a chess player learns from its previous games, the dataset is

considered to be non independent and identically distributed (non-i.i.d.). This dataset

has 8 attributes, 2 classes and 503 instances, which were divided into a first batch of

153 samples and subsequent batches of 50 samples. The sub-batch size was set to 30

samples for ILSTM.

2 Available at: scikit-multiflow.github.io/3 Available at: github.com/ogozuacik/concept-drift-datasets-scikit-multiflow

48 Chapter 5. Experimental Methodology

• Electricity: is formed from real data collected in the Australian electricity market.

In this market, prices fluctuate according to the demand, and data is sampled every

30 minutes. Since electricity demand is seasonal, this is a non-i.i.d. dataset and such

characteristic was taken into account in the experimental setup: sub-batch size of 48

samples (one day worth of data), batches of 336 samples (one week of data), except

for the first batch with 960 samples, which adds up to all the 45312 instances of the

dataset, that has 8 dimensions and 2 classes.

• Weather: is formed by 50 years of real daily measurements with several meteorological

information, which characterizes this dataset as non-i.i.d.. It has 2 classes, 8 attributes

and 18159 instances, divided into a first batch of 1369 samples and subsequent batches

of 730 samples (two years of data). For ILSTM, the sub-batch size was set to 365

samples (one year of data).

• Sine 1: is a synthetic and i.i.d. dataset with a total of 10000 instances, 2 dimensions,

two different concepts with 5000 instances each and an abrupt change of concept.

In the first concept, all points below the curve y = sin(x) are classified as positive

and after the drift, the classification is reversed. The experimental setup was: 300

samples for the initial batch, 100 samples for the subsequent ones and 60 samples

for the ILSTM’s sub-batches.

• Sine 2: is a synthetic and i.i.d. dataset with a total of 10000 instances of 4 dimensions

so that two of them represent only noise; it also has two different concepts with

5000 instances each and gradual change. The gradual change of concept was carried

out by a sigmoid function f(t) = 1/(1 + e−4(t−p)/w), where p = 5000 is the position

in which the change happens and w = 500 is the width of the transition. For this

dataset, we have followed the same experimental setup of Sine 1.

• SEA: is a synthetic and i.i.d. dataset with 60000 instances, 3 dimensions and 2

classes. The dataset presents four different concepts with 15000 instances each; 10%

of the data are noisy. The experimental setup was the same for the Sine 1 dataset.

• Stagger: is a synthetic and i.i.d. dataset with 30000 instances, 3 attributes, and two

abrupt drifts taking place in the transition regions of three different concepts with

10000 instances each. It has boolean values, generated by three different functions,

where each concept represents one of these functions. The experimental setup was

the same for the Sine 1 dataset.

• Spam: is a real dataset containing a ham/spam collection of legitimate messages

with 6213 instances and 499 attributes. Since the messages are not correlated, this is

an i.i.d. dataset. The experimental setup was: 313 samples for the initial batch, 100

samples for the subsequent ones, and 60 samples for the ILSTM’s sub-batches.

49

Chapter 6

Results

This chapter will present the results of the experimental setup detailed in Chapter

5. First, the overall performance will be shown in section 6.1, and then a more thorough

analysis will be made for the following three groups of datasets:

• non-i.i.d. and real datasets, composed by Chess, Electricity and Weather, in section

6.2

• i.i.d. and synthetic datasets, composed by Sine 1, Sine 2, SEA and Stagger, in

section 6.3

• i.i.d. and real datasets, composed by Spam, in section 6.4

6.1 Overall performance analysis

The overall performance for all experiments can be seen in Table 2. Each row

represents the results of all methods for a single dataset and the one with the highest

average accuracy is highlighted in bold red. Columns show the results for a single method

across all datasets. A cell in this table represents the overall accuracy (as seen in Equation

5.6) of a method, for the 5 experiment runs on the dataset.

Table 2 – Overall performance achieved by all tested methods

Model IRNN IRNN-ST IGRU IGRU-ST ILSTM ILSTM-ST ARF SAM-KNN OB-ADWINDatasetChess 77.27± 4.32 75.00± 3.95 78.37± 4.39 77.70± 3.43 79.67± 3.86 78.37± 3.73 77.67± 3.06 74.50± 2.86 76.67± 4.05Electricity 76.14± 9.73 75.47± 9.90 78.76± 10.06 81.50± 9.19 75.10± 11.22 73.33± 12.19 77.97± 10.38 64.75± 9.26 69.58± 7.77Weather 73.99± 3.44 73.99± 3.92 75.79± 3.47 76.25± 3.14 75.68± 3.27 75.73± 3.33 73.90± 3.82 73.43± 5.60 75.32± 3.45Sine 1 98.50± 7.84 96.11± 11.57 97.67± 9.61 74.69± 25.15 98.35± 8.07 97.31± 9.00 96.97± 8.27 96.80± 13.72 92.90± 20.71Sine 2 94.36± 9.11 92.95± 8.79 94.20± 8.71 94.56± 8.90 94.57± 9.19 94.49± 8.51 89.44± 10.05 86.01± 17.52 86.63± 13.70SEA 83.68± 3.91 82.83± 4.52 84.68± 3.43 84.09± 3.46 84.45± 3.88 84.14± 4.09 87.39± 3.61 87.16± 2.74 86.44± 2.69Stagger 61.90± 28.23 72.44± 29.27 82.89± 23.89 85.14± 23.99 98.72± 5.56 98.58± 6.07 99.64± 3.13 99.08± 5.90 97.98± 9.14Spam 88.88± 14.25 88.86± 14.24 91.19± 16.59 89.83± 19.08 87.96± 17.00 86.78± 17.89 90.15± 10.28 91.42± 8.26 89.89± 9.13

The proposed ISM methods achieved the highest average accuracy for the non-i.i.d.

and real datasets. For the i.i.d. and synthetic ones, they achieved the highest average

50 Chapter 6. Results

accuracy for the Sine 1 and Sine 2 datasets, whereas for the remaining ones (SEA and

Stagger), the state-of-the-art methods used for comparison ranked the highest results.

The comparison methods also obtained the highest average accuracy for the i.i.d and real

dataset (Spam). Still, the ISM models obtained comparable results.

In the following three sections a more in-depth analysis of the experimental results

will be made. Instead of an aggregate view (as presented in Table 2), accuracy evolution

over time plots will be used, representing the accuracy evolution vectors from Equation

5.5. In such plots, a thick gray line will represent the average accuracy for each training

batch (across all experiments), in lighter gray, the standard deviation and, in dashed red,

the timestamps where a concept drift occurred for the synthetic datasets; there is no

information regarding drifts for the real ones.

6.2 Non-i.i.d. and real datasets

For the non-i.i.d. and real datasets, the ISM methods obtained the best overall

accuracy average results. The accuracy over time plots for the Chess, Electricity and

Weather datasets can be seen in Figures 9, 10 and 11, respectively.

6.3 i.i.d. and synthetic datasets

The proposed model also ranked the top results for two of the i.i.d. synthetic

datasets, which are Sine 1 and Sine 2 and their accuracy evolution plots can be seen in

Figures 12 and 13, respectively.

Since those are synthetic datasets, their concept drift time indexes are known

and are highlighted in red in the aforementioned plots. It’s evident that for the proposed

methods the accuracy drawdown that usually occurs after the drift tends to be smaller

than for the other methods.

For the SEA dataset, although the results for the ISMs were very close, the

comparison methods obtained the top results. By analyzing Figure 14, the quicker accuracy

recovery performed by the ISM methods was also present in these experiments.

The Stagger dataset was another one that the comparison methods obtained the

best results. Figure 15 shows that ILSTM and ILSTM-ST had comparable results, while the

other ISM variants had very poor performance.

6.4 i.i.d. and real datasets

For the Spam dataset, the only dataset in this category, the state-of-the-art models

obtained the best average results. Figure 16 show its accuracy evolution plot, where its

6.4. i.i.d. and real datasets 51

0.7

0.8

IRNN

0.7

0.8

IRNN-ST

0.7

0.8

IGRU

0.7

0.8

IGRU-ST

0.7

0.8

ILSTM

0.7

0.8

ILSTM-ST

0.7

0.8

SAM-KNN

0.7

0.8

ARF

150 200 250 300 350 400 450

0.7

0.8

OB-ADWIN

Sample index over time

Accu

racy

Figure 9 – Chess dataset

evident that the ISM methods had higher variance than the others.


0.50

0.75

1.00IRNN

0.50

0.75

1.00IRNN-ST

0.50

0.75

1.00IGRU

0.50

0.75

1.00IGRU-ST

0.50

0.75

1.00ILSTM

0.50

0.75

1.00ILSTM-ST

0.50

0.75

1.00SAM-KNN

0.50

0.75

1.00ARF

0 10000 20000 30000 40000

0.50

0.75

1.00OB-ADWIN


Accu

racy

Figure 10 – Electricity dataset


0.6

0.7

0.8

IRNN

0.6

0.7

0.8

IRNN-ST

0.6

0.7

0.8

IGRU

0.6

0.7

0.8

IGRU-ST

0.6

0.7

0.8

ILSTM

0.6

0.7

0.8

ILSTM-ST

0.6

0.7

0.8

SAM-KNN

0.6

0.7

0.8

ARF

2000 4000 6000 8000 10000 12000 14000 16000 180000.6

0.7

0.8

OB-ADWIN


Accu

racy

Figure 11 – Weather dataset


0.5

1.0IRNN

0.5

1.0IRNN-ST

0.5

1.0IGRU

0.5

1.0IGRU-ST

0.5

1.0ILSTM

0.5

1.0ILSTM-ST

0.5

1.0SAM-KNN

0.5

1.0ARF

0 2000 4000 6000 8000 10000

0.5

1.0OB-ADWIN


Accu

racy

Figure 12 – Sine 1 dataset


0.5

1.0IRNN

0.5

1.0IRNN-ST

0.5

1.0IGRU

0.5

1.0IGRU-ST

0.5

1.0ILSTM

0.5

1.0ILSTM-ST

0.5

1.0SAM-KNN

0.5

1.0ARF

0 2000 4000 6000 8000 10000

0.5

1.0OB-ADWIN


Accu

racy

Figure 13 – Sine 2 dataset


0.6

0.8

IRNN

0.6

0.8

IRNN-ST

0.6

0.8

IGRU

0.6

0.8

IGRU-ST

0.6

0.8

ILSTM

0.6

0.8

ILSTM-ST

0.6

0.8

SAM-KNN

0.6

0.8

ARF

0 10000 20000 30000 40000 50000 60000

0.6

0.8

OB-ADWIN


Accu

racy

Figure 14 – SEA dataset


0.5

1.0IRNN

0.5

1.0IRNN-ST

0.5

1.0IGRU

0.5

1.0IGRU-ST

0.5

1.0ILSTM

0.5

1.0ILSTM-ST

0.5

1.0SAM-KNN

0.5

1.0ARF

0 5000 10000 15000 20000 25000 30000

0.5

1.0OB-ADWIN


Accu

racy

Figure 15 – Stagger dataset


0.0

0.5

1.0IRNN

0.0

0.5

1.0IRNN-ST

0.0

0.5

1.0IGRU

0.0

0.5

1.0IGRU-ST

0.0

0.5

1.0ILSTM

0.0

0.5

1.0ILSTM-ST

0.0

0.5

1.0SAM-KNN

0.0

0.5

1.0ARF

1000 2000 3000 4000 5000 60000.0

0.5

1.0OB-ADWIN


Accu

racy

Figure 16 – Spam dataset

59

Chapter 7

Conclusion

The proposed method achieved good results, having simplicity as its main strength.

In contrast to other incremental learning algorithms, it does not need a complex memory

structure, nor explicit mechanisms for detecting drifts or forgetfulness. Instead, it uses

RNN’s intrinsic ability to deal with the stability-plasticity dilemma, which I believe is

the main contribution of this work.

The successful results of this work relied on other contributions too. The first of

them was the idea of considering data streams as infinite non-i.i.d. sequences and feeding

them to RNNs. It leveraged the power of RNNs to track time dependencies and the built-in

online learning characteristics of neural networks. Although it may be a strong assumption

to consider all data streams to be non-i.i.d., it is reasonable for most real-world problems

involving streaming data.

The second important contribution was the sub-batching mechanism, which is an

effective use of sliding windowing for data augmentation that made it possible to generate

a batch of sequences out of a batch of samples. It proved to be effective and a vectorized

implementation was demonstrated.

The third relevant contribution was the statefulness, which proposed a way to

keep the internal state of RNNs always with the most updated context of the sequence.

This feature obtained good results, but it is important to note that experiments with it

disabled also performed very well. So it is not a silver bullet and a deeper study could be

done to understand this specific characteristic better.

The fourth important contribution was the implementation and open sourcing of

the pyISM1 Python package, containing a family of ISM models with an scikit-multiflow

API. It is also based on TensorFlow and is extensible for any type RNN layer implemented

using the keras API. The implementation followed industry-level software engineering

practices such as composition, extensibility, and test coverage.

1 https://github.com/alvarolemos/pyism

60 Chapter 7. Conclusion

The last important contributions were the presentation of an early version of this

work on the XXIII Brazilian Conference on Automation, as well as the invitation and

further acceptance to the publication of an extended version of that same early work on a

special edition of the Journal of Control, Automation and Electrical Systems.

Although good results were achieved, the proposed methods have limitations:

as suspected, their applicability is limited to non-i.i.d. problems. Also, although not

measured during the experiments, the proposed method is based on complex models, so

computational costs constraints may exist.

At last, the author believes that there is room for the following future works:

• Go beyond RNNs, since they are not the only existing sequence models. A starting

point would be the Temporal Convolution Network (TCN) [Bai et al., 2018], which

has shown competitive results and promises to be faster than RNNs.

• Test the proposed method in different learning tasks, such as regression and fore-

casting.

• Another important work could be a deeper analysis of the results using statistical

tests, since there were overlaps in the overall accuracy results, as seen in Table 2.

• Re-run the experiments, considering resource consumption related metrics, like

memory usage and elapsed time.

• Leverage different approaches to present the results, like drawdown charts.

61

References

M. Awad and R. Khanna. Support vector machines for classification. In Efficient Learning

Machines, pages 39–66. Springer, 2015.

S. K. Azman. Fast and robust sliding window vectorization

with numpy, Jun 2020. URL https://towardsdatascience.com/

fast-and-robust-sliding-window-vectorization-with-numpy-3ad950ed62f5.

S. Bai, J. Z. Kolter, and V. Koltun. An empirical evaluation of generic convolutional and

recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271, 2018.

A. Bifet and R. Gavalda. Learning from time-changing data with adaptive windowing. In

Proceedings of the 2007 SIAM international conference on data mining, pages 443–448.

SIAM, 2007.

C. M. Bishop. Pattern recognition. Machine learning, 128(9), 2006.

L. Breiman. Random forests. Machine learning, 45(1):5–32, 2001.

T. Chen and C. Guestrin. Xgboost: A scalable tree boosting system. In Proceedings of

the 22nd acm sigkdd international conference on knowledge discovery and data mining,

pages 785–794, 2016.

J. Chung, C. Gulcehre, K. Cho, and Y. Bengio. Empirical evaluation of gated recurrent

neural networks on sequence modeling. arXiv preprint arXiv:1412.3555, 2014.

C. Cortes and V. Vapnik. Support-vector networks. Machine learning, 20(3):273–297,

1995.

A. P. Dawid. Present position and potential developments: Some personal views statistical

theory the prequential approach. Journal of the Royal Statistical Society: Series A

(General), 147(2):278–290, 1984.

G. De Francisci Morales, A. Bifet, L. Khan, J. Gama, and W. Fan. Iot big data stream

mining. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge

discovery and data mining, pages 2119–2120, 2016.

https://towardsdatascience.com/fast-and-robust-sliding-window-vectorization-with-numpy-3ad950ed62f5

https://towardsdatascience.com/fast-and-robust-sliding-window-vectorization-with-numpy-3ad950ed62f5

62 References

G. Ditzler, M. Roveri, C. Alippi, and R. Polikar. Learning in nonstationary environments:

A survey. IEEE Computational Intelligence Magazine, 10(4):12–25, 2015.

J. L. Elman. Finding structure in time. Cognitive science, 14(2):179–211, 1990.

J. Gama, R. Sebastiao, and P. P. Rodrigues. Issues in evaluation of stream learning

algorithms. In Proceedings of the 15th ACM SIGKDD international conference on

Knowledge discovery and data mining, pages 329–338, 2009.

J. Gama, I. Zliobaite, A. Bifet, M. Pechenizkiy, and A. Bouchachia. A survey on concept

drift adaptation. ACM computing surveys (CSUR), 46(4):1–37, 2014.

A. Gepperth and B. Hammer. Incremental learning algorithms and applications. 2016.

H. M. Gomes, J. P. Barddal, F. Enembreck, and A. Bifet. A survey on ensemble learning

for data stream classification. ACM Computing Surveys (CSUR), 50(2):1–36, 2017a.

H. M. Gomes, A. Bifet, J. Read, J. P. Barddal, F. Enembreck, B. Pfharinger, G. Holmes,

and T. Abdessalem. Adaptive random forests for evolving data stream classification.

Machine Learning, 106(9-10):1469–1495, 2017b.

S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):

1735–1780, 1997.

S. C. Hoi, D. Sahoo, J. Lu, and P. Zhao. Online learning: A comprehensive survey. arXiv

preprint arXiv:1802.02871, 2018.

G. James, D. Witten, T. Hastie, and R. Tibshirani. An introduction to statistical learning,

volume 112. Springer, 2013.

D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint

arXiv:1412.6980, 2014.

B. Krawczyk, L. L. Minku, J. Gama, J. Stefanowski, and M. Wozniak. Ensemble learning

for data stream analysis: A survey. Information Fusion, 37:132–156, 2017.

L. Li, R. Sun, S. Cai, K. Zhao, and Q. Zhang. A review of improved extreme learning

machine methods for data stream classification. Multimedia Tools and Applications, 78

(23):33375–33400, 2019.

V. Losing, B. Hammer, and H. Wersing. Knn classifier with self adjusting memory for

heterogeneous concept drift. In 2016 IEEE 16th international conference on data mining

(ICDM), pages 291–300. IEEE, 2016.

O. Marschall, K. Cho, and C. Savin. A unified framework of online learning algorithms

for training recurrent neural networks. Journal of Machine Learning Research, 21(135):

1–34, 2020.

References 63

S. Mehta et al. Concept drift in streaming data classification: Algorithms, platforms and

issues. Procedia computer science, 122:804–811, 2017.

M. Mermillod, A. Bugaiska, and P. Bonin. The stability-plasticity dilemma: Investigating

the continuum from catastrophic forgetting to age-limited learning effects. Frontiers in

psychology, 4:504, 2013.

A. H. Mirza, M. Kerpicci, and S. S. Kozat. Efficient online learning with improved lstm

neural networks. Digital Signal Processing, 102:102742, 2020.

J. Montiel, R. Mitchell, E. Frank, B. Pfahringer, T. Abdessalem, and A. Bifet. Adaptive

xgboost for evolving data streams. arXiv preprint arXiv:2005.07353, 2020.

A. Ng. Sentiment classification, 2018. URL https://www.coursera.org/learn/

nlp-sequence-models/lecture/Jxuhl/sentiment-classification.

C. Olah. Understanding lstm networks. 2015.

N. C. Oza. Online bagging and boosting. In 2005 IEEE international conference on

systems, man and cybernetics, volume 3, pages 2340–2345. Ieee, 2005.

R. Pascanu, T. Mikolov, and Y. Bengio. On the difficulty of training recurrent neural

networks. In International conference on machine learning, pages 1310–1318, 2013.

F. Rosenblatt. The perceptron: a probabilistic model for information storage and organi-

zation in the brain. Psychological review, 65(6):386, 1958.

R. M. Schmidt, F. Schneider, and P. Hennig. Descending through a crowded valley-

benchmarking deep learning optimizers. In International Conference on Machine

Learning, pages 9367–9376. PMLR, 2021.

Z. Shao and M. J. Er. An online sequential learning algorithm for regularized extreme

learning machine. Neurocomputing, 173:778–788, 2016.

R. Xu, Y. Cheng, Z. Liu, Y. Xie, and Y. Yang. Improved long short-term memory based

anomaly detection with concept drift adaptive method for supporting iot services. Future

Generation Computer Systems, 112:228–242, 2020.

I. Zliobaite, M. Pechenizkiy, and J. Gama. An overview of concept drift applications. In

Big data analysis: new algorithms for a new society, pages 91–114. Springer, 2016.

https://www.coursera.org/learn/nlp-sequence-models/lecture/Jxuhl/sentiment-classification

https://www.coursera.org/learn/nlp-sequence-models/lecture/Jxuhl/sentiment-classification

An Incremental Learning approach using Sequence Models

Documents

Transcript of An Incremental Learning approach using Sequence Models