Biclustering-based imputation in longitudinal data · was tested together with several baseline...
Transcript of Biclustering-based imputation in longitudinal data · was tested together with several baseline...
Biclustering-based imputation in longitudinal data
Inês de Almeida Nolasco
Thesis to obtain the Master of Science Degree in
Electrical and Computer Engineering
Supervisor(s): Professor Alexandra Sofia Martins CarvalhoProfessor Sara Alexandra Cordeiro Madeira
Examination Committee
Chairperson: Prof. João Fernando Cardoso Silva SequeiraSupervisor: Prof. Alexandra Sofia Martins de Carvalho
Member of the Committee: Prof. Pedro Filipe Zeferino Tomás
May 2015
ii
Resumo
Esclerose Lateral Amiotrofica (ALS) e uma doenca neurodegenerativa que afecta as capacidades mo-
toras. A degeneracao de pacientes com ALS progride rapidamente e usualmente estes acabam por
falecer num perıodo de poucos anos, principalmente devido ao comprometimento das funcoes respi-
ratorias. A importancia de identificar atempadamente a necessidade de iniciar Ventilacao nao invasiva
e portanto vital, este problema e abordado atraves de uma analise longitudinal de dados clinicos que
seguem os mesmos pacientes ao longo de um perıodo de tempo. Contudo, estes estudos e dados em
que os estudos se baseiam sofrem muito de valores em falta, i.e., , valores que por alguma razao estao
em falta e portanto a obtencao de conhecimentos a partir dos dados torna-se bastante difıcil. Neste
trabalho, abordamos a problematica de valores em falta em dados longitudinais atraves da aplicacao
de tecnicas baseadas em biclustering com o intuito de encontrar grupos de pacientes que partilhem da
mesma tendencia de evolucao das variaveis ao longo do tempo e imputar os valores em falta com base
nessa informacao. Esta abordagem e aplicada, em conjunto com outros metodos basicos de tratamento
de valores em falta, em dados sinteticos e no caso real de predicao de NIV em pacientes de ALS. Os
resultados indicam que o uso de imputacao baseada em biclusters melhora os resultados em dados
longitudinais.
Palavras-chave: Valores em falta, Dados longitudinais, Imputacao baseada em Biclusters,
Esclerose lateral amiotrofica, Ventilacao nao invasiva
iii
iv
Abstract
Amyotrophic Lateral Sclerosis (ALS) is a neurodegenerative disorder that affects motor abilities. The
patients with ALS progress rapidly and die in few years mainly from respiratory failure, the importance of
being able to identify when should one star the Non Invasive Ventilation (NIV) is vital. The prediction of
NIV is approached by the analysis of longitudinal data consisting in clinical follow-ups of patients through
time. However, this data is very prone to the occurrence of missing values. In this work the problem
of missings values in longitudinal data is addressed here by applying bicluster techniques in order to
find trends in the data and thus enhance the imputations of missing values. The proposed approach
was tested together with several baseline imputation methods in both synthetic and real-world data
(ALS dataset). the Results indicate that generally the use of biclustering-based imputation improves the
results.
Keywords: Missing Values, Longitudinal data, Bicluster-based imputation, ALS disease
v
vi
Contents
Resumo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Dissertation outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Background 3
2.1 Longitudinal data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Missing values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2.1 Baseline methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2.2 Imputation in longitudinal data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Biclustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 Classification and performance metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3 Method 12
3.1 Biclustering-based imputation in longitudinal data . . . . . . . . . . . . . . . . . . . . . . . 12
3.2 Imputation methods applied . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4 Results 17
4.1 Synthetic data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.1.1 Data description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.1.2 Evaluation results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.2 Real-World data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.2.1 Data description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.2.2 Biclustering results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.2.3 Datasets imputation results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.2.4 Classification results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5 Conclusions and Future work 36
vii
Bibliography 38
A Synthetic data results 39
B Dataset statistical description 41
C Real-world data classification results 44
viii
List of Tables
4.1 Results of Wilcoxon signed-rank tests for NaiveBayes classifier . . . . . . . . . . . . . . . 33
4.2 Results of Wilcoxon signed-rank tests for KNN classifier . . . . . . . . . . . . . . . . . . . 33
4.3 Results of Wilcoxon signed-rank tests for DT classifier . . . . . . . . . . . . . . . . . . . . 34
4.4 Results of Wilcoxon signed-rank tests for linearSVM classifier . . . . . . . . . . . . . . . . 34
4.5 Comparison 3symbols and 5 symbols in the classification results. . . . . . . . . . . . . . . 35
A.1 Results for synthetic datasets of size 1000*150. . . . . . . . . . . . . . . . . . . . . . . . . 39
A.2 Results for synthetic datasets of size 2000*200. . . . . . . . . . . . . . . . . . . . . . . . . 40
A.3 Results for synthetic datasets of size 5000*200. . . . . . . . . . . . . . . . . . . . . . . . . 40
B.1 Statistical description for each feature in the 1st time-point. . . . . . . . . . . . . . . . . . 41
B.2 Statistical description for each feature in the 2nd time-point. . . . . . . . . . . . . . . . . . 42
B.3 Statistical description for each feature in the 3rd time-point. . . . . . . . . . . . . . . . . . 42
B.4 Statistical description for each feature in the 4th time-point. . . . . . . . . . . . . . . . . . 43
B.5 Statistical description for each feature in the 5th time-point. . . . . . . . . . . . . . . . . . 43
C.1 Naive Bayes classification results for ALS data for unbalanced data. . . . . . . . . . . . . 44
C.2 Naive Bayes classification results for ALS data for balanced(SMOTE300) data. . . . . . . 45
C.3 Naive Bayes classification results for ALS data with SMOTE500. . . . . . . . . . . . . . . 45
C.4 LinearSVM classification results for unbalanced dataset. . . . . . . . . . . . . . . . . . . . 46
C.5 LinearSVM classification results for balanced datasets. . . . . . . . . . . . . . . . . . . . . 46
C.6 LinearSVM classification results for SMOTE500 data. . . . . . . . . . . . . . . . . . . . . 47
C.7 Decision Trees classification results for unbalanced data. . . . . . . . . . . . . . . . . . . 47
C.8 Decision Trees classification results for balanced data (SMOTE300). . . . . . . . . . . . . 48
C.9 Decision Trees classification results for SMOTE500 data. . . . . . . . . . . . . . . . . . . 48
C.10 K-Nearest-neighbor classification results on unbalanced data. . . . . . . . . . . . . . . . . 49
C.11 K-Nearest-neighbor classification results for balanced data (SMOTE300). . . . . . . . . . 49
C.12 K-Nearest-neighbor classification results for SMOTE500 data. . . . . . . . . . . . . . . . 50
ix
x
List of Figures
3.1 An e-CCCbicluster example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 Biclustering illustration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3 Illustration of biclustering-based imputation. . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.4 Imputation by bicluster pattern. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.1 Synthetic datasets description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2 Imputation methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.3 Mean imputation error - MED and EM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.4 Mean imputation error for different amount of missing values. . . . . . . . . . . . . . . . . 21
4.5 Robustness of imputation methods to missing values . . . . . . . . . . . . . . . . . . . . . 22
4.6 Comparison of imputation methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.7 ALS dataset: Number of instances per class. . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.8 Summary description of missing values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.9 Proportion of missing values in class EVOL and noEvol . . . . . . . . . . . . . . . . . . . 25
4.10 Bicluster categories. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.11 Number of missings per bicluster group. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.12 Real-world data datasets description. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.13 classification results with NaiveBayes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.14 classification results with linearSVM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.15 classification results with K-nearest-neighbor. . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.16 classification results with Decision Trees. . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
xi
xii
Chapter 1
Introduction
1.1 Motivation
Amyotrophic lateral sclerosis (ALS) is a motor-neuron disease, that involves the degeneration of upper
and lower motor neurons causing muscle weakness and atrophy throughout the body. ALS forces people
to need continuous care and basic life functions become compromised near the final stages of the
disease. Without any known cure, patient care is reduced to symptom relief, improving quality of life and
increasing life expectancy as much as possible. With adequate care patient’s average life expectancy
is between 2 to 5 years. A worsening in the lives of both patient and families happen when respiratory
difficulties appear and thus respiratory assistance, nammed non invasive ventilation (NIV), is needed. It
is a delicate stage since from this point onward the patient will be much more dependent of machines
and care givers and it is also a process that involves high costs.
Several studies based on clinical follow-ups of patients, focus on the prediction of the moment in
which patients will need NIV. Clinical follow-ups are usually presented as longitudinal data, which con-
sists in subjects being observed for some period of time in several moments. A recurring problem in this
kind of studies is missing values, i.e., observations that, for some reason, were not made or were not
written down and thus no value is known for that feature of that person, at that moment. Missing values
imply that some information is missing, which hinders the extraction of knowledge from that data. It is
intuitive that depending on the magnitude, and characteristics of these missing values, the conclusions
taken from the analysis of the data may be distorted. Studying how these missing values affect the con-
clusions taken, and understanding the methods that better deal with missing data became a pressing
matter in the specific context of studies predicting NIV and in general on that use longitudinal structured
data. As an example, in the ongoing work by Andre Carreiro et al. [2], the authors are able to predict if
NIV will be needed or not by the time of the sixth clinical evaluation, based on the five previous evalua-
tions. It is this author’s belief that by improving the predictive performance of these studies a significant
help may be provided to ALS patients in order to better cope with the conditions inflicted by this disease.
To tackle the problem of missings in longitudinal data and, in the specific context of the ALS disease,
the present work explores the application of biclustering techniques with the objective of finding trends
1
among data and thus develop a novel and more accurate way to deal with missing values in longitudinal
data. Although not new, biclustering concept, only has been recently furthered explored and new and
robust algorithms have been developed. Due to this advance it is now possible to explore the application
of these techniques in other contexts as missing data.
1.2 Problem formulation
The problem of predicting the need for NIV in ALS patients is structured as a classification problem.
Here, subjects evaluated in ALS related features in several moments, are labeled as Evol or noEvol
considering the evolution or not from not needing NIV to needing NIV from the 5th to the 6th evaluation
moment. This work tackles the problem of missing values in the specific context of predicting the need
for NIV in ALS patients, aiming to help the predictive power of these studies. In that sense it is important
to (1) understand the implications that missing values have on the classification accuracy, (2) understand
the relation between the data structure (longitudinal) and the capacity of the methods to deal with missing
values, and finally, (3) using the derived knowledge to enhance the classification problem accuracy by
treating missing values in an intelligent and specially designed way.
1.3 Dissertation outline
The present work is organized as follows: First, Chapter 2 consists on a literature review and descrip-
tion of crucial concepts. Here the definition of missings values, longitudinal data and Biclustering is
presented, together with a survey on state-of-the-art works concerning each aspect.
In Chapter 3 it is described the implementation of biclustering-based imputation followed by the
description of the methodology used to apply other baseline methods that are used.
Chapter 4 consists in a presentation and discussion of the results. Here the synthetic and ALS
datasets are described and the results of the imputation evaluations, missings analysis, biclustering and
classification results are presented and discussed.
In Appendix is still possible to consult a statistical description of each feature in the ALS dataset in
each time-point, also the extensive results for synthetic and ALS data are available here.
2
Chapter 2
Background
2.1 Longitudinal data
Longitudinal studies, from where longitudinal data results, are designed to repeatedly observe the same
subjects and measure the same features of interest for long periods of time. The main advantage of this
design is that it allows to understand the evolution or change in behavior of some subjects excluding
time-invariant characteristics that could cloud the conclusions.
Regarding the moments at which the measurements take place, these may be all fixed i.e., regular
moments and equal for every subject or they may be different for each, with different intervals in between.
The difference between this and other time-considering designs is not always clear, in specific, time-
series data also consists of successive measurements made over some time interval, however the fields
of research using one and other design are very different. Usually time-series data presents a large
quantity of time points, and the rate at which the observations are made is much faster. For instance the
data acquired from a sensor during a certain period of time may be seen as a time-series.
2.2 Missing values
Missing data occurs when no data values are stored for some variables. An example of missing data is
the non-response to some questions on a survey. This may have different reasons, as Little and Rubin
exemplify in [8] a person may not answer because she refuses to, or because she does not know the
answer. These two different reasons represent distinct mechanisms that lead to data being missed and
should be treated differently when analyzing the data.
Missing data can then be classified, referring to the mechanisms that causes it to be missing. In [1]
a formalization is proposed in the following way: Y is the variable under observation which may have
missing values, X are other variables in observation without missing values, and R is a response indi-
cator which has the value 1 corresponding to a missing value of Y and 0 corresponding to an observed
value of Y. Given this, missing data is classified as:
Not Missing at random (NMAR) If the reason why the data is missing is related to the value itself, then
3
the data is said to be not missing at random. This means that the probability of observing a missing
value in Y is not independent from the variable Y itself, i.e., P(R=1—X,Y)=P(R=1—Y).
Missing at random (MAR) Data missing at random has missing values unrelated with the value itself,
but related to other variables under observation. Therefore, MAR data has the probability of having
missing values in Y described as P(R=1—X,Y)=P(R=1—X).
Missing completely at random (MCAR) The values missing in this category are completely indepen-
dent from any dimension of the data. Given this, the probability of missing values in MCAR data
can be described as P(R=1—X,Y)=P(R=1).
For an illustrative example consider a survey which asks people about income, gender, age, and schol-
arity level: a person with low income may be more likely to refuse answering the income question, i.e.,
P(R=1—Y,X)=P(R=1—Y), which falls in the NMAR category. Another case would be a person with low
scholarly level being more likely to not answer the income question, i.e., P(R=1—Y,X)=P(R=1—X), falling
into the MAR category. If otherwise the response or non-response of the income questions is completely
independent of any variable, then we face a truly MCAR data.
2.2.1 Baseline methods
If we cannot consider data to be at least MAR (being therefore NMAR) then the missing mechanism must
be modeled and taken into account in the analysis performed with the data. This means we must derive
exactly how the missing values depend on the variable value and/or other variables under observation.
In a large number of studies this may be impossible. As Dondersa and Heijdenc point out [4], if data
is NMAR then no universal method that deals with missing data exists. Some new tools are being
developed in this context but they are out of the scope of this thesis, therefore they will not be addressed
here. The methods discussed henceforward are based on the assumption that data is at least MAR.
Listwise deletion Deletes all instances in which missing values appear. This generates a complete
data set with fewer instances on which the usual estimation process may be performed.
Pairwise deletion Only deletes instances that have missing values on the variables under analysis, but
uses the same instances when analyzing other variables without missing values.
These methods are usually categorized as Complete dataset approaches, since the general strategy
is to clean the dataset of the missings in order to get a ”complete” dataset to perform the analysis
desired. The main problem with these are the resulting bias in descriptive statistics, specially observed
when data is just MAR, yet another aspect worth considering is the decrease of statistical power that
follows from the lower number of sample instances. Relating to the applicability of these methods, the
authors in [3] are straightforward: only if the amount of data available is large or if the missing values
correspond to some small percentage of data may these methods be applied. Otherwise and in the case
where the missing information is necessary other methods that predict the exact missing value should
be applied.
4
Imputation methods
Following the need to predict missing values, imputation methods were developed. These are based
on the assumption that the resulting inferences from a representative sample, should be similar if the
same analysis is performed on a different, but also representative sample. Therefore, and as A. Rogier
et al [4] state, if we change one subject from the sample and this subject is drawn at random from the
population under study, the analysis results should not be very different.
R. Little and D. Rubin [8], also propose an intuitive reason for applying these methods: if in our study
we have some variable Yj which is highly correlated with some other observed variable Yk, and if some
values are missing in Yj then it is very tempting to predict these Yj values from Yk.
Taking that into account, the overall scheme of these methods is to imput, in the place of the missing
values, values that make sense to be there. It is in the approach used to decide which values to input
that the methods differ.
The general positive point with this kind of methods is that, after the imputation process is done,
we may work with a full sized sample, and perform the exact same analysis we would do if no missing
values ever occurred. This is, however, not without problems which are discussed through this section.
Hot-deck imputation If a value is missing in the responses of some subject, a sub-sample of subjects
with similar characteristics is formed and the value missing is imputed with a value taken at random
from that subgroup. This approach results in biased estimators and misleadingly small standard
errors since there is no new information being added. Another problem regarding this method
is the high difficulty or even impossibility of being able to get a subset of subjects with similar
characteristics. In practice, ad hoc strategies are employed.
Mean imputation In the place of missing values, this method inputs the estimated sample mean of
the variable that shows missing values. This approach may be generalized to use any metric of
frequency or central tendency to infer missing values. However, the same problems pointed before
exists, as we are not imputing new information, (the imputed values are based on the sample and
do not increase the uncertainty associated) and as we are using a full sized sample, the standard
error for the estimated parameters will be misleadingly small. The bias problem also remains
specially if the data is not MCAR.
Regression imputation Based on the sample, a regression function is constructed where the variables
with missing values as the dependent ones and the other observed variables as the predictors. The
missing slots are then imputed with values resulting from that function. Several kinds of regression
functions may be derived, and should be used accordingly to the type of data at hand, for instance
linear or logistic regressions. The problems mentioned above are still not solved with general
regression imputation methods, since the biased estimators and small standard errors continue to
happen.
The stated imputation methods may be considered as part of the single imputation methods group,
which means that the prediction of values to impute is performed and imputed only once. Some state
5
that this strategy allows to underestimate the imputation error since the imputed data does not reflect the
uncertainty created by missing values, i.e., the values imputed are not associated with any uncertainty
on the choice of the value to imput, thus leading to an overconfidence on the resulting estimations.
Therefore, a mechanism that reflects the uncertainty on the choice of the value to imput is needed.
Donald B. Rubin proposed, in [14], such a method named multiple imputation that tackles this specific
problem.
Multiple imputation this method belongs to the imputation methods category, since the rationale and
overall approach is the same. This, however, addresses the problems shown above and it has
been proven that its results are highly better [14]. It starts by constructing a regression function
the same way as performed when doing regression imputation. In this case, however, an error is
introduced to the regression function, so that, each time the this function is used to derive some
value for the dependent variable, a different value is obtained. The imputation process is then
performed several times, i.e, by deriving a value from the current regression function and impute
it in the place of the missing value, repeating this procedure several times and generating several
different datasets. An analysis may then be done in each dataset and the estimated parameters
are averaged through all the datasets analyzed, resulting that the statistics no longer suffer from
too small estimated standard errors.
This method works more as a tool and not as a method alone, being possible to apply even when
using other kind of models on the data, i.e., not only regression models.
Expectation-maximization imputation As a multiple imputation technique the Expectation-maximization
(EM) algorithm is generally used. it provides a methodological way for applying some specific
method, e.g., regression imputation, and improving its results. For an imputation process, it gen-
erally works as follows: An E step (expectation) derives some descriptive model from the available
data, and an imputation is made using that model. Next the M step (maximization) follows which
means that having a completed imputed data set, the parameters defining the previous model
may be re-estimated and a new model is derived. The two steps are repeated, until the model
converges. This way, by successive iterations, it’s possible to start with easy and simpler models
describing the data and finally achieving enhanced and more accurate models which lead in the
end to better imputed values. In [16] the author developed EM imputation algorithm for Matlab.
An important category of imputation procedures are the called model-based imputation methods,
which stand for methods in which the imputed values are obtained through some function that is able to
model the data at hand. The simplest model-based imputation methods are the regression imputation
methods, in which the variables are modeled through a regression function. These methods are usually
applied in a iterative way that generally follow and repeat two main steps: first learn a model from the
observed data, second impute the missing values according to the learnt model. As mentioned before,
an example of such an iterative process is the EM algorithm.
Few advances have taken place in this area, as Joseph Schafer and John Graham conclude, in
[15], multiple imputation with regression and maximum likelihood with EM algorithm are still considered
6
the best approaches to deal with missing data. Recently, the focus has turned to the development of
methods that are able to model more complex datasets and exploit specific aspects of these datasets.
Complex datasets may refer to datasets that mixture together a variety of different variable types as cat-
egorical, binary, scalar or continuous. It may refer to the relation between variables, or to the complexity
of the dataset design as is the case of longitudinal data.
To deal with different type variables, an important algorithm was developed in [11], the imputation and
variance estimation software (IVEWARE), that implements a sequential regression imputation approach.
It consists in a iterative process that estimates missing values by fitting a sequence of regression models
and drawing values from the corresponding predictive distributions. A draw back of this algorithm is that
it is not sufficiently robust to outliers, resulting in poor results when outliers are present. An improvement
of the previous algorithm, the Iterative Robust Model-based Imputation (IRMI) was presented in [17]. It
follows the same approach as IVEWARE but presents a larger robustness of the imputed values.
In [12] Michael Richman et.al. introduced and evaluated the use of machine learning algorithms, such
as support vector machines (SVM), and neural networks (NN) to model missing data in the imputation
process. In particular, the idea was to use SVM and NN to construct regression functions from which the
values to impute would be derived. The results of comparing the application of these algorithms against
some more basic imputation approaches allowed the conclusion that the methods tested are better in
general, and in the case of nonlinear correlated variables, these methods are extensively better than the
other basic approaches tried.
Specifically to deal with categorical data, regression trees have been successfully used as described
in [18]. However, in [13], Vanessa Romero and Antonio Salmeron explored the use of Bayesian networks
(BN) for multivariate imputation of categorical variables. The results of these approaches were better
when compared with the results using regression trees. BN are useful tools to get joint probability
distribution over all features. The overall idea of their work is to learn a BN for the set of variables
in the dataset, and after, impute the missing values by values that increase the likelihood of the data
accordingly to the joint probability distribution derived.
Some recent studies as [3] and [5] are revisiting the Hot-deck imputation idea, where subgroups of
data (also called classes) are formed following some coherence rule, i.e., they are formed such that, in
selected attributes the data from same-group objects behave accordingly to that coherence, and so it is
expected that the imputation of missing values (whichever be the technique applied), considering only
data from the same group should perform better than methods that apply the same imputation method
to the global dataset. In [3] the authors use the concept of biclustering to generate those classes, and
as the coherence rule the Mean Square Residue metric is used.
2.2.2 Imputation in longitudinal data
Throughout the course of this work it became somewhat clear that the imputation process is very sen-
sitive to data structure and design. The works mentioned above are an example of that, where specific
tools are used to tackle specific aspects of the data.
7
The same conjecture appears to be true when dealing with longitudinal data. Indeed, knowing the
exact structure of a dataset, and in this case, understanding that attributes present an evolution over
time, is of interest when developing methods to deal with missing values in longitudinal data. The
following methods appear to be extensions of the baseline methods instead of contributions on missing
data management.
Individual mean imputation The missing value is imputed with the mean of observed values in the
distinct time points for the same instance.
Time mean imputation The missing value is imputed with the mean of the observed values in distinct
instances for the same time-point. As before, the imputation with other metrics, e.g., median, is
possible, and should be considered according to the dataset’s characteristics.
Last Value Carried Forward (LVCF) A missing value in time point t is imputed with the previous (t-
1) observed value of that instance. Several modifications of this approach may be found, for
instance, instead of using the previous it might be used the nearest, or the latest, or the next (t+1)
observed value. The decision about which should be used depends always on the prior knowledge
about how variables evolve over time points. Nonetheless, these methods assume a more or less
constant value over time and, if that is not the case more sophisticated approaches should be
considered.
Linear interpolation imputation A missing value in time point t is imputed with the average of the
previous (t-1) and next (t+1) observed values. This approach assumes a linear development in
time of the variables under imputation.
Longitudinal regression methods In [19] the authors propose two methods: individual longitudinal
regression imputation and population longitudinal regression imputation. Both methods are a ver-
sion of a simple regression imputation method, however, when choosing the predictors to compute
the regression functions, they take into account the time as an important component. Individual
longitudinal regression imputation derives a regression function between time and the attribute
with missings in the same instance. This must be performed for each object separately, since it
is assumed that the dependency on the variable with time is not the same for every object in the
dataset. In population longitudinal regression imputation the regression function uses not only time
as a predictor but other attributes from other individuals as well.
Related with this subject, the work by J.Honaker and G. King is worth mention, In [7] they developed
the AMELIA II algorithm which applies multiple imputation to time series cross sectional (TSCS) data.
Although TSCS data differs from longitudinal one, in particular in the amount of time points available and
in the number of subjects, the problems in applying a regular multiple imputation method are similar, and
the solutions found, even if not directly applicable to the longitudinal case, should be explored.
Several studies try to understand if the use of specially developed methods for longitudinal data
would result in better imputed values than the general baseline methods. In this context two studies [5]
8
and [19], arrive to the same conclusions: if we are dealing with longitudinal data then the longitudinal
aspects should enter into consideration in the imputation method and the longitudinal methods applied
yield better imputations in both cases when compared with baseline methods.
2.3 Biclustering
Clustering stands for grouping together objects that in some sense are similar to each other or that have
more in common than the rest of the objects. The rules that define which objects belong to the same
cluster may be very different, may even be applied to discretized versions of data or to the real values
directly, depending on the problem at hand. In the simplest cases, measures of distance between objects
are used. Clustering may also be applied in more than one dimension. Consider a data matrix A of size
n×m, where rows represent objects and columns are attributes of interest, clustering applied to objects
will group together objects that present closest values for all attributes, and will result in a sub-matrix of
size s×m containing all the selected objects (s < n). But clustering may also be applied to the columns,
(attributes) and that would group together attributes that behave in a similar way for all objects, resulting
in a sub-matrix n×p with n objects and p attributes (p < m). In some contexts, clustering simultaneously
in both dimensions (rows and columns) is also of interest, This means that a subset of objects and
attributes are selected to form a sub-matrix of size k × t with k < n and t < m, called bicluster. In [9]
the authors express the underlying difference between these clustering approaches, clustering methods
derive a global model for the data, since each object belonging to a cluster is selected using all attributes
or vice-versa. Biclustering, however, produces local models since objects and attributes belonging to a
bicluster are selected considering only a subset of attributes and objects.
Time-series data biclustering
Biclustering has also been shown to be an interesting technique to be applied to time-series data with
the objective of finding objects that behave in a similar manner over the same time-points, i.e., objects
that present the same trends over time. For that purpose consider the application of biclustering to a
data matrix A where, instead of different attributes, columns would represent different time-slices of the
same feature. If the goal is to find local trends over time, then it makes sense that columns, representing
time-points, selected for each bicluster to be contiguous. This restriction was further explored by the
authors in [10] which proposed an algorithm that is able to find such biclusters in linear time, the called
Contiguous Column Coherent Biclustering (CCC-Biclustering) algorithm. In this algorithm a discretized
version (A) of the data matrix (A’) is generated, and rows and columns are grouped together following
two rules: first, columns must be contiguous; and second, discretized symbols in each column must be
the same in every row selected.
9
2.4 Classification and performance metrics
Synthetic minority over sampling technique (SMOTE)
To undermine the effects of unbalanced data on classification results Nitesh V. Chawla et al. developed
the Synthetic minority over sampling technique (SMOTE), which is able to artificially balance the dataset
between the two classes by over-sampling the minority class through the introduction of synthetic in-
stances. This tool allows the user to define the oversampling desired, by defining the percentage of data
to be created. As an example, a defined SMOTE of 100% would result in a dataset with twice as much
instances.
performance metrics
In order to evaluate classifications several metrics are used, here a brief description of some of these
metrics is presented. when dealing with a classification problem caractheristicaly a positive class and a
negative class are defined, e.g., , in the context of medical studies a positive class is usually assign to
the sick patients and the negative class is assign to the healthy persons.
True positive and True Negative rates A true positive is an instance that has been correctly classified
as positive and a true negative is an instance that is correctly classified as negative. The TP
rate and TN rate stand four the measure of the proportion of actual positives or negatives that
are correctly identified as such, in other words it is the number of instances correctly classified
as Positive (TP) or negative (TN) divided over the total amount of positive (P) or negative (N)
instances. This is defined as:
TPrate =TP
P(2.1)
TNrate =TN
N(2.2)
True Positive rate is also usually called recall and serves as a measure of the capacity of classifiers
to retrieve relevant information i.e., how many relevant instances is it able to find from all relevant
instances .
Precision Is the fraction between the number of instances correctly classified as positive (TP) over the
total amount of instances classified as positive ether correctly (TP) or incorrectly (FP). This metric
measures if the classification is resulting in more correctly classified instances than incorrectly
ones.
Precision =TP
TP + FP(2.3)
K-statistics Metric that compares the observed accuracy with the accuracy from a random chance, this
10
allows to have a general idea if the classification behaved as a random system or not.
K − statistics =totalAccuracy − randomAccuracy
1− randomAccuracy(2.4)
totalAccuracy =TP + TN
TP + TN + FP + FN(2.5)
randomAccuracy =(TN + FP )× (TN + FN) + (FN + TP )× (FP + TP )
total × total(2.6)
F-measure incorporates both Precision ans recall, being able to measure the performance from both
aspects. its definition is as follows:
F −measure =Precision×Recall
Precision+Recall(2.7)
11
Chapter 3
Method
3.1 Biclustering-based imputation in longitudinal data
Following the idea of imputation inside classes or subgroups introduced in Section 2.2.1, in this work it
will be further explored the use of biclustering techniques, applied to our longitudinal dataset, in order
to create similarity groups and perform group-dependent imputation. The idea is that if the imputation
is carried out based on groups that share some similarity, it is to be expected that the imputed values
are more accurate than if the imputation was made considering the whole dataset. Considering time-
dependent aspects in the imputation process, the idea of grouping persons according to some similarity
can be described as looking for local trends in the data, i.e., we want to group together patients that
for some feature show the same evolution over time. Then, it is imputed therein any missing value
taking into account the group trend for that feature. The biclustering strategy used in the course of this
work is called CCC-biclustering, which as previously explained (in Section 2.3) only finds biclusters with
contiguous time points. However, a slightly different version of this algorithm has to be used, the e-
CCCbiclustering, which allows the biclustering of objects with approximate similarity instead of an exact
one, i.e., selected patients in the defined time-points do not need to be precisely equal between them,
but only similar in some defined degree that may contain mismatches or missing values. In the context
of this work, allowing missing values inside bicluster is imperative to be able to group together subjects
that present some missing values in some time-point, and generate biclusters as shown in Figure 3.1.
However, we are not interested in allowing other mismatches/errors in the bi-cluster computation, so
the e-CCCbiclustering algorithm was modified to allow only missing values as errors. Before applying
the biclustering algorithm, the data needs to be processed in order to generate one-feature matrices for
each longitudinal feature in the original dataset. Each data matrix consists in the observations for the
considered feature for all patients at different time points and it constitutes the base data matrix where
biclustering is performed. In Figure 3.2 a representation of the strategy used to biclustering this data is
presented step by step.
After the generation of each one-feature matrix, as described in the Section 2.3, the matrix needs
to be discretized since this algorithm applies biclustering over a discretized version of the data. There
12
Figure 3.1: Illustration of an e-CCCbicluster containing samples with missing values.
Figure 3.2: Bicluster computation workflow. First construct the one-feature matrices by separating thedataset in sub-datasets of only one feature (but with all samples and all time points). Then apply themodified e-CCCbiclustering algorithm to find biclusters in samples that have the same discretized valueswith only missing values as the possible differences.
are several options regarding the discretization method, but the one that seems more convenient to our
problem is the discretization in n symbols performed with equal width by subject.
This discretization approach looks into the values of each subject across the time and creates n
bins of equal width that correspond univocously to a symbol of the discretization alphabet. Alphabets
of 3 symbols are usually used, in particular the following sequence of letters ordered in crescent or-
der: (D,N,U ). But, other alphabets are possible, in this work, in order to understand the effect of
discretization on the results of biclustering-based imputation the discretization with 5 symbols is also
used, corresponding to the following alphabet: (A,B,C,D,E).
A point to consider is that not all resulting biclusters are of interest, they may be too small or trivial1,
they may be statistically insignificant, or they may simply not make much sense in the context of the
problem. For instance, we are interested in getting local trends in a six time-point longitudinal data
1Herein trivial means a bicluster with only one time point or relatively to only one sample/person.
13
and therefore a bicluster of only two time points it is not very consistent with that. To understand the
implications of these aspects in the imputations performed, and what kind of metric should be interesting
to apply, it is considered the use of four different sets of biclusters in the imputation process: (1) ALL:
all non trivial biclusters, (2) SIG: only significant biclusters, i.e., with a P-value associated lower than
0.05 (3) TP: all non trivial biclusters with more than 3 time points i.e., biclusters that present at least 3
columns, and (4) SIGeTP only significant biclusters with more than 3 time points.
Once the biclusters are formed, the imputation may be performed inside the biclusters exactly as it
would be done on a whole dataset except that it is performed on a smaller group of data that forms the
bicluster. This process has as input each one-feature matrix representing each longitudinal matrix with
missing values and a description of the biclusters found after running the e-CCC biclustering algorithm.
It starts by corresponding each missing value to one single bicluster from which the imputation is to
be performed. After the univocal relation between missing value and bicluster is computed, the local
imputation process takes place and a single value to impute is found. This value is thus imputed, in
the one-feature matrix, in the place of the missing value. A schematic representation of this process is
presented in Figure 3.3. However, several biclusters may contain the same missing value and a single
bicluster must be selected, to solve this, in any of the four cases enumerated above, the biclusters are
evaluated according to their statistical significance and the most significant one is selected. Furthermore,
some missing values don’t fall inside any bicluster. In these cases, the missing value will remain missing,
i.e., these missing values will not be imputed, or are imputed with an additional method that is applied
to the whole one-feature matrix.
Figure 3.3: Illustration of the biclustering-based imputation process. For each missing value, it finds abicluster that contains it. Next, it takes the sub-matrix of the data contained in that bicluster and performimputation on that missing value.
3.2 Imputation methods applied
In this section a description of the imputation procedures that are applied and compared are presented.
14
Expectation maximization Imputation using EM approach is performed with the Matlab software, specif-
ically the EM imputation implementation described in [16]. This approach receives as input a matrix
with missing values defined as Not a Number (NaN) and generates as output the same matrix with
the missing values imputed.
Median cross subjects This approach was implemented in the Matlab environment. The procedure
computes for each one-feature matrix the median for each time point across all subjects, then it
imputes each missing value with the corresponding median value computed.
Median longitudinal A variation on the previous implementation was also developed where median
values to input are computed separately, not only in each feature, but also in each subject, across
all time points.
Bicluster-based imputation Following the strategy described in Section 3.1 three imputation methods
applied to the biclusters were explored: imputation with Median cross Persons, imputation with EM
and imputation by bicluster pattern. The first two approaches are simply direct applications of the
previous introduced implementations and differ only in the sense that they are applied to a much
more restricted group, the bicluster. The last method uses the information of the bicluster pattern
and the local values from the same person to predict the value to impute. The strategy used is:
first, select the corresponding letter of the bicluster pattern for that missing value, and then apply
reverse discretization to it. Based on the other values available for that person, compute the value
interval that letter represents and make the imputation with the mean value of that interval. An
example of this process is presented in Figure 3.4.
15
Figure 3.4: Illustration of the biclustering-based imputation using ”by pattern“ approach. First determinewhich discretization letter corresponds to the missing value. Second, compute the mean value of theinterval that letter represents for the given subject. The imputation is then performed with this meanvalue.
16
Chapter 4
Results
The goal of this chapter is two-fold: (1) evaluate the effectiveness of the imputation methods in longitu-
dinal data; (2) evaluate how general the conclusions are, by testing the imputation algorithms in several
datasets.
To properly evaluate the imputation methods, synthetic datasets were generated. Synthetic datasets
are essential in this work to properly evaluate the imputation algorithms, since they make it possible to
compare the imputations against ground truth. Indeed synthetic data is used, which allows reasoning
about aspects that are impossible with real data. As an example, for being able to evaluate imputation
methods, one should check if the predicted values are close to the real ones, i.e., , the values that
should be there and went missing. Using synthetic data, missing values are known a priori and thus an
assessment of this kind is possible. To tackle the second evaluation point, since the previous method-
ology is impossible to apply on real-world data, a classification approach is used over the ALS dataset.
It seems adequate to state that if a dataset resulting from an imputation method yields better results in
the classification problem than other imputed datasets, and if the classifications are performed in exactly
the same conditions, then this imputation method must be better than the others tested for this particular
context. Following this idea several imputations were applied to the real-world dataset and the resulting
complete datasets are classified. This chapter provides the detailed information on each one of these
procedures.
4.1 Synthetic data
The main advantage of testing methods in Synthetic data is that this experiences may be performed in a
controlled environment, since this data may be as well defined as needed and cleaned from other factors
that obstruct and occlude conclusions. Also this kind of data should be designed with a clear idea of
what are the questions and hypothesis one may want to test and thus define parameters to construct
a dataset that naturally shows some answers. As previously mentioned , we want to test if specially
designed imputation methods for longitudinal data perform better when applied to this kind of structured
17
data other than baseline ones, in particular we want to test if the developed bicluster-based imputation
method performs better. With this in mind, a couple of parameters should be tuned when designing the
synthetic dataset: the number of biclusters found in the data and the total number of missing values.
These aspects directly influence the performance of imputation methods and thus should be carefully
defined.
In [6] the authors developed a generator of synthetic data for biclustering (BiGen), which is used
in this work. BiGen creates a data matrix where biclusters are planted accordingly to a multitude of
parameters that the user may control, which the ones of interest are size of dataset, distribution and
type of data values, number of biclusters to plant, coherence type and the size of biclusters, and also
the total number of missing values.
Regarding the settings referring to biclusters, the coherence type is the most important, since it
defines the type of similarity that objects will share together and be grouped upon. As mentioned, the
objective of applying biclustering algorithms is to be able to group together objects that show the same
trend in time. For this, the coherence type defined is ”order preserving across rows“, which defines that
objects are grouped together if values in each object follow the same trend across columns. Although
a strictly longitudinal dataset is not designed, by defining the bicluster coherence as explained, we are
able to approximate the biclusters planted to the ones we would find in a truly longitudinal dataset.
An also important aspect to consider is the strategy used to generate missing values. BiGen is able
to implement a defined percentage of missing values at random positions. To understand the influence
of the amount of missing values in the imputations process several percentages were used. A version
of the data without missing values is also available, which will act as the ground truth in the evaluation
stage. A more complete description of the data generated is presented next, in chapter 4.1.1.
After imputation, the resulting datasets are evaluated using two metrics: Number of missing values
imputed and the mean imputation error, which is the mean difference between the ground truth values
and the imputed ones as described next:
MeanImputationError =
∑(|RealV alue− PredictedV alue|)totalNumberV aluesPredicted
(4.1)
4.1.1 Data description
Using the BiGen generator, four different sized matrices were generated. The sizes defined were: 1000×
150, 2000 × 200 and 5000 × 200. Each generated matrix consists in integer values ranging from 0 to 20
and planted with biclusters and missing values. The biclusters planted are set to be ”order preserving
across rows“ and their size is defined by an uniform distribution for both rows and columns, for which
the user defines the minimum and maximum values. The defined sizes were not extremely big in order
to simulate what happens with the real-world data and considering the dataset size. After the biclusters
are planted, missing values were also included in each matrix and in different percentages: 10%, 20%,
30% and 50%. This generates the final datasets where imputations are to be performed. In Figure 4.1
a summary and description of each dataset is presented.
For each one of these datasets, 9 imputation strategies are applied, resulting in 9 differently imputed
18
Figure 4.1: Description of datasets generated by BiGen, before imputation is performed. Each datasetis here described by dataset size, percentage of missing values, bicluster parameters and number ofmissing values found in biclusters.
datasets. The applied imputation strategies follow the same procedures mentioned in chapter 3.2 which
are described next.
BICmed Only missing values inside biclusters are imputed with median across rows, the remaining are
not imputed with any method. The portion of missing values imputed with this method corresponds
to the total number of missing values found in biclusters.
BICem Only missing values inside biclusters are imputed with EM. This method is able to impute every
missing value inside biclusters leaving the missing values not grouped in biclusters not imputed.
Refer to Figure 4.1 to see the corresponding number of imputations.
BICmed MED Applies median across rows inside biclusters and the remaining are imputed with median
across rows applied to the whole dataset.
BICem MED Missing values inside biclusters are imputed with EM and the remaining are imputed with
median across rows.
BICem EM Missing values inside biclusters are imputed with EM and remaining are imputed with EM
applied to the whole one-feature matrix.
MED Imputation without biclustering, applies median across rows to the whole matrix.
MEDL Applies longitudinal median to each line of the whole matrix. Note that if a whole line of the
matrix is missing, this method does not impute any value in that line.
MEDL MED First a longitudinal median is applied to each line of the matrix. Then the few lines that
were entirely missing are imputed with median across rows.
EM Imputation with EM applied to the whole matrix. All missing values are imputed.
For a visual aid on the description of each imputation refer to the Figure 4.2, here each imputation
process is described by the imputation method that it applies(in green).
19
Figure 4.2: Imputation strategies description. Each imputation method that is used appears in green,methods not used are shown in grey. As an example, BICem MED imputes missing values insidebiclusters with the EM approach followed by applying median imputation to the remaining missing values.
In short, each one of the 10 original datasets is imputed by 9 different approaches, originating 90
different datasets that are evaluated and compared. The results of such evaluation are presented in the
next section.
4.1.2 Evaluation results
As mentioned above each imputation approach, applied to each dataset, is evaluated by two metrics,
percentage of missing values imputed and mean imputation error. In the present section the results of
such evaluation are presented.
From the several questions one may want to answer using these results, the most interesting is
whether the bicluster-based imputation approach derives better imputed values with respect to alterna-
tive methods. To answer this, it is crucial to compare between imputation approaches that use bicluster-
based imputation in one portion of data and an additional method on the remaining missing values, with
the imputation approaches that use the same additional method to impute the whole dataset. By doing
this it is possible to directly understand if the use of bicluster-based imputation, even on a small por-
tion of data, results in a better imputation than if this approach was not used. The specific approaches
that fall in the category described are MED versus BICmed MED, or versus BICem MED and also EM
versus BICem EM. Such comparison may be observed in Figures 4.3 and 4.3 where these approaches
are compared relatively to their mean imputation error. The conclusion is straightforward, when using
bicluster-based imputation methods, whichever be the imputation method used inside the bicluster, the
mean error is smaller than if no bicluster imputation is applied. Also these results are consistent in all
nine synthetic datasets, independently of size and percentage of missing values, i.e., the relative rela-
tions between mean imputation errors for the methods in analysis are maintained, confirming that these
conclusions are independent of percentage of missing values or dataset size.
These imputation methods are also robust to the amount of missing values in the synthetic datasets.
As can be seen from Figure 4.4 and 4.5, in general, for all data sizes, the mean imputation error almost
does not increase with a dramatic increase of missing values (from 10% to 50%).
Finally, it is also possible to find which one of the tested imputation approach shows the most promis-
20
Figure 4.3: Comparison between datasets imputed with bbimputation and Median or EM in the remainingmissings with dataset imputed only with median or EM. All nine synthetic datasets present the samerelative results: bicluster imputation, enhances the imputation results.
Figure 4.4: Mean imputation error for the smallest dataset (1000× 150) with differing amount of missingvalues. All methods perform almost equally well for different amount of missing values, even when thisamount rises to 50%. This result is consistent across the datasets of different sizes.
ing results in terms of mean imputation error, in Figure 4.6 the mean imputation error is represented for
the smallest dataset (1000× 150) for all the imputation approaches applied. As before, these results are
also consistent for all datasets sizes and amount of missing values, thus only results from the datasets
of size 1000×150 are shown here. Analyzing these results leads to the conclusion that the BICem MED
and BICem EM methods consistently achieve better imputations, i.e., , the predicted values are closer
to the real ones. The complete results from which these analysis are performed may be consulted in
Appendix A.
4.2 Real-World data
For the real-world data, the ALS dataset, the imputation methods were indirectly evaluated through the
classification results. The classification problem, as described in Chapter 1.2, can be described as
predicting the evolution or not of needing assisted ventilation (NIV) by the time of the sixth visit using all
previous observations.
21
Figure 4.5: Here the mean imputation error for all the datasets with size 1000 × 150 and percentage ofmissings 10% and 50% are represented.
Figure 4.6: Mean imputation error in the smallest dataset (1000 × 150) for all methods, and differentamount of missing values. For all datasets (sizes and amount of missing values) the best method wasBICmed MED followed closely by BICmed MED. BICem MED: EM imputation inside each biclusterfollowed by median imputation across rows of the whole dataset for the rest of the missing values.BICem EM: EM imputation inside each bicluster followed by EM imputation in the whole dataset for therest of the missing values. See above in the text for the definition of all methods.
Classification is performed in a supervised way, but since the two classes EVOL and noEVOl are
seriously unbalanced, it was necessary to apply Synthetic Minority Oversampling Technique (SMOTE)
(as described in Chapter 2.4) in order to achieve better balance.
Concerning the classifiers, the ongoing work by Andre Carreiro [2], selects the Naive Bayes since
this is the one that yields better results. However, Naive Bayes(NB) implementation in the WEKA data
mining software is also known to be a classifier that can work particularly well with missing values, so it is
not expected that classification results can be significantly improved by using better imputation methods.
For this reason, and in order to be able to highlight which imputations really improve the classification
process, different classifiers were used, namely Decision Trees (DT), K-Nearest-Neighbor (KNN) and
linear Support vector machine (LinearSVM).
NB was applied with kernel estimator, regarding the default method for dealing with missing values,
this NB implementation in WEKA simply omits the conditional probabilities of the features with missing
values in test instances. KNN was applied with 1 neighbor, as to default missing values treatment, miss-
ing values are assigned the maximum distance when comparing instances with missing values. DT was
22
performed with a confidence factor of 0.25 and without Laplace smoothing, this implementation simply
does not consider the values of the attributes with missings to compute gain and entropy. LinearSVM
was performed with complexity of 1.0 , missings are treated by imputing global means/modes.
The classification process was performed with cross validation setup, where each dataset (SMOTE
300%, SMOTE 500% and Original) was divided in five folds from which 4 were used for training and one
for testing. These experiments, for each classifier and dataset, were repeated 10 times. Classifications
are evaluated with the TP rate, TN rate, Precision, K-statistics and F-measure. Conclusions from these
results should always be taken from a evaluation of the several metrics. However as F-measure balances
the influence of each class and integrates both precision and recall in a final number, it was given chief
importance.
4.2.1 Data description
This work was build upon clinical data containing information regarding ALS patient follow-ups collected
by the Neuromuscular Unit at the Molecular Medicine Institute of Lisbon. As mentioned, this dataset is
constructed in a longitudinal fashion where each patient is observed in several moments through time.
Although observations do not follow a strict plan, they tend to average 3 month between consecutive ob-
servations. The dataset contains demographic information, patient characteristics, neuropsychological
analysis, motor evaluations and also respiratory tests where the NIV requirement is included. In short,
each patient evaluation consists in the observation of 34 different features. A statistical description of the
dataset is presented in the Appendix B. There are static features, which are time invariant and longitu-
dinal ones which are time variant.evaluated may be differentiated accordingly to their evolution through
time, as static or longitudinal, being the static ones the features that stay constant along the time and
the longitudinal ones the features that show some trend. From the 34 features, 22 are longitudinal and
so are the focus of this work. In the context of the presented problem, each patient’s follow-up is labeled
with Evol or noEvol considering if an evolution in the NIV indicator exists or not. The higher the number
of follow-ups the easier it should be to perceive and exploit trends in the data. Therefore, only patients
that presented at least five follow-ups were considered. From these only the patients that didn’t evolve
from not needing NIV to needing NIV before the fifth moment are of interest to the classification problem
at hand. Although other setups could be considered, this was the best option from a balance between
number of resulting patients and number of follow-ups, since more follow-ups result in fewer patients
fulfilling the needed conditions. By filtering the not interesting ones, the resulting dataset consists of 159
patients observed in 34 different features at 5 different moments, which takes the form of a matrix of size
159*170, as depicted in the Figure 3.2.
The resulting dataset is quite unbalanced. It contains 31 EVOL samples and 128 noEVOL samples
that may be observed in Figure 4.7, where the percentage of patients labeled with Evol consists only in
approximately 20% of the cases.
23
Figure 4.7: Number of instances per class, Evol or noEvol, in the ALS dataset.
Missing Values Analysis
Approximately 40% of the values in the present dataset are missing. As illustrated in Figure 4.8 these
missing values occur in approximately 80% of the features and there is no single patient that does not
present at least one missing value.
Figure 4.8: Here it is described, as to the amount of missing values, features, patients and the wholedataset. missing values are represented in green.
These missing values are distributed unevenly throughout the two classes since from the total num-
ber of values belonging to class Evol approximately 80% are missing, against 20% missing values in
class noEvol. This aspect is depict in Figure 4.9 and represents a problem for the classification.
Regarding the mechanisms of missing values, i.e., the reasons why data is missing, there is not a
known justification. In the longitudinal features, missing values occur in what seems to be a random
fashion without any identified pattern and without any previous knowledge that indicates any pattern.
Data is simply missing because either it was not observed, or not annotated, and a consistent mecha-
nism creating these missing values was not found. In the static features, however, some of the missing
24
Figure 4.9: Proportion of missing values in class Evol and noEvol. 80% of the values in the Evol samplesare missing.
values may be considered as ”false“ missings, since the value that is missing in some time-point is
readily filled in with values from other time-points. The missing values that cannot be instantly filled
are the ones for which no value is observed for all time-points. These cases are pre-imputed with the
median across rows. In Appendix B a characterization of the number of missing values per feature and
time-points is presented. For each time-point features 11 to 33 are the longitudinal ones and are also
the ones presenting the greatest amount of missing data.
4.2.2 Biclustering results
As previously mentioned, the modified version of the biclustering algorithm, e-CCC-Biclustering, was
applied to longitudinal features that previously were transformed in one-feature matrices. The results of
such procedure are presented here. The discretization described in Chapter 3.1, needed when applying
this algorithm, was performed with two different number of symbols, 3 and 5, that correspond to the
following alphabets: U, N, D, and A, B, C, D, E. The reason of using these two settings for the discretiza-
tion was to understand the influence of these on the biclustering, imputation and classification results.
it is expected that a better discretization, i.e., , a discretization where the error that one makes when
discretizes values, should lead to better imputed values, however the biclustering results are expected
to be worse i.e., lower number of interesting biclusters found.
To understand the ability of the e-CCC-Biclustering algorithm to find and group together patients that
show the same trends, it is necessary to analyze the amount and importance of the found biclusters.
Although the trivial biclusters have already been filtered out, not all resulting biclusters are of interest for
the present problem, and thus a characterization relating the size and significance of the biclusters is
needed. For this reason, the biclusters to consider are grouped in four different categories (ALL, SIG,
TP, and TPeSIG), as introduced in Chapter 3.1. The representation of the amount of biclusters in each
one of these categories, distributed by feature is presented in Figure 4.10.
This analysis allows for a characterization of the biclusters found and results in the general idea that,
as expected, the number of biclusters that are both significant and have three or more time-points are
scarce. Also, an interesting observation to be made from this analysis and comparing to the number of
missing values in each feature (presented in Appendix B) is that the higher the number of missing values
the lower the number of biclusters found, in both discretizations of 3 or 5 symbols. This was expected
25
Figure 4.10: Distribution of the biclusters through the bicluster categories. ALL: All biclusters after filter-ing out the trivial ones. SIG: Significant biclusters. TP: biclusters with 3 or more time-points. TPeSIG:Significant biclusters with three or more time-points.
since missings leads to lost of information, the higher the number of missing values the scattered the
data becomes, increasing the difficulty of finding interesting biclusters.
An important aspect to be analyzed, since it is highly correlated with the imputation results, is the
amount of missing values that are caught in biclusters. In Figure 4.11 the percentage of missing values
that fall in each bicluster category is presented for both biclusters from 3 and 5 symbols discretization.
Figure 4.11: Representation of the number of missing values belonging to each set of biclusters.
As may be observed the total number of missings that are grouped in biclusters do not ascend to
30%, and that is considering all the non trivial biclusters. If a more restricted but also more interesting
group of biclusters is to be considered then only approximately 5% of the missings are caught. Also,
regarding the effect that the discretization options had on the capability of finding biclusters and on their
quality, one may observe, from all the previous analysis, that using 3 or 5 symbols does not result in a
concrete difference.
.
4.2.3 Datasets imputation results
As presented in Chapter 3.2 several imputation methods and their mixtures were applied to this data
resulting in several imputed datasets. This section describes in detail each dataset created stating the
imputation method used or the strategy implemented as well as the specific settings applied. Here, each
dataset is identified by the imputation method that was used for its creation.
26
ORI The original dataset, here imputation with the proposed methods is not performed, instead the
missings are left to be treated by the default method that deals with missing values implemented
by WEKA for each classifier. This dataset is the baseline for comparison between the proposed
imputation methods and the default one.
MED The missing values were imputed with the value resulted from the median off all values of the
same feature across all patients in the dataset. This method is capable of imputing all missings in
the dataset.
MEDL Missing values were imputed with the median off values from the same patient and feature for
all time points. This may be seen as a median taken horizontally in contrast with the MED method
that may be seen as a median taken vertically. In the specific cases where, for a single person,
observations in all time-points from the same feature are missing, this method is incapable of
predicting any value to imput. The percentage of missings imputed is about 64%.
MEDL MED This dataset is imputed with the previous imputation approach and additionally it is applied
MED. This strategy allows for the remaining 36% of missings to be imputed.
EM The EM imputation method implementation as described in Chapter 3.2 is applied to the entire
dataset. These approach is able to imput every missing in the dataset.
EMfeature The same EM implementation is applied to each one of the one-feature matrices which
structure is introduced in Chapter 3.1. In the few cases where an entire row or an entire column of
values is missing in these matrices, the EM-imputation algorithm can not predict any value for the
corresponding feature. These cases correspond to 4% of the missings which are left missing.
EMfeature MED In the special cases where EMfeature is not capable of predicting the values, the
cross-persons-median imputation is applied, allowing to obtain a complete dataset.
BIC3TPem This dataset is constructed by imputing missings with the bicluster-based imputation strat-
egy using EM. The biclustering procedure is performed on a discretized version of data with alpha-
bet of 3 symbols, and from the found biclusters, only the biclusters with more than 3 time-points are
considered. This approach is able to imput all missing values that belong to this set of biclusters,
the other remain missing. The amount of missing values imputed consists only in approximately
6% of the missing values in the original dataset.
BIC3TPem MED This dataset is the same as the previous one on which the remaining missings are
imputed with MED imputation.
BIC3TPem EMfeature This dataset is the version of BIC3TPem where the remaining missings are
imputed with the EMfeature approach. As before, this approach is not able to take care of every
missing, resulting that in total only 68% of the missings are imputed.
BIC3TPem EM This dataset is also a version of BIC3TPem where the additional method to impute the
remaining missings is EM. This strategy is able to impute every missing in the dataset.
27
BIC3ALLmed Missings are imputed with bicluster-based imputation with median, using all non trivial
bicluster found on the discretized version of the data performed with 3 symbols. The number of
missings belonging to these biclusters are about 30% of the total amount of missing values and
are all imputed.
BIC3ALLmed MED Here the remaining missings from BIC3ALLmed are imputed with MED, which
results in a complete dataset with no missing values.
BIC3SIGmed Missings are imputed with bicluster-based imputation with median and using only the
significant biclusters. In addition, the biclusters are found on a discretized version of data with 3
symbols. The number of missings imputed corresponds to the number of missings existing in the
selected biclusters, which is about 9% of the total.
BIC3SIGmed MED The remaining missings from BIC3SIGmed are imputed with MED, all missings are
imputed.
BIC3SIGTPmed The bicluster-based imputation is here performed using median as the internal impu-
tation method and using biclusters that are both significant and have more than 3 time-points. The
discretization is also performed with 3 symbols. The amount of missings that this strategy is able
to imput corresponds to the total number of missings that may be found on the selected biclusters,
about 5%.
BIC3SIGTPmed MED The remaining missings from BIC3SIGTPmed are imputed with MED, resulting
in a complete dataset.
BIC3TPmed Biclustering-based imputation is applied with median imputation approach, the data is
discretized with an alphabet of 3 symbols and the biclusters selected to perform the imputation
are the ones that have at least 3 time-points. The amount of missings imputed with this strategy is
about 6% of the total amount of missings.
BIC3TPmed MED In order to deal with the remaining missings from BIC3TPmed, MED is applied. This
allows for a complete imputation of every missing in the dataset.
BIC3TPpattern In this dataset the biclustering-based imputation by pattern is applied on the missings
belonging to the biclusters with at least 3 time-points. Here a discretization with an alphabet of 3
symbols is used. This strategy is able to impute every missing belonging to the selected biclusters,
which are about 6% of the total amount of missings in the original dataset.
BIC3TPpattern MED The previous dataset is additionally imputed with the MED approach, in order to
deal with the remaining missings. This results in a complete dataset.
BIC5TPem This dataset was imputed with biclustering-based imputation using EM approach. The
dataset discretization was performed with an alphabet of 5 symbols and the selected biclusters
are the ones that present at least 3 time-points. This strategy is able to imput 6% of the missings.
28
BIC5TPem MED The remaining missings from the BIC5TPem dataset are here imputed with the MED
procedure. The resulting dataset has no missings.
BIC5TPem EMfeature The remaining missings from the BIC5TPem dataset are here imputed with the
EMfeature procedure. The resulting dataset still have about 2% of missing values.
BIC5TPem EM The remaining missings from the BIC5TPem dataset are here imputed with the EM
procedure. This results in a complete dataset.
BIC5ALLmed Missings were imputed using the biclustering-based imputation with median. Biclustering
is here performed on a discretized version of data with 5 symbols and the imputation process
uses all non trivial biclusters found. The amount of missings that this strategy is able to imput
corresponds to the number of missings existing in the selected biclusters, which are about 30%
BIC5ALLmed MED The remaining missings in the previous dataset are here imputed with MED gen-
erating a complete imputed dataset.
BIC5SIGmed Missings are imputed with bicluster-based imputation with median and using only the
significant biclusters. Here the biclusters are found on a discretized version of data with 5 symbols.
The number of missings imputed corresponds to the number of missings existing in the selected
biclusters, which is about 9%.
BIC5SIGmed MED The remaining missings from BIC3SIGmed are imputed with MED, all missings are
imputed.
BIC5SIGTPmed The bicluster-based imputation is here performed using median as the internal impu-
tation method and using biclusters that are both significant and have more than 5 time-points. The
discretization is also performed with 3 symbols. The amount of missings that this strategy is able
to imput corresponds to the total number of missings that may be found on the selected biclusters,
about 5%.
BIC5SIGTPmed MED The remaining missings from BIC3SIGTPmed are imputed with MED, resulting
in a complete dataset.
BIC5TPmed Biclustering-based imputation is applied with median imputation approach, the data is
discretized with an alphabet of 5symbols and the biclusters selected to perform the imputation are
the ones that have at least 3 time-points. The amount of missings imputed with this strategy is
about 6% of the total amount of missings.
BIC5TPmed MED In order to deal with the remaining missings from BIC3TPmed, MED is applied. This
allows for a complete imputation of every missing in the dataset.
BIC5TPpattern In this dataset the biclustering-based imputation by pattern is applied on the missings
belonging to the biclusters with at least 3 time-points. Here a discretization with an alphabet of 5
symbols is used. This strategy is able to impute every missing belonging to the selected biclusters,
which are about 6% of the total amount of missings in the original dataset.
29
BIC5TPpattern MED The previous dataset is additionally imputed with the MED approach, in order to
deal with the remaining missings. This results in a complete dataset.
A organized visual representation of these descriptions is presented in Figure 4.12. Therein, for
each dataset, the methods and settings used appear in green, together with the information of how
many missing each method is able to imput.
Figure 4.12: For each dataset it is shown which are the imputation approaches uses (green), the orderof its application, and the number of missings each method is able to imput.
4.2.4 Classification results
Because of the extreme data unbalance, SMOTE was imperative to be applied and only with the appli-
cation of SMOTE with 300% was it possible to obtain a balanced dataset. As it is the usual procedure,
SMOTE with 500% was also applied in order to obtain the inverted unbalanced dataset, i.e., the class
EVOL with as many more instances than the class noEVOl had in the original dataset. These two
procedures were applied to each dataset described above.
After these stage, all datasets described, together with the ones created by SMOTE, are classified
as explained before. The resulting classifications are evaluated trough the metrics TP rate, TN rate,
Precision, Recall, K-statistics and F-measure. The extensive results of these metrics for each dataset,
and classifier are presented in Appendix C.
Being unable to directly evaluate the imputation methods with these data, the focus here is to un-
derstand which methods work better with which classifiers, in order to help improve the classification
process addressed in other works. In Figures 4.13, 4.14, 4.15 and 4.16, the F-measure metric for each
classifier and balanced datasets (with SMOTE300) are represented. From here it is possible not only
to observe which method is the best for each classifier, but more importantly it is possible to inquire
30
some aspects regarding the relation between imputation methods particularities and the classifiers per-
formance. It is also important to have in mind that the results concerning the original dataset (ORI) serve
to evaluate the default management of missings that each classifier implementation uses, and compare
it with the present imputation methods in test. The default strategies are described in the Chapter 4.2.
Figure 4.13: F-measure for all balanced datasets classified with NaiveBayes.
Figure 4.14: F-measure for all balanced datasets classified with linearSVM.
Using the Naive Bayes classifier, the imputations applying Median to the whole dataset(MED) and
Bbimputation with median (as an example the datasets imputed with Bic3SIG and BIC5SIG) improves
over Weka’s default method for dealing with missing values (ORI). Also when using KNN, the default
method to deal with missing values in Weka performs worst than Bbimputation approaches with median
(BIC3SIG and BIC5SIG), EM applied to the whole dataset (EM) and Median aplied to the whole dataset
(MED). Regading Decision Trees (DT), it is noticeable that the bbimputation procedure with EM improves
the results over the EM imputation applied feature by feature. As to the linear SVM, the biclustering-
31
Figure 4.15: F-measure for all balanced datasets classified with K-nearest-neighbor.
Figure 4.16: F-measure for all balanced datasets classified with Decision Trees.
based imputation methods using the by pattern approach ( as an example the datasets imputed wih
BIc3bypattern) and Median applied to the whole dataset (MED) improves results over the default missing
treatment of WEKA. The Bbimputation with by pattern approache also improves results over EM aplied
feature by feature (EMbyfeature) and over Median applied to the whole dataset(MED).
These conclusions are supported by the results of the Wilcoxon signed-rank tests, that compares
results of both experiments and is able to determine if the values are significantly different or not. The
results of the applied tests in the form of Pvalues are presented, for each classifier in the tables 4.1, 4.2
4.3 and 4.4. A Pvalue lower than 0.05 indicates that the fscore mean values for both experiments are
different and thus conclusions about performance may be drawn.
Regarding the use of 3 or 5 symbols in the discretization process, by comparing the classification re-
32
sults of the several BIC3 and BIC5 methods it is suggested that also here these options do not introduce
much difference. In the Table 4.5 the mentioned comparison is presented.
Table 4.1: Pvalue results of applying the Wilcoxon signed-rank test to theFscore results of the experi-ments defined at the left collumn, together with the Fscore mean values for each experiment.
Naive Bayes
Mean1 Mean2 Pvalue’MEDvsBIC3bypatternMED’ 0,898 0,8947 0,326751396
’MEDvsBIC5bypattern MED’ 0,898 0,8935 0,239517679’MEDvsBIC3SIGTP MED’ 0,898 0,9007 0,567822337
’ORIvsBIC3SIG’ 0,872 0,8777 5,18E-01’ORIvsBIC5SIG’ 0,872 0,8792 3,52E-01
’ORIvsBIC3bypattern’ 0,872 0,8754 0,797367247’ORIvsBIC5bypattern’ 0,872 0,8722 0,791341672
’ORIvsMED’ 0,872 0,898 2,36E-03’ORIvsEM’ 0,872 0,8802 0,400191719
’ORIvsEMbyfeature’ 0,872 0,8754 0,667968919’EMvsBIC3em EM’ 0,8802 0,8654 0,000728388’EMvsBIC5em EM’ 0,8802 0,8671 0,00199244
’EMbyfeature vs BIC3em EMbyfeature’ 0,8754 0,8701 0,820018494’EMbyfeature vs BIC5em EMbyfeature’ 0,8754 0,8763 0,979542991
Table 4.2: Pvalue results of applying the Wilcoxon signed-rank test to theFscore results of the experi-ments defined at the left collumn, together with the Fscore mean values for each experiment.
KNN
Mean1 Mean2 Pvalue’MEDvsBIC3bypatternMED’ 0,8613 0,8549 0,2829474
’MEDvsBIC5bypattern MED’ 0,8613 0,8552 0,484913’MEDvsBIC3SIGTP MED’ 0,8613 0,8508 0,0438265
’ORIvsBIC3SIG’ 0,6696 0,7272 9,60E-06’ORIvsBIC5SIG’ 0,6696 0,7262 1,94E-05
’ORIvsBIC3bypattern’ 0,6696 0,6648 0,9484468’ORIvsBIC5bypattern’ 0,6696 0,6685 0,8658498
’ORIvsMED’ 0,6696 0,8613 7,56E-10’ORIvsEM’ 0,6696 0,7059 0,0008672
’ORIvsEMbyfeature’ 0,6696 0,6798 0,2871624’EMvsBIC3em EM’ 0,7059 0,7045 0,5722664’EMvsBIC5em EM’ 0,7059 0,6988 0,7500602
’EMbyfeature vs BIC3em EMbyfeature’ 0,6798 0,6656 0,2679884’EMbyfeature vs BIC5em EMbyfeature’ 0,6798 0,6858 0,8431294
33
Table 4.3: Pvalue results of applying the Wilcoxon signed-rank test to theFscore results of the experi-ments defined at the left collumn, together with the Fscore mean values for each experiment.
DT
Mean1 Mean2 Pvalue’MEDvsBIC3bypatternMED’ 0,8218 0,8259 0,9881
’MEDvsBIC5bypattern MED’ 0,8218 0,8185 0,7466’MEDvsBIC3SIGTP MED’ 0,8218 0,8203 0,7315
’ORIvsBIC3SIG’ 0,8072 0,8157 0,5657’ORIvsBIC5SIG’ 0,8072 0,8165 0,4418
’ORIvsBIC3bypattern’ 0,8072 0,8153 0,5115’ORIvsBIC5bypattern’ 0,8072 0,82 0,2012
’ORIvsMED’ 0,8072 0,8218 0,1809’ORIvsEM’ 0,8072 0,786 0,064
’ORIvsEMbyfeature’ 0,8072 0,8066 0,6535’EMvsBIC3em EM’ 0,786 0,795 0,6154’EMvsBIC5em EM’ 0,786 0,7795 0,5464
’EMbyfeature vs BIC3em EMbyfeature’ 0,8066 0,7836 0,088’EMbyfeature vs BIC5em EMbyfeature’ 0,8066 0,8321 0,0164
Table 4.4: Pvalue results of applying the Wilcoxon signed-rank test to theFscore results of the experi-ments defined at the left collumn, together with the Fscore mean values for each experiment.
linear SVM
Mean1 Mean2 Pvalue’MEDvsBIC3bypatternMED’ 0,8727 0,8872 0,0083
’MEDvsBIC5bypattern MED’ 0,8727 0,872 0,9953’MEDvsBIC3SIGTP MED’ 0,8727 0,8918 0,0005
’ORIvsBIC3SIG’ 0,841 0,8442 0,8135’ORIvsBIC5SIG’ 0,841 0,8461 0,7981
’ORIvsBIC3bypattern’ 0,841 0,8671 0,0274’ORIvsBIC5bypattern’ 0,841 0,8588 0,1309
’ORIvsMED’ 0,841 0,8727 0,0093’ORIvsEM’ 0,841 0,8546 0,2249
’ORIvsEMbyfeature’ 0,841 0,8434 0,8853’EMvsBIC3em EM’ 0,8546 0,8618 0,225’EMvsBIC5em EM’ 0,8546 0,861 0,3426
’EMbyfeature vs BIC3em EMbyfeature’ 0,8434 0,846 0,8056’EMbyfeature vs BIC5em EMbyfeature’ 0,8434 0,8601 0,0144
34
Table 4.5: F-measure results for each classifier and each Bicluster-based imputation methods using 3and 5 symbols in the discretization.
NB KNN Linear SVM DT
BIC -n- Tpem 3 symbols 0,868055785 0,662322251 0,865307774 0,8663440025 symbols 0,868142974 0,618738883 0,854791296 0,871225781
BIC-n-ALLmed 3 symbols 0,879851286 0,676202215 0,855354761 0,8814349585 symbols 0,874813669 0,680651013 0,850386687 0,882611443
BIC-n-SIGmed 3 symbols 0,877722672 0,727191065 0,844211513 0,8869127395 symbols 0,879193357 0,726209807 0,846123113 0,887166384
BIC-n-TPmed 3 symbols 0,871242938 0,677981705 0,865453804 0,8811131315 symbols 0,867237839 0,634922946 0,851689888 0,882896434
BIC-n-SIGTPmed 3 symbols 0,871348082 0,684007823 0,851011579 0,8685662425 symbols 0,868506938 0,69033395 0,849971234 0,877556996
BIC-n-TPpattern3 symbols 0,875402185 0,664789098 0,867100562 0,8772136415 symbols 0,872178449 0,668481014 0,858814724 0,863707615
35
Chapter 5
Conclusions and Future work
In this work it was studied the problem of missing values in longitudinal data. Synthetic datasets were
generated to test the performance of imputation methods against ground truth, and to test the influence
of the amount of missing values on the imputation methods. The problem was furthered evaluated in
a real-world dataset of ALS patients, where the purpose is to predict the evolution of the need of Non
Invasive Ventilation (NIV) for assisted breathing.
The problem of imputing missings values in longitudinal data is approached here by the use of bi-
clustering algorithms. The application of biclustering algorithms allowed to find trends in the data, from
which better imputations can be performed. Using biclusters to imput in local portions of the data shows
improvement on the quality of the imputation in the synthetic data as well as in the performance of the
classifications in the real-world data.The tested methods are robust to the number of missing values
even when this amount rises dramatically to 50% of the total data.
Regarding ALS data the biclustering algorithm requires discretization before being applied and it was
found that using from 3 to 5 symbols in the discretization does not change results significantly.
For the future it is of interest to analyze how the proposed biclustering-based imputation approaches
deals with different mechanisms of missing values, i.e., compare between datasets with MCAR, MAR
and NMAR data. Also of interest for the future is the application of the same methods to other real-world
datasets such as the Alzheimer one, to confirm the conclusions drawn in this work.
The Prediction of NIV in ALS patients based on the present dataset may be considered an extreme
case, where the amount of missing values is huge and the distribution of these between the two classes
is very unbalanced. The general good classification results obtained here make us believe that this work
is in the right track and contributed positively to the solution of the problem at hand.
36
Bibliography
[1] Paul D Allison. Missing data, volume 136. Sage publications, 2001.
[2] Andre V. Carreiro, Susana Pinto, Alexandra M. Carvalho, Mamede de Carvalho, and Sara C.
Madeira. Predicting non-invasive ventilation in als patients using time windows. In ACM SIGKDD
Workshop on Healthcare Informatics (HI-KDD 2014), 2014.
[3] Fabrıcio Olivetti de Franca, Guilherme Palermo Coelho, and Fernando J Von Zuben. Predicting
missing values with biclustering: A coherence-based approach. Pattern Recognition, 46(5):1255–
1266, 2013.
[4] A Rogier T Donders, Geert JMG van der Heijden, Theo Stijnen, and Karel GM Moons. Review: a
gentle introduction to imputation of missing values. Journal of clinical epidemiology, 59(10):1087–
1091, 2006.
[5] Jean Mundahl Engels and Paula Diehr. Imputation of missing longitudinal data: a comparison of
methods. Journal of clinical epidemiology, 56(10):968–976, 2003.
[6] Rui Henriques, Francisco L. Ferreira, and Sara C. Madeira. Bigen: Synthetic data generator for
biclustering. Submitted for publication, available in: http://web.ist.utl.pt/rmch/software/bigen/, 2015.
[7] James Honaker and Gary King. What to do about missing values in time-series cross-section data.
American Journal of Political Science, 54(2):561–581, 2010.
[8] Roderick Little and Donald B Rubin. Statistical analysis with missing data. John Wiley & Sons,
2014.
[9] Sara C Madeira and Arlindo L Oliveira. Biclustering algorithms for biological data analysis: a survey.
Computational Biology and Bioinformatics, IEEE/ACM Transactions on, 1(1):24–45, 2004.
[10] Sara C Madeira and Arlindo L Oliveira. A linear time biclustering algorithm for time series gene
expression data. In Algorithms in Bioinformatics, pages 39–52. Springer, 2005.
[11] Tusell Palmer, Fernando Jorge, Barcena Ruiz, and Marıa Jesus. Multivariate data imputation us-
ing trees. Technical report, Universidad del Paıs Vasco-Departamento de Economıa Aplicada III
(Econometrıa y Estadıstica), 2002.
37
[12] Trivellore E Raghunathan, James M Lepkowski, John Van Hoewyk, and Peter Solenberger. A
multivariate technique for multiply imputing missing values using a sequence of regression models.
Survey methodology, 27(1):85–96, 2001.
[13] Michael B Richman, Theodore B Trafalis, and Indra Adrianto. Missing data imputation through ma-
chine learning algorithms. In Artificial Intelligence Methods in the Environmental Sciences, pages
153–169. Springer, 2009.
[14] Vanessa Romero and Antonio Salmeron. Multivariate imputation of qualitative missing data us-
ing bayesian networks. In Soft methodology and random information systems, pages 605–612.
Springer, 2004.
[15] Donald B Rubin. Multiple imputation for nonresponse in surveys, volume 81. John Wiley & Sons,
2004.
[16] Joseph L Schafer and John W Graham. Missing data: our view of the state of the art. Psychological
methods, 7(2):147, 2002.
[17] Tapio Schneider. Analysis of incomplete climate data: Estimation of mean values and covariance
matrices and imputation of missing values. Journal of Climate, 14(5):853–871, 2001.
[18] Matthias Templ, Alexander Kowarik, and Peter Filzmoser. Iterative stepwise regression imputation
using standard and robust methods. Computational Statistics & Data Analysis, 55(10):2793–2806,
2011.
[19] Jos Twisk and Wieke de Vente. Attrition in longitudinal studies: how to deal with missing data.
Journal of clinical epidemiology, 55(4):329–337, 2002.
38
Appendix A
Synthetic data results
Table A.1: Extensive results of the evaluation performed on all datasets of size 1000*150Dataset Size: 1000*150
Missing percentage Imputation mode Percentage of Missings Imputed: Mean imputation Error = (real Value-Value Imputed) / n imputations’
10%
BICmed 20.32666667 1,413742211BICem 20.32666667 0.937845142
BICmed MED 100 4,271733333BICem MED 100 4,175002542
BICem EM 100 4,211056034MED 100 4,983866667
MEDL 100 5,033466667MEDL MED 100 5,033466667
EM 100 4,755692469
20%
BICmed 21,54 1,324899412BICem 21,54 0.933288337
BICmed MED 100 4,262833333BICem MED 100 4,177121824
BICem EM 100 4,194946974MED 100 5,006033333
MEDL 100 5,048933333MEDL MED 100 5,048933333
EM 100 4,757333401
30%
BICmed 21,09111111 1,337372247BICem 21,09111111 1,017368599
BICmed MED 100 4,268144444BICem MED 100 4,201484852
BICem EM 100 4,234740331MED 100 5,000688889
MEDL 100 5,060088889MEDL MED 100 5,060088889
EM 100 4,827003504
50%
BICmed 21,84133333 1,430407179BICem 21,84133333 1,269626416
BICmed MED 100 4,259079952BICem MED 100 4,221598019
BICem EM 100 4,273330267MED 100 4,98176
MEDL 100 5,053426667MEDL MED 100 5,053426667
EM 100 4,854577885
39
Table A.2: Extensive results of the evaluation performed on all datasets of size 2000*200Dataset Size: 2000*200
Missing percentage Imputation mode Percentage of Missings Imputed: Mean imputation Error = (real Value-Value Imputed) / n imputations’
10%
BICmed 25,3175 1,126542905BICem 25,3175 0.643620933
BICmed MED 100 4,0512125BICem MED 100 3,926985166
BICem EM 100 3,96411498MED 100 4,950925
MEDL 100 5,010925MEDL MED 100 5,010925
EM 100 4,493694918
20%
BICmed 24,39125 1,171834162BICem 24,39125 0.715224544
BICmed MED 100 4,0920625BICem MED 100 3,979542287
BICem EM 100 4,021799629MED 100 4,981625
MEDL 100 5,031825MEDL MED 100 5,031825
EM 100 4,598826337
30%
BICmed 23,74666667 1,202133633BICem 23,74666667 0.755448609
BICmed MED 100 4,127716667BICem MED 100 4,022413642
BICem EM 100 4,06061872MED 100 4,959583333
MEDL 100 5,0123MEDL MED 100 5,0123
EM 100 4,63213825
50%
BICmed 25,9685 1,171063019BICem 25,9685 0.869324979
BICmed MED 100 4,0334BICem MED 100 3,952184797
BICem EM 100 4,036433586MED 100 4,968145
MEDL 100 5,024245MEDL MED 100 5,024245
EM 100 4,63234984
Table A.3: Extensive results of the evaluation performed on all datasets of size 1000*150Dataset Size: 5000*200
Missing percentage Imputation mode Percentage of Missings Imputed: Mean imputation Error = (real Value-Value Imputed) / n imputations’
10%
BICmed 47,245 1,16156207BICem 47,245 0.618308819
BICmed MED 100 3,25001BICem MED 100 2,993413285
BICem EM 100 3,028054068MED 100 4,89789
MEDL 100 5,01538MEDL MED 100 5,01538
EM 100 3,56259588
40
Appendix B
Dataset statistical description
Table B.1: Statistical description for each feature in the 1st time-point1st Time-point
No average standard deviation Missings No extreme valuesCount percentage low high
Name 1 159 311,36 186,020 0 0.0 0 0Gender 1 159 1,48 ,501 0 0.0 0 0BMI 1 133 25,4286066615 3,78767643893 26 16,4 0 2MND 1 147 1,93 ,264 12 7,5Ageatonset 1 159 57,67 12,476 0 0.0 1 0Onsetform 1 158 1,23 ,425 1 ,6@1stsymptoms1stvisit 1 159 20.159 28,3256 0 0.0 0 11ALSFRS 1 136 34,16 4,867 23 14,5 4 0ALSFRSR 1 127 42,14 5,111 32 20.1 4 0ALSFRSb 1 135 10.70 2,038 24 15,1 3 0R 1 129 11,81 ,512 30 18,9SpO2mean 1 29 95,3159 1,31259 130 81,8 0 0SpO2min 1 27 87,30 4,866 132 83,0 0 0SpO290 1 29 ,4883 ,82623 130 81,8 0 6Dips4 1 22 11,41 14,212 137 86,2 0 2Dipsh4 1 27 2,1489 3,40944 132 83,0 0 3Dips3 1 22 18,00 22,112 137 86,2 0 2Dipsh3 1 27 3,7044 4,77662 132 83,0 0 3Pattern 1 28 1,79 ,787 131 82,4 0 0VC 1 113 99,1356 18,48219 46 28,9 4 1FVC 1 115 99,4253 20.30843 44 27,7 4 0MIP 1 106 63,6730 29,70809 53 33,3 0 1MEP 1 105 79,9995 27,84390 54 34,0 0 0P01 1 97 95,9935 36,65919 62 39,0 2 4PO2 1 97 86,985 8,2239 62 39,0 0 0PCO2 1 97 38,843 3,7038 62 39,0 1 3peso 1 97 68,21 12,253 62 39,0 0 1PhrenMeanLat 1 72 7,96014 ,896103 87 54,7 0 1PhrenMeanAmpl 1 73 ,63336 ,225156 86 54,1 0 0PhrenMeanArea 1 64 2,58992 1,606443 95 59,7 0 1
41
Table B.2: Statistical description for each feature in the 2nd time-point2nd Time-point
No average standard deviation Missings No extreme valuesCount percentage low high
Name 2 159 311,36 186,020 0 0.0 0 0Gender 2 159 1,48 ,501 0 0.0 0 0BMI 2 133 25,4286066615 3,78767643893 26 16,4 0 2MND 2 147 1,93 ,264 12 7,5Ageatonset 2 159 57,67 12,476 0 0.0 1 0Onsetform 2 158 1,23 ,425 1 ,6@1stsymptoms1stvisit 2 159 20.159 28,3256 0 0.0 0 11ALSFRS 2 144 32,85 5,062 15 9,4 4 0ALSFRSR 2 136 40.88 5,189 23 14,5 4 0ALSFRSb 2 144 10.41 2,518 15 9,4 5 0R 2 136 11,82 ,560 23 14,5SpO2mean 2 54 95,1794 1,30176 105 66,0 0 0SpO2min 2 50 87,54 4,841 109 68,6 2 0SpO290 2 54 ,7996 2,37402 105 66,0 0 8Dips4 2 45 14,38 15,370 114 71,7 0 1Dipsh4 2 51 1,6363 1,77859 108 67,9 0 2Dips3 2 45 25,38 21,805 114 71,7 0 1Dipsh3 2 51 2,9722 2,51773 108 67,9 0 1Pattern 2 54 1,89 ,904 105 66,0 0 0VC 2 58 99,6247 23,00548 101 63,5 2 1FVC 2 61 99,8059 22,50889 98 61,6 1 1MIP 2 58 66,1457 26,94055 101 63,5 0 2MEP 2 58 77,5350 29,71665 101 63,5 0 1P01 2 52 95,4683 28,13483 107 67,3 0 3PO2 2 50 87,804 10.4409 109 68,6 0 0PCO2 2 50 38,958 3,5280 109 68,6 0 0peso 2 43 71,88 13,590 116 73,0 0 1PhrenMeanLat 2 61 8,11811 ,910869 98 61,6 0 0PhrenMeanAmpl 2 61 ,66836 ,253015 98 61,6 0 0PhrenMeanArea 2 59 2,68703 1,543984 100 62,9 0 2
Table B.3: Statistical description for each feature in the 3rd time-point3rd Time-point
No average standard deviation Missings No extreme valuesCount percentage low high
Name 3 159 311,36 186,020 0 0.0 0 0Gender 3 159 1,48 ,501 0 0.0 0 0BMI 3 133 25,4286066615 3,78767643893 26 16,4 0 2MND 3 147 1,93 ,264 12 7,5Ageatonset 3 159 57,67 12,476 0 0.0 1 0Onsetform 3 158 1,23 ,425 1 ,6@1stsymptoms1stvisit 3 159 20.159 28,3256 0 0.0 0 11ALSFRS 3 146 31,23 5,507 13 8,2 4 0ALSFRSR 3 136 39,38 5,562 23 14,5 4 0ALSFRSb 3 146 10.21 2,497 13 8,2 5 0R 3 137 11,77 ,572 22 13,8SpO2mean 3 50 94,9854 1,48872 109 68,6 2 0SpO2min 3 47 85,543 5,9425 112 70.4 0 0SpO290 3 50 2,5596 8,45576 109 68,6 0 6Dips4 3 44 11,393 11,2819 115 72,3 0 1Dipsh4 3 46 1,4774 1,46939 113 71,1 0 1Dips3 3 41 20.71 17,982 118 74,2 0 3Dipsh3 3 43 2,6888 2,28933 116 73,0 0 1Pattern 3 46 1,93 ,879 113 71,1 0 0VC 3 69 93,4180 18,48484 90 56,6 0 1FVC 3 71 93,8775 18,94539 88 55,3 0 0MIP 3 67 62,9934 31,24111 92 57,9 0 0MEP 3 67 74,3136 35,37268 92 57,9 0 2P01 3 58 101,7045 45,27073 101 63,5 0 6PO2 3 49 88,345 8,8202 110 69,2 0 0PCO2 3 49 38,769 3,5044 110 69,2 0 0peso 3 59 66,44 10.978 100 62,9 0 0PhrenMeanLat 3 58 7,97474 ,931877 101 63,5 3 0PhrenMeanAmpl 3 58 ,57466 ,209979 101 63,5 1 1PhrenMeanArea 3 58 2,26060 ,808356 101 63,5 0 1
42
Table B.4: Statistical description for each feature in the 4th time-point4th Time-point
No average standard deviation Missings No extreme valuesCount percentage low high
Name 4 159 311,36 186,020 0 0.0 0 0Gender 4 159 1,48 ,501 0 0.0 0 0BMI 4 133 25,4286066615 3,78767643893 26 16,4 0 2MND 4 147 1,93 ,264 12 7,5Ageatonset 4 159 57,67 12,476 0 0.0 1 0Onsetform 4 158 1,23 ,425 1 ,6@1stsymptoms1stvisit 4 159 20.159 28,3256 0 0.0 0 11ALSFRS 4 149 29,56 6,483 10 6,3 4 0ALSFRSR 4 140 37,59 6,617 19 11,9 4 0ALSFRSb 4 149 9,77 3,003 10 6,3 5 0R 4 141 11,62 ,762 18 11,3 3 0SpO2mean 4 52 95,0142 1,35310 107 67,3 1 0SpO2min 4 49 85,949 5,5474 110 69,2 1 0SpO290 4 52 ,9500 1,98168 107 67,3 0 4Dips4 4 47 17,9606 27,28716 112 70.4 0 7Dipsh4 4 50 4,0286 13,36912 109 68,6 0 8Dips3 4 45 28,7538 36,35080 114 71,7 0 4Dipsh3 4 46 5,4630 14,07666 113 71,1 0 7Pattern 4 47 2,09 ,775 112 70.4 0 0VC 4 69 92,7926 24,18406 90 56,6 1 1FVC 4 70 93,4324 24,76286 89 56,0 3 1MIP 4 68 62,6638 28,22980 91 57,2 0 1MEP 4 68 69,8210 35,54146 91 57,2 0 3P01 4 56 95,5238 37,99597 103 64,8 0 2PO2 4 53 88,604 9,7377 106 66,7 0 0PCO2 4 53 38,430 2,8873 106 66,7 0 0peso 4 55 67,53 10.315 104 65,4 0 0PhrenMeanLat 4 43 8,08884 1,225198 116 73,0 1 2PhrenMeanAmpl 4 43 ,56709 ,266939 116 73,0 0 0PhrenMeanArea 4 42 2,59405 2,075788 117 73,6 0 1
Table B.5: Statistical description for each feature in the 5th time-point5th Time-point
No average standard deviation Missings No extreme valuesCount percentage low high
Name 5 159 311,36 186,020 0 0.0 0 0Gender 5 159 1,48 ,501 0 0.0 0 0BMI 5 133 25,4286066615 3,78767643893 26 16,4 0 2MND 5 147 1,93 ,264 12 7,5Ageatonset 5 159 57,67 12,476 0 0.0 1 0Onsetform 5 158 1,23 ,425 1 ,6@1stsymptoms1stvisit 5 159 20.159 28,3256 0 0.0 0 11ALSFRS 5 145 27,10 7,421 14 8,8 1 0ALSFRSR 5 133 35,13 7,741 26 16,4 1 0ALSFRSb 5 145 9,26 3,442 14 8,8 0 0R 5 134 11,46 1,088 25 15,7 11 0SpO2mean 5 48 94,6413 1,53173 111 69,8 0 1SpO2min 5 47 83,834 8,7846 112 70.4 3 0SpO290 5 48 1,3644 2,27144 111 69,8 0 3Dips4 5 43 13,21 14,064 116 73,0 0 5Dipsh4 5 46 1,7589 1,93567 113 71,1 0 5Dips3 5 41 24,32 20.323 118 74,2 0 4Dipsh3 5 42 3,1281 2,75294 117 73,6 0 2Pattern 5 42 2,31 ,749 117 73,6 0 0VC 5 70 88,9146 23,68825 89 56,0 0 0FVC 5 72 88,8629 24,89035 87 54,7 0 0MIP 5 68 57,8096 31,50190 91 57,2 0 1MEP 5 68 69,3697 36,41167 91 57,2 0 2P01 5 59 95,6593 33,96141 100 62,9 0 2PO2 5 55 86,720 9,0579 104 65,4 0 3PCO2 5 54 39,909 4,5258 105 66,0 0 1peso 5 54 66,69 13,037 105 66,0 0 0PhrenMeanLat 5 31 8,36355 ,945720 128 80.5 0 0PhrenMeanAmpl 5 31 ,49226 ,215406 128 80.5 0 0PhrenMeanArea 5 29 2,04276 ,839246 130 81,8 0 0
43
Appendix C
Real-world data classification results
Table C.1: Extensive results of the classification with Naive Bayes on all unbalanced datasets.NaiveBAYES
Unbalanced original datasetConfusion matrices’ Tprate Tnrate Precision Kappa statistic F measure
ORI [19 4; 6 2] 0.759353846 0.385714 0.833288108 0.121873547 0.79131053EM [20 4;6 3] 0.7784 0.418571 0.847134471 0.178652475 0.808623045
EMfeature [20 4;6 3] 0.777507692 0.401905 0.843564887 0.163081145 0.806343065EMfeature MED [20 3;6 3] 0.775753846 0.453333 0.855688425 0.20072535 0.8113231
MED [20 3;6 3] 0.778861538 0.44 0.853066447 0.193586039 0.811957115MEDL [20 4;6 2] 0.778 0.332381 0.828447452 0.101975689 0.799359827
MEDL MED [20 4;6 2] 0.766307692 0.315238 0.821871433 0.078350923 0.79044669BIC3TPem [20 3;5 3] 0.791538462 0.485238 0.865021719 0.247713047 0.824749956
BIC3TPem MED [20 4;6 3] 0.781476923 0.421905 0.849028411 0.183107267 0.811903642BIC3em EMfeature [20 3;6 3] 0.763230769 0.438095 0.849759509 0.172790245 0.80239782
BIC3em EM [20 3;5 3] 0.7876 0.494286 0.86611893 0.249982414 0.822749604BIC3ALLmed [20 4;6 3] 0.782338462 0.414762 0.848283225 0.172507451 0.811516256
BIC3ALLmed MED [20 3;5 3] 0.790923077 0.451429 0.858227435 0.214861181 0.82041486BIC3SIGmed [20 4;6 2] 0.767353846 0.389524 0.838987802 0.142603997 0.798778411
BIC3SIGmed MED [20 3;6 3] 0.774984615 0.441429 0.852493859 0.190561951 0.809551681BIC3SIGTPmed [20 3;6 3] 0.7792 0.494762 0.866190854 0.23513914 0.817816327
BIC3SIGTPmed MED [20 3;5 3] 0.786276923 0.510476 0.870143142 0.261388026 0.823279476BIC3TPmed [20 3;6 3] 0.785476923 0.491905 0.865679138 0.243872027 0.820919194
BIC3TPmed MED [20 3;6 3] 0.783230769 0.496667 0.866378262 0.243991787 0.819718719BIC3TPpattern [19 3;6 3] 0.7568 0.472381 0.85786653 0.193203588 0.8008173
BIC3TPpattern MED [20 4;6 3] 0.767353846 0.414762 0.845138465 0.158012042 0.801367184BIC5TPem [20 3;6 3] 0.780153846 0.437619 0.851686676 0.195913526 0.812093062
BIC5TPem MED [20 4;6 3] 0.778923077 0.401905 0.84513878 0.159470076 0.807489584BIC5em EMfeature [20 4;6 2] 0.773323077 0.402381 0.842200217 0.15914093 0.803556164
BIC5em EM [20 3;6 3] 0.779476923 0.454286 0.855175839 0.20996597 0.813330358BIC5ALLmed [20 3;6 3] 0.765876923 0.443333 0.852300287 0.177778462 0.803849208
BIC5ALLmed MED [20 3;6 3] 0.778215385 0.446667 0.854800471 0.198473262 0.812108513BIC5SIGmed [20 4;6 2] 0.765753846 0.389524 0.838745483 0.140759305 0.797757701
BIC5SIGmed MED [20 3;6 3] 0.774984615 0.441429 0.852493859 0.190561951 0.809551681BIC5SIGTPmed [20 4;6 3] 0.772984615 0.418095 0.84755954 0.166200792 0.806229261
BIC5SIGTPmed MED [20 3;6 3] 0.778215385 0.485714 0.863367209 0.228104376 0.815926962BIC5TPmed [20 4;6 3] 0.771938462 0.422381 0.848651232 0.165459457 0.805614636
BIC5TPmed MED [20 4;6 3] 0.782892308 0.424286 0.850565969 0.180799328 0.812390009BIC5TPpattern [19 3;6 3] 0.760123077 0.46381 0.855827601 0.190347297 0.802258994
BIC5TPpattern MED [20 4;6 2] 0.774369231 0.389048 0.841284676 0.140319173 0.803480379
44
Table C.2: Extensive results of the classification with Naive Bayes on all balanced (SMOTE300) datasets.NaiveBAYES
SMOTE300-Balanced dataset’Confusion matrices’ Tprate Tnrate Precision Kappa statistic F measure
ORI [23 5;2 21] 0.914307692 0.818123 0.836446874 0.731892386 0.871985167EM [23 4;3 21] 0.902646154 0.8466 0.861431688 0.749686365 0.880152473
EMfeature [23 4;3 21] 0.900215385 0.8385 0.854574477 0.739254342 0.875392201EMfeature MED [24 3;2 22] 0.9392 0.867633 0.883367705 0.807542612 0.908907953
MED [24 3;2 22] 0.918030769 0.870133 0.88229149 0.788600741 0.897995435MEDL [22 3;4 22] 0.853323077 0.87 0.874146101 0.723012158 0.861221629
MEDL MED [22 3;4 22] 0.842246154 0.870867 0.872792748 0.712804007 0.855108815BIC3TPem [23 4;3 21] 0.882953846 0.845033 0.858294561 0.728277207 0.868055785
BIC3TPem MED [23 3;2 22] 0.915661538 0.8684 0.881014196 0.784570584 0.896438741BIC3em EMfeature [22 3;3 21] 0.875076923 0.8619 0.870042716 0.737054533 0.870070106
BIC3em EM [22 4;3 21] 0.8784 0.844167 0.857525114 0.722727737 0.865437676BIC3ALLmed [23 4;2 21] 0.903415385 0.845833 0.861502901 0.749742041 0.879851286
BIC3ALLmed MED [24 3;2 21] 0.929692308 0.866 0.880949738 0.796343133 0.902972339BIC3SIGmed [23 4;3 21] 0.897907692 0.8475 0.861657499 0.745752559 0.877722672
BIC3SIGmed MED [24 3;2 22] 0.922061538 0.877333 0.888499647 0.799744223 0.903244594BIC3SIGTPmed [23 4;3 21] 0.888523077 0.844267 0.858145675 0.733049567 0.871348082
BIC3SIGTPmed MED [24 3;2 22] 0.921138462 0.8717 0.884005297 0.79330533 0.900673783BIC3TPmed [23 4;3 21] 0.892461538 0.838633 0.854922227 0.731441137 0.871242938
BIC3TPmed MED [24 3;2 22] 0.922707692 0.880533 0.891523541 0.803668993 0.905355394BIC3TPpattern [23 4;3 21] 0.886153846 0.858 0.868622636 0.74427619 0.875402185
BIC3TPpattern MED [23 3;2 22] 0.910215385 0.871633 0.882978509 0.782244336 0.894693771BIC5TPem [23 4;3 21] 0.886923077 0.839367 0.854388226 0.726695495 0.868142974
BIC5TPem MED [24 3;2 22] 0.9196 0.870067 0.882364753 0.790163404 0.899099445BIC5em EMfeature [22 3;3 22] 0.870676923 0.8821 0.886254909 0.752414683 0.876312116
BIC5em EM [23 4;3 21] 0.879169231 0.847367 0.859851537 0.726798081 0.867064537BIC5ALLmed [23 4;3 21] 0.901938462 0.8353 0.852583804 0.73776577 0.874813669
BIC5ALLmed MED [24 3;2 22] 0.9188 0.882167 0.891817929 0.801346337 0.903790384BIC5SIGmed [23 4;3 21] 0.901015385 0.846633 0.861820271 0.748069377 0.879193357
BIC5SIGmed MED [24 3;2 22] 0.9212 0.879767 0.889970919 0.801327231 0.903676155BIC5SIGTPmed [23 4;3 21] 0.893292308 0.831367 0.848652865 0.725125495 0.868506938
BIC5SIGTPmed MED [24 3;2 22] 0.931292308 0.868433 0.882658748 0.800429609 0.904640953BIC5TPmed [23 4;3 21] 0.889384615 0.833767 0.849987446 0.723515544 0.867237839
BIC5TPmed MED [24 3;2 22] 0.921230769 0.880633 0.89153427 0.802166326 0.904502317BIC5TPpattern [23 4;3 21] 0.888430769 0.846633 0.859770569 0.735423553 0.872178449
BIC5TPpattern MED [23 3;2 21] 0.914153846 0.863567 0.876956201 0.77824718 0.893515351
Table C.3: Extensive results of the classification with Naive Bayes on all SMOTE500 datasets.NaiveBAYES
SMOTE500’Confusion matrices’ Tprate Tnrate Precision Kappa statistic F measure
ORI [24 5;1 33] 0.941876923 0.869784 0.830195536 0.793282822 0.881115378EM [24 4;2 33] 0.938923077 0.880597 0.846193662 0.805633871 0.889158553
EMfeature [24 5;2 32] 0.939753846 0.867681 0.832812199 0.791423491 0.881874423EMfeature MED [24 3;1 34] 0.947569231 0.919915 0.893456926 0.859217694 0.918507216
MED [24 3;1 34] 0.955415385 0.909673 0.882049026 0.853889165 0.916137547MEDL [23 4;3 33] 0.885815385 0.895704 0.856483239 0.776978205 0.869204047
MEDL MED [23 3;3 35] 0.883415385 0.929602 0.899019522 0.814649553 0.889073057BIC3TPem [24 5;2 32] 0.929723077 0.867738 0.830314962 0.782529648 0.876159189
BIC3TPem MED [24 4;2 33] 0.926461538 0.896174 0.862847576 0.813118861 0.892187433BIC3em EMfeature [23 4;3 33] 0.896061538 0.899474 0.861903525 0.790257853 0.877187157
BIC3em EM [24 5;2 33] 0.924923077 0.874154 0.836483493 0.785819185 0.877349176BIC3ALLmed [25 4;1 34] 0.958646154 0.901081 0.872355763 0.846481713 0.912316508
BIC3ALLmed MED [24 5;1 33] 0.942861538 0.874154 0.839092332 0.801469725 0.887211563BIC3SIGmed [25 4;1 33] 0.959353846 0.896714 0.867251384 0.842068566 0.910052679
BIC3SIGmed MED [24 5;2 32] 0.9296 0.870953 0.83429986 0.786347461 0.878282065BIC3SIGTPmed [24 5;2 32] 0.934430769 0.87256 0.836405644 0.792234939 0.881711101
BIC3SIGTPmed MED [24 4;1 33] 0.948276923 0.890327 0.858639045 0.824991222 0.899992947BIC3TPmed [24 5;2 32] 0.935138462 0.871465 0.835527879 0.791584228 0.88135531
BIC3TPmed MED [24 4;1 33] 0.941969231 0.894595 0.86337775 0.824707427 0.899502151BIC3TPpattern [24 5;2 33] 0.932061538 0.877383 0.841566438 0.795879096 0.883505602
BIC3TPpattern MED [24 4;2 33] 0.931169231 0.898321 0.865677256 0.819603048 0.89587354BIC5TPem [24 5;2 32] 0.9304 0.869872 0.832913843 0.785776529 0.878054501
BIC5TPem MED [24 4;2 33] 0.933353846 0.893997 0.861085893 0.816612053 0.894559656BIC5em EMfeature [23 3;2 34] 0.909323077 0.906927 0.87271515 0.810503314 0.88894686
BIC5em EM [24 5;2 33] 0.925753846 0.878478 0.841311434 0.791632947 0.880683042BIC5ALLmed [24 4;1 33] 0.949138462 0.879545 0.846057786 0.813078374 0.893774081
BIC5ALLmed MED [25 4;1 33] 0.956861538 0.897297 0.868571972 0.840709757 0.909138282BIC5SIGmed [24 5;2 32] 0.934246154 0.870953 0.834845169 0.790373415 0.880791668
BIC5SIGmed MED [25 4;1 33] 0.962430769 0.897781 0.869073038 0.845984831 0.912320736BIC5SIGTPmed [24 5;2 32] 0.926615385 0.868791 0.831083549 0.781132933 0.875226818
BIC5SIGTPmed MED [24 4;1 33] 0.942030769 0.899417 0.868502425 0.830309265 0.902437027BIC5TPmed [24 5;2 32] 0.924184615 0.86825 0.829931596 0.778420558 0.873544055
BIC5TPmed MED [24 4;2 34] 0.933446154 0.90266 0.871195343 0.826726604 0.899926117BIC5TPpattern [24 5;2 33] 0.929661538 0.876856 0.840461274 0.793129002 0.881751431
BIC5TPpattern MED [24 4;2 33] 0.933415385 0.899417 0.867781199 0.822891995 0.897795054
45
Table C.4: Extensive results of the classification with linear SVM on all unbalanced datasets.SVM linear
Unbalanced original datasetConfusion matrices’ Tprate Tnrate Precision Kappa statistic F measure
ORI [23 4;2 3] 0.917015385 0.403333333 0.860196879 0.346695225 0.886443268EM [20 4;6 3] 0.7784 0.418571429 0.847134471 0.178652475 0.808623045
EMfeature [24 4;2 2] 0.928184615 0.384285714 0.863791694 0.347986762 0.893576999EMfeature MED [24 4;2 2] 0.927261538 0.371428571 0.861194134 0.334454912 0.891368236
MED [24 4;2 2] 0.931876923 0.365714286 0.860454795 0.334695945 0.893452577MEDL [24 4;2 2] 0.935138462 0.315238095 0.851640713 0.285036116 0.890575929
MEDL MED [24 4;2 2] 0.932861538 0.312380952 0.850709229 0.278412325 0.888993917BIC3TPem [20 3;5 3] 0.791538462 0.485238095 0.865021719 0.247713047 0.824749956
BIC3TPem MED [24 4;1 2] 0.948461538 0.37952381 0.865234056 0.3716016 0.904034095BIC3em EMfeature [24 4;2 2] 0.923292308 0.34952381 0.855378602 0.308040467 0.887289353
BIC3em EM [20 3;5 3] 0.7876 0.494285714 0.86611893 0.249982414 0.822749604BIC3ALLmed [24 4;2 2] 0.918707692 0.348095238 0.855131494 0.295742343 0.884835715
BIC3ALLmed MED [23 4;3 2] 0.899938462 0.292857143 0.841632223 0.214320666 0.86845477BIC3SIGmed [24 4;2 2] 0.929692308 0.358571429 0.859035767 0.32351958 0.891648415
BIC3SIGmed MED [24 4;2 2] 0.921784615 0.36 0.857945521 0.31025777 0.887307088BIC3SIGTPmed [24 4;2 2] 0.929723077 0.38952381 0.864183937 0.357997981 0.894783568
BIC3SIGTPmed MED [24 4;2 2] 0.935015385 0.375238095 0.862885484 0.346634613 0.896320977BIC3TPmed [24 4;1 3] 0.942215385 0.41047619 0.870495755 0.397986319 0.903742755
BIC3TPmed MED [24 4;2 2] 0.923292308 0.349047619 0.856223652 0.301277235 0.887439836BIC3TPpattern [24 4;2 3] 0.926676923 0.414285714 0.869266661 0.371775197 0.895933742
BIC3TPpattern MED [24 4;2 2] 0.927384615 0.367142857 0.860248063 0.323755351 0.891068823BIC5TPem [20 3;6 3] 0.780153846 0.437619048 0.851686676 0.195913526 0.812093062
BIC5TPem MED [24 4;2 2] 0.933476923 0.341428571 0.855428003 0.312710432 0.89188189BIC5em EMfeature [24 4;2 2] 0.926369231 0.355238095 0.857981484 0.310275592 0.889943742
BIC5em EM [20 3;6 3] 0.779476923 0.454285714 0.855175839 0.20996597 0.813330358BIC5ALLmed [23 4;2 2] 0.912492308 0.334761905 0.851583144 0.275843321 0.879928852
BIC5ALLmed MED [23 4;3 2] 0.895323077 0.29 0.840150385 0.204860813 0.865857599BIC5SIGmed [24 4;2 2] 0.925815385 0.349047619 0.856500556 0.307031179 0.888658568
BIC5SIGmed MED [24 4;2 2] 0.921076923 0.350952381 0.855639617 0.302596828 0.885922665BIC5SIGTPmed [24 4;2 2] 0.936738462 0.398095238 0.867319103 0.374854403 0.8997085
BIC5SIGTPmed MED [24 4;2 2] 0.9296 0.384761905 0.864195614 0.345708237 0.89441422BIC5TPmed [24 4;2 2] 0.929630769 0.37 0.860190039 0.338705783 0.892707922
BIC5TPmed MED [23 4;2 2] 0.912461538 0.312380952 0.847846261 0.248771504 0.87743495BIC5TPpattern [24 4;2 2] 0.930523077 0.38047619 0.862939724 0.345030257 0.894345072
BIC5TPpattern MED [24 4;2 2] 0.931261538 0.350952381 0.857634513 0.313537077 0.891589528
Table C.5: Extensive results of the classification with linear SVM on all balanced datasets (SMOTE300).SVM linear
SMOTE300-Balanced datasetConfusion matrices’ Tprate Tnrate Precision Kappa statistic F measure
ORI [21 4;4 22] 0.831846154 0.858092308 0.856905389 0.689507624 0.841040025EM [22 4;4 21] 0.851353846 0.857033333 0.862688215 0.708340679 0.854554406
EMfeature [21 4;4 21] 0.831015385 0.8586 0.862274915 0.689378901 0.843389121EMfeature MED [22 3;3 22] 0.871661538 0.866833333 0.871661538 0.738485021 0.870840314
MED [22 3;3 22] 0.868646154 0.8748 0.880466579 0.743408851 0.872709502MEDL [22 4;4 21] 0.847630769 0.859 0.863553476 0.706230731 0.853140047
MEDL MED [23 4;3 21] 0.889169231 0.852566667 0.863431063 0.741823922 0.874012375BIC3TPem [22 4;3 21] 0.877076923 0.845866667 0.857353804 0.723330192 0.865307774
BIC3TPem MED [23 4;2 21] 0.906123077 0.8563 0.868807092 0.762966132 0.885779373BIC3em EMfeature [22 4;4 21] 0.853169231 0.832166667 0.844099757 0.685336466 0.846023485
BIC3em EM [22 4;3 21] 0.874861538 0.8411 0.853027742 0.716286892 0.861780233BIC3ALLmed [22 4;4 21] 0.851569231 0.8571 0.864308855 0.708476341 0.855354761
BIC3ALLmed MED [22 4;4 21] 0.859230769 0.851633333 0.859230769 0.710752613 0.857625484BIC3SIGmed [21 4;4 21] 0.838153846 0.8497 0.856151309 0.687781498 0.844211513
BIC3SIGmed MED [23 4;3 21] 0.882523077 0.8498 0.861287642 0.732827886 0.869463099BIC3SIGTPmed [22 4;4 21] 0.845815385 0.855433333 0.861447809 0.701107476 0.851011579
BIC3SIGTPmed MED [23 3;2 22] 0.902769231 0.8758 0.88441281 0.778916018 0.891843053BIC3TPmed [22 3;3 21] 0.865415385 0.862866667 0.869844169 0.72822778 0.865453804
BIC3TPmed MED [23 3;3 22] 0.901384615 0.8709 0.879777013 0.772551274 0.888498803BIC3TPpattern [22 4;3 21] 0.873261538 0.8548 0.863835821 0.728234383 0.867100562
BIC3TPpattern MED [23 3;3 22] 0.895661538 0.8733 0.88184491 0.769334896 0.887224954BIC5TPem [22 4;4 21] 0.853692308 0.853066667 0.86104419 0.706723947 0.854791296
BIC5TPem MED [22 4;3 21] 0.877230769 0.853166667 0.862624068 0.730614773 0.868053319BIC5em EMfeature [22 4;4 21] 0.861046154 0.855733333 0.863883921 0.716497772 0.860083126
BIC5em EM [22 4;3 21] 0.868738462 0.845933333 0.857620259 0.714740158 0.860976359BIC5ALLmed [22 4;4 21] 0.847538462 0.8522 0.858755073 0.699679466 0.850386687
BIC5ALLmed MED [22 4;4 21] 0.858369231 0.855533333 0.862732691 0.71386559 0.858480697BIC5SIGmed [22 4;4 21] 0.839661538 0.852166667 0.858286683 0.691787188 0.846123113
BIC5SIGmed MED [22 3;3 22] 0.875661538 0.866766667 0.875131374 0.742536748 0.873144762BIC5SIGTPmed [22 4;4 21] 0.845107692 0.853866667 0.860379002 0.698807204 0.849971234
BIC5SIGTPmed MED [23 3;3 22] 0.898861538 0.8693 0.879184491 0.768523226 0.886951535BIC5TPmed [22 3;4 21] 0.842923077 0.861266667 0.866857275 0.703811655 0.851689888
BIC5TPmed MED [23 4;3 21] 0.889815385 0.857166667 0.867342594 0.747251451 0.876490983BIC5TPpattern [22 4;3 21] 0.867046154 0.843533333 0.854006981 0.710700158 0.858814724
BIC5TPpattern MED [23 4;3 21] 0.883261538 0.853866667 0.863674358 0.737546115 0.871973226
46
Table C.6: Extensive results of the classification with linear SVM on all datasets with SMOTE500.SVM linear
SMOTE500Confusion matrices’ Tprate Tnrate Precision Kappa statistic F measure
ORI [21 3;5 36] 0.815846154 0.926072874 0.885020751 0.750654403 0.845281592EM [22 3;4 34] 0.846953846 0.919871977 0.882984838 0.771036284 0.862279903
EMfeature [21 3;4 34] 0.830492308 0.926358464 0.889423018 0.763893059 0.856412721EMfeature MED [22 3;4 35] 0.856953846 0.931621622 0.898548363 0.793851281 0.875467306
MED [23 3;3 35] 0.879046154 0.929004267 0.896317591 0.809822459 0.886046445MEDL [21 4;5 33] 0.800830769 0.898975818 0.848762672 0.705475589 0.822045514
MEDL MED [22 3;4 34] 0.861015385 0.912403983 0.874492599 0.774639172 0.865593654BIC3TPem [21 3;4 34] 0.829692308 0.92685633 0.89065033 0.763945543 0.856366861
BIC3TPem MED [22 3;3 34] 0.873415385 0.909644381 0.873412602 0.782591251 0.871279864BIC3em EMfeature [21 4;4 33] 0.830246154 0.885576102 0.837536425 0.716106723 0.831139838
BIC3em EM [22 3;4 35] 0.858646154 0.929530583 0.896758395 0.792836428 0.875256445BIC3ALLmed [22 3;3 34] 0.868 0.916714083 0.880200455 0.785760317 0.872164247
BIC3ALLmed MED [21 3;4 34] 0.835784615 0.925206259 0.888653803 0.767388527 0.858819051BIC3SIGmed [22 3;3 34] 0.875169231 0.910199147 0.873699687 0.784523743 0.872357104
BIC3SIGmed MED [21 3;4 34] 0.831169231 0.919886202 0.880060662 0.756944458 0.852772388BIC3SIGTPmed [22 2;4 35] 0.843169231 0.94029872 0.910259007 0.791838875 0.873153186
BIC3SIGTPmed MED [23 3;3 35] 0.891538462 0.92799431 0.898367197 0.820014971 0.893212296BIC3TPmed [21 2;4 35] 0.837415385 0.936514936 0.904617087 0.782270998 0.867224299
BIC3TPmed MED [23 3;3 34] 0.888246154 0.909658606 0.875069594 0.795317927 0.879145177BIC3TPpattern [21 3;4 35] 0.831261538 0.928990043 0.894267732 0.767910003 0.858729663
BIC3TPpattern MED [23 3;3 35] 0.885138462 0.93056899 0.901961205 0.817503594 0.891211BIC5TPem [21 3;5 34] 0.820430769 0.924694168 0.88573759 0.753067938 0.849529591
BIC5TPem MED [22 3;4 34] 0.859446154 0.909089616 0.872297509 0.769580124 0.862924365BIC5em EMfeature [22 4;4 34] 0.848676923 0.905376956 0.863855153 0.75509628 0.853357166
BIC5em EM [21 3;4 34] 0.838461538 0.924153627 0.886931295 0.768584361 0.86016153BIC5ALLmed [21 3;4 34] 0.830430769 0.923584637 0.886111201 0.760798464 0.854856678
BIC5ALLmed MED [21 3;4 34] 0.838061538 0.921465149 0.88285645 0.765089456 0.857949821BIC5SIGmed [21 3;4 34] 0.833538462 0.923086771 0.884413169 0.762880109 0.856291057
BIC5SIGmed MED [22 3;3 34] 0.876738462 0.916088193 0.88119076 0.792896453 0.877073945BIC5SIGTPmed [21 2;4 35] 0.834461538 0.939758179 0.908737083 0.783538586 0.86776736
BIC5SIGTPmed MED [23 3;3 34] 0.898523077 0.919857752 0.888636957 0.816405359 0.891625283BIC5TPmed [21 3;4 35] 0.826615385 0.932190612 0.897921647 0.767430875 0.857810402
BIC5TPmed MED [23 3;3 34] 0.888153846 0.913485064 0.879714325 0.799852575 0.881747808BIC5TPpattern [22 3;4 34] 0.840676923 0.926315789 0.890984644 0.773169855 0.86278146
BIC5TPpattern MED [23 3;3 34] 0.886769231 0.91230441 0.8781807 0.797389335 0.88050436
Table C.7: Extensive results of the classification with Decision Trees on all unbalanced datasets.Decision Trees
Unbalanced original datasetConfusion matrices’ Tprate Tnrate Precision Kappa statistic F measure
ORI [25 6;1 0] 0.972123077 0.045714 0.801965656 0.021911594 0.878323212EM [24 6;2 1] 0.941569231 0.110952 0.8150202 0.05575752 0.870902911
EMfeature [25 6;1 0] 0.968892308 0.043333 0.807197349 0.013246617 0.879661011EMfeature MED [22 4;4 3] 0.847846154 0.410476 0.856753506 0.256414358 0.849040243
MED [22 4;4 3] 0.848646154 0.410476 0.856911608 0.257259789 0.849554427MEDL [22 4;3 2] 0.871723077 0.290952 0.836502848 0.171445341 0.852288855
MEDL MED [22 4;4 2] 0.845630769 0.285714 0.830667691 0.13700804 0.836685431BIC3TPem [24 5;2 1] 0.923569231 0.147143 0.818445683 0.078892837 0.866344002
BIC3TPem MED [21 4;4 2] 0.827723077 0.360476 0.843866046 0.182455711 0.833074286BIC3em EMfeature [23 5;3 1] 0.879169231 0.210952 0.821647935 0.100942008 0.848010013
BIC3em EM [22 5;3 1] 0.867846154 0.237619 0.826578608 0.101691782 0.844116056BIC3ALLmed [24 4;2 2] 0.918738462 0.318571 0.849598354 0.258671563 0.881434958
BIC3ALLmed MED [22 4;4 2] 0.844184615 0.352381 0.845410829 0.192827084 0.842152791BIC3SIGmed [25 6;1 0] 0.979723077 0.056667 0.811166823 0.044596699 0.886912739
BIC3SIGmed MED [22 4;4 2] 0.858769231 0.390476 0.854423115 0.251034768 0.854297461BIC3SIGTPmed [24 6;2 1] 0.936153846 0.106667 0.812776714 0.050067168 0.868566242
BIC3SIGTPmed MED [21 4;4 3] 0.836953846 0.406667 0.856232731 0.232601269 0.843817099BIC3TPmed [24 5;2 2] 0.931107692 0.255714 0.838649806 0.216812133 0.881113131
BIC3TPmed MED [21 4;4 2] 0.837938462 0.36381 0.847095515 0.191874983 0.839194025BIC3TPpattern [24 5;1 1] 0.945292308 0.139524 0.820209406 0.100556343 0.877213641
BIC3TPpattern MED [22 4;4 2] 0.850553846 0.325238 0.839921536 0.180074303 0.843358664BIC5TPem [24 6;1 1] 0.942984615 0.097143 0.812792951 0.044722103 0.871225781
BIC5TPem MED [22 4;4 2] 0.845384615 0.337143 0.84244269 0.177917085 0.841834911BIC5em EMfeature [23 4;3 2] 0.878892308 0.300952 0.83898927 0.194438138 0.85640648
BIC5em EM [23 5;3 1] 0.899938462 0.13381 0.811627963 0.035618404 0.851063923BIC5ALLmed [24 4;2 2] 0.923538462 0.310952 0.848390499 0.262723617 0.882611443
BIC5ALLmed MED [22 4;4 2] 0.848184615 0.37381 0.84950177 0.221096381 0.846366357BIC5SIGmed [25 6;1 0] 0.975846154 0.077143 0.81430531 0.065505856 0.887166384
BIC5SIGmed MED [22 4;4 2] 0.856461538 0.397143 0.855250016 0.255604726 0.853319557BIC5SIGTPmed [24 6;1 1] 0.953015385 0.102381 0.814609055 0.072319513 0.877556996
BIC5SIGTPmed MED [22 4;4 2] 0.845630769 0.385714 0.852295039 0.228364063 0.845788041BIC5TPmed [24 5;1 1] 0.953876923 0.152857 0.823453135 0.133449834 0.882896434
BIC5TPmed MED [21 4;4 2] 0.8332 0.328095 0.83830946 0.159646652 0.831984233BIC5TPpattern [24 5;2 1] 0.920338462 0.135238 0.815366006 0.0632581 0.863707615
BIC5TPpattern MED [21 4;4 2] 0.839476923 0.330476 0.839110003 0.172724697 0.837167678
47
Table C.8: Extensive results of the classification with Decision Trees on all balanced datasets(SMOTE300).
Decision Trees
SMOTE300-Balanced datasetConfusion matrices’ Tprate Tnrate Precision Kappa statistic F measure
ORI [25 6;1 0] 0.972123077 0.045714 0.801965656 0.021911594 0.878323212EM [24 6;2 1] 0.941569231 0.110952 0.8150202 0.05575752 0.870902911
EMfeature [25 6;1 0] 0.968892308 0.043333 0.807197349 0.013246617 0.879661011EMfeature MED [22 4;4 3] 0.847846154 0.410476 0.856753506 0.256414358 0.849040243
MED [22 4;4 3] 0.848646154 0.410476 0.856911608 0.257259789 0.849554427MEDL [22 4;3 2] 0.871723077 0.290952 0.836502848 0.171445341 0.852288855
MEDL MED [22 4;4 2] 0.845630769 0.285714 0.830667691 0.13700804 0.836685431BIC3TPem [24 5;2 1] 0.923569231 0.147143 0.818445683 0.078892837 0.866344002
BIC3TPem MED [21 4;4 2] 0.827723077 0.360476 0.843866046 0.182455711 0.833074286BIC3em EMfeature [23 5;3 1] 0.879169231 0.210952 0.821647935 0.100942008 0.848010013
BIC3em EM [22 5;3 1] 0.867846154 0.237619 0.826578608 0.101691782 0.844116056BIC3ALLmed [24 4;2 2] 0.918738462 0.318571 0.849598354 0.258671563 0.881434958
BIC3ALLmed MED [22 4;4 2] 0.844184615 0.352381 0.845410829 0.192827084 0.842152791BIC3SIGmed [25 6;1 0] 0.979723077 0.056667 0.811166823 0.044596699 0.886912739
BIC3SIGmed MED [22 4;4 2] 0.858769231 0.390476 0.854423115 0.251034768 0.854297461BIC3SIGTPmed [24 6;2 1] 0.936153846 0.106667 0.812776714 0.050067168 0.868566242
BIC3SIGTPmed MED [21 4;4 3] 0.836953846 0.406667 0.856232731 0.232601269 0.843817099BIC3TPmed [24 5;2 2] 0.931107692 0.255714 0.838649806 0.216812133 0.881113131
BIC3TPmed MED [21 4;4 2] 0.837938462 0.36381 0.847095515 0.191874983 0.839194025BIC3TPpattern [24 5;1 1] 0.945292308 0.139524 0.820209406 0.100556343 0.877213641
BIC3TPpattern MED [22 4;4 2] 0.850553846 0.325238 0.839921536 0.180074303 0.843358664BIC5TPem [24 6;1 1] 0.942984615 0.097143 0.812792951 0.044722103 0.871225781
BIC5TPem MED [22 4;4 2] 0.845384615 0.337143 0.84244269 0.177917085 0.841834911BIC5em EMfeature [23 4;3 2] 0.878892308 0.300952 0.83898927 0.194438138 0.85640648
BIC5em EM [23 5;3 1] 0.899938462 0.13381 0.811627963 0.035618404 0.851063923BIC5ALLmed [24 4;2 2] 0.923538462 0.310952 0.848390499 0.262723617 0.882611443
BIC5ALLmed MED [22 4;4 2] 0.848184615 0.37381 0.84950177 0.221096381 0.846366357BIC5SIGmed [25 6;1 0] 0.975846154 0.077143 0.81430531 0.065505856 0.887166384
BIC5SIGmed MED [22 4;4 2] 0.856461538 0.397143 0.855250016 0.255604726 0.853319557BIC5SIGTPmed [24 6;1 1] 0.953015385 0.102381 0.814609055 0.072319513 0.877556996
BIC5SIGTPmed MED [22 4;4 2] 0.845630769 0.385714 0.852295039 0.228364063 0.845788041BIC5TPmed [24 5;1 1] 0.953876923 0.152857 0.823453135 0.133449834 0.882896434
BIC5TPmed MED [21 4;4 2] 0.8332 0.328095 0.83830946 0.159646652 0.831984233BIC5TPpattern [24 5;2 1] 0.920338462 0.135238 0.815366006 0.0632581 0.863707615
BIC5TPpattern MED [21 4;4 2] 0.839476923 0.330476 0.839110003 0.172724697 0.837167678
Table C.9: Extensive results of the classification with Decision Trees on all datasets with SMOTE500.Decision Trees
SMOTE500Confusion matrices’ Tprate Tnrate Precision Kappa statistic F measure
ORI [25 6;1 0] 0.972123077 0.045714 0.801965656 0.021911594 0.878323212EM [24 6;2 1] 0.941569231 0.110952 0.8150202 0.05575752 0.870902911
EMfeature [25 6;1 0] 0.968892308 0.043333 0.807197349 0.013246617 0.879661011EMfeature MED [22 4;4 3] 0.847846154 0.410476 0.856753506 0.256414358 0.849040243
MED [22 4;4 3] 0.848646154 0.410476 0.856911608 0.257259789 0.849554427MEDL [22 4;3 2] 0.871723077 0.290952 0.836502848 0.171445341 0.852288855
MEDL MED [22 4;4 2] 0.845630769 0.285714 0.830667691 0.13700804 0.836685431BIC3TPem [24 5;2 1] 0.923569231 0.147143 0.818445683 0.078892837 0.866344002
BIC3TPem MED [21 4;4 2] 0.827723077 0.360476 0.843866046 0.182455711 0.833074286BIC3em EMfeature [23 5;3 1] 0.879169231 0.210952 0.821647935 0.100942008 0.848010013
BIC3em EM [22 5;3 1] 0.867846154 0.237619 0.826578608 0.101691782 0.844116056BIC3ALLmed [22 4;4 2] 0.844184615 0.352381 0.845410829 0.192827084 0.842152791
BIC3ALLmed MED [24 4;2 2] 0.918738462 0.318571 0.849598354 0.258671563 0.881434958BIC3SIGmed [22 4;4 2] 0.858769231 0.390476 0.854423115 0.251034768 0.854297461
BIC3SIGmed MED [25 6;1 0] 0.979723077 0.056667 0.811166823 0.044596699 0.886912739BIC3SIGTPmed [24 6;2 1] 0.936153846 0.106667 0.812776714 0.050067168 0.868566242
BIC3SIGTPmed MED [21 4;4 3] 0.836953846 0.406667 0.856232731 0.232601269 0.843817099BIC3TPmed [24 5;2 2] 0.931107692 0.255714 0.838649806 0.216812133 0.881113131
BIC3TPmed MED [21 4;4 2] 0.837938462 0.36381 0.847095515 0.191874983 0.839194025BIC3TPpattern [24 5;1 1] 0.945292308 0.139524 0.820209406 0.100556343 0.877213641
BIC3TPpattern MED [22 4;4 2] 0.850553846 0.325238 0.839921536 0.180074303 0.843358664BIC5TPem [24 6;1 1] 0.942984615 0.097143 0.812792951 0.044722103 0.871225781
BIC5TPem MED [22 4;4 2] 0.845384615 0.337143 0.84244269 0.177917085 0.841834911BIC5em EMfeature [23 4;3 2] 0.878892308 0.300952 0.83898927 0.194438138 0.85640648
BIC5em EM [23 5;3 1] 0.899938462 0.13381 0.811627963 0.035618404 0.851063923BIC5ALLmed [24 4;2 2] 0.923538462 0.310952 0.848390499 0.262723617 0.882611443
BIC5ALLmed MED [22 4;4 2] 0.848184615 0.37381 0.84950177 0.221096381 0.846366357BIC5SIGmed [25 6;1 0] 0.975846154 0.077143 0.81430531 0.065505856 0.887166384
BIC5SIGmed MED [22 4;4 2] 0.856461538 0.397143 0.855250016 0.255604726 0.853319557BIC5SIGTPmed [24 6;1 1] 0.953015385 0.102381 0.814609055 0.072319513 0.877556996
BIC5SIGTPmed MED [22 4;4 2] 0.845630769 0.385714 0.852295039 0.228364063 0.845788041BIC5TPmed [24 5;1 1] 0.953876923 0.152857 0.823453135 0.133449834 0.882896434
BIC5TPmed MED [21 4;4 2] 0.8332 0.328095 0.83830946 0.159646652 0.831984233BIC5TPpattern [24 5;2 1] 0.920338462 0.135238 0.815366006 0.0632581 0.863707615
BIC5TPpattern MED [21 4;4 2] 0.839476923 0.330476 0.839110003 0.172724697 0.837167678
48
Table C.10: Extensive results of the classification with K-Nearest-neighbor on all unbalanced datasets.K-Nearest-Neighbor
Unbalanced original datasetConfusion matrices’ Tprate Tnrate Precision Kappa statistic F measure
ORI [25 6;1 0] 0.972123077 0.045714286 0.801965656 0.021911594 0.878323212EM [24 6;2 1] 0.941569231 0.110952381 0.8150202 0.05575752 0.870902911
EMfeature [25 6;1 0] 0.968892308 0.043333333 0.807197349 0.013246617 0.879661011EMfeature MED [22 4;4 3] 0.847846154 0.41047619 0.856753506 0.256414358 0.849040243
MED [22 4;4 3] 0.848646154 0.41047619 0.856911608 0.257259789 0.849554427MEDL [22 4;3 2] 0.871723077 0.290952381 0.836502848 0.171445341 0.852288855
MEDL MED [22 4;4 2] 0.845630769 0.285714286 0.830667691 0.13700804 0.836685431BIC3TPem [24 5;2 1] 0.923569231 0.147142857 0.818445683 0.078892837 0.866344002
BIC3TPem MED [21 4;4 2] 0.827723077 0.36047619 0.843866046 0.182455711 0.833074286BIC3em EMfeature [23 5;3 1] 0.879169231 0.210952381 0.821647935 0.100942008 0.848010013
BIC3em EM [22 5;3 1] 0.867846154 0.237619048 0.826578608 0.101691782 0.844116056BIC3ALLmed [24 4;2 2] 0.918738462 0.318571429 0.849598354 0.258671563 0.881434958
BIC3ALLmed MED [22 4;4 2] 0.844184615 0.352380952 0.845410829 0.192827084 0.842152791BIC3SIGmed [25 6;1 0] 0.979723077 0.056666667 0.811166823 0.044596699 0.886912739
BIC3SIGmed MED [22 4;4 2] 0.858769231 0.39047619 0.854423115 0.251034768 0.854297461BIC3SIGTPmed [24 6;2 1] 0.936153846 0.106666667 0.812776714 0.050067168 0.868566242
BIC3SIGTPmed MED [21 4;4 3] 0.836953846 0.406666667 0.856232731 0.232601269 0.843817099BIC3TPmed [24 5;2 2] 0.931107692 0.255714286 0.838649806 0.216812133 0.881113131
BIC3TPmed MED [21 4;4 2] 0.837938462 0.363809524 0.847095515 0.191874983 0.839194025BIC3TPpattern [24 5;1 1] 0.945292308 0.13952381 0.820209406 0.100556343 0.877213641
BIC3TPpattern MED [22 4;4 2] 0.850553846 0.325238095 0.839921536 0.180074303 0.843358664BIC5TPem [24 6;1 1] 0.942984615 0.097142857 0.812792951 0.044722103 0.871225781
BIC5TPem MED [22 4;4 2] 0.845384615 0.337142857 0.84244269 0.177917085 0.841834911BIC5em EMfeature [23 4;3 2] 0.878892308 0.300952381 0.83898927 0.194438138 0.85640648
BIC5em EM [23 5;3 1] 0.899938462 0.133809524 0.811627963 0.035618404 0.851063923BIC5ALLmed [24 4;2 2] 0.923538462 0.310952381 0.848390499 0.262723617 0.882611443
BIC5ALLmed MED [22 4;4 2] 0.848184615 0.373809524 0.84950177 0.221096381 0.846366357BIC5SIGmed [25 6;1 0] 0.975846154 0.077142857 0.81430531 0.065505856 0.887166384
BIC5SIGmed MED [22 4;4 2] 0.856461538 0.397142857 0.855250016 0.255604726 0.853319557BIC5SIGTPmed [24 6;1 1] 0.953015385 0.102380952 0.814609055 0.072319513 0.877556996
BIC5SIGTPmed MED [22 4;4 2] 0.845630769 0.385714286 0.852295039 0.228364063 0.845788041BIC5TPmed [24 5;1 1] 0.953876923 0.152857143 0.823453135 0.133449834 0.882896434
BIC5TPmed MED [21 4;4 2] 0.8332 0.328095238 0.83830946 0.159646652 0.831984233BIC5TPpattern [24 5;2 1] 0.920338462 0.135238095 0.815366006 0.0632581 0.863707615
BIC5TPpattern MED [21 4;4 2] 0.839476923 0.33047619 0.839110003 0.172724697 0.837167678
Table C.11: Extensive results of the classification with K-Nearest-neighbor on all balanced datasets(SMOTE300).
K-Nearest-Neighbor
SMOTE300-Balanced datasetConfusion matrices’ Tprate Tnrate Precision Kappa statistic F measure
ORI [25 6;1 0] 0.972123077 0.045714286 0.801965656 0.021911594 0.878323212EM [24 6;2 1] 0.941569231 0.110952381 0.8150202 0.05575752 0.870902911
EMfeature [25 6;1 0] 0.968892308 0.043333333 0.807197349 0.013246617 0.879661011EMfeature MED [22 4;4 3] 0.847846154 0.41047619 0.856753506 0.256414358 0.849040243
MED [22 4;4 3] 0.848646154 0.41047619 0.856911608 0.257259789 0.849554427MEDL [22 4;3 2] 0.871723077 0.290952381 0.836502848 0.171445341 0.852288855
MEDL MED [22 4;4 2] 0.845630769 0.285714286 0.830667691 0.13700804 0.836685431BIC3TPem [24 5;2 1] 0.923569231 0.147142857 0.818445683 0.078892837 0.866344002
BIC3TPem MED [21 4;4 2] 0.827723077 0.36047619 0.843866046 0.182455711 0.833074286BIC3em EMfeature [23 5;3 1] 0.879169231 0.210952381 0.821647935 0.100942008 0.848010013
BIC3em EM [22 5;3 1] 0.867846154 0.237619048 0.826578608 0.101691782 0.844116056BIC3ALLmed [24 4;2 2] 0.918738462 0.318571429 0.849598354 0.258671563 0.881434958
BIC3ALLmed MED [22 4;4 2] 0.844184615 0.352380952 0.845410829 0.192827084 0.842152791BIC3SIGmed [25 6;1 0] 0.979723077 0.056666667 0.811166823 0.044596699 0.886912739
BIC3SIGmed MED [22 4;4 2] 0.858769231 0.39047619 0.854423115 0.251034768 0.854297461BIC3SIGTPmed [24 6;2 1] 0.936153846 0.106666667 0.812776714 0.050067168 0.868566242
BIC3SIGTPmed MED [21 4;4 3] 0.836953846 0.406666667 0.856232731 0.232601269 0.843817099BIC3TPmed [24 5;2 2] 0.931107692 0.255714286 0.838649806 0.216812133 0.881113131
BIC3TPmed MED [21 4;4 2] 0.837938462 0.363809524 0.847095515 0.191874983 0.839194025BIC3TPpattern [24 5;1 1] 0.945292308 0.13952381 0.820209406 0.100556343 0.877213641
BIC3TPpattern MED [22 4;4 2] 0.850553846 0.325238095 0.839921536 0.180074303 0.843358664BIC5TPem [24 6;1 1] 0.942984615 0.097142857 0.812792951 0.044722103 0.871225781
BIC5TPem MED [22 4;4 2] 0.845384615 0.337142857 0.84244269 0.177917085 0.841834911BIC5em EMfeature [23 4;3 2] 0.878892308 0.300952381 0.83898927 0.194438138 0.85640648
BIC5em EM [23 5;3 1] 0.899938462 0.133809524 0.811627963 0.035618404 0.851063923BIC5ALLmed [24 4;2 2] 0.923538462 0.310952381 0.848390499 0.262723617 0.882611443
BIC5ALLmed MED [22 4;4 2] 0.848184615 0.373809524 0.84950177 0.221096381 0.846366357BIC5SIGmed [25 6;1 0] 0.975846154 0.077142857 0.81430531 0.065505856 0.887166384
BIC5SIGmed MED [22 4;4 2] 0.856461538 0.397142857 0.855250016 0.255604726 0.853319557BIC5SIGTPmed [24 6;1 1] 0.953015385 0.102380952 0.814609055 0.072319513 0.877556996
BIC5SIGTPmed MED [22 4;4 2] 0.845630769 0.385714286 0.852295039 0.228364063 0.845788041BIC5TPmed [24 5;1 1] 0.953876923 0.152857143 0.823453135 0.133449834 0.882896434
BIC5TPmed MED [21 4;4 2] 0.8332 0.328095238 0.83830946 0.159646652 0.831984233BIC5TPpattern [24 5;2 1] 0.920338462 0.135238095 0.815366006 0.0632581 0.863707615
BIC5TPpattern MED [21 4;4 2] 0.839476923 0.33047619 0.839110003 0.172724697 0.837167678
49
Table C.12: Extensive results of the classification with K-Nearest-neighbor on all SMOTE500 datasets.K-Nearest-Neighbor
SMOTE500Confusion matrices’ Tprate Tnrate Precision Kappa statistic F measure
ORI [25 6;1 0] 0.972123077 0.045714286 0.801965656 0.021911594 0.878323212EM [24 6;2 1] 0.941569231 0.110952381 0.8150202 0.05575752 0.870902911
EMfeature [25 6;1 0] 0.968892308 0.043333333 0.807197349 0.013246617 0.879661011EMfeature MED [22 4;4 3] 0.847846154 0.41047619 0.856753506 0.256414358 0.849040243
MED [22 4;4 3] 0.848646154 0.41047619 0.856911608 0.257259789 0.849554427MEDL [22 4;3 2] 0.871723077 0.290952381 0.836502848 0.171445341 0.852288855
MEDL MED [22 4;4 2] 0.845630769 0.285714286 0.830667691 0.13700804 0.836685431BIC3TPem [24 5;2 1] 0.923569231 0.147142857 0.818445683 0.078892837 0.866344002
BIC3TPem MED [21 4;4 2] 0.827723077 0.36047619 0.843866046 0.182455711 0.833074286BIC3em EMfeature [23 5;3 1] 0.879169231 0.210952381 0.821647935 0.100942008 0.848010013
BIC3em EM [22 5;3 1] 0.867846154 0.237619048 0.826578608 0.101691782 0.844116056BIC3ALLmed [22 4;4 2] 0.844184615 0.352380952 0.845410829 0.192827084 0.842152791
BIC3ALLmed MED [24 4;2 2] 0.918738462 0.318571429 0.849598354 0.258671563 0.881434958BIC3SIGmed [22 4;4 2] 0.858769231 0.39047619 0.854423115 0.251034768 0.854297461
BIC3SIGmed MED [25 6;1 0] 0.979723077 0.056666667 0.811166823 0.044596699 0.886912739BIC3SIGTPmed [24 6;2 1] 0.936153846 0.106666667 0.812776714 0.050067168 0.868566242
BIC3SIGTPmed MED [21 4;4 3] 0.836953846 0.406666667 0.856232731 0.232601269 0.843817099BIC3TPmed [24 5;2 2] 0.931107692 0.255714286 0.838649806 0.216812133 0.881113131
BIC3TPmed MED [21 4;4 2] 0.837938462 0.363809524 0.847095515 0.191874983 0.839194025BIC3TPpattern [24 5;1 1] 0.945292308 0.13952381 0.820209406 0.100556343 0.877213641
BIC3TPpattern MED [22 4;4 2] 0.850553846 0.325238095 0.839921536 0.180074303 0.843358664BIC5TPem [24 6;1 1] 0.942984615 0.097142857 0.812792951 0.044722103 0.871225781
BIC5TPem MED [22 4;4 2] 0.845384615 0.337142857 0.84244269 0.177917085 0.841834911BIC5em EMfeature [23 4;3 2] 0.878892308 0.300952381 0.83898927 0.194438138 0.85640648
BIC5em EM [23 5;3 1] 0.899938462 0.133809524 0.811627963 0.035618404 0.851063923BIC5ALLmed [24 4;2 2] 0.923538462 0.310952381 0.848390499 0.262723617 0.882611443
BIC5ALLmed MED [22 4;4 2] 0.848184615 0.373809524 0.84950177 0.221096381 0.846366357BIC5SIGmed [25 6;1 0] 0.975846154 0.077142857 0.81430531 0.065505856 0.887166384
BIC5SIGmed MED [22 4;4 2] 0.856461538 0.397142857 0.855250016 0.255604726 0.853319557BIC5SIGTPmed [24 6;1 1] 0.953015385 0.102380952 0.814609055 0.072319513 0.877556996
BIC5SIGTPmed MED [22 4;4 2] 0.845630769 0.385714286 0.852295039 0.228364063 0.845788041BIC5TPmed [24 5;1 1] 0.953876923 0.152857143 0.823453135 0.133449834 0.882896434
BIC5TPmed MED [21 4;4 2] 0.8332 0.328095238 0.83830946 0.159646652 0.831984233BIC5TPpattern [24 5;2 1] 0.920338462 0.135238095 0.815366006 0.0632581 0.863707615
BIC5TPpattern MED [21 4;4 2] 0.839476923 0.33047619 0.839110003 0.172724697 0.837167678
50