Biclustering-based imputation in longitudinal data · was tested together with several baseline...

Biclustering-based imputation in longitudinal data

Inês de Almeida Nolasco

Thesis to obtain the Master of Science Degree in

Electrical and Computer Engineering

Supervisor(s): Professor Alexandra Sofia Martins CarvalhoProfessor Sara Alexandra Cordeiro Madeira

Examination Committee

Chairperson: Prof. João Fernando Cardoso Silva SequeiraSupervisor: Prof. Alexandra Sofia Martins de Carvalho

Member of the Committee: Prof. Pedro Filipe Zeferino Tomás

May 2015

Resumo

Esclerose Lateral Amiotrofica (ALS) e uma doenca neurodegenerativa que afecta as capacidades mo-

toras. A degeneracao de pacientes com ALS progride rapidamente e usualmente estes acabam por

falecer num perıodo de poucos anos, principalmente devido ao comprometimento das funcoes respi-

ratorias. A importancia de identificar atempadamente a necessidade de iniciar Ventilacao nao invasiva

e portanto vital, este problema e abordado atraves de uma analise longitudinal de dados clinicos que

seguem os mesmos pacientes ao longo de um perıodo de tempo. Contudo, estes estudos e dados em

que os estudos se baseiam sofrem muito de valores em falta, i.e., , valores que por alguma razao estao

em falta e portanto a obtencao de conhecimentos a partir dos dados torna-se bastante difıcil. Neste

trabalho, abordamos a problematica de valores em falta em dados longitudinais atraves da aplicacao

de tecnicas baseadas em biclustering com o intuito de encontrar grupos de pacientes que partilhem da

mesma tendencia de evolucao das variaveis ao longo do tempo e imputar os valores em falta com base

nessa informacao. Esta abordagem e aplicada, em conjunto com outros metodos basicos de tratamento

de valores em falta, em dados sinteticos e no caso real de predicao de NIV em pacientes de ALS. Os

resultados indicam que o uso de imputacao baseada em biclusters melhora os resultados em dados

longitudinais.

Palavras-chave: Valores em falta, Dados longitudinais, Imputacao baseada em Biclusters,

Esclerose lateral amiotrofica, Ventilacao nao invasiva

iii

Abstract

Amyotrophic Lateral Sclerosis (ALS) is a neurodegenerative disorder that affects motor abilities. The

patients with ALS progress rapidly and die in few years mainly from respiratory failure, the importance of

being able to identify when should one star the Non Invasive Ventilation (NIV) is vital. The prediction of

NIV is approached by the analysis of longitudinal data consisting in clinical follow-ups of patients through

time. However, this data is very prone to the occurrence of missing values. In this work the problem

of missings values in longitudinal data is addressed here by applying bicluster techniques in order to

find trends in the data and thus enhance the imputations of missing values. The proposed approach

was tested together with several baseline imputation methods in both synthetic and real-world data

(ALS dataset). the Results indicate that generally the use of biclustering-based imputation improves the

results.

Keywords: Missing Values, Longitudinal data, Bicluster-based imputation, ALS disease

v

Contents

Resumo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi

1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Dissertation outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Background 3

2.1 Longitudinal data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.2 Missing values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.2.1 Baseline methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2.2 Imputation in longitudinal data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.3 Biclustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.4 Classification and performance metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3 Method 12

3.1 Biclustering-based imputation in longitudinal data . . . . . . . . . . . . . . . . . . . . . . . 12

3.2 Imputation methods applied . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4 Results 17

4.1 Synthetic data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4.1.1 Data description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4.1.2 Evaluation results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4.2 Real-World data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.2.1 Data description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.2.2 Biclustering results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.2.3 Datasets imputation results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.2.4 Classification results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

5 Conclusions and Future work 36

vii

Bibliography 38

A Synthetic data results 39

B Dataset statistical description 41

C Real-world data classification results 44

viii

List of Tables

4.1 Results of Wilcoxon signed-rank tests for NaiveBayes classifier . . . . . . . . . . . . . . . 33

4.2 Results of Wilcoxon signed-rank tests for KNN classifier . . . . . . . . . . . . . . . . . . . 33

4.3 Results of Wilcoxon signed-rank tests for DT classifier . . . . . . . . . . . . . . . . . . . . 34

4.4 Results of Wilcoxon signed-rank tests for linearSVM classifier . . . . . . . . . . . . . . . . 34

4.5 Comparison 3symbols and 5 symbols in the classification results. . . . . . . . . . . . . . . 35

A.1 Results for synthetic datasets of size 1000*150. . . . . . . . . . . . . . . . . . . . . . . . . 39



B.1 Statistical description for each feature in the 1st time-point. . . . . . . . . . . . . . . . . . 41

B.2 Statistical description for each feature in the 2nd time-point. . . . . . . . . . . . . . . . . . 42

B.3 Statistical description for each feature in the 3rd time-point. . . . . . . . . . . . . . . . . . 42

B.4 Statistical description for each feature in the 4th time-point. . . . . . . . . . . . . . . . . . 43

B.5 Statistical description for each feature in the 5th time-point. . . . . . . . . . . . . . . . . . 43

C.1 Naive Bayes classification results for ALS data for unbalanced data. . . . . . . . . . . . . 44

C.2 Naive Bayes classification results for ALS data for balanced(SMOTE300) data. . . . . . . 45

C.3 Naive Bayes classification results for ALS data with SMOTE500. . . . . . . . . . . . . . . 45

C.4 LinearSVM classification results for unbalanced dataset. . . . . . . . . . . . . . . . . . . . 46

C.5 LinearSVM classification results for balanced datasets. . . . . . . . . . . . . . . . . . . . . 46

C.6 LinearSVM classification results for SMOTE500 data. . . . . . . . . . . . . . . . . . . . . 47

C.7 Decision Trees classification results for unbalanced data. . . . . . . . . . . . . . . . . . . 47

C.8 Decision Trees classification results for balanced data (SMOTE300). . . . . . . . . . . . . 48

C.9 Decision Trees classification results for SMOTE500 data. . . . . . . . . . . . . . . . . . . 48

C.10 K-Nearest-neighbor classification results on unbalanced data. . . . . . . . . . . . . . . . . 49

C.11 K-Nearest-neighbor classification results for balanced data (SMOTE300). . . . . . . . . . 49

C.12 K-Nearest-neighbor classification results for SMOTE500 data. . . . . . . . . . . . . . . . 50

ix

List of Figures

3.1 An e-CCCbicluster example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.2 Biclustering illustration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.3 Illustration of biclustering-based imputation. . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.4 Imputation by bicluster pattern. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4.1 Synthetic datasets description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.2 Imputation methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4.3 Mean imputation error - MED and EM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.4 Mean imputation error for different amount of missing values. . . . . . . . . . . . . . . . . 21

4.5 Robustness of imputation methods to missing values . . . . . . . . . . . . . . . . . . . . . 22

4.6 Comparison of imputation methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4.7 ALS dataset: Number of instances per class. . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.8 Summary description of missing values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.9 Proportion of missing values in class EVOL and noEvol . . . . . . . . . . . . . . . . . . . 25

4.10 Bicluster categories. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.11 Number of missings per bicluster group. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.12 Real-world data datasets description. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.13 classification results with NaiveBayes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.14 classification results with linearSVM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.15 classification results with K-nearest-neighbor. . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.16 classification results with Decision Trees. . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

xi

Chapter 1

Introduction

1.1 Motivation

Amyotrophic lateral sclerosis (ALS) is a motor-neuron disease, that involves the degeneration of upper

and lower motor neurons causing muscle weakness and atrophy throughout the body. ALS forces people

to need continuous care and basic life functions become compromised near the final stages of the

disease. Without any known cure, patient care is reduced to symptom relief, improving quality of life and

increasing life expectancy as much as possible. With adequate care patient’s average life expectancy

is between 2 to 5 years. A worsening in the lives of both patient and families happen when respiratory

difficulties appear and thus respiratory assistance, nammed non invasive ventilation (NIV), is needed. It

is a delicate stage since from this point onward the patient will be much more dependent of machines

and care givers and it is also a process that involves high costs.

Several studies based on clinical follow-ups of patients, focus on the prediction of the moment in

which patients will need NIV. Clinical follow-ups are usually presented as longitudinal data, which con-

sists in subjects being observed for some period of time in several moments. A recurring problem in this

kind of studies is missing values, i.e., observations that, for some reason, were not made or were not

written down and thus no value is known for that feature of that person, at that moment. Missing values

imply that some information is missing, which hinders the extraction of knowledge from that data. It is

intuitive that depending on the magnitude, and characteristics of these missing values, the conclusions

taken from the analysis of the data may be distorted. Studying how these missing values affect the con-

clusions taken, and understanding the methods that better deal with missing data became a pressing

matter in the specific context of studies predicting NIV and in general on that use longitudinal structured

data. As an example, in the ongoing work by Andre Carreiro et al. [2], the authors are able to predict if

NIV will be needed or not by the time of the sixth clinical evaluation, based on the five previous evalua-

tions. It is this author’s belief that by improving the predictive performance of these studies a significant

help may be provided to ALS patients in order to better cope with the conditions inflicted by this disease.

To tackle the problem of missings in longitudinal data and, in the specific context of the ALS disease,

the present work explores the application of biclustering techniques with the objective of finding trends

1

among data and thus develop a novel and more accurate way to deal with missing values in longitudinal

data. Although not new, biclustering concept, only has been recently furthered explored and new and

robust algorithms have been developed. Due to this advance it is now possible to explore the application

of these techniques in other contexts as missing data.

1.2 Problem formulation

The problem of predicting the need for NIV in ALS patients is structured as a classification problem.

Here, subjects evaluated in ALS related features in several moments, are labeled as Evol or noEvol

considering the evolution or not from not needing NIV to needing NIV from the 5th to the 6th evaluation

moment. This work tackles the problem of missing values in the specific context of predicting the need

for NIV in ALS patients, aiming to help the predictive power of these studies. In that sense it is important

to (1) understand the implications that missing values have on the classification accuracy, (2) understand

the relation between the data structure (longitudinal) and the capacity of the methods to deal with missing

values, and finally, (3) using the derived knowledge to enhance the classification problem accuracy by

treating missing values in an intelligent and specially designed way.

1.3 Dissertation outline

The present work is organized as follows: First, Chapter 2 consists on a literature review and descrip-

tion of crucial concepts. Here the definition of missings values, longitudinal data and Biclustering is

presented, together with a survey on state-of-the-art works concerning each aspect.

In Chapter 3 it is described the implementation of biclustering-based imputation followed by the

description of the methodology used to apply other baseline methods that are used.

Chapter 4 consists in a presentation and discussion of the results. Here the synthetic and ALS

datasets are described and the results of the imputation evaluations, missings analysis, biclustering and

classification results are presented and discussed.

In Appendix is still possible to consult a statistical description of each feature in the ALS dataset in

each time-point, also the extensive results for synthetic and ALS data are available here.

2

Chapter 2

Background

2.1 Longitudinal data

Longitudinal studies, from where longitudinal data results, are designed to repeatedly observe the same

subjects and measure the same features of interest for long periods of time. The main advantage of this

design is that it allows to understand the evolution or change in behavior of some subjects excluding

time-invariant characteristics that could cloud the conclusions.

Regarding the moments at which the measurements take place, these may be all fixed i.e., regular

moments and equal for every subject or they may be different for each, with different intervals in between.

The difference between this and other time-considering designs is not always clear, in specific, time-

series data also consists of successive measurements made over some time interval, however the fields

of research using one and other design are very different. Usually time-series data presents a large

quantity of time points, and the rate at which the observations are made is much faster. For instance the

data acquired from a sensor during a certain period of time may be seen as a time-series.

2.2 Missing values

Missing data occurs when no data values are stored for some variables. An example of missing data is

the non-response to some questions on a survey. This may have different reasons, as Little and Rubin

exemplify in [8] a person may not answer because she refuses to, or because she does not know the

answer. These two different reasons represent distinct mechanisms that lead to data being missed and

should be treated differently when analyzing the data.

Missing data can then be classified, referring to the mechanisms that causes it to be missing. In [1]

a formalization is proposed in the following way: Y is the variable under observation which may have

missing values, X are other variables in observation without missing values, and R is a response indi-

cator which has the value 1 corresponding to a missing value of Y and 0 corresponding to an observed

value of Y. Given this, missing data is classified as:

Not Missing at random (NMAR) If the reason why the data is missing is related to the value itself, then

3

the data is said to be not missing at random. This means that the probability of observing a missing

value in Y is not independent from the variable Y itself, i.e., P(R=1—X,Y)=P(R=1—Y).

Missing at random (MAR) Data missing at random has missing values unrelated with the value itself,

but related to other variables under observation. Therefore, MAR data has the probability of having

missing values in Y described as P(R=1—X,Y)=P(R=1—X).

Missing completely at random (MCAR) The values missing in this category are completely indepen-

dent from any dimension of the data. Given this, the probability of missing values in MCAR data

can be described as P(R=1—X,Y)=P(R=1).

For an illustrative example consider a survey which asks people about income, gender, age, and schol-

arity level: a person with low income may be more likely to refuse answering the income question, i.e.,

P(R=1—Y,X)=P(R=1—Y), which falls in the NMAR category. Another case would be a person with low

scholarly level being more likely to not answer the income question, i.e., P(R=1—Y,X)=P(R=1—X), falling

into the MAR category. If otherwise the response or non-response of the income questions is completely

independent of any variable, then we face a truly MCAR data.

2.2.1 Baseline methods

If we cannot consider data to be at least MAR (being therefore NMAR) then the missing mechanism must

be modeled and taken into account in the analysis performed with the data. This means we must derive

exactly how the missing values depend on the variable value and/or other variables under observation.

In a large number of studies this may be impossible. As Dondersa and Heijdenc point out [4], if data

is NMAR then no universal method that deals with missing data exists. Some new tools are being

developed in this context but they are out of the scope of this thesis, therefore they will not be addressed

here. The methods discussed henceforward are based on the assumption that data is at least MAR.

Listwise deletion Deletes all instances in which missing values appear. This generates a complete

data set with fewer instances on which the usual estimation process may be performed.

Pairwise deletion Only deletes instances that have missing values on the variables under analysis, but

uses the same instances when analyzing other variables without missing values.

These methods are usually categorized as Complete dataset approaches, since the general strategy

is to clean the dataset of the missings in order to get a ”complete” dataset to perform the analysis

desired. The main problem with these are the resulting bias in descriptive statistics, specially observed

when data is just MAR, yet another aspect worth considering is the decrease of statistical power that

follows from the lower number of sample instances. Relating to the applicability of these methods, the

authors in [3] are straightforward: only if the amount of data available is large or if the missing values

correspond to some small percentage of data may these methods be applied. Otherwise and in the case

where the missing information is necessary other methods that predict the exact missing value should

be applied.

4

Imputation methods

Following the need to predict missing values, imputation methods were developed. These are based

on the assumption that the resulting inferences from a representative sample, should be similar if the

same analysis is performed on a different, but also representative sample. Therefore, and as A. Rogier

et al [4] state, if we change one subject from the sample and this subject is drawn at random from the

population under study, the analysis results should not be very different.

R. Little and D. Rubin [8], also propose an intuitive reason for applying these methods: if in our study

we have some variable Yj which is highly correlated with some other observed variable Yk, and if some

values are missing in Yj then it is very tempting to predict these Yj values from Yk.

Taking that into account, the overall scheme of these methods is to imput, in the place of the missing

values, values that make sense to be there. It is in the approach used to decide which values to input

that the methods differ.

The general positive point with this kind of methods is that, after the imputation process is done,

we may work with a full sized sample, and perform the exact same analysis we would do if no missing

values ever occurred. This is, however, not without problems which are discussed through this section.

Hot-deck imputation If a value is missing in the responses of some subject, a sub-sample of subjects

with similar characteristics is formed and the value missing is imputed with a value taken at random

from that subgroup. This approach results in biased estimators and misleadingly small standard

errors since there is no new information being added. Another problem regarding this method

is the high difficulty or even impossibility of being able to get a subset of subjects with similar

characteristics. In practice, ad hoc strategies are employed.

Mean imputation In the place of missing values, this method inputs the estimated sample mean of

the variable that shows missing values. This approach may be generalized to use any metric of

frequency or central tendency to infer missing values. However, the same problems pointed before

exists, as we are not imputing new information, (the imputed values are based on the sample and

do not increase the uncertainty associated) and as we are using a full sized sample, the standard

error for the estimated parameters will be misleadingly small. The bias problem also remains

specially if the data is not MCAR.

Regression imputation Based on the sample, a regression function is constructed where the variables

with missing values as the dependent ones and the other observed variables as the predictors. The

missing slots are then imputed with values resulting from that function. Several kinds of regression

functions may be derived, and should be used accordingly to the type of data at hand, for instance

linear or logistic regressions. The problems mentioned above are still not solved with general

regression imputation methods, since the biased estimators and small standard errors continue to

happen.

The stated imputation methods may be considered as part of the single imputation methods group,

which means that the prediction of values to impute is performed and imputed only once. Some state

5

that this strategy allows to underestimate the imputation error since the imputed data does not reflect the

uncertainty created by missing values, i.e., the values imputed are not associated with any uncertainty

on the choice of the value to imput, thus leading to an overconfidence on the resulting estimations.

Therefore, a mechanism that reflects the uncertainty on the choice of the value to imput is needed.

Donald B. Rubin proposed, in [14], such a method named multiple imputation that tackles this specific

problem.

Multiple imputation this method belongs to the imputation methods category, since the rationale and

overall approach is the same. This, however, addresses the problems shown above and it has

been proven that its results are highly better [14]. It starts by constructing a regression function

the same way as performed when doing regression imputation. In this case, however, an error is

introduced to the regression function, so that, each time the this function is used to derive some

value for the dependent variable, a different value is obtained. The imputation process is then

performed several times, i.e, by deriving a value from the current regression function and impute

it in the place of the missing value, repeating this procedure several times and generating several

different datasets. An analysis may then be done in each dataset and the estimated parameters

are averaged through all the datasets analyzed, resulting that the statistics no longer suffer from

too small estimated standard errors.

This method works more as a tool and not as a method alone, being possible to apply even when

using other kind of models on the data, i.e., not only regression models.

Expectation-maximization imputation As a multiple imputation technique the Expectation-maximization

(EM) algorithm is generally used. it provides a methodological way for applying some specific

method, e.g., regression imputation, and improving its results. For an imputation process, it gen-

erally works as follows: An E step (expectation) derives some descriptive model from the available

data, and an imputation is made using that model. Next the M step (maximization) follows which

means that having a completed imputed data set, the parameters defining the previous model

may be re-estimated and a new model is derived. The two steps are repeated, until the model

converges. This way, by successive iterations, it’s possible to start with easy and simpler models

describing the data and finally achieving enhanced and more accurate models which lead in the

end to better imputed values. In [16] the author developed EM imputation algorithm for Matlab.

An important category of imputation procedures are the called model-based imputation methods,

which stand for methods in which the imputed values are obtained through some function that is able to

model the data at hand. The simplest model-based imputation methods are the regression imputation

methods, in which the variables are modeled through a regression function. These methods are usually

applied in a iterative way that generally follow and repeat two main steps: first learn a model from the

observed data, second impute the missing values according to the learnt model. As mentioned before,

an example of such an iterative process is the EM algorithm.

Few advances have taken place in this area, as Joseph Schafer and John Graham conclude, in

[15], multiple imputation with regression and maximum likelihood with EM algorithm are still considered

6

the best approaches to deal with missing data. Recently, the focus has turned to the development of

methods that are able to model more complex datasets and exploit specific aspects of these datasets.

Complex datasets may refer to datasets that mixture together a variety of different variable types as cat-

egorical, binary, scalar or continuous. It may refer to the relation between variables, or to the complexity

of the dataset design as is the case of longitudinal data.

To deal with different type variables, an important algorithm was developed in [11], the imputation and

variance estimation software (IVEWARE), that implements a sequential regression imputation approach.

It consists in a iterative process that estimates missing values by fitting a sequence of regression models

and drawing values from the corresponding predictive distributions. A draw back of this algorithm is that

it is not sufficiently robust to outliers, resulting in poor results when outliers are present. An improvement

of the previous algorithm, the Iterative Robust Model-based Imputation (IRMI) was presented in [17]. It

follows the same approach as IVEWARE but presents a larger robustness of the imputed values.

In [12] Michael Richman et.al. introduced and evaluated the use of machine learning algorithms, such

as support vector machines (SVM), and neural networks (NN) to model missing data in the imputation

process. In particular, the idea was to use SVM and NN to construct regression functions from which the

values to impute would be derived. The results of comparing the application of these algorithms against

some more basic imputation approaches allowed the conclusion that the methods tested are better in

general, and in the case of nonlinear correlated variables, these methods are extensively better than the

other basic approaches tried.

Specifically to deal with categorical data, regression trees have been successfully used as described

in [18]. However, in [13], Vanessa Romero and Antonio Salmeron explored the use of Bayesian networks

(BN) for multivariate imputation of categorical variables. The results of these approaches were better

when compared with the results using regression trees. BN are useful tools to get joint probability

distribution over all features. The overall idea of their work is to learn a BN for the set of variables

in the dataset, and after, impute the missing values by values that increase the likelihood of the data

accordingly to the joint probability distribution derived.

Some recent studies as [3] and [5] are revisiting the Hot-deck imputation idea, where subgroups of

data (also called classes) are formed following some coherence rule, i.e., they are formed such that, in

selected attributes the data from same-group objects behave accordingly to that coherence, and so it is

expected that the imputation of missing values (whichever be the technique applied), considering only

data from the same group should perform better than methods that apply the same imputation method

to the global dataset. In [3] the authors use the concept of biclustering to generate those classes, and

as the coherence rule the Mean Square Residue metric is used.

2.2.2 Imputation in longitudinal data

Throughout the course of this work it became somewhat clear that the imputation process is very sen-

sitive to data structure and design. The works mentioned above are an example of that, where specific

tools are used to tackle specific aspects of the data.

7

The same conjecture appears to be true when dealing with longitudinal data. Indeed, knowing the

exact structure of a dataset, and in this case, understanding that attributes present an evolution over

time, is of interest when developing methods to deal with missing values in longitudinal data. The

following methods appear to be extensions of the baseline methods instead of contributions on missing

data management.

Individual mean imputation The missing value is imputed with the mean of observed values in the

distinct time points for the same instance.

Time mean imputation The missing value is imputed with the mean of the observed values in distinct

instances for the same time-point. As before, the imputation with other metrics, e.g., median, is

possible, and should be considered according to the dataset’s characteristics.

Last Value Carried Forward (LVCF) A missing value in time point t is imputed with the previous (t-

1) observed value of that instance. Several modifications of this approach may be found, for

instance, instead of using the previous it might be used the nearest, or the latest, or the next (t+1)

observed value. The decision about which should be used depends always on the prior knowledge

about how variables evolve over time points. Nonetheless, these methods assume a more or less

constant value over time and, if that is not the case more sophisticated approaches should be

considered.

Linear interpolation imputation A missing value in time point t is imputed with the average of the

previous (t-1) and next (t+1) observed values. This approach assumes a linear development in

time of the variables under imputation.

Longitudinal regression methods In [19] the authors propose two methods: individual longitudinal

regression imputation and population longitudinal regression imputation. Both methods are a ver-

sion of a simple regression imputation method, however, when choosing the predictors to compute

the regression functions, they take into account the time as an important component. Individual

longitudinal regression imputation derives a regression function between time and the attribute

with missings in the same instance. This must be performed for each object separately, since it

is assumed that the dependency on the variable with time is not the same for every object in the

dataset. In population longitudinal regression imputation the regression function uses not only time

as a predictor but other attributes from other individuals as well.

Related with this subject, the work by J.Honaker and G. King is worth mention, In [7] they developed

the AMELIA II algorithm which applies multiple imputation to time series cross sectional (TSCS) data.

Although TSCS data differs from longitudinal one, in particular in the amount of time points available and

in the number of subjects, the problems in applying a regular multiple imputation method are similar, and

the solutions found, even if not directly applicable to the longitudinal case, should be explored.

Several studies try to understand if the use of specially developed methods for longitudinal data

would result in better imputed values than the general baseline methods. In this context two studies [5]

8

and [19], arrive to the same conclusions: if we are dealing with longitudinal data then the longitudinal

aspects should enter into consideration in the imputation method and the longitudinal methods applied

yield better imputations in both cases when compared with baseline methods.

2.3 Biclustering

Clustering stands for grouping together objects that in some sense are similar to each other or that have

more in common than the rest of the objects. The rules that define which objects belong to the same

cluster may be very different, may even be applied to discretized versions of data or to the real values

directly, depending on the problem at hand. In the simplest cases, measures of distance between objects

are used. Clustering may also be applied in more than one dimension. Consider a data matrix A of size

n×m, where rows represent objects and columns are attributes of interest, clustering applied to objects

will group together objects that present closest values for all attributes, and will result in a sub-matrix of

size s×m containing all the selected objects (s < n). But clustering may also be applied to the columns,

(attributes) and that would group together attributes that behave in a similar way for all objects, resulting

in a sub-matrix n×p with n objects and p attributes (p < m). In some contexts, clustering simultaneously

in both dimensions (rows and columns) is also of interest, This means that a subset of objects and

attributes are selected to form a sub-matrix of size k × t with k < n and t < m, called bicluster. In [9]

the authors express the underlying difference between these clustering approaches, clustering methods

derive a global model for the data, since each object belonging to a cluster is selected using all attributes

or vice-versa. Biclustering, however, produces local models since objects and attributes belonging to a

bicluster are selected considering only a subset of attributes and objects.

Time-series data biclustering

Biclustering has also been shown to be an interesting technique to be applied to time-series data with

the objective of finding objects that behave in a similar manner over the same time-points, i.e., objects

that present the same trends over time. For that purpose consider the application of biclustering to a

data matrix A where, instead of different attributes, columns would represent different time-slices of the

same feature. If the goal is to find local trends over time, then it makes sense that columns, representing

time-points, selected for each bicluster to be contiguous. This restriction was further explored by the

authors in [10] which proposed an algorithm that is able to find such biclusters in linear time, the called

Contiguous Column Coherent Biclustering (CCC-Biclustering) algorithm. In this algorithm a discretized

version (A) of the data matrix (A’) is generated, and rows and columns are grouped together following

two rules: first, columns must be contiguous; and second, discretized symbols in each column must be

the same in every row selected.

9

2.4 Classification and performance metrics

Synthetic minority over sampling technique (SMOTE)

To undermine the effects of unbalanced data on classification results Nitesh V. Chawla et al. developed

the Synthetic minority over sampling technique (SMOTE), which is able to artificially balance the dataset

between the two classes by over-sampling the minority class through the introduction of synthetic in-

stances. This tool allows the user to define the oversampling desired, by defining the percentage of data

to be created. As an example, a defined SMOTE of 100% would result in a dataset with twice as much

instances.

performance metrics

In order to evaluate classifications several metrics are used, here a brief description of some of these

metrics is presented. when dealing with a classification problem caractheristicaly a positive class and a

negative class are defined, e.g., , in the context of medical studies a positive class is usually assign to

the sick patients and the negative class is assign to the healthy persons.

True positive and True Negative rates A true positive is an instance that has been correctly classified

as positive and a true negative is an instance that is correctly classified as negative. The TP

rate and TN rate stand four the measure of the proportion of actual positives or negatives that

are correctly identified as such, in other words it is the number of instances correctly classified

as Positive (TP) or negative (TN) divided over the total amount of positive (P) or negative (N)

instances. This is defined as:

TPrate =TP

P(2.1)

TNrate =TN

N(2.2)

True Positive rate is also usually called recall and serves as a measure of the capacity of classifiers

to retrieve relevant information i.e., how many relevant instances is it able to find from all relevant

instances .

Precision Is the fraction between the number of instances correctly classified as positive (TP) over the

total amount of instances classified as positive ether correctly (TP) or incorrectly (FP). This metric

measures if the classification is resulting in more correctly classified instances than incorrectly

ones.

Precision =TP

TP + FP(2.3)

K-statistics Metric that compares the observed accuracy with the accuracy from a random chance, this

10

allows to have a general idea if the classification behaved as a random system or not.

K − statistics =totalAccuracy − randomAccuracy

1− randomAccuracy(2.4)

totalAccuracy =TP + TN

TP + TN + FP + FN(2.5)

randomAccuracy =(TN + FP )× (TN + FN) + (FN + TP )× (FP + TP )

total × total(2.6)

F-measure incorporates both Precision ans recall, being able to measure the performance from both

aspects. its definition is as follows:

F −measure =Precision×Recall

Precision+Recall(2.7)

11

Chapter 3

Method

3.1 Biclustering-based imputation in longitudinal data

Following the idea of imputation inside classes or subgroups introduced in Section 2.2.1, in this work it

will be further explored the use of biclustering techniques, applied to our longitudinal dataset, in order

to create similarity groups and perform group-dependent imputation. The idea is that if the imputation

is carried out based on groups that share some similarity, it is to be expected that the imputed values

are more accurate than if the imputation was made considering the whole dataset. Considering time-

dependent aspects in the imputation process, the idea of grouping persons according to some similarity

can be described as looking for local trends in the data, i.e., we want to group together patients that

for some feature show the same evolution over time. Then, it is imputed therein any missing value

taking into account the group trend for that feature. The biclustering strategy used in the course of this

work is called CCC-biclustering, which as previously explained (in Section 2.3) only finds biclusters with

contiguous time points. However, a slightly different version of this algorithm has to be used, the e-

CCCbiclustering, which allows the biclustering of objects with approximate similarity instead of an exact

one, i.e., selected patients in the defined time-points do not need to be precisely equal between them,

but only similar in some defined degree that may contain mismatches or missing values. In the context

of this work, allowing missing values inside bicluster is imperative to be able to group together subjects

that present some missing values in some time-point, and generate biclusters as shown in Figure 3.1.

However, we are not interested in allowing other mismatches/errors in the bi-cluster computation, so

the e-CCCbiclustering algorithm was modified to allow only missing values as errors. Before applying

the biclustering algorithm, the data needs to be processed in order to generate one-feature matrices for

each longitudinal feature in the original dataset. Each data matrix consists in the observations for the

considered feature for all patients at different time points and it constitutes the base data matrix where

biclustering is performed. In Figure 3.2 a representation of the strategy used to biclustering this data is

presented step by step.

After the generation of each one-feature matrix, as described in the Section 2.3, the matrix needs

to be discretized since this algorithm applies biclustering over a discretized version of the data. There

12

Figure 3.1: Illustration of an e-CCCbicluster containing samples with missing values.

Figure 3.2: Bicluster computation workflow. First construct the one-feature matrices by separating thedataset in sub-datasets of only one feature (but with all samples and all time points). Then apply themodified e-CCCbiclustering algorithm to find biclusters in samples that have the same discretized valueswith only missing values as the possible differences.

are several options regarding the discretization method, but the one that seems more convenient to our

problem is the discretization in n symbols performed with equal width by subject.

This discretization approach looks into the values of each subject across the time and creates n

bins of equal width that correspond univocously to a symbol of the discretization alphabet. Alphabets

of 3 symbols are usually used, in particular the following sequence of letters ordered in crescent or-

der: (D,N,U ). But, other alphabets are possible, in this work, in order to understand the effect of

discretization on the results of biclustering-based imputation the discretization with 5 symbols is also

used, corresponding to the following alphabet: (A,B,C,D,E).

A point to consider is that not all resulting biclusters are of interest, they may be too small or trivial1,

they may be statistically insignificant, or they may simply not make much sense in the context of the

problem. For instance, we are interested in getting local trends in a six time-point longitudinal data

1Herein trivial means a bicluster with only one time point or relatively to only one sample/person.

13

and therefore a bicluster of only two time points it is not very consistent with that. To understand the

implications of these aspects in the imputations performed, and what kind of metric should be interesting

to apply, it is considered the use of four different sets of biclusters in the imputation process: (1) ALL:

all non trivial biclusters, (2) SIG: only significant biclusters, i.e., with a P-value associated lower than

0.05 (3) TP: all non trivial biclusters with more than 3 time points i.e., biclusters that present at least 3

columns, and (4) SIGeTP only significant biclusters with more than 3 time points.

Once the biclusters are formed, the imputation may be performed inside the biclusters exactly as it

would be done on a whole dataset except that it is performed on a smaller group of data that forms the

bicluster. This process has as input each one-feature matrix representing each longitudinal matrix with

missing values and a description of the biclusters found after running the e-CCC biclustering algorithm.

It starts by corresponding each missing value to one single bicluster from which the imputation is to

be performed. After the univocal relation between missing value and bicluster is computed, the local

imputation process takes place and a single value to impute is found. This value is thus imputed, in

the one-feature matrix, in the place of the missing value. A schematic representation of this process is

presented in Figure 3.3. However, several biclusters may contain the same missing value and a single

bicluster must be selected, to solve this, in any of the four cases enumerated above, the biclusters are

evaluated according to their statistical significance and the most significant one is selected. Furthermore,

some missing values don’t fall inside any bicluster. In these cases, the missing value will remain missing,

i.e., these missing values will not be imputed, or are imputed with an additional method that is applied

to the whole one-feature matrix.

Figure 3.3: Illustration of the biclustering-based imputation process. For each missing value, it finds abicluster that contains it. Next, it takes the sub-matrix of the data contained in that bicluster and performimputation on that missing value.

3.2 Imputation methods applied

In this section a description of the imputation procedures that are applied and compared are presented.

14

Expectation maximization Imputation using EM approach is performed with the Matlab software, specif-

ically the EM imputation implementation described in [16]. This approach receives as input a matrix

with missing values defined as Not a Number (NaN) and generates as output the same matrix with

the missing values imputed.

Median cross subjects This approach was implemented in the Matlab environment. The procedure

computes for each one-feature matrix the median for each time point across all subjects, then it

imputes each missing value with the corresponding median value computed.

Median longitudinal A variation on the previous implementation was also developed where median

values to input are computed separately, not only in each feature, but also in each subject, across

all time points.

Bicluster-based imputation Following the strategy described in Section 3.1 three imputation methods

applied to the biclusters were explored: imputation with Median cross Persons, imputation with EM

and imputation by bicluster pattern. The first two approaches are simply direct applications of the

previous introduced implementations and differ only in the sense that they are applied to a much

more restricted group, the bicluster. The last method uses the information of the bicluster pattern

and the local values from the same person to predict the value to impute. The strategy used is:

first, select the corresponding letter of the bicluster pattern for that missing value, and then apply

reverse discretization to it. Based on the other values available for that person, compute the value

interval that letter represents and make the imputation with the mean value of that interval. An

example of this process is presented in Figure 3.4.

15

Figure 3.4: Illustration of the biclustering-based imputation using ”by pattern“ approach. First determinewhich discretization letter corresponds to the missing value. Second, compute the mean value of theinterval that letter represents for the given subject. The imputation is then performed with this meanvalue.

16

Chapter 4

Results

The goal of this chapter is two-fold: (1) evaluate the effectiveness of the imputation methods in longitu-

dinal data; (2) evaluate how general the conclusions are, by testing the imputation algorithms in several

datasets.

To properly evaluate the imputation methods, synthetic datasets were generated. Synthetic datasets

are essential in this work to properly evaluate the imputation algorithms, since they make it possible to

compare the imputations against ground truth. Indeed synthetic data is used, which allows reasoning

about aspects that are impossible with real data. As an example, for being able to evaluate imputation

methods, one should check if the predicted values are close to the real ones, i.e., , the values that

should be there and went missing. Using synthetic data, missing values are known a priori and thus an

assessment of this kind is possible. To tackle the second evaluation point, since the previous method-

ology is impossible to apply on real-world data, a classification approach is used over the ALS dataset.

It seems adequate to state that if a dataset resulting from an imputation method yields better results in

the classification problem than other imputed datasets, and if the classifications are performed in exactly

the same conditions, then this imputation method must be better than the others tested for this particular

context. Following this idea several imputations were applied to the real-world dataset and the resulting

complete datasets are classified. This chapter provides the detailed information on each one of these

procedures.

4.1 Synthetic data

The main advantage of testing methods in Synthetic data is that this experiences may be performed in a

controlled environment, since this data may be as well defined as needed and cleaned from other factors

that obstruct and occlude conclusions. Also this kind of data should be designed with a clear idea of

what are the questions and hypothesis one may want to test and thus define parameters to construct

a dataset that naturally shows some answers. As previously mentioned , we want to test if specially

designed imputation methods for longitudinal data perform better when applied to this kind of structured

17

data other than baseline ones, in particular we want to test if the developed bicluster-based imputation

method performs better. With this in mind, a couple of parameters should be tuned when designing the

synthetic dataset: the number of biclusters found in the data and the total number of missing values.

These aspects directly influence the performance of imputation methods and thus should be carefully

defined.

In [6] the authors developed a generator of synthetic data for biclustering (BiGen), which is used

in this work. BiGen creates a data matrix where biclusters are planted accordingly to a multitude of

parameters that the user may control, which the ones of interest are size of dataset, distribution and

type of data values, number of biclusters to plant, coherence type and the size of biclusters, and also

the total number of missing values.

Regarding the settings referring to biclusters, the coherence type is the most important, since it

defines the type of similarity that objects will share together and be grouped upon. As mentioned, the

objective of applying biclustering algorithms is to be able to group together objects that show the same

trend in time. For this, the coherence type defined is ”order preserving across rows“, which defines that

objects are grouped together if values in each object follow the same trend across columns. Although

a strictly longitudinal dataset is not designed, by defining the bicluster coherence as explained, we are

able to approximate the biclusters planted to the ones we would find in a truly longitudinal dataset.

An also important aspect to consider is the strategy used to generate missing values. BiGen is able

to implement a defined percentage of missing values at random positions. To understand the influence

of the amount of missing values in the imputations process several percentages were used. A version

of the data without missing values is also available, which will act as the ground truth in the evaluation

stage. A more complete description of the data generated is presented next, in chapter 4.1.1.

After imputation, the resulting datasets are evaluated using two metrics: Number of missing values

imputed and the mean imputation error, which is the mean difference between the ground truth values

and the imputed ones as described next:

MeanImputationError =

∑(|RealV alue− PredictedV alue|)totalNumberV aluesPredicted

(4.1)

4.1.1 Data description

Using the BiGen generator, four different sized matrices were generated. The sizes defined were: 1000×

150, 2000 × 200 and 5000 × 200. Each generated matrix consists in integer values ranging from 0 to 20

and planted with biclusters and missing values. The biclusters planted are set to be ”order preserving

across rows“ and their size is defined by an uniform distribution for both rows and columns, for which

the user defines the minimum and maximum values. The defined sizes were not extremely big in order

to simulate what happens with the real-world data and considering the dataset size. After the biclusters

are planted, missing values were also included in each matrix and in different percentages: 10%, 20%,

30% and 50%. This generates the final datasets where imputations are to be performed. In Figure 4.1

a summary and description of each dataset is presented.

For each one of these datasets, 9 imputation strategies are applied, resulting in 9 differently imputed

18

Figure 4.1: Description of datasets generated by BiGen, before imputation is performed. Each datasetis here described by dataset size, percentage of missing values, bicluster parameters and number ofmissing values found in biclusters.

datasets. The applied imputation strategies follow the same procedures mentioned in chapter 3.2 which

are described next.

BICmed Only missing values inside biclusters are imputed with median across rows, the remaining are

not imputed with any method. The portion of missing values imputed with this method corresponds

to the total number of missing values found in biclusters.

BICem Only missing values inside biclusters are imputed with EM. This method is able to impute every

missing value inside biclusters leaving the missing values not grouped in biclusters not imputed.

Refer to Figure 4.1 to see the corresponding number of imputations.

BICmed MED Applies median across rows inside biclusters and the remaining are imputed with median

across rows applied to the whole dataset.

BICem MED Missing values inside biclusters are imputed with EM and the remaining are imputed with

median across rows.

BICem EM Missing values inside biclusters are imputed with EM and remaining are imputed with EM

applied to the whole one-feature matrix.

MED Imputation without biclustering, applies median across rows to the whole matrix.

MEDL Applies longitudinal median to each line of the whole matrix. Note that if a whole line of the

matrix is missing, this method does not impute any value in that line.

MEDL MED First a longitudinal median is applied to each line of the matrix. Then the few lines that

were entirely missing are imputed with median across rows.

EM Imputation with EM applied to the whole matrix. All missing values are imputed.

For a visual aid on the description of each imputation refer to the Figure 4.2, here each imputation

process is described by the imputation method that it applies(in green).

19

Figure 4.2: Imputation strategies description. Each imputation method that is used appears in green,methods not used are shown in grey. As an example, BICem MED imputes missing values insidebiclusters with the EM approach followed by applying median imputation to the remaining missing values.

In short, each one of the 10 original datasets is imputed by 9 different approaches, originating 90

different datasets that are evaluated and compared. The results of such evaluation are presented in the

next section.

4.1.2 Evaluation results

As mentioned above each imputation approach, applied to each dataset, is evaluated by two metrics,

percentage of missing values imputed and mean imputation error. In the present section the results of

such evaluation are presented.

From the several questions one may want to answer using these results, the most interesting is

whether the bicluster-based imputation approach derives better imputed values with respect to alterna-

tive methods. To answer this, it is crucial to compare between imputation approaches that use bicluster-

based imputation in one portion of data and an additional method on the remaining missing values, with

the imputation approaches that use the same additional method to impute the whole dataset. By doing

this it is possible to directly understand if the use of bicluster-based imputation, even on a small por-

tion of data, results in a better imputation than if this approach was not used. The specific approaches

that fall in the category described are MED versus BICmed MED, or versus BICem MED and also EM

versus BICem EM. Such comparison may be observed in Figures 4.3 and 4.3 where these approaches

are compared relatively to their mean imputation error. The conclusion is straightforward, when using

bicluster-based imputation methods, whichever be the imputation method used inside the bicluster, the

mean error is smaller than if no bicluster imputation is applied. Also these results are consistent in all

nine synthetic datasets, independently of size and percentage of missing values, i.e., the relative rela-

tions between mean imputation errors for the methods in analysis are maintained, confirming that these

conclusions are independent of percentage of missing values or dataset size.

These imputation methods are also robust to the amount of missing values in the synthetic datasets.

As can be seen from Figure 4.4 and 4.5, in general, for all data sizes, the mean imputation error almost

does not increase with a dramatic increase of missing values (from 10% to 50%).

Finally, it is also possible to find which one of the tested imputation approach shows the most promis-

20

Figure 4.3: Comparison between datasets imputed with bbimputation and Median or EM in the remainingmissings with dataset imputed only with median or EM. All nine synthetic datasets present the samerelative results: bicluster imputation, enhances the imputation results.

Figure 4.4: Mean imputation error for the smallest dataset (1000× 150) with differing amount of missingvalues. All methods perform almost equally well for different amount of missing values, even when thisamount rises to 50%. This result is consistent across the datasets of different sizes.

ing results in terms of mean imputation error, in Figure 4.6 the mean imputation error is represented for

the smallest dataset (1000× 150) for all the imputation approaches applied. As before, these results are

also consistent for all datasets sizes and amount of missing values, thus only results from the datasets

of size 1000×150 are shown here. Analyzing these results leads to the conclusion that the BICem MED

and BICem EM methods consistently achieve better imputations, i.e., , the predicted values are closer

to the real ones. The complete results from which these analysis are performed may be consulted in

Appendix A.

4.2 Real-World data

For the real-world data, the ALS dataset, the imputation methods were indirectly evaluated through the

classification results. The classification problem, as described in Chapter 1.2, can be described as

predicting the evolution or not of needing assisted ventilation (NIV) by the time of the sixth visit using all

previous observations.

21

Figure 4.5: Here the mean imputation error for all the datasets with size 1000 × 150 and percentage ofmissings 10% and 50% are represented.

Figure 4.6: Mean imputation error in the smallest dataset (1000 × 150) for all methods, and differentamount of missing values. For all datasets (sizes and amount of missing values) the best method wasBICmed MED followed closely by BICmed MED. BICem MED: EM imputation inside each biclusterfollowed by median imputation across rows of the whole dataset for the rest of the missing values.BICem EM: EM imputation inside each bicluster followed by EM imputation in the whole dataset for therest of the missing values. See above in the text for the definition of all methods.

Classification is performed in a supervised way, but since the two classes EVOL and noEVOl are

seriously unbalanced, it was necessary to apply Synthetic Minority Oversampling Technique (SMOTE)

(as described in Chapter 2.4) in order to achieve better balance.

Concerning the classifiers, the ongoing work by Andre Carreiro [2], selects the Naive Bayes since

this is the one that yields better results. However, Naive Bayes(NB) implementation in the WEKA data

mining software is also known to be a classifier that can work particularly well with missing values, so it is

not expected that classification results can be significantly improved by using better imputation methods.

For this reason, and in order to be able to highlight which imputations really improve the classification

process, different classifiers were used, namely Decision Trees (DT), K-Nearest-Neighbor (KNN) and

linear Support vector machine (LinearSVM).

NB was applied with kernel estimator, regarding the default method for dealing with missing values,

this NB implementation in WEKA simply omits the conditional probabilities of the features with missing

values in test instances. KNN was applied with 1 neighbor, as to default missing values treatment, miss-

ing values are assigned the maximum distance when comparing instances with missing values. DT was

22

performed with a confidence factor of 0.25 and without Laplace smoothing, this implementation simply

does not consider the values of the attributes with missings to compute gain and entropy. LinearSVM

was performed with complexity of 1.0 , missings are treated by imputing global means/modes.

The classification process was performed with cross validation setup, where each dataset (SMOTE

300%, SMOTE 500% and Original) was divided in five folds from which 4 were used for training and one

for testing. These experiments, for each classifier and dataset, were repeated 10 times. Classifications

are evaluated with the TP rate, TN rate, Precision, K-statistics and F-measure. Conclusions from these

results should always be taken from a evaluation of the several metrics. However as F-measure balances

the influence of each class and integrates both precision and recall in a final number, it was given chief

importance.

4.2.1 Data description

This work was build upon clinical data containing information regarding ALS patient follow-ups collected

by the Neuromuscular Unit at the Molecular Medicine Institute of Lisbon. As mentioned, this dataset is

constructed in a longitudinal fashion where each patient is observed in several moments through time.

Although observations do not follow a strict plan, they tend to average 3 month between consecutive ob-

servations. The dataset contains demographic information, patient characteristics, neuropsychological

analysis, motor evaluations and also respiratory tests where the NIV requirement is included. In short,

each patient evaluation consists in the observation of 34 different features. A statistical description of the

dataset is presented in the Appendix B. There are static features, which are time invariant and longitu-

dinal ones which are time variant.evaluated may be differentiated accordingly to their evolution through

time, as static or longitudinal, being the static ones the features that stay constant along the time and

the longitudinal ones the features that show some trend. From the 34 features, 22 are longitudinal and

so are the focus of this work. In the context of the presented problem, each patient’s follow-up is labeled

with Evol or noEvol considering if an evolution in the NIV indicator exists or not. The higher the number

of follow-ups the easier it should be to perceive and exploit trends in the data. Therefore, only patients

that presented at least five follow-ups were considered. From these only the patients that didn’t evolve

from not needing NIV to needing NIV before the fifth moment are of interest to the classification problem

at hand. Although other setups could be considered, this was the best option from a balance between

number of resulting patients and number of follow-ups, since more follow-ups result in fewer patients

fulfilling the needed conditions. By filtering the not interesting ones, the resulting dataset consists of 159

patients observed in 34 different features at 5 different moments, which takes the form of a matrix of size

159*170, as depicted in the Figure 3.2.

The resulting dataset is quite unbalanced. It contains 31 EVOL samples and 128 noEVOL samples

that may be observed in Figure 4.7, where the percentage of patients labeled with Evol consists only in

approximately 20% of the cases.

23

Figure 4.7: Number of instances per class, Evol or noEvol, in the ALS dataset.

Missing Values Analysis

Approximately 40% of the values in the present dataset are missing. As illustrated in Figure 4.8 these

missing values occur in approximately 80% of the features and there is no single patient that does not

present at least one missing value.

Figure 4.8: Here it is described, as to the amount of missing values, features, patients and the wholedataset. missing values are represented in green.

These missing values are distributed unevenly throughout the two classes since from the total num-

ber of values belonging to class Evol approximately 80% are missing, against 20% missing values in

class noEvol. This aspect is depict in Figure 4.9 and represents a problem for the classification.

Regarding the mechanisms of missing values, i.e., the reasons why data is missing, there is not a

known justification. In the longitudinal features, missing values occur in what seems to be a random

fashion without any identified pattern and without any previous knowledge that indicates any pattern.

Data is simply missing because either it was not observed, or not annotated, and a consistent mecha-

nism creating these missing values was not found. In the static features, however, some of the missing

24

Figure 4.9: Proportion of missing values in class Evol and noEvol. 80% of the values in the Evol samplesare missing.

values may be considered as ”false“ missings, since the value that is missing in some time-point is

readily filled in with values from other time-points. The missing values that cannot be instantly filled

are the ones for which no value is observed for all time-points. These cases are pre-imputed with the

median across rows. In Appendix B a characterization of the number of missing values per feature and

time-points is presented. For each time-point features 11 to 33 are the longitudinal ones and are also

the ones presenting the greatest amount of missing data.

4.2.2 Biclustering results

As previously mentioned, the modified version of the biclustering algorithm, e-CCC-Biclustering, was

applied to longitudinal features that previously were transformed in one-feature matrices. The results of

such procedure are presented here. The discretization described in Chapter 3.1, needed when applying

this algorithm, was performed with two different number of symbols, 3 and 5, that correspond to the

following alphabets: U, N, D, and A, B, C, D, E. The reason of using these two settings for the discretiza-

tion was to understand the influence of these on the biclustering, imputation and classification results.

it is expected that a better discretization, i.e., , a discretization where the error that one makes when

discretizes values, should lead to better imputed values, however the biclustering results are expected

to be worse i.e., lower number of interesting biclusters found.

To understand the ability of the e-CCC-Biclustering algorithm to find and group together patients that

show the same trends, it is necessary to analyze the amount and importance of the found biclusters.

Although the trivial biclusters have already been filtered out, not all resulting biclusters are of interest for

the present problem, and thus a characterization relating the size and significance of the biclusters is

needed. For this reason, the biclusters to consider are grouped in four different categories (ALL, SIG,

TP, and TPeSIG), as introduced in Chapter 3.1. The representation of the amount of biclusters in each

one of these categories, distributed by feature is presented in Figure 4.10.

This analysis allows for a characterization of the biclusters found and results in the general idea that,

as expected, the number of biclusters that are both significant and have three or more time-points are

scarce. Also, an interesting observation to be made from this analysis and comparing to the number of

missing values in each feature (presented in Appendix B) is that the higher the number of missing values

the lower the number of biclusters found, in both discretizations of 3 or 5 symbols. This was expected

25

Figure 4.10: Distribution of the biclusters through the bicluster categories. ALL: All biclusters after filter-ing out the trivial ones. SIG: Significant biclusters. TP: biclusters with 3 or more time-points. TPeSIG:Significant biclusters with three or more time-points.

since missings leads to lost of information, the higher the number of missing values the scattered the

data becomes, increasing the difficulty of finding interesting biclusters.

An important aspect to be analyzed, since it is highly correlated with the imputation results, is the

amount of missing values that are caught in biclusters. In Figure 4.11 the percentage of missing values

that fall in each bicluster category is presented for both biclusters from 3 and 5 symbols discretization.

Figure 4.11: Representation of the number of missing values belonging to each set of biclusters.

As may be observed the total number of missings that are grouped in biclusters do not ascend to

30%, and that is considering all the non trivial biclusters. If a more restricted but also more interesting

group of biclusters is to be considered then only approximately 5% of the missings are caught. Also,

regarding the effect that the discretization options had on the capability of finding biclusters and on their

quality, one may observe, from all the previous analysis, that using 3 or 5 symbols does not result in a

concrete difference.

.

4.2.3 Datasets imputation results

As presented in Chapter 3.2 several imputation methods and their mixtures were applied to this data

resulting in several imputed datasets. This section describes in detail each dataset created stating the

imputation method used or the strategy implemented as well as the specific settings applied. Here, each

dataset is identified by the imputation method that was used for its creation.

26

ORI The original dataset, here imputation with the proposed methods is not performed, instead the

missings are left to be treated by the default method that deals with missing values implemented

by WEKA for each classifier. This dataset is the baseline for comparison between the proposed

imputation methods and the default one.

MED The missing values were imputed with the value resulted from the median off all values of the

same feature across all patients in the dataset. This method is capable of imputing all missings in

the dataset.

MEDL Missing values were imputed with the median off values from the same patient and feature for

all time points. This may be seen as a median taken horizontally in contrast with the MED method

that may be seen as a median taken vertically. In the specific cases where, for a single person,

observations in all time-points from the same feature are missing, this method is incapable of

predicting any value to imput. The percentage of missings imputed is about 64%.

MEDL MED This dataset is imputed with the previous imputation approach and additionally it is applied

MED. This strategy allows for the remaining 36% of missings to be imputed.

EM The EM imputation method implementation as described in Chapter 3.2 is applied to the entire

dataset. These approach is able to imput every missing in the dataset.

EMfeature The same EM implementation is applied to each one of the one-feature matrices which

structure is introduced in Chapter 3.1. In the few cases where an entire row or an entire column of

values is missing in these matrices, the EM-imputation algorithm can not predict any value for the

corresponding feature. These cases correspond to 4% of the missings which are left missing.

EMfeature MED In the special cases where EMfeature is not capable of predicting the values, the

cross-persons-median imputation is applied, allowing to obtain a complete dataset.

BIC3TPem This dataset is constructed by imputing missings with the bicluster-based imputation strat-

egy using EM. The biclustering procedure is performed on a discretized version of data with alpha-

bet of 3 symbols, and from the found biclusters, only the biclusters with more than 3 time-points are

considered. This approach is able to imput all missing values that belong to this set of biclusters,

the other remain missing. The amount of missing values imputed consists only in approximately

6% of the missing values in the original dataset.

BIC3TPem MED This dataset is the same as the previous one on which the remaining missings are

imputed with MED imputation.

BIC3TPem EMfeature This dataset is the version of BIC3TPem where the remaining missings are

imputed with the EMfeature approach. As before, this approach is not able to take care of every

missing, resulting that in total only 68% of the missings are imputed.

BIC3TPem EM This dataset is also a version of BIC3TPem where the additional method to impute the

remaining missings is EM. This strategy is able to impute every missing in the dataset.

27

BIC3ALLmed Missings are imputed with bicluster-based imputation with median, using all non trivial

bicluster found on the discretized version of the data performed with 3 symbols. The number of

missings belonging to these biclusters are about 30% of the total amount of missing values and

are all imputed.

BIC3ALLmed MED Here the remaining missings from BIC3ALLmed are imputed with MED, which

results in a complete dataset with no missing values.

BIC3SIGmed Missings are imputed with bicluster-based imputation with median and using only the

significant biclusters. In addition, the biclusters are found on a discretized version of data with 3

symbols. The number of missings imputed corresponds to the number of missings existing in the

selected biclusters, which is about 9% of the total.

BIC3SIGmed MED The remaining missings from BIC3SIGmed are imputed with MED, all missings are

imputed.

BIC3SIGTPmed The bicluster-based imputation is here performed using median as the internal impu-

tation method and using biclusters that are both significant and have more than 3 time-points. The

discretization is also performed with 3 symbols. The amount of missings that this strategy is able

to imput corresponds to the total number of missings that may be found on the selected biclusters,

about 5%.

BIC3SIGTPmed MED The remaining missings from BIC3SIGTPmed are imputed with MED, resulting

in a complete dataset.

BIC3TPmed Biclustering-based imputation is applied with median imputation approach, the data is

discretized with an alphabet of 3 symbols and the biclusters selected to perform the imputation

are the ones that have at least 3 time-points. The amount of missings imputed with this strategy is

about 6% of the total amount of missings.

BIC3TPmed MED In order to deal with the remaining missings from BIC3TPmed, MED is applied. This

allows for a complete imputation of every missing in the dataset.

BIC3TPpattern In this dataset the biclustering-based imputation by pattern is applied on the missings

belonging to the biclusters with at least 3 time-points. Here a discretization with an alphabet of 3

symbols is used. This strategy is able to impute every missing belonging to the selected biclusters,

which are about 6% of the total amount of missings in the original dataset.

BIC3TPpattern MED The previous dataset is additionally imputed with the MED approach, in order to

deal with the remaining missings. This results in a complete dataset.

BIC5TPem This dataset was imputed with biclustering-based imputation using EM approach. The

dataset discretization was performed with an alphabet of 5 symbols and the selected biclusters

are the ones that present at least 3 time-points. This strategy is able to imput 6% of the missings.

28

BIC5TPem MED The remaining missings from the BIC5TPem dataset are here imputed with the MED

procedure. The resulting dataset has no missings.

BIC5TPem EMfeature The remaining missings from the BIC5TPem dataset are here imputed with the

EMfeature procedure. The resulting dataset still have about 2% of missing values.

BIC5TPem EM The remaining missings from the BIC5TPem dataset are here imputed with the EM

procedure. This results in a complete dataset.

BIC5ALLmed Missings were imputed using the biclustering-based imputation with median. Biclustering

is here performed on a discretized version of data with 5 symbols and the imputation process

uses all non trivial biclusters found. The amount of missings that this strategy is able to imput

corresponds to the number of missings existing in the selected biclusters, which are about 30%

BIC5ALLmed MED The remaining missings in the previous dataset are here imputed with MED gen-

erating a complete imputed dataset.

BIC5SIGmed Missings are imputed with bicluster-based imputation with median and using only the

significant biclusters. Here the biclusters are found on a discretized version of data with 5 symbols.

The number of missings imputed corresponds to the number of missings existing in the selected

biclusters, which is about 9%.

BIC5SIGmed MED The remaining missings from BIC3SIGmed are imputed with MED, all missings are

imputed.

BIC5SIGTPmed The bicluster-based imputation is here performed using median as the internal impu-

tation method and using biclusters that are both significant and have more than 5 time-points. The

discretization is also performed with 3 symbols. The amount of missings that this strategy is able

to imput corresponds to the total number of missings that may be found on the selected biclusters,

about 5%.

BIC5SIGTPmed MED The remaining missings from BIC3SIGTPmed are imputed with MED, resulting

in a complete dataset.

BIC5TPmed Biclustering-based imputation is applied with median imputation approach, the data is

discretized with an alphabet of 5symbols and the biclusters selected to perform the imputation are

the ones that have at least 3 time-points. The amount of missings imputed with this strategy is

about 6% of the total amount of missings.

BIC5TPmed MED In order to deal with the remaining missings from BIC3TPmed, MED is applied. This

allows for a complete imputation of every missing in the dataset.

BIC5TPpattern In this dataset the biclustering-based imputation by pattern is applied on the missings

belonging to the biclusters with at least 3 time-points. Here a discretization with an alphabet of 5

symbols is used. This strategy is able to impute every missing belonging to the selected biclusters,

which are about 6% of the total amount of missings in the original dataset.

29

BIC5TPpattern MED The previous dataset is additionally imputed with the MED approach, in order to

deal with the remaining missings. This results in a complete dataset.

A organized visual representation of these descriptions is presented in Figure 4.12. Therein, for

each dataset, the methods and settings used appear in green, together with the information of how

many missing each method is able to imput.

Figure 4.12: For each dataset it is shown which are the imputation approaches uses (green), the orderof its application, and the number of missings each method is able to imput.

4.2.4 Classification results

Because of the extreme data unbalance, SMOTE was imperative to be applied and only with the appli-

cation of SMOTE with 300% was it possible to obtain a balanced dataset. As it is the usual procedure,

SMOTE with 500% was also applied in order to obtain the inverted unbalanced dataset, i.e., the class

EVOL with as many more instances than the class noEVOl had in the original dataset. These two

procedures were applied to each dataset described above.

After these stage, all datasets described, together with the ones created by SMOTE, are classified

as explained before. The resulting classifications are evaluated trough the metrics TP rate, TN rate,

Precision, Recall, K-statistics and F-measure. The extensive results of these metrics for each dataset,

and classifier are presented in Appendix C.

Being unable to directly evaluate the imputation methods with these data, the focus here is to un-

derstand which methods work better with which classifiers, in order to help improve the classification

process addressed in other works. In Figures 4.13, 4.14, 4.15 and 4.16, the F-measure metric for each

classifier and balanced datasets (with SMOTE300) are represented. From here it is possible not only

to observe which method is the best for each classifier, but more importantly it is possible to inquire

30

some aspects regarding the relation between imputation methods particularities and the classifiers per-

formance. It is also important to have in mind that the results concerning the original dataset (ORI) serve

to evaluate the default management of missings that each classifier implementation uses, and compare

it with the present imputation methods in test. The default strategies are described in the Chapter 4.2.

Figure 4.13: F-measure for all balanced datasets classified with NaiveBayes.

Figure 4.14: F-measure for all balanced datasets classified with linearSVM.

Using the Naive Bayes classifier, the imputations applying Median to the whole dataset(MED) and

Bbimputation with median (as an example the datasets imputed with Bic3SIG and BIC5SIG) improves

over Weka’s default method for dealing with missing values (ORI). Also when using KNN, the default

method to deal with missing values in Weka performs worst than Bbimputation approaches with median

(BIC3SIG and BIC5SIG), EM applied to the whole dataset (EM) and Median aplied to the whole dataset

(MED). Regading Decision Trees (DT), it is noticeable that the bbimputation procedure with EM improves

the results over the EM imputation applied feature by feature. As to the linear SVM, the biclustering-

31

Figure 4.15: F-measure for all balanced datasets classified with K-nearest-neighbor.

Figure 4.16: F-measure for all balanced datasets classified with Decision Trees.

based imputation methods using the by pattern approach ( as an example the datasets imputed wih

BIc3bypattern) and Median applied to the whole dataset (MED) improves results over the default missing

treatment of WEKA. The Bbimputation with by pattern approache also improves results over EM aplied

feature by feature (EMbyfeature) and over Median applied to the whole dataset(MED).

These conclusions are supported by the results of the Wilcoxon signed-rank tests, that compares

results of both experiments and is able to determine if the values are significantly different or not. The

results of the applied tests in the form of Pvalues are presented, for each classifier in the tables 4.1, 4.2

4.3 and 4.4. A Pvalue lower than 0.05 indicates that the fscore mean values for both experiments are

different and thus conclusions about performance may be drawn.

Regarding the use of 3 or 5 symbols in the discretization process, by comparing the classification re-

32

sults of the several BIC3 and BIC5 methods it is suggested that also here these options do not introduce

much difference. In the Table 4.5 the mentioned comparison is presented.

Table 4.1: Pvalue results of applying the Wilcoxon signed-rank test to theFscore results of the experi-ments defined at the left collumn, together with the Fscore mean values for each experiment.

Naive Bayes

Mean1 Mean2 Pvalue’MEDvsBIC3bypatternMED’ 0,898 0,8947 0,326751396

’MEDvsBIC5bypattern MED’ 0,898 0,8935 0,239517679’MEDvsBIC3SIGTP MED’ 0,898 0,9007 0,567822337

’ORIvsBIC3SIG’ 0,872 0,8777 5,18E-01’ORIvsBIC5SIG’ 0,872 0,8792 3,52E-01

’ORIvsBIC3bypattern’ 0,872 0,8754 0,797367247’ORIvsBIC5bypattern’ 0,872 0,8722 0,791341672

’ORIvsMED’ 0,872 0,898 2,36E-03’ORIvsEM’ 0,872 0,8802 0,400191719

’ORIvsEMbyfeature’ 0,872 0,8754 0,667968919’EMvsBIC3em EM’ 0,8802 0,8654 0,000728388’EMvsBIC5em EM’ 0,8802 0,8671 0,00199244

’EMbyfeature vs BIC3em EMbyfeature’ 0,8754 0,8701 0,820018494’EMbyfeature vs BIC5em EMbyfeature’ 0,8754 0,8763 0,979542991


KNN



’ORIvsBIC3SIG’ 0,6696 0,7272 9,60E-06’ORIvsBIC5SIG’ 0,6696 0,7262 1,94E-05


’ORIvsMED’ 0,6696 0,8613 7,56E-10’ORIvsEM’ 0,6696 0,7059 0,0008672



33


DT



’ORIvsBIC3SIG’ 0,8072 0,8157 0,5657’ORIvsBIC5SIG’ 0,8072 0,8165 0,4418


’ORIvsMED’ 0,8072 0,8218 0,1809’ORIvsEM’ 0,8072 0,786 0,064




linear SVM



’ORIvsBIC3SIG’ 0,841 0,8442 0,8135’ORIvsBIC5SIG’ 0,841 0,8461 0,7981


’ORIvsMED’ 0,841 0,8727 0,0093’ORIvsEM’ 0,841 0,8546 0,2249



34

Table 4.5: F-measure results for each classifier and each Bicluster-based imputation methods using 3and 5 symbols in the discretization.

NB KNN Linear SVM DT

BIC -n- Tpem 3 symbols 0,868055785 0,662322251 0,865307774 0,8663440025 symbols 0,868142974 0,618738883 0,854791296 0,871225781

BIC-n-ALLmed 3 symbols 0,879851286 0,676202215 0,855354761 0,8814349585 symbols 0,874813669 0,680651013 0,850386687 0,882611443

BIC-n-SIGmed 3 symbols 0,877722672 0,727191065 0,844211513 0,8869127395 symbols 0,879193357 0,726209807 0,846123113 0,887166384

BIC-n-TPmed 3 symbols 0,871242938 0,677981705 0,865453804 0,8811131315 symbols 0,867237839 0,634922946 0,851689888 0,882896434

BIC-n-SIGTPmed 3 symbols 0,871348082 0,684007823 0,851011579 0,8685662425 symbols 0,868506938 0,69033395 0,849971234 0,877556996

BIC-n-TPpattern3 symbols 0,875402185 0,664789098 0,867100562 0,8772136415 symbols 0,872178449 0,668481014 0,858814724 0,863707615

35

Chapter 5

Conclusions and Future work

In this work it was studied the problem of missing values in longitudinal data. Synthetic datasets were

generated to test the performance of imputation methods against ground truth, and to test the influence

of the amount of missing values on the imputation methods. The problem was furthered evaluated in

a real-world dataset of ALS patients, where the purpose is to predict the evolution of the need of Non

Invasive Ventilation (NIV) for assisted breathing.

The problem of imputing missings values in longitudinal data is approached here by the use of bi-

clustering algorithms. The application of biclustering algorithms allowed to find trends in the data, from

which better imputations can be performed. Using biclusters to imput in local portions of the data shows

improvement on the quality of the imputation in the synthetic data as well as in the performance of the

classifications in the real-world data.The tested methods are robust to the number of missing values

even when this amount rises dramatically to 50% of the total data.

Regarding ALS data the biclustering algorithm requires discretization before being applied and it was

found that using from 3 to 5 symbols in the discretization does not change results significantly.

For the future it is of interest to analyze how the proposed biclustering-based imputation approaches

deals with different mechanisms of missing values, i.e., compare between datasets with MCAR, MAR

and NMAR data. Also of interest for the future is the application of the same methods to other real-world

datasets such as the Alzheimer one, to confirm the conclusions drawn in this work.

The Prediction of NIV in ALS patients based on the present dataset may be considered an extreme

case, where the amount of missing values is huge and the distribution of these between the two classes

is very unbalanced. The general good classification results obtained here make us believe that this work

is in the right track and contributed positively to the solution of the problem at hand.

36

Bibliography

[1] Paul D Allison. Missing data, volume 136. Sage publications, 2001.

[2] Andre V. Carreiro, Susana Pinto, Alexandra M. Carvalho, Mamede de Carvalho, and Sara C.

Madeira. Predicting non-invasive ventilation in als patients using time windows. In ACM SIGKDD

Workshop on Healthcare Informatics (HI-KDD 2014), 2014.

[3] Fabrıcio Olivetti de Franca, Guilherme Palermo Coelho, and Fernando J Von Zuben. Predicting

missing values with biclustering: A coherence-based approach. Pattern Recognition, 46(5):1255–

1266, 2013.

[4] A Rogier T Donders, Geert JMG van der Heijden, Theo Stijnen, and Karel GM Moons. Review: a

gentle introduction to imputation of missing values. Journal of clinical epidemiology, 59(10):1087–

1091, 2006.

[5] Jean Mundahl Engels and Paula Diehr. Imputation of missing longitudinal data: a comparison of

methods. Journal of clinical epidemiology, 56(10):968–976, 2003.

[6] Rui Henriques, Francisco L. Ferreira, and Sara C. Madeira. Bigen: Synthetic data generator for

biclustering. Submitted for publication, available in: http://web.ist.utl.pt/rmch/software/bigen/, 2015.

[7] James Honaker and Gary King. What to do about missing values in time-series cross-section data.

American Journal of Political Science, 54(2):561–581, 2010.

[8] Roderick Little and Donald B Rubin. Statistical analysis with missing data. John Wiley & Sons,

2014.

[9] Sara C Madeira and Arlindo L Oliveira. Biclustering algorithms for biological data analysis: a survey.

Computational Biology and Bioinformatics, IEEE/ACM Transactions on, 1(1):24–45, 2004.

[10] Sara C Madeira and Arlindo L Oliveira. A linear time biclustering algorithm for time series gene

expression data. In Algorithms in Bioinformatics, pages 39–52. Springer, 2005.

[11] Tusell Palmer, Fernando Jorge, Barcena Ruiz, and Marıa Jesus. Multivariate data imputation us-

ing trees. Technical report, Universidad del Paıs Vasco-Departamento de Economıa Aplicada III

(Econometrıa y Estadıstica), 2002.

37

[12] Trivellore E Raghunathan, James M Lepkowski, John Van Hoewyk, and Peter Solenberger. A

multivariate technique for multiply imputing missing values using a sequence of regression models.

Survey methodology, 27(1):85–96, 2001.

[13] Michael B Richman, Theodore B Trafalis, and Indra Adrianto. Missing data imputation through ma-

chine learning algorithms. In Artificial Intelligence Methods in the Environmental Sciences, pages

153–169. Springer, 2009.

[14] Vanessa Romero and Antonio Salmeron. Multivariate imputation of qualitative missing data us-

ing bayesian networks. In Soft methodology and random information systems, pages 605–612.

Springer, 2004.

[15] Donald B Rubin. Multiple imputation for nonresponse in surveys, volume 81. John Wiley & Sons,

2004.

[16] Joseph L Schafer and John W Graham. Missing data: our view of the state of the art. Psychological

methods, 7(2):147, 2002.

[17] Tapio Schneider. Analysis of incomplete climate data: Estimation of mean values and covariance

matrices and imputation of missing values. Journal of Climate, 14(5):853–871, 2001.

[18] Matthias Templ, Alexander Kowarik, and Peter Filzmoser. Iterative stepwise regression imputation

using standard and robust methods. Computational Statistics & Data Analysis, 55(10):2793–2806,

2011.

[19] Jos Twisk and Wieke de Vente. Attrition in longitudinal studies: how to deal with missing data.

Journal of clinical epidemiology, 55(4):329–337, 2002.

38

Appendix A

Synthetic data results

Table A.1: Extensive results of the evaluation performed on all datasets of size 1000*150Dataset Size: 1000*150

Missing percentage Imputation mode Percentage of Missings Imputed: Mean imputation Error = (real Value-Value Imputed) / n imputations’

10%

BICmed 20.32666667 1,413742211BICem 20.32666667 0.937845142

BICmed MED 100 4,271733333BICem MED 100 4,175002542

BICem EM 100 4,211056034MED 100 4,983866667

MEDL 100 5,033466667MEDL MED 100 5,033466667

EM 100 4,755692469

20%

BICmed 21,54 1,324899412BICem 21,54 0.933288337


BICem EM 100 4,194946974MED 100 5,006033333

MEDL 100 5,048933333MEDL MED 100 5,048933333

EM 100 4,757333401

30%

BICmed 21,09111111 1,337372247BICem 21,09111111 1,017368599


BICem EM 100 4,234740331MED 100 5,000688889

MEDL 100 5,060088889MEDL MED 100 5,060088889

EM 100 4,827003504

50%

BICmed 21,84133333 1,430407179BICem 21,84133333 1,269626416


BICem EM 100 4,273330267MED 100 4,98176

MEDL 100 5,053426667MEDL MED 100 5,053426667

EM 100 4,854577885

39



10%

BICmed 25,3175 1,126542905BICem 25,3175 0.643620933


BICem EM 100 3,96411498MED 100 4,950925

MEDL 100 5,010925MEDL MED 100 5,010925

EM 100 4,493694918

20%

BICmed 24,39125 1,171834162BICem 24,39125 0.715224544


BICem EM 100 4,021799629MED 100 4,981625

MEDL 100 5,031825MEDL MED 100 5,031825

EM 100 4,598826337

30%

BICmed 23,74666667 1,202133633BICem 23,74666667 0.755448609


BICem EM 100 4,06061872MED 100 4,959583333

MEDL 100 5,0123MEDL MED 100 5,0123

EM 100 4,63213825

50%

BICmed 25,9685 1,171063019BICem 25,9685 0.869324979


BICem EM 100 4,036433586MED 100 4,968145

MEDL 100 5,024245MEDL MED 100 5,024245

EM 100 4,63234984



10%

BICmed 47,245 1,16156207BICem 47,245 0.618308819


BICem EM 100 3,028054068MED 100 4,89789

MEDL 100 5,01538MEDL MED 100 5,01538

EM 100 3,56259588

40

Appendix B

Dataset statistical description

Table B.1: Statistical description for each feature in the 1st time-point1st Time-point

No average standard deviation Missings No extreme valuesCount percentage low high

Name 1 159 311,36 186,020 0 0.0 0 0Gender 1 159 1,48 ,501 0 0.0 0 0BMI 1 133 25,4286066615 3,78767643893 26 16,4 0 2MND 1 147 1,93 ,264 12 7,5Ageatonset 1 159 57,67 12,476 0 0.0 1 0Onsetform 1 158 1,23 ,425 1 ,6@1stsymptoms1stvisit 1 159 20.159 28,3256 0 0.0 0 11ALSFRS 1 136 34,16 4,867 23 14,5 4 0ALSFRSR 1 127 42,14 5,111 32 20.1 4 0ALSFRSb 1 135 10.70 2,038 24 15,1 3 0R 1 129 11,81 ,512 30 18,9SpO2mean 1 29 95,3159 1,31259 130 81,8 0 0SpO2min 1 27 87,30 4,866 132 83,0 0 0SpO290 1 29 ,4883 ,82623 130 81,8 0 6Dips4 1 22 11,41 14,212 137 86,2 0 2Dipsh4 1 27 2,1489 3,40944 132 83,0 0 3Dips3 1 22 18,00 22,112 137 86,2 0 2Dipsh3 1 27 3,7044 4,77662 132 83,0 0 3Pattern 1 28 1,79 ,787 131 82,4 0 0VC 1 113 99,1356 18,48219 46 28,9 4 1FVC 1 115 99,4253 20.30843 44 27,7 4 0MIP 1 106 63,6730 29,70809 53 33,3 0 1MEP 1 105 79,9995 27,84390 54 34,0 0 0P01 1 97 95,9935 36,65919 62 39,0 2 4PO2 1 97 86,985 8,2239 62 39,0 0 0PCO2 1 97 38,843 3,7038 62 39,0 1 3peso 1 97 68,21 12,253 62 39,0 0 1PhrenMeanLat 1 72 7,96014 ,896103 87 54,7 0 1PhrenMeanAmpl 1 73 ,63336 ,225156 86 54,1 0 0PhrenMeanArea 1 64 2,58992 1,606443 95 59,7 0 1

41

Table B.2: Statistical description for each feature in the 2nd time-point2nd Time-point


Name 2 159 311,36 186,020 0 0.0 0 0Gender 2 159 1,48 ,501 0 0.0 0 0BMI 2 133 25,4286066615 3,78767643893 26 16,4 0 2MND 2 147 1,93 ,264 12 7,5Ageatonset 2 159 57,67 12,476 0 0.0 1 0Onsetform 2 158 1,23 ,425 1 ,6@1stsymptoms1stvisit 2 159 20.159 28,3256 0 0.0 0 11ALSFRS 2 144 32,85 5,062 15 9,4 4 0ALSFRSR 2 136 40.88 5,189 23 14,5 4 0ALSFRSb 2 144 10.41 2,518 15 9,4 5 0R 2 136 11,82 ,560 23 14,5SpO2mean 2 54 95,1794 1,30176 105 66,0 0 0SpO2min 2 50 87,54 4,841 109 68,6 2 0SpO290 2 54 ,7996 2,37402 105 66,0 0 8Dips4 2 45 14,38 15,370 114 71,7 0 1Dipsh4 2 51 1,6363 1,77859 108 67,9 0 2Dips3 2 45 25,38 21,805 114 71,7 0 1Dipsh3 2 51 2,9722 2,51773 108 67,9 0 1Pattern 2 54 1,89 ,904 105 66,0 0 0VC 2 58 99,6247 23,00548 101 63,5 2 1FVC 2 61 99,8059 22,50889 98 61,6 1 1MIP 2 58 66,1457 26,94055 101 63,5 0 2MEP 2 58 77,5350 29,71665 101 63,5 0 1P01 2 52 95,4683 28,13483 107 67,3 0 3PO2 2 50 87,804 10.4409 109 68,6 0 0PCO2 2 50 38,958 3,5280 109 68,6 0 0peso 2 43 71,88 13,590 116 73,0 0 1PhrenMeanLat 2 61 8,11811 ,910869 98 61,6 0 0PhrenMeanAmpl 2 61 ,66836 ,253015 98 61,6 0 0PhrenMeanArea 2 59 2,68703 1,543984 100 62,9 0 2

Table B.3: Statistical description for each feature in the 3rd time-point3rd Time-point


Name 3 159 311,36 186,020 0 0.0 0 0Gender 3 159 1,48 ,501 0 0.0 0 0BMI 3 133 25,4286066615 3,78767643893 26 16,4 0 2MND 3 147 1,93 ,264 12 7,5Ageatonset 3 159 57,67 12,476 0 0.0 1 0Onsetform 3 158 1,23 ,425 1 ,6@1stsymptoms1stvisit 3 159 20.159 28,3256 0 0.0 0 11ALSFRS 3 146 31,23 5,507 13 8,2 4 0ALSFRSR 3 136 39,38 5,562 23 14,5 4 0ALSFRSb 3 146 10.21 2,497 13 8,2 5 0R 3 137 11,77 ,572 22 13,8SpO2mean 3 50 94,9854 1,48872 109 68,6 2 0SpO2min 3 47 85,543 5,9425 112 70.4 0 0SpO290 3 50 2,5596 8,45576 109 68,6 0 6Dips4 3 44 11,393 11,2819 115 72,3 0 1Dipsh4 3 46 1,4774 1,46939 113 71,1 0 1Dips3 3 41 20.71 17,982 118 74,2 0 3Dipsh3 3 43 2,6888 2,28933 116 73,0 0 1Pattern 3 46 1,93 ,879 113 71,1 0 0VC 3 69 93,4180 18,48484 90 56,6 0 1FVC 3 71 93,8775 18,94539 88 55,3 0 0MIP 3 67 62,9934 31,24111 92 57,9 0 0MEP 3 67 74,3136 35,37268 92 57,9 0 2P01 3 58 101,7045 45,27073 101 63,5 0 6PO2 3 49 88,345 8,8202 110 69,2 0 0PCO2 3 49 38,769 3,5044 110 69,2 0 0peso 3 59 66,44 10.978 100 62,9 0 0PhrenMeanLat 3 58 7,97474 ,931877 101 63,5 3 0PhrenMeanAmpl 3 58 ,57466 ,209979 101 63,5 1 1PhrenMeanArea 3 58 2,26060 ,808356 101 63,5 0 1

42

Table B.4: Statistical description for each feature in the 4th time-point4th Time-point


Name 4 159 311,36 186,020 0 0.0 0 0Gender 4 159 1,48 ,501 0 0.0 0 0BMI 4 133 25,4286066615 3,78767643893 26 16,4 0 2MND 4 147 1,93 ,264 12 7,5Ageatonset 4 159 57,67 12,476 0 0.0 1 0Onsetform 4 158 1,23 ,425 1 ,6@1stsymptoms1stvisit 4 159 20.159 28,3256 0 0.0 0 11ALSFRS 4 149 29,56 6,483 10 6,3 4 0ALSFRSR 4 140 37,59 6,617 19 11,9 4 0ALSFRSb 4 149 9,77 3,003 10 6,3 5 0R 4 141 11,62 ,762 18 11,3 3 0SpO2mean 4 52 95,0142 1,35310 107 67,3 1 0SpO2min 4 49 85,949 5,5474 110 69,2 1 0SpO290 4 52 ,9500 1,98168 107 67,3 0 4Dips4 4 47 17,9606 27,28716 112 70.4 0 7Dipsh4 4 50 4,0286 13,36912 109 68,6 0 8Dips3 4 45 28,7538 36,35080 114 71,7 0 4Dipsh3 4 46 5,4630 14,07666 113 71,1 0 7Pattern 4 47 2,09 ,775 112 70.4 0 0VC 4 69 92,7926 24,18406 90 56,6 1 1FVC 4 70 93,4324 24,76286 89 56,0 3 1MIP 4 68 62,6638 28,22980 91 57,2 0 1MEP 4 68 69,8210 35,54146 91 57,2 0 3P01 4 56 95,5238 37,99597 103 64,8 0 2PO2 4 53 88,604 9,7377 106 66,7 0 0PCO2 4 53 38,430 2,8873 106 66,7 0 0peso 4 55 67,53 10.315 104 65,4 0 0PhrenMeanLat 4 43 8,08884 1,225198 116 73,0 1 2PhrenMeanAmpl 4 43 ,56709 ,266939 116 73,0 0 0PhrenMeanArea 4 42 2,59405 2,075788 117 73,6 0 1

Table B.5: Statistical description for each feature in the 5th time-point5th Time-point


Name 5 159 311,36 186,020 0 0.0 0 0Gender 5 159 1,48 ,501 0 0.0 0 0BMI 5 133 25,4286066615 3,78767643893 26 16,4 0 2MND 5 147 1,93 ,264 12 7,5Ageatonset 5 159 57,67 12,476 0 0.0 1 0Onsetform 5 158 1,23 ,425 1 ,6@1stsymptoms1stvisit 5 159 20.159 28,3256 0 0.0 0 11ALSFRS 5 145 27,10 7,421 14 8,8 1 0ALSFRSR 5 133 35,13 7,741 26 16,4 1 0ALSFRSb 5 145 9,26 3,442 14 8,8 0 0R 5 134 11,46 1,088 25 15,7 11 0SpO2mean 5 48 94,6413 1,53173 111 69,8 0 1SpO2min 5 47 83,834 8,7846 112 70.4 3 0SpO290 5 48 1,3644 2,27144 111 69,8 0 3Dips4 5 43 13,21 14,064 116 73,0 0 5Dipsh4 5 46 1,7589 1,93567 113 71,1 0 5Dips3 5 41 24,32 20.323 118 74,2 0 4Dipsh3 5 42 3,1281 2,75294 117 73,6 0 2Pattern 5 42 2,31 ,749 117 73,6 0 0VC 5 70 88,9146 23,68825 89 56,0 0 0FVC 5 72 88,8629 24,89035 87 54,7 0 0MIP 5 68 57,8096 31,50190 91 57,2 0 1MEP 5 68 69,3697 36,41167 91 57,2 0 2P01 5 59 95,6593 33,96141 100 62,9 0 2PO2 5 55 86,720 9,0579 104 65,4 0 3PCO2 5 54 39,909 4,5258 105 66,0 0 1peso 5 54 66,69 13,037 105 66,0 0 0PhrenMeanLat 5 31 8,36355 ,945720 128 80.5 0 0PhrenMeanAmpl 5 31 ,49226 ,215406 128 80.5 0 0PhrenMeanArea 5 29 2,04276 ,839246 130 81,8 0 0

43

Appendix C

Real-world data classification results

Table C.1: Extensive results of the classification with Naive Bayes on all unbalanced datasets.NaiveBAYES

Unbalanced original datasetConfusion matrices’ Tprate Tnrate Precision Kappa statistic F measure

ORI [19 4; 6 2] 0.759353846 0.385714 0.833288108 0.121873547 0.79131053EM [20 4;6 3] 0.7784 0.418571 0.847134471 0.178652475 0.808623045

EMfeature [20 4;6 3] 0.777507692 0.401905 0.843564887 0.163081145 0.806343065EMfeature MED [20 3;6 3] 0.775753846 0.453333 0.855688425 0.20072535 0.8113231

MED [20 3;6 3] 0.778861538 0.44 0.853066447 0.193586039 0.811957115MEDL [20 4;6 2] 0.778 0.332381 0.828447452 0.101975689 0.799359827

MEDL MED [20 4;6 2] 0.766307692 0.315238 0.821871433 0.078350923 0.79044669BIC3TPem [20 3;5 3] 0.791538462 0.485238 0.865021719 0.247713047 0.824749956

BIC3TPem MED [20 4;6 3] 0.781476923 0.421905 0.849028411 0.183107267 0.811903642BIC3em EMfeature [20 3;6 3] 0.763230769 0.438095 0.849759509 0.172790245 0.80239782

BIC3em EM [20 3;5 3] 0.7876 0.494286 0.86611893 0.249982414 0.822749604BIC3ALLmed [20 4;6 3] 0.782338462 0.414762 0.848283225 0.172507451 0.811516256

BIC3ALLmed MED [20 3;5 3] 0.790923077 0.451429 0.858227435 0.214861181 0.82041486BIC3SIGmed [20 4;6 2] 0.767353846 0.389524 0.838987802 0.142603997 0.798778411

BIC3SIGmed MED [20 3;6 3] 0.774984615 0.441429 0.852493859 0.190561951 0.809551681BIC3SIGTPmed [20 3;6 3] 0.7792 0.494762 0.866190854 0.23513914 0.817816327

BIC3SIGTPmed MED [20 3;5 3] 0.786276923 0.510476 0.870143142 0.261388026 0.823279476BIC3TPmed [20 3;6 3] 0.785476923 0.491905 0.865679138 0.243872027 0.820919194

BIC3TPmed MED [20 3;6 3] 0.783230769 0.496667 0.866378262 0.243991787 0.819718719BIC3TPpattern [19 3;6 3] 0.7568 0.472381 0.85786653 0.193203588 0.8008173

BIC3TPpattern MED [20 4;6 3] 0.767353846 0.414762 0.845138465 0.158012042 0.801367184BIC5TPem [20 3;6 3] 0.780153846 0.437619 0.851686676 0.195913526 0.812093062







BIC5TPpattern MED [20 4;6 2] 0.774369231 0.389048 0.841284676 0.140319173 0.803480379

44

Table C.2: Extensive results of the classification with Naive Bayes on all balanced (SMOTE300) datasets.NaiveBAYES

SMOTE300-Balanced dataset’Confusion matrices’ Tprate Tnrate Precision Kappa statistic F measure

ORI [23 5;2 21] 0.914307692 0.818123 0.836446874 0.731892386 0.871985167EM [23 4;3 21] 0.902646154 0.8466 0.861431688 0.749686365 0.880152473


MED [24 3;2 22] 0.918030769 0.870133 0.88229149 0.788600741 0.897995435MEDL [22 3;4 22] 0.853323077 0.87 0.874146101 0.723012158 0.861221629

MEDL MED [22 3;4 22] 0.842246154 0.870867 0.872792748 0.712804007 0.855108815BIC3TPem [23 4;3 21] 0.882953846 0.845033 0.858294561 0.728277207 0.868055785















Table C.3: Extensive results of the classification with Naive Bayes on all SMOTE500 datasets.NaiveBAYES

SMOTE500’Confusion matrices’ Tprate Tnrate Precision Kappa statistic F measure

ORI [24 5;1 33] 0.941876923 0.869784 0.830195536 0.793282822 0.881115378EM [24 4;2 33] 0.938923077 0.880597 0.846193662 0.805633871 0.889158553


MED [24 3;1 34] 0.955415385 0.909673 0.882049026 0.853889165 0.916137547MEDL [23 4;3 33] 0.885815385 0.895704 0.856483239 0.776978205 0.869204047

MEDL MED [23 3;3 35] 0.883415385 0.929602 0.899019522 0.814649553 0.889073057BIC3TPem [24 5;2 32] 0.929723077 0.867738 0.830314962 0.782529648 0.876159189















45

Table C.4: Extensive results of the classification with linear SVM on all unbalanced datasets.SVM linear


ORI [23 4;2 3] 0.917015385 0.403333333 0.860196879 0.346695225 0.886443268EM [20 4;6 3] 0.7784 0.418571429 0.847134471 0.178652475 0.808623045


MED [24 4;2 2] 0.931876923 0.365714286 0.860454795 0.334695945 0.893452577MEDL [24 4;2 2] 0.935138462 0.315238095 0.851640713 0.285036116 0.890575929

MEDL MED [24 4;2 2] 0.932861538 0.312380952 0.850709229 0.278412325 0.888993917BIC3TPem [20 3;5 3] 0.791538462 0.485238095 0.865021719 0.247713047 0.824749956















Table C.5: Extensive results of the classification with linear SVM on all balanced datasets (SMOTE300).SVM linear

SMOTE300-Balanced datasetConfusion matrices’ Tprate Tnrate Precision Kappa statistic F measure

ORI [21 4;4 22] 0.831846154 0.858092308 0.856905389 0.689507624 0.841040025EM [22 4;4 21] 0.851353846 0.857033333 0.862688215 0.708340679 0.854554406


MED [22 3;3 22] 0.868646154 0.8748 0.880466579 0.743408851 0.872709502MEDL [22 4;4 21] 0.847630769 0.859 0.863553476 0.706230731 0.853140047

MEDL MED [23 4;3 21] 0.889169231 0.852566667 0.863431063 0.741823922 0.874012375BIC3TPem [22 4;3 21] 0.877076923 0.845866667 0.857353804 0.723330192 0.865307774















46

Table C.6: Extensive results of the classification with linear SVM on all datasets with SMOTE500.SVM linear

SMOTE500Confusion matrices’ Tprate Tnrate Precision Kappa statistic F measure

ORI [21 3;5 36] 0.815846154 0.926072874 0.885020751 0.750654403 0.845281592EM [22 3;4 34] 0.846953846 0.919871977 0.882984838 0.771036284 0.862279903


MED [23 3;3 35] 0.879046154 0.929004267 0.896317591 0.809822459 0.886046445MEDL [21 4;5 33] 0.800830769 0.898975818 0.848762672 0.705475589 0.822045514

MEDL MED [22 3;4 34] 0.861015385 0.912403983 0.874492599 0.774639172 0.865593654BIC3TPem [21 3;4 34] 0.829692308 0.92685633 0.89065033 0.763945543 0.856366861















Table C.7: Extensive results of the classification with Decision Trees on all unbalanced datasets.Decision Trees


ORI [25 6;1 0] 0.972123077 0.045714 0.801965656 0.021911594 0.878323212EM [24 6;2 1] 0.941569231 0.110952 0.8150202 0.05575752 0.870902911


MED [22 4;4 3] 0.848646154 0.410476 0.856911608 0.257259789 0.849554427MEDL [22 4;3 2] 0.871723077 0.290952 0.836502848 0.171445341 0.852288855

MEDL MED [22 4;4 2] 0.845630769 0.285714 0.830667691 0.13700804 0.836685431BIC3TPem [24 5;2 1] 0.923569231 0.147143 0.818445683 0.078892837 0.866344002















47

Table C.8: Extensive results of the classification with Decision Trees on all balanced datasets(SMOTE300).

Decision Trees


ORI [25 6;1 0] 0.972123077 0.045714 0.801965656 0.021911594 0.878323212EM [24 6;2 1] 0.941569231 0.110952 0.8150202 0.05575752 0.870902911


MED [22 4;4 3] 0.848646154 0.410476 0.856911608 0.257259789 0.849554427MEDL [22 4;3 2] 0.871723077 0.290952 0.836502848 0.171445341 0.852288855

MEDL MED [22 4;4 2] 0.845630769 0.285714 0.830667691 0.13700804 0.836685431BIC3TPem [24 5;2 1] 0.923569231 0.147143 0.818445683 0.078892837 0.866344002















Table C.9: Extensive results of the classification with Decision Trees on all datasets with SMOTE500.Decision Trees


ORI [25 6;1 0] 0.972123077 0.045714 0.801965656 0.021911594 0.878323212EM [24 6;2 1] 0.941569231 0.110952 0.8150202 0.05575752 0.870902911


MED [22 4;4 3] 0.848646154 0.410476 0.856911608 0.257259789 0.849554427MEDL [22 4;3 2] 0.871723077 0.290952 0.836502848 0.171445341 0.852288855

MEDL MED [22 4;4 2] 0.845630769 0.285714 0.830667691 0.13700804 0.836685431BIC3TPem [24 5;2 1] 0.923569231 0.147143 0.818445683 0.078892837 0.866344002















48

Table C.10: Extensive results of the classification with K-Nearest-neighbor on all unbalanced datasets.K-Nearest-Neighbor


ORI [25 6;1 0] 0.972123077 0.045714286 0.801965656 0.021911594 0.878323212EM [24 6;2 1] 0.941569231 0.110952381 0.8150202 0.05575752 0.870902911


MED [22 4;4 3] 0.848646154 0.41047619 0.856911608 0.257259789 0.849554427MEDL [22 4;3 2] 0.871723077 0.290952381 0.836502848 0.171445341 0.852288855

MEDL MED [22 4;4 2] 0.845630769 0.285714286 0.830667691 0.13700804 0.836685431BIC3TPem [24 5;2 1] 0.923569231 0.147142857 0.818445683 0.078892837 0.866344002















Table C.11: Extensive results of the classification with K-Nearest-neighbor on all balanced datasets(SMOTE300).

K-Nearest-Neighbor


ORI [25 6;1 0] 0.972123077 0.045714286 0.801965656 0.021911594 0.878323212EM [24 6;2 1] 0.941569231 0.110952381 0.8150202 0.05575752 0.870902911


MED [22 4;4 3] 0.848646154 0.41047619 0.856911608 0.257259789 0.849554427MEDL [22 4;3 2] 0.871723077 0.290952381 0.836502848 0.171445341 0.852288855

MEDL MED [22 4;4 2] 0.845630769 0.285714286 0.830667691 0.13700804 0.836685431BIC3TPem [24 5;2 1] 0.923569231 0.147142857 0.818445683 0.078892837 0.866344002















49

Table C.12: Extensive results of the classification with K-Nearest-neighbor on all SMOTE500 datasets.K-Nearest-Neighbor


ORI [25 6;1 0] 0.972123077 0.045714286 0.801965656 0.021911594 0.878323212EM [24 6;2 1] 0.941569231 0.110952381 0.8150202 0.05575752 0.870902911


MED [22 4;4 3] 0.848646154 0.41047619 0.856911608 0.257259789 0.849554427MEDL [22 4;3 2] 0.871723077 0.290952381 0.836502848 0.171445341 0.852288855

MEDL MED [22 4;4 2] 0.845630769 0.285714286 0.830667691 0.13700804 0.836685431BIC3TPem [24 5;2 1] 0.923569231 0.147142857 0.818445683 0.078892837 0.866344002















50

Biclustering-based imputation in longitudinal data · was tested together with several baseline...

Documents

Transcript of Biclustering-based imputation in longitudinal data · was tested together with several baseline...