ACKNOWLEDGEMENTS - Sheffield · As data analysis suffers the curse ... were then suggested to fill...

58
2 EXECUTIVE SUMMARY As data analysis suffers the curse of missing data, there is a need to come up with methods that are guaranteed to assist our analysis. A lot of researches have been conducted decades ago to come up with methods that are helpful for analysis. The primary objective of this project is to evaluate techniques of analyzing missing data. Few steps involve in the evaluation such as: Understanding behavior or mechanism of missing data Studying and analyzing the available methods Choosing few methods to be tested Coding each method in Matlab Running test data set on each method Reweighting has been chosen to be the method of design. Reweighting introduced in the project is slightly different from the method available. The approaches utilized in reweighting method include: Minimizing sum of squared residual by applying weights Regressing the observed data with the weighted data Reweighting applies Bayesian Decision Theory, Multivariate Density and Discriminant Function in the development of it. The result is the tested by using Hypothesis Testing to compare the reweighted distribution and complete data distribution. Reweighting technique has been successfully implemented. Nevertheless, its robustness with different types of data has not been proved. In addition to that, ńaive way of weight assignment has limited its actual performance. It is recommended that a better way to produce weights to be implemented. With such a systematic and professional way of weight assignment, a better result should be expected.

Transcript of ACKNOWLEDGEMENTS - Sheffield · As data analysis suffers the curse ... were then suggested to fill...

2

EXECUTIVE SUMMARY

As data analysis suffers the curse of missing data, there is a need to come up with

methods that are guaranteed to assist our analysis. A lot of researches have been conducted

decades ago to come up with methods that are helpful for analysis.

The primary objective of this project is to evaluate techniques of analyzing missing

data. Few steps involve in the evaluation such as:

• Understanding behavior or mechanism of missing data

• Studying and analyzing the available methods

• Choosing few methods to be tested

• Coding each method in Matlab

• Running test data set on each method

Reweighting has been chosen to be the method of design. Reweighting introduced

in the project is slightly different from the method available. The approaches utilized in

reweighting method include:

• Minimizing sum of squared residual by applying weights

• Regressing the observed data with the weighted data

Reweighting applies Bayesian Decision Theory, Multivariate Density and Discriminant

Function in the development of it. The result is the tested by using Hypothesis Testing to

compare the reweighted distribution and complete data distribution. Reweighting technique

has been successfully implemented. Nevertheless, its robustness with different types of data

has not been proved. In addition to that, ńaive way of weight assignment has limited its

actual performance.

It is recommended that a better way to produce weights to be implemented. With such

a systematic and professional way of weight assignment, a better result should be expected.

3

ACKNOWLEDGEMENTS

A million thanks to my supervisor, Dr Robert F Harrison for his precious idea and helping

hands during the period of the project.

Not to be forgotten, to my family and friends who has give moral supports in the battle of

completing this masterpiece.

4

TABLE OF CONTENTS

ABSTRACT 1

EXECUTIVE SUMMARY 2

ACKNOWLEDGEMENTS 3

CHAPTER 1 – INTRODUCTION 6

CHAPTER 2 - ANALYSIS WITH MISSING DATA 7

2.1 MISSING DATA MECHANISM 8

CHAPTER 3 – REQUIREMENT AND ANALYSIS 11

3.1 DATA SET 11

3.2 MISSING DATA TECHNIQUES 12

3.2.1 DELETION 13

3.2.1.1 LISTWISE DELETION 14

3.2.1.2 PAIRWISE DELETION 14

3.2.1.3 CASEWISE DELETION 14

3.2.2 IMPUTATION 15

3.2.2.1 SIMPLE MEAN IMPUTATION 15

3.2.2.2 REGRESSION MEAN IMPUTATION 16

3.2.2.3 LINEAR REGRESSION 16

3.2.2.4 LOGISTIC REGRESSION 18

3.2.2.5 GENERALIZED LINEAR REGRESSION 19

3.2.2.6 HOT DECK IMPUTATION 19

3.2.2.7 COLD DECK IMPUTATION 20

3.2.3 REWEIGHTING 20

3.2.4 FULL INFORMATION MAXIMUM LIKELIHOOD 21

3.2.5 MULTIPLE IMPUTATION 21

CHAPTER 4 – METHOD OF APPROACH 22

4.1 BAYESIAN DECISION THEORY 22

4.2 MULTIVARIATE DENSITY 23

4.3 DISCRIMINANT FUNCTION 23

4.4 HYPOTHESIS TEST 24

5

4.5 REWEIGHTING 24

CHAPTER 5 – RELATED WORKS 27

CHAPTER 6 – RESULTS 29

6.1 OVERALL MEAN IMPUTATION 29

6.2 CLASS MEAN IMPUTATION 29

6.3 LINEAR REGRESSION 30

6.4 REWEIGHTING 30

6.5 FURTHER WORKS ON REWEIGHTING 31

CHAPTER 7 – CONCLUSION 45

REFERENCES 46

APPENDIX A

MATLAB CODE FOR OVERALL MEAN IMPUTATION 4 7

APPENDIX B

MATLAB CODE FOR CLASS MEAN IMPUTATION 50

APPENDIX C

MATLAB CODE FOR LINEAR REGRESSION 53

APPENDIX D

MATLAB CODE FOR REWEIGHTING 56

6

1 INTRODUCTION

A complete set of data is crucially needed in order statistical analysis. However, it is

nearly impossible to have “perfect” data sets as missing values have always been a constraint

in analysis. Missing values appears for several reasons for examples non respondents in a

survey, data missing due to intermittent faulty suffers by a machine, ‘wild’ values which are

impossible to be observed being removed and highly cost data being omitted in a data set.

In such cases, we would not be able to come up with reliable and highly confidence

estimations as they affect the properties such as means, variances and percentiles of the data

set. We can opt either to ignore the missing values or to fill in some sensible values for

replacement depending on the mechanisms of the missing data.

Researches have been done earlier on how to analyze a data set which suffered from

missing values. Basically, there are three options available to treat missing data which are

listwise deletion (LD), imputations and maximum likelihood. The first two options are

categorized as complete case analysis while the last is incomplete case analysis. Each option

will be described further in Chapter 3.

The techniques suggested application depends on the mechanisms of data

‘missingness’. Mechanisms include missing at random (MAR), missing completely at

random (MCAR) and missing not at random (MNAR). Mechanisms will be described in

detail in Chapter 2. In the analysis however, missing cases will be assumed to be MCAR.

The goal of this project is to examine the techniques available and the strengths and

weaknesses for each of them. Finally, a technique which is reweighting will be used as the

method of choice. Specific details on design approach will be documented in Chapter 4.

To evaluate performance of different methods, simulations will be done. The analysis

of simulated methods will be illustrated in Chapter 6. Note that only continuous data are

considered in this project.

7

2 ANALYSIS WITH MISSING DATA

Analyzing missing data has become a big challenge in statistical analysis since years

ago. When the presence of missing data is significant, it will results in biased estimation and

can lead to wrong inferences.

Earlier, data set with missing values were analyzed by simply ignoring the missing

values. By pretending that missing samples did not presence, they could be dropped from the

case and consequently reducing the population samples. Traditionally, deletion methods have

been used. Those methods were quite successful in the presence of large amount of data.

Deletion of unobserved data would not cause serious harm to the estimation. This is however

a default way of solving missing data in many statistical packages.

However, in 1987, researchers namely Little & Rubin and Lepkowski, Landis &

Stehouwer started to address better ways of analyzing missing data. The concern was pointed

out due the significant amount of information loss by deletion methods. Imputation methods

were then suggested to fill in the missing values so that the sample size would not be

distorted.

Initially, the simplest mean imputation was fancied among researchers. However, due

to its effect on population distribution, more imputation techniques were studied. The

techniques were cold deck imputation, hot deck imputation, regression imputation and the

most recent, similar response imputation.

Extending the idea of single imputation, multiple imputations method was later being

proposed by Rubin, 1987. Imputations however required complicated computation

especially multiple imputation and therefore reduced the efficiency of the method. Besides,

they also required supplementary data to be introduced.

Later on, due to the complexity of imputation methods, full maximum likelihood

(FIML) has become an attractive alternative for solving missing data. Even so, this method

has not been a successfully practiced. The issue of convergence has caused failure in the

implementation of this method. Furthermore, the convergence failure of binary data has

limited the capability of this method.

8

2.1 MISSING DATA MECHANISM

Missing data in statistical analysis may appear in several patterns depending on the

reasons. A few reasons that contribute to missing data have been introduced earlier. First, in a

survey, there are total non-responsive cases (Figure 2.1) which have been observed in a

sample. It could be due to several causes and hence leading to emptied row of data. Possible

causes of non responsive in a survey are:

• Respondent is absent

• Respondent refuse to answer survey questions due to timeliness

X1 X2 X3 X4 X5

1 x x x x x2 x x x x x3 ? ? ? ? ?4 x x x x x5 x x x x x

Figure 2.1 Total non responsive

In fields where data collection is costly for example getting data from interfaces, it is

very likely that those values will be left out resulting a data set with component partial non

responsive. This therefore creates an empty column in a data set as illustrated in Figure 2.2. It

is also known as univariate non response. For multivariate non response, certain variables are

missing for certain cases as in Figure 2.3.

X1 X2 X3 X4 X5

1 x x x ? x2 x x x ? x3 x x x ? x4 x x x ? x5 x x x ? x

Figure 2.2 Partial Non Responsive Univariate Patterns

X1 X2 X3 X4 X5

1 x x x ? ?2 x x x ? ?3 x x x ? ?4 x x x x x5 x x x x x

Figure 2.3 Partial Non Responsive Multivariate Patterns

9

There are also partial non responsive cases (Figure 2.4) where only certain items are

missing in a data set for certain variables. This possibly due to removals of weird, “wild”

values or respondent do not response to all questions.

X1 X2 X3 X4 X5

1 x ? x x x2 x x ? x x3 x x x x ?4 x ? x x x5 ? x x x ?

Figure 2.4 Partial Non Responsive Haphazard Patterns

The pattern of missing data will be monotone when case/cases is/are removed for the

subsequent variables. This normally occurs in attrition of longitudinal survey. Pattern of

monotone non responsive is as in Figure 2.5.

X1 X2 X3 X4 X5

1 x x x x x2 x x x x ?3 x x x ? ?4 x x ? ? ?5 x ? ? ? ?

Figure 2.5 Non Responsive Monotone Patterns

There are three mechanisms for the occurrence of missing data, two of which are

missing at random and the last to be not missing at random. Little and Rubin [1] however

made a distinction between missing completely at random (MCAR) and missing at random

(MAR).

Missing case is categorized as MCAR if it satisfies the condition; probability of

missing is independent on any value of any variable. Mathematically, it can be expressed as

P(Y|y missing) = P(Y|y observed) where Y is the random variable under studies. Since

“missingness” is uncorrelated to any variables, the distribution of Y is not affected by the

missing values.

In the case of MAR, P(Y|y missing,Z) = P(Y|y observed,Z), which means that the

probability of y missing is equal to probability of y observed for Y Є Z. In other words, the

probability of Y is missing is not correlated to the value of Y after controlling observed

variables, Z. Again, the distribution of Y is not being altered for Y Є Z, where Z is a set of

variables. It is obvious that MCAR is a stronger condition rather than MAR.

10

If neither condition of MAR nor MCAR is hold, the “missingness” is considered as

MNAR. As the name has suggested, the pattern of missing is not random and is associated to

the variable on which the data is missing.

11

3 REQUIREMENTS AND ANALYSIS As discussed earlier, there are a number of techniques to treat missing data. How well

it works depends on where it is being used. For example in field where large number of data

is a constrained, deletion methods and full maximum likelihood might not bring up a good

result. Even the results might not be misleading; level of confidence on the inference made is

a bit low or less reliable.

As decision is crucial in industries like health monitoring and plant control, the need

of producing a highly accurate estimate is a big challenge since it is very unlikely that a

perfect set of data will be present. Any misleading would hardly be tolerated as it deals with

life and will be costly. As such, the most appropriate technique has to be used to analyze

such cases.

A few techniques namely overall mean imputation, class imputation, linear regression

and reweighting will be used and each result will be compared. How well each method works

depend on the classification result as compared to complete data result. The wellness is

determined by the classification result after applying the techniques as compared to the

complete data result.

3.1 DATA SET

A set of data is created using MATLAB random number generator. For consistency,

the random generator is initially set to run at a specified state.

To reflect the real case, data was generated to be multivariable. For analysis purpose,

only three variables data is generated with 500 cases. Data distribution was assumed to be

multivariate normal Gaussian with same mean and variance. The variance of the data set

plays a significant contribution in our analysis later.

The data was partitioned into two classes, Class 1 and Class 2 for simplicity. The first

250 cases are categorized as Class 1 and the remaining as Class 2. Here, Bayesian Decision

Theory is being applied to determine the decision boundary. The theory will be covered in

detail in the next chapter.

In order to create missing cells in the data set, once again random generator is used to

give missing at random mechanism. As we will see afterward, the percentage of missing data

12

has to be considered as well in technique selection. As the missing cells are randomly

distributed, the amount of missing data in each class might not necessarily being the same.

The missing cells were first assigned to be NaN (Not a Number). Those cells were

then manipulated according to the techniques applied. Different techniques approach the

missing cells differently. Deletion techniques will drop the case with missing data on the

variable being studied and as a consequence the sample size will become smaller.

Imputations on the other hand, fill in the NaN cells with predicted values computed from

variety of techniques. Reweighting altered the data resulting from listwise deletion by

assigning certain weight to the data. The alteration is based on minimizing the squared

residuals.

3.2 MISSING DATA TECHNIQUES

The way missing data being handled determine the quality of our analysis whether it

will lead to the predicted solution or vice versa. If the prediction made is wrong or biased, we

prone to be in trouble especially in cases like analyzing health data for patients at a hospital.

Wrong prediction might cause reputation to be in alarming state and patient’s state to be in

danger. Therefore in such critical area, missing data must be dealt with care so that prediction

will be as closed as possible to the expected solution.

There are several ways to handle missing data. We can opt either to remove them

completely from the data set or imputing values to those missing holes to get a complete data

set. Both options are known as complete case analysis. In oppose to complete case analysis

we are able to analyze data with missing values directly without having to complete the data

set. Incomplete case analysis is model based approach while the complete case analysis is

sampling based.

The effectiveness of the techniques is determined by how much the outcome has

deviated from the complete case data. The percentage of deviation for the complete case

mean will be a performance indicator. In the case of applying imputations, if the mean after

data imputation is close to the complete case mean, we should be expecting that the values

imputed are closed to the actual values. The classification result using the data set with

imputed values should be similar to the result with complete data analysis. The same

13

inference goes to other methods as well. The closer the resulting data to the complete case,

the lower the likelihood that the classification will be misleading.

The way missing data being handled determine the quality of our analysis whether it

will lead to the predicted solution or vice versa. If the prediction made is wrong or biased, we

prone to be in trouble especially in cases like analyzing health data for patients at a hospital.

Wrong prediction might cause reputation to be in alarming state and patient’s state to be in

danger. Therefore in such critical area, missing data must be dealt with care so that prediction

will be as closed as possible to the expected solution.

There are several ways to handle missing data. We can opt either to remove them

completely from the data set or imputing values to those missing holes to get a complete data

set. Both options are known as complete case analysis. In oppose to complete case analysis

we are able to analyze data with missing values directly without having to complete the data

set. Incomplete case analysis is model based approach while the complete case analysis is

sampling based.

The effectiveness of the techniques is determined by how much the outcome has

deviated from the complete case data. The percentage of deviation for the complete case

mean will be a performance indicator. In the case of applying imputations, if the mean after

data imputation is close to the complete case mean, we should be expecting that the values

imputed are closed to the actual values. The classification result using the data set with

imputed values should be similar to the result with complete data analysis. The same

inference goes to other methods as well. The closer the resulting data to the complete case,

the lower the likelihood that the classification will be misleading.

3.2.1 DELETION

Deletion or data removal is one of the techniques to achieve complete data set. This is

the easiest to be done and most commonly set as default technique for statistical software

packages. Removing data will reduce the amount of data available for analysis. If the sample

is small, the amount of data after elimination will be getting smaller. Though data deletion

appears to be simple and less time consuming, insufficient data will have higher tendency of

14

inaccurate estimation especially when the sample size is small. Thus, this is neither practical

nor intelligent way of handling data sets with missing values.

Deletion can be done in several ways including listwise deletion, pairwise deletion

and casewise deletion.

3.2.1.1 LISTWISE DELETION

Listwise deletion is a method of data removal where the entire variable/variables

having any missing data is/are being dropped from analysis. This obviously causes a

significant amount of information loss if the pattern of ‘missingness’ is haphazard random.

However, this naïve method of data treatment still has an advantage if the missing

values occur at random. It does not suffer from biases estimates as the whole case is dropped

rather than making ‘wild’ substitutions to the missing holes.

3.2.1.2 PAIRWISE DELETION

Pairwise deletion works in the sense that if a case has missing values for any variable,

it will automatically be dropped out from calculation relating to those missing variables. It is

named such since the correlation of each pair of variables are calculated using only cases

with complete data for the pair of variables. This method of deletion leads to different

covariance matrix sizes for different pairs of variables since different cases are deleted for

calculation.

3.2.1.3 CASEWISE DELETION

For casewise deletion, cases having missing values in variables of interest are totally

excluded in the analysis. Therefore, correlations are calculated only using cases that have

been selected for analysis.

It is possible to end up with nothing if the missing values are randomly distributed

across all variables. Casewise deletion decreases the amount of data available for calculation.

This will cause misleading in analysis especially for a small sample size due to insufficient

amount of data.

15

3.2.2 IMPUTATION

Imputations appear to be more appropriate in effort of analyzing data with missing

values. By imputing values into missing holes, statistical tools available can be used in

analysis as complete set of data are now available. Imputation if properly applied to the cases

will produce a reliable set of data for analysis. The problems we might face are how to

impute values which mimic the true values which are missing and are those values truly

represent the blanks?

In the earlier years of imputation, the easiest way to impute missing data was by a

single value substitution or single imputation (SI). Later in 1987, Rubin has introduced

multiple imputation (MI) method which substituted the holes with a set of likely values for

them. MI used the combined results of several SIs to get the final value to be imputed to the

missing cell.

There are few ways commonly used to get the substitutions namely mean imputation,

regression imputation, cold deck imputation, hot deck imputation and similar response

pattern imputation (SRPI). Each of this technique will be further discussed next.

Imputation in general suffers from distribution distortion as variance will be affected.

This means that good estimation on individual values does not guarantee good overall

estimation on the parameters under study. This however can be encountered by taking class

mean imputation or multiple regression imputation.

3.2.2.1 SIMPLE MEAN IMPUTATION

Simple mean imputation is done by substituting missing value with the mean of the

variable of interest. This however is not a recommended way to impute values as it will cause

the distribution of the variable to shrink and the variance to be wrongly estimated. It is the

easiest way of imputation but does not guarantee correctness in the inferences made.

Mean imputation can be further classified into two; overall mean imputation and class

mean imputation. Overall mean imputation substitutes the missing values with the overall

data mean whereas class mean imputation replaces those with class mean.

Class mean imputation is an improved version of overall mean imputation with target

to reduce data variance shrinkage. Besides, class mean imputation promotes unbiased

estimate of mean.

16

3.2.2.2 REGRESSION MEAN IMPUTATION

Regression mean imputation is a better way to impute values rather than simple mean

imputation since the information on the joint distribution between the variables of interest is

utilized. There are three ways of regression that can be performed; linear regression, logistic

regression and generalized linear regression. Linear regression is suitable for continuous

variables while logistic regression on the other hand works for binary output data. In the case

of multiple unordered categories, generalized linear regression appears to be an attractive

choice.

The first thing to be done for regression imputation is to carry out regression of the

incomplete variables to an auxiliary variable. Auxiliary variable refers to variable which has

information prior to sampling. In this manner, the information of the joint distribution

between the auxiliary variables and variables of interest is taken into account. Precise

regression model will lead to unbiased estimates of the mean. Nevertheless, this method still

suffers from misleading inferences due to inaccurate regression coefficients as a result of

small variability of imputations.

3.2.2.3 LINEAR REGRESSION

Linear regression is about fitting two or more variables to fit a linear equation which

will minimize the sum of squared residuals. The best fit will give the smallest value of r

which means that the line fits most of the data. The missing values are then predicted from

the line. Consider the case of two variables which satisfy the linear equation as follows:

(3.1)

where Y is the variable of interest or dependent variable

X is the auxiliary or independent variable

a, the intercept of the linear equation is the predicted mean for variable Y. From equation

(3.1), we can deduce the equation to calculate a as below:

aveave bXYa −= (3.2)

b, the regression coefficient determine the association between variable X and Y. Positive b

indicates positive association between X and Y, that is increasing X will cause an increase in

bXaY +=

17

Y. Negative association on the other hand implies that an increase in X resulting in a

decrease in Y. b takes the value between -1 and 1. b equals zero means there is no association

between X and Y while b equals 1 indicate the best correlation between them. b is calculated

using the formula below:

22

x

xy

S

C

x

xyb =

ΣΣ= (3.3)

where xyC is covariance of X and Y and2xS is the variance of X.

Once the equation of linear regression has been obtained, we need to analyze how

well the data fit the line. Therefore, r which determines the degree of the linear regression

calculated is introduced. r is mathematically expressed as:

yyxx

xy

SS

Sr = (3.4)

xyS is the covariance of X and Y, xxS is the variance of X andyyS is the variance of Y.

Perfectly linear fit occurs when r = 1 whereas r = 0 means that there is no linear fit between

the dependent and independent variables.

The idea of linear regression can be further extended to multiple linear regressions.

This is done by introducing more independent variables to be regressed with the dependent

variable. In this sense, we will be able to minimize the sum of squared residual. The best

model will be determined using all the regressions performed.

To construct multiple linear regressions, equation (1) has to be expanded to include

more independent variables. The equation for multiple regressions is:

nn XbXbXbaY ++++= ...2211 (3.5)

Calculation of b1, b2,..., bn,is similar to the calculation of b shown earlier.

Below are the illustrations for Linear Regression and Multiple Linear Regressions.

18

Figure 3.1 Linear Regression

Figure 3.2 Multiple Linear Regressions

There is limitation on the linear regression discussed above. It can only be used to

analyze single dependent variable. With such limitation, generalized linear regression has

been opted for. Details of generalized linear regression will be addressed in section 3.2.2.5.

3.2.2.4 LOGISTIC REGRESSION

Instead of using linear regression, logistic regression has been used for data with

dichotomous outcomes. Logistic regression gives a better flexibility compared to linear

regression as it exhibits the S shape from the sigmoid function:

)exp(1

)exp(

Y

Yp

+= (3.6)

p generated from equation (6) is the probability of the value of the variable being predicted

and hence takes a value between 0 and 1. Substituting equation (5) into (6) gives:

19

)...exp(1

)...exp(

2211

2211

nn

nn

XbXbXba

XbXbXbap

+++++++++

= (3.7)

Again, values of b1, b2,..., bn are calculated the same way as for linear regression. From the

logistic regression model obtained, ones can predict values to fill up the missing holes. Since

logistic regression works for binary or dichotomies outcomes, the predicted values will take

the probability of the dependent variable, Y to be 1. Figure 3.3 below shows the illustration

of logistic regression model.

Figure 3.3 Logistic Regression

3.2.2.5 GENERALIZED LINEAR REGRESSION

Generalized linear model approach seeks to improve the limitation of multiple linear

regressions by allowing more dependent variables to be analyzed in parallel. The procedures

of producing the regression coefficients are similar to those discussed in linear regression

section. It only differs in the sense that the dependent variable, Y is now a matrix instead of a

vector and the regression coefficients are tabulated as a matrix now.

The advanced ability of this model as compared to multiple linear regressions enables

combination or transformation of multiple linear dependent variables. Compared to multiple

linear regressions which are a univariate method, generalized linear regression is a

multivariate method which uses the correlation information of the dependent variables.

3.2.2.6 HOT DECK IMPUTATION

In contrast to mean imputation, hot deck imputation technique does not distort the

distribution of the sample. This is because different observed values are substituted to the

missing holes instead of the population mean. The idea of hot deck imputation method is

20

based on drawing a prediction from current observed set of values or model, called donor of

the same variable that is almost similar to the set with missing data or the client. The values

to be imputed are drawn from the donor using based on methods like nearest neighborhood

imputation, similar response pattern imputation (SRPI), longitudinal imputation and cross

sectional imputation. Updates on the emptied cells result in continuous updates on the sample

as well.

The following procedures describe the steps to apply hot deck imputation:

1. Determine the initial deck or donor to be used.

2. Select sample cases.

3. Categorize sample into subclasses.

4. Amend the hot deck values to reflect subclasses by choosing records with complete

observations.

5. Substitute missing values with chosen values from donor and update the hot deck

value.

3.2.2.7 COLD DECK IMPUTATION

Cold deck imputation on the other hand utilizes previous data of similar cases or

survey observed to get the values for imputations. Example of cold deck imputation is using

previous year data to impute the missing gaps. Since the deck data is from history, therefore

it is not being updated every time missing data is filled up. This signifies that deck values are

always static or fixed. All the procedures described in hot deck imputation applied but the

fifth one.

3.2.3 REWEIGHTING

The essence behind reweighting is similar to regression that is to produce a data set

that minimizes the sum of residual squared error. The introduction of weight column in a data

set helps to achieve such aim.

The rule of the weight introduced is to attenuate high value residuals while keeping

the lower residuals unchanged. Once reweighting has been completed, the original data set

will be regressed with the new set. The same procedures of regression will be applied here.

21

3.2.4 FULL INFORMATION MAXIMUM LIKELIHOOD (FIML)

There is one common goal for all the methods described beforehand that is the

methods attempt to get a complete set of data for analysis. The difference is just the approach

to achieve the goal. Section 3.1 opts for deletion of missing values from the data set whereas

section 3.2 methods achieve the goal by imputing values to fill in the gaps.

FIML approach is to maximize likelihood function of a model given the observed

data with assumption that data distribution is multivariate normal. An attractive advantage

that FIML offered is that it would not lead to a biased estimation regardless of the number of

missing values. However, this in return demands a relatively large number of data sets. Other

than this drawback, FIML is considered to be impressive as it still accept even the data sets

that do not fully exhibit criteria of multivariate normal distribution.

3.2.5 MULTIPLE IMPUTATIONS

The idea of multiple imputations comes from the fact that there are several possible

values that ones can impute. By taking the combinations of all the imputations performed as

the actual value, we are accounting for all the uncertainties present in the prediction of the

missing values. Generally, multiple imputations repeat between two to five times single

imputations to produced predicted data sets. Upon completion of the multiple imputations,

we would able to figure the best inferences for the missing values.

Few methods introduced earlier such as mean imputation, regression imputation may

be performed with multiple imputations. However, SRPI and multiple imputations might not

be an appropriate combination as SRPI will draw the same value in each imputation.

22

4 METHOD OF APPROACH

This project was mainly concentrate on continuous data analysis. Throughout the

project, a few imputation techniques and reweighting have been evaluated to demonstrate

how well they have performed in the presence of different percentage of missing data.

Analysis could either be one stage process or two stages process. One stage process

only involved imputations procedure while two stages process had an additional

preprocessing stage.

Analysis of overall mean imputation and class imputation were one stage process

while regression imputation and reweighting will be two stages. Each method was tested with

different values of data variance to observe the degree of robustness. Missing data percentage

was varied as well for the same purpose.

Prior on thorough discussion of the methods, section 4.1, 4.2 and 4.3 will cover the

elementary knowledge used in the development of this project. It includes Bayesian Decision

Theory, Multivariate Density and Discriminant Function.

4.1 BAYESIAN DECISION THEORY

Bayesian decision theory is one of the methods applied to deal with pattern

classification problem. It makes used of Bayes Theorem to accompany a decision. The

essence of Bayes Theorem is:

‘Given certain knowledge or priori information that an event will occur and likelihood of the

event to occur, the conditional probability of the event to occur with a given a set of data can

be predicted.’

Mathematically, the expression is as such:

evidence

priorlikelihood

xp

wPwxpposteriorxwP jj

j

×=×

=)(

)()|(),|( (5.1)

Let consider a simple example to predict the absence of a student in a class for a particular

day. It is known that the student will be absent if it is raining with probability a. Given that

the probability that it is raining on that day is b. With such information, we have to predict

the probability that the student will be absent on that day. In this case, we have to make a

decision out of two possibilities, which is either absent or present. The decision will be an

23

absence on Wednesday if the posterior of the absent, P(absent|Wednesday) >

P(present|Wednesday). Otherwise, the decision will be a present on Wednesday.

It is very likely that the decision made is wrong at times. This occurs when:

P(error|x)= P(ω1|x) decide ω2

P(ω2|x) decide ω1

Such error has to be minimized in order to avoid wrong prediction. To minimize the error, we

have to decide ω1 when P (ω1|x)> P (ω2|x) and vice versa. This hence caused the error

equation to become:

P(error|x)=min[P(ω1|x), P(ω2|x)] (5.2)

Since prior knowledge of the data set is needed to apply Bayesian Decision Theory, a

prior probability of 0.5 has been used in the analysis.

4.2 MULTIVARIATE DENSITY

As mentioned earlier, throughout this whole project, data was assumed to be random

multivariate Gaussian with dimension, d. For simplicity, d=3 is used for analysis. Thus, the

normal multivariate density can be expressed as:

−∑−−∑

= − )()(2

1exp

||)2(

1)( 1

2/12/µµ

πxxxp t

d (5.3)

where x is the d-component column vector, µ is the d-component mean vector, ∑ is the d-by-

d covariance matrix, ||∑ is the determinant of the covariance matrix and 1−∑ is the inverse

of the covariance matrix.

4.3 DISCRIMINANT FUNCTION

Discriminant function is one way to perform classification. For a dichotomizer case,

we can easily classify the data with a single discriminant function, g(x) where:

)()()( 21 xgxgxg −= . Using the discriminant function, decision is ω1 if g1 > g2 and vice

versa.

gi(x) can be defined as P(ωi|x), that is probability of getting ωi given x. Using Bayes

theorem, )()|()()|(

)()|()|()( ii

jj

iiii Pxp

Pxp

PxpxPxg ωω

ωωωωω =

∑== . We can now represent

24

)(xg as: )|()|()( 21 xPxPxg ωω −= . Taking natural logarithm, ln for the right hand side of

)(xg gives:

)(ln

)(ln

)|(ln

)|(ln)(

2

1

2

1

ωω

ωω

P

P

xp

xpxg += (5.4)

This equation will be used to determine the class of a tested value. The test will be done on

both complete and altered data to compare the outcome. From equation 4.3, we can thus

manipulate equation 4.4 into:

)(ln2

1

2

1)( 111

iiit

iiiit

i Pxg ωµµµ +∑−∑+∑−= −−− xxx (5.5)

Equation 4.5 is used as the discriminant function in the analysis. A vector, x of dimension 1-

by-d will be used to verify that the classification with predicted values on the missing holes

matches the classification result with complete data. Given a vector of test value, x if g1(x) >

g2(x) means test value belongs to class 1 and vice versa.

4.4 HYPOTHESIS TEST

Hypothesis Testing is an essential tool in statistic analysis. It can be used to test the

mean of two distributions. In order to begin with hypothesis testing, a hypothesis has to be

defined.

Null hypothesis represents a statement which is believed not to be true but somehow

it could not be rejected. In this case, null hypothesis is that new distribution is not the same as

the distribution of complete data set.

By using ttest2 command in MATLAB, we would be able to establish hypothesis

testing in this project. The results of the hypothesis testing are confidence level and

confidence intervals which will help us in making inferences.

4.5 REWEIGHTING

Reweighting was the main method of interest in this project. The other methods used

in simulation just served as comparison purpose. Major works being done for this project

were related to this method. Even though the effectiveness was not well proven, the

approaches of this method were attractive. If it is well-developed, it might become a useful

technique of analyzing missing data.

25

Reweighting comprises of producing a new data set and regression. As it was a two

stages process, preprocessing took place to remove NaNs from the data set which means that

analysis will be performed only with observed data. Having the first stage completed, a set of

rules were then established to define weight assignments for the data set.

The purpose of weight assignments was to weight the data while keeping the

distribution very much similar to the original. With those weights, sum of squared residual

would be reduced which pointed out that the data was closer to the true value.

Residual was defined as the deviation of the observed value from the mean of the

data. Total residual of all observed samples should be equal to zero since the number was

random with specified mean and variance. A minimum sum of squared error guaranteed a

good fit with mean line.

Residuals with significant values were penalized by appointing lower weights while

the lower residuals will be kept unchanged or given a weight of one. The rules used in this

analysis were as the followings:

• Weight = 0 for missing data and 0.8 < ratio of residual ≤ 1.0

• Weight = 0.25 for 0.6 < ratio of residual ≤ 0.8

• Weight = 0.5 for 0.4 < ratio of residual ≤ 0.6

• Weight = 0.75 for 0.2 < ratio of residual ≤ 0.4

• Weight = 1 for 0 ≤ ratio of residual ≤ 0.2 Ratio of residual was simply the ratio and the standard deviation. There is no specific

calculation on the selection of the weights above. The most important guidelines of weight

selection are applying those weights will minimized the sum of squared residuals and the

distribution remained unchanged. It is believed that thorough weight computations will show

a better performance.

By multiplying the weight with the squared residual for each sample, we obtained a

new set of squared residual. Taking the square roots of each gave a new set of residuals and

hence calculation for new set of data was performed. Next, the preprocessed data was

regressed with the new set of data producing another new data set. This new data set was the

result from reweighting and regression processes of manipulation. Subsequently, the analysis

was continued by computing the means and covariance matrices for each class.

26

Techniques’ performance was assessed by adding a test value to verify any

misclassification. With the flexibility of the code written whereby the user has to enter test

values when being prompted, testing more values during technique assessment was less

hassle.

27

5 RELATED WORKS

Reweighting method has appeared in several documentations for example by Karla

Nobrega and David Haziza. The idea proposed by them basically similar to the idea of

reweighting presented here. The concepts are:

1. Only consider observed data.

2. Adjust the sampling weights of observed data compensate for the deleted cases.

The weight assignment presented in this project is just a simple guessing whereas the

assignment suggested by Nobrega and Haziza is more appropriate. In their citation, to begin

with reweighting procedure, a model needs to be estimated to reflect the unobserved data.

The weight is then computed using the following formula:

∧=i

ii

p

ww *

Hence the estimated total population model *Y will be:

ip is the estimated probability distribution, iw is the initial weight of the data and iy is the

observed data. ip ≈ ip guarantees an unbiased estimation of population Y.

In the case of the reweighting method presented in this project, as stated earlier, there

is no calculation being done to get the appropriate weight. Using such a simple intuitive

assignment, analysis has been conducted and hence, expectation on the performance should

not be much.

Another major difference with the reweighting in this project is that it also utilizes the

weights to the residuals of the observed data. The weighted residual is then used to predict a

new set of data. This new set of data will be used as a regression model to find the

coefficients of regression. The original observed data is finally regressed with the weighted

new data.

As far as reweighting is concerned, techniques introduced by Nobrega and Haziza

have been one of the established methods of reweighting. They have agreed that reweighting

method has advantages over other methods like imputation in the sense that it does not

∑∑∈∈

==rr si

ii

i

siii y

p

wywY

ˆˆ **

28

promote creation of artificial data. Besides it simplicity, the availability of a number of

software to compute estimation reduces the timeliness issue on estimating population model.

However, reweighting cannot escape from the curse of dimensionality. If large

number of variables present and the missing mechanism is item non responsive, a large

number of adjusted weights have to be worked out.

Another citation about reweighting was found in Newborn Lung Project in University

of Wisconsin, USA. The method presented was similar to the previous Nobrega’s and

Haziza’s. Weight is simply the inverse of probability of observed data. In the project,

performance of reweighting is measured by calculating the deviation of the reweighted data

from complete data case. It was found out that reweighted data has not performed well in the

sense of mean deviation but vice versa for variance deviation. Overall, data reweighting has

not been of choice in that study.

29

6 RESULTS

6.1 Overall Mean Imputation

The performance of Overall Mean Imputation in the presence of 20% missing values

is presented by Table 6.1. The performance is measured by how much the mean has deviated

from the actual complete data case with varying values of variance.

From the table, it can be concluded that in the presence of 20% missing data, mean of

imputed data deviated between 2.5 % to 6.5% as compared to the complete case. This is

considered an acceptable range as there is no misclassification occurred.

An increase of the missing values to 40% however, showed a poor performance. The

result is as indicated in Table 6.2. From that table, it can be concluded that the deviation from

the complete case was too severe and hence leading to misclassification on all values of

variance being tested. Percentage of deviation of mean with 40% missing ranged from 12%

to 24%.

At this point, we should be expecting that with higher percentage of missing data,

performance of overall mean imputation method would be poor as well (refer to Table 6.3).

From the above detail, generally, overall mean imputation method does not guarantee

reliable imputed values. This is critical especially when the data is not well separated, that is

mean of class 1 and classes 2 are closed to each other. As mentioned earlier, the resulting

distribution will be shrunk as the mean value is used to replace the missing ones. If more

mean substitution takes place, the variance shrinkage will be more severe.

The last three rows in the table shows the result of hypothesis testing of the new data

and the original data. A value of 1 for h indicates that null hypothesis can be rejected at the

confidence level and between the confidence interval specified. This means that if a test

value chosen is in the range of confidence interval specified, the result of analysis with

complete case should match the result after applying overall imputation technique.

6.2 Class Mean Imputation

As discussed in previous chapter, class mean imputation is a better option for mean

imputation. Even though it still results in a decrease in variance, it is much less severe as the

30

class data is taken into account. This is indicated by the percentage of error in the classes

mean in Table 6.4, 6.5 and 6.6.

From those tables, it can be concluded that the deviation of the data with imputed

values and the complete case data is in the range between 0.5% and 3.5%. This indicates that

this method produces reliable predictions to fill in the missing gaps. However,

misclassification still occurred for missing percentage of 40% and 50%. This would not

happen if class 1 and class 2 are well-discriminated.

6.3 Linear Regression

Performance shown by linear regression method is as shown in Table 6.7 to 6.9. It

shows no misclassification even in the presence of a large number of missing data.

In terms of percentage of mean deviation, for 20% missing data, the deviation ranged

from 0.5% to 10%. For 40% missing data, the deviation took the interval from 10% to 18%

while for 50% missing data the interval was between 19% and 32%. This is quite an

interesting finding. It shows that percentage of mean deviation is not a good indicator for

missing data techniques performance. Even the percentage of deviation showing a

considerable value, misclassification does not occur.

6.4 Reweighting

The results for evaluation of reweighting method are as illustrated by Table 6.10 to

6.12. The results strengthen the idea in linear regression that percentage of mean deviation

does not serve as a measure of techniques performance.

The situation for reweighting is far more complex to be understood. In Table 6.11 and

6.12, the result with low variance showing misclassification is contradicted with the

expectation mentioned earlier. It is noticed that at lower variance, the percentage of mean

deviation is high and the value is reduced significantly when the variance of the data is

increased. Theoretically, it is understood that with higher variance, misclassification tends to

occur. At this stage, this should be leave as a puzzle that remains unsolved.

With 20% missing data, the percentage of deviation varies between 0.5% and 9% for

data variance higher than 1. At those variances as well, there is a very slight increase

observed of percentage of deviation when the missing data is at 40% and 50%. Percentage

31

between 25% and 58% is observed for variance equals to 1 for all 20%, 40% and 50%

missing data.

As proved by the tables of results, this method does not seem to be the best. However,

if it is given more technical touches in weight assignments, it would have perform at least

equivalent to linear regression.

6.5 Further works on Reweighting

With the extraordinary discovery in reweighting as discussed formerly, it should be

beneficial to work further on it. One obvious thing that should be reworked on is the weight

assignments. Since in this project weights were assigned intuitively, it can be further

improved by introducing a more technical and systematic way of weight calculation. By such

a professional way, the results might turn out to be more impressive. Thorough research in

statistics might help in this case.

Another area that has not been covered is performance of this method for binary data.

This would be the critical area to be evaluated as we could see how robust reweighting

method is to different types of data.

32

Variable 1 Variable 2 Variable 3 Variable 1 Variable 2 Variable 3 Variable 1 Variable 2 Variable 3 Variable 1 Variable 2 Variable 3

Mean 0.31452 0.30575 0.326 1.5726 1.5288 1.63 3.1452 3.0575 3.26 4.7178 4.5863 4.89Standard

Deviation0.2699 0.26247 0.25607 1.4571 1.4757 1.3957 2.9142 2.9514 2.7915 4.3714 4.4271 4.1872

Mean 0.3247 0.30282 0.33435 1.6235 1.5141 1.6717 3.247 3.0282 3.3435 4.8705 4.5423 5.0152Standard

Deviation0.2699 0.26247 0.25607 1.4194 1.36 1.4975 2.8388 2.72 2.9951 4.2582 4.0799 4.4926

Test Value 0.3 0.3 0.3 1.5 1.5 1.5 3 3 3 4.5 4.5 4.5

Mean 0.3014 0.29114 0.32762 1.507 1.4557 1.6381 3.014 2.9114 3.2762 4.521 4.367 4.9143Standard

Deviation0.26053 0.26533 0.25373 1.3026 1.3267 1.2686 2.6053 2.6533 2.5373 3.9079 3.98 3.8059

Percentage of Mean Error

4.17% 4.78% 0.50% 4.17% 4.78% 0.50% 4.17% 4.78% 0.50% 4.17% 4.78% 0.50%

Mean 0.3124 0.29128 0.31772 1.562 1.4564 1.5886 3.124 2.9128 3.1772 4.6861 4.3693 4.7659Standard

Deviation0.2579 0.25414 0.25699 1.2895 1.2707 1.2849 2.579 2.5414 2.5699 3.8686 3.8122 3.8548

Percentage of Mean Error

3.79% 3.81% 4.97% 3.79% 3.81% 4.97% 3.79% 3.81% 4.97% 3.79% 3.81% 4.97%

Classification with complete

data

Classification with Imputed data

Null Hypothesis

h 0 0 0 0 0 0 0 0 0 0 0 0Significance

Level of confidence

0.769 0.4428 0.6646 0.769 0.4428 0.6646 0.769 0.4428 0.6646 0.769 0.4438 0.6466

Confidence Interval

[-0.0411, Inf] [-0.0465, 0.0203] [-0.0415, 0.0265] [-0.2057, Inf] [-0.2324, 0.1017 [-0.2073, 0.1323] [-0.4660, 0.2118] [-0.4649, 0.2034] [-0.4146, 0.2645] [-0.6171, Inf] [-0.6973, 0.3051] [-0.6219, 0.3968]

Complete data Class 2

Mean Imputed data Class 2

Mean Imputed distribution is equal to Complete Case Distribution

Mean Imputed data Class 1 Mean Imputed data Class 1

Mean Imputed data Class 2

Mean Imputed distribution is equal to Complete Case Distribution

Mean Imputed distribution is equal to Complete Case Distribution

Mean Imputed data Class 2

σ=10 σ=15

Complete data Class 1 Complete data Class 1 Complete data Class 1

METHOD : OVERALL MEAN IMPUTATION

Complete data Class 2

Mean Imputed distribution is equal to Complete Case Distribution

σ=1 σ=5

Mean Imputed data Class 1

Complete data Class 1

Complete data Class 2

Mean Imputed data Class 1

Mean Imputed data Class 2

C2

C2

C2

C2

C2

C2

C2

C2

Complete data Class 2

Table 6.1 Missing data rate 20% with Overall Imputation Method

33

Variable 1 Variable 2 Variable 3 Variable 1 Variable 2 Variable 3 Variable 1 Variable 2 Variable 3 Variable 1 Variable 2 Variable 3

Mean 0.31452 0.30575 0.326 1.5726 1.5288 1.63 3.1452 3.0575 3.26 4.7178 4.5863 4.89Standard

Deviation0.2699 0.26247 0.25607 1.4571 1.4757 1.3957 2.9142 2.9514 2.7915 4.3714 4.4271 4.1872

Mean 0.3247 0.30282 0.33435 1.6235 1.5141 1.6717 3.247 3.0282 3.3435 4.8705 4.5423 5.0152Standard

Deviation0.2699 0.26247 0.25607 1.4194 1.36 1.4975 2.8388 2.72 2.9951 4.2582 4.0799 4.4926

Test Value 0.2 0.2 0.2 1.5 1.5 1.5 3 3 3 4.5 4.5 4.5

Mean 0.26143 0.2527 0.26845 1.3071 1.2635 1.3423 2.6405 2.345 2.8482 3.9214 3.7905 4.0268Standard

Deviation0.22707 0.23694 0.22343 1.1353 1.1847 1.1171 2.2692 2.2563 2.3075 3.406 3.5541 3.3514

Percentage of Mean Error

16.88% 17.35% 17.65% 16.88% 17.35% 17.65% 16.05% 23.30% 12.63% 16.88% 17.35% 17.65%

Mean 0.28286 0.25514 0.26502 1.4143 1.2757 1.3251 2.79 2.4657 2.7664 4.2429 3.8271 3.9752Standard

Deviation0.23593 0.22891 0.23154 1.1797 1.1445 1.1577 2.3168 2.2204 2.3343 3.539 3.4336 3.4732

Percentage of Mean Error

12.89% 15.75% 20.74% 12.89% 15.75% 20.73% 14.07% 18.58% 17.26% 12.89% 15.75% 20.74%

Classification with complete

data

Classification with Imputed data

Null Hypothesis

h 1 1 1 1 1 1 1 1 1 1 1 1Significance

Level of confidence

0.0046 0.0021 1.53E-04 0.0046 0.0021 1.53E-04 0.0046 0.0021 1.53E-04 0.0046 0.0021 1.53E-04

Confidence Interval

[-0.4472, 04641] [-0.4082, 04652] [-0.3011, 0.6170] [-0.3996, -0.0733] [-0.4110, -0.0912] [-0.4775, -0.1524] [-0.7992, -0.1466] [-0.8220, 0.1824] [-0.9551, -0.3048] [-1.1989, -0.2198] [-1.2330, -0.2376] [-1.4326, -0.4571]

METHOD : OVERALL MEAN IMPUTATION

Complete data Class 2

Mean Imputed distribution is equal to Complete Case Distribution

σ=1 σ=5

Mean Imputed data Class 1

Complete data Class 1

Complete data Class 2

Mean Imputed data Class 1

Mean Imputed data Class 2

Complete data Class 2

σ=10 σ=15

Complete data Class 1 Complete data Class 1 Complete data Class 1

Mean Imputed data Class 1

Mean Imputed data Class 2

Mean Imputed distribution is equal to Complete Case Distribution

Mean Imputed distribution is equal to Complete Case Distribution

Complete data Class 2

Mean Imputed data Class 2

Mean Imputed distribution is equal to Complete Case Distribution

Mean Imputed data Class 1

Mean Imputed data Class 2

C2

C1

C2

C1

C2

C1

C2

C1

Table 6.2 Missing data rate at 40% with Overall Imputation Method

34

Variable 1 Variable 2 Variable 3 Variable 1 Variable 2 Variable 3 Variable 1 Variable 2 Variable 3 Variable 1 Variable 2 Variable 3

Mean 0.31452 0.30575 0.326 1.5726 1.5288 1.63 3.1452 3.0575 3.26 4.7178 4.5863 4.89Standard

Deviation0.2699 0.26247 0.25607 1.4571 1.4757 1.3957 2.9142 2.9514 2.7915 4.3714 4.4271 4.1872

Mean 0.3247 0.30282 0.33435 1.6235 1.5141 1.6717 3.247 3.0282 3.3435 4.8705 4.5423 5.0152Standard

Deviation0.2699 0.26247 0.25607 1.4194 1.36 1.4975 2.8388 2.72 2.9951 4.2582 4.0799 4.4926

Test Value 0.3 0.3 0.3 1.5 1.5 1.5 3 3 3 4.5 4.5 4.5

Mean 0.2342 0.21332 0.23577 1.171 1.0666 1.1788 2.342 2.1332 2.3577 3.5131 3.1998 3.5365Standard

Deviation0.21487 0.21467 0.20597 1.0743 1.0733 1.0298 2.1487 2.1467 2.0597 3.223 3.22 3.0895

Percentage of Mean Error

25.54% 30.23% 27.68% 25.54% 30.23% 27.68% 25.54% 30.23% 27.68% 25.54% 30.23% 27.68%

Mean 0.2545 0.23198 0.2408 1.2725 1.1599 1.204 2.545 2.3198 2.408 3.8174 3.4797 3.612Standard

Deviation0.22456 0.21503 0.21737 1.1228 1.0752 1.0868 2.2456 2.1503 2.1737 3.3684 3.2255 3.2605

Percentage of Mean Error

21.62% 23.39% 27.98% 21.62% 23.39% 27.98% 21.62% 23.39% 27.98% 21.62% 23.39% 27.98%

Classification with complete

data

Classification with Imputed data

Null Hypothesis

h 0 0 0 0 0 0 1 1 1 1 1 1Significance

Level of confidence

4.47E-06 3.48E-07 1.78E-08 4.47E-06 3.48E-07 1.78E-08 4.47E-06 3.48E-07 1.78E-08 4.47E-06 3.48E-07 1.78E-08

Confidence Interval

[-0.1073, -0.0432] [-0.1129, -0.0504] [-0.1236, -0.0601] [-0.5364, -0.2162] [-0.5643, -0.2520] [-0.6182, -0.3007] [-1.0727, -0.4325] [-1.1286, -0.5041] [-1.2364, -0.6014] [-1.6091, -0.6487] [-1.6930, -0.7561] [-1.8546, -0.9021]

Mean Imputed data Class 2

Mean Imputed distribution is equal to Complete Case Distribution

Mean Imputed distribution is equal to Complete Case Distribution

Complete data Class 2

Mean Imputed data Class 2

Mean Imputed distribution is equal to Complete Case Distribution

Mean Imputed data Class 1

Mean Imputed data Class 2

C2

C1

Complete data Class 1 Complete data Class 1 Complete data Class 1

Mean Imputed data Class 1

Mean Imputed data Class 2

Complete data Class 2

σ=10 σ=15METHOD : OVERALL MEAN IMPUTATION

Complete data Class 2

Mean Imputed distribution is equal to Complete Case Distribution

σ=1 σ=5

Mean Imputed data Class 1

Complete data Class 1

Complete data Class 2

Mean Imputed data Class 1

C1

C2C2

C1 C1

C2

Table 6.3 Missing data rate at 50% with Overall Mean Imputation

35

Variable 1 Variable 2 Variable 3 Variable 1 Variable 2 Variable 3 Variable 1 Variable 2 Variable 3 Variable 1 Variable 2 Variable 3

Mean 0.31452 0.30575 0.326 1.5726 1.5288 1.63 3.1452 3.0575 3.26 4.7178 4.5863 4.89Standard

Deviation0.29142 0.29514 0.27915 1.4571 1.4757 1.3957 2.9142 2.9514 2.7915 4.3714 4.4271 4.1872

Mean 0.3247 0.30282 0.33435 1.6235 1.5141 1.6717 3.247 3.0282 3.3435 4.8705 4.5423 5.0152Standard

Deviation0.28388 0.272 0.29951 1.4194 1.36 1.4975 2.8388 2.72 2.9951 4.2582 4.0799 4.4926

Test Value 0.2 0.2 0.2 1.5 1.5 1.5 3 3 3 4.5 4.5 4.5

Mean 0.3014 0.29114 0.32762 1.507 1.4557 1.6381 3.014 2.9114 3.2762 3.7382 4.0809 3.6809Standard

Deviation0.26053 0.26533 0.25373 1.3026 1.3267 1.2686 2.6053 2.6533 2.5373 3.8788 3.9744 3.62

Percentage of Mean Error

3.24% 0.96% 2.56% 3.24% 0.96% 2.56% 3.24% 0.96% 2.56% 3.24% 0.96% 2.56%

Mean 0.3124 0.29128 0.31772 1.562 1.4564 1.5886 3.124 2.9128 3.1772 4.6628 4.2808 4.7409Standard

Deviation0.2579 0.25414 0.25699 1.2895 1.2707 1.2849 2.579 2.5414 2.5699 4.1036 3.7482 3.7074

Percentage of Mean Error

4.17% 3.81% 4.97% 3.79% 3.81% 4.97% 3.79% 3.81% 4.97% 4.26% 5.76% 5.47%

Classification with complete

data

Classification with Imputed data

Null Hypothesis

h 0 0 0 0 0 0 0 0 0 1 1 1Significance

Level of confidence

0.769 0.4428 0.6646 0.769 0.4428 0.6646 0.769 0.4428 0.6646 0.0296 0.1201 0.09

Confidence Interval

[-0.0411, Inf] [-0.0465, 0.0203] [-0.0415, 0.0265] [-0.2057, Inf] [-0.2324, 0.1017 [-0.2073, 0.1323] [-0.4660, 0.2118] [-0.4649, 0.2034] [-0.4146, 0.2645] [-1.1698, -0.0611] [-1.004, 0.1161] [-0.9964, 0.0723]

METHOD : CLASS MEAN IMPUTATIONσ=1 σ=5 σ=10 σ=15

Complete data Class 1 Complete data Class 1 Complete data Class 1 Complete data Class 1

Complete data Class 2 Complete data Class 2 Complete data Class 2 Complete data Class 2

Mean Imputed data Class 1 Mean Imputed data Class 1 Mean Imputed data Class 1 Mean Imputed data Class 1

Mean Imputed data Class 2 Mean Imputed data Class 2 Mean Imputed data Class 2 Mean Imputed data Class 2

Mean Imputed distribution is equal to Complete Case Distribution

Mean Imputed distribution is equal to Complete Case Distribution

Mean Imputed distribution is equal to Complete Case Distribution

Mean Imputed distribution is equal to Complete Case Distribution

C1

C2

C2

C2

C2

C2

C1

C1

Table 6.4 Missing data rate at 20% with Class Mean Imputation

36

Variable 1 Variable 2 Variable 3 Variable 1 Variable 2 Variable 3 Variable 1 Variable 2 Variable 3 Variable 1 Variable 2 Variable 3

Mean 0.31452 0.30575 0.326 1.5726 1.5288 1.63 3.1452 3.0575 3.26 4.7178 4.5863 4.89Standard

Deviation0.29142 0.29514 0.27915 1.4571 1.4757 1.3957 2.9142 2.9514 2.7915 4.3714 4.4271 4.1872

Mean 0.3247 0.30282 0.33435 1.6235 1.5141 1.6717 3.247 3.0282 3.3435 4.8705 4.5423 5.0152Standard

Deviation0.28388 0.272 0.29951 1.4194 1.36 1.4975 2.8388 2.72 2.9951 4.2582 4.0799 4.4926

Test Value 0.2 0.2 0.2 1.5 1.5 1.5 3 3 3 4.5 4.5 4.5

Mean 0.30926 0.29102 0.31783 1.5463 1.4551 1.5892 3.0926 2.9102 3.1783 4.6389 4.3654 4.7675Standard

Deviation0.21706 0.21749 0.21384 1.0853 1.0875 1.0692 2.1706 2.1749 2.1384 3.2558 3.2624 3.2076

Percentage of Mean Error

1.67% 4.82% 2.51% 1.67% 4.82% 2.50% 1.67% 4.82% 2.51% 1.67% 4.82% 2.51%

Mean 0.33823 0.30274 0.3267 1.6911 1.5137 1.6335 3.3823 3.0274 3.267 5.0734 4.5411 4.9004Standard

Deviation0.24127 0.21162 0.22037 1.2063 1.0581 1.1019 2.4127 2.1162 2.2037 3.619 3.1743 3.3056

Percentage of Mean Error

4.17% 0.03% 2.29% 4.16% 0.03% 2.29% 4.17% 0.03% 2.29% 4.17% 0.03% 2.29%

Classification with complete

data

Classification with Imputed data

Null Hypothesis

h 0 0 0 0 0 0 0 0 0 0 0 0Significance

Level of confidence

0.8017 0.6416 0.6248 0.8017 0.6416 0.6248 0.8017 0.6416 0.6248 0.8017 0.6416 0.6248

Confidence Interval

[-0.0282, 0.0364] [-0.0386, 0.0238] [-0.0396, 0.0238][-0.1408, 0.1821] [-0.1930, 0.1190] [-0.1982, 01191] [-0.2816, 0.3642] [-0.3860, 0.2379] [-0.3964, 0.2382] [-0.4224, 0.5464] [-0.5790, 0.3569] [-0.5946, 0.3573]

C2

C1

METHOD : CLASS MEAN IMPUTATIONσ=1 σ=5 σ=10 σ=15

Complete data Class 1 Complete data Class 1 Complete data Class 1 Complete data Class 1

Complete data Class 2 Complete data Class 2 Complete data Class 2 Complete data Class 2

Mean Imputed data Class 1 Mean Imputed data Class 1 Mean Imputed data Class 1 Mean Imputed data Class 1

Mean Imputed data Class 2 Mean Imputed data Class 2 Mean Imputed data Class 2 Mean Imputed data Class 2

Mean Imputed distribution is equal to Complete Case Distribution

Mean Imputed distribution is equal to Complete Case Distribution

Mean Imputed distribution is equal to Complete Case Distribution

Mean Imputed distribution is equal to Complete Case Distribution

C2

C1

C2

C1

C2

C1

Table 6.5 Missing data rate at 40% with Class Mean Imputation

37

Variable 1 Variable 2 Variable 3 Variable 1 Variable 2 Variable 3 Variable 1 Variable 2 Variable 3 Variable 1 Variable 2 Variable 3

Mean 0.31452 0.30575 0.326 1.5726 1.5288 1.63 3.1452 3.0575 3.26 4.7178 4.5863 4.89Standard

Deviation0.29142 0.29514 0.27915 1.4571 1.4757 1.3957 2.9142 2.9514 2.7915 4.3714 4.4271 4.1872

Mean 0.3247 0.30282 0.33435 1.6235 1.5141 1.6717 3.247 3.0282 3.3435 4.8705 4.5423 5.0152Standard

Deviation0.28388 0.272 0.29951 1.4194 1.36 1.4975 2.8388 2.72 2.9951 4.2582 4.0799 4.4926

Test Value 0.3 0.3 0.3 1.5 1.5 1.5 3 3 3 4.5 4.5 4.5

Mean 0.31305 0.29513 0.31466 1.5652 1.4756 1.5733 3.1305 2.9513 3.1466 4.6957 4.4269 4.7199Standard

Deviation0.20159 0.20402 0.19295 1.008 1.0201 0.96475 2.0159 2.0402 1.9295 3.0239 3.0603 2.8942

Percentage of Mean Error

0.47% 3.47% 3.48% 3.24% 0.96% 2.56% 0.47% 3.47% 3.48% 0.47% 3.48% 3.48%

Mean 0.33216 0.30608 0.33346 1.6608 1.5304 1.6673 3.3216 3.0608 3.3346 4.9824 4.5912 5.0019Standard

Deviation0.2167 0.19987 0.21145 1.0835 0.99934 1.0572 2.167 1.9987 2.1145 3.2505 2.998 3.1717

Percentage of Mean Error

2.30% 1.08% 0.27% 2.30% 1.08% 0.26% 2.30% 1.08% 0.27% 2.30% 1.08% 0.27%

Classification with complete

data

Classification with Imputed data

Null Hypothesis

h 0 0 0 0 0 0 0 0 0 0 0 0Significance

Level of confidence

0.8507 0.8131 0.6986 0.8507 0.8131 0.6986 0.8507 0.8131 0.6986 0.8507 0.8131 0.6986

Confidence Interval

[-0.0282, 0.0342] [-0.0342, 0.0269] [-0.0371, 0.0249][-0.1410, 0.1710] [-0.1711, 0.1343] [-0.1855, 0.1243][-0.2821, 0.3420] [-0.3422, 0.2686] [-0.3710, 0.2487][-0.4231, 0.5129] [-0.5133, 0.4029] [-0.5565, 0.3730]

C1

C2 C2

C1

C2

C1

Mean Imputed data Class 2 Mean Imputed data Class 2 Mean Imputed data Class 2

Mean Imputed distribution is equal to Complete Case Distribution

Mean Imputed distribution is equal to Complete Case Distribution

Mean Imputed distribution is equal to Complete Case Distribution

Mean Imputed distribution is equal to Complete Case Distribution

Mean Imputed data Class 2

Mean Imputed data Class 1 Mean Imputed data Class 1 Mean Imputed data Class 1 Mean Imputed data Class 1

Complete data Class 1

Complete data Class 2 Complete data Class 2 Complete data Class 2 Complete data Class 2

C2

C1

METHOD : CLASS MEAN IMPUTATIONσ=1 σ=5 σ=10 σ=15

Complete data Class 1 Complete data Class 1 Complete data Class 1

Table 6.6 Missing data rate at 50% with Class Mean Imputation

38

Variable 1 Variable 2 Variable 3 Variable 1 Variable 2 Variable 3 Variable 1 Variable 2 Variable 3 Variable 1 Variable 2 Variable 3

Mean 0.31452 0.30575 0.326 1.5726 1.5288 1.63 3.1452 3.0575 3.26 4.7178 4.5863 4.89Standard

Deviation0.2699 0.26247 0.25607 1.4571 1.4757 1.3957 2.9142 2.9514 2.7915 4.3714 4.4271 4.1872

Mean 0.3247 0.30282 0.33435 1.6235 1.5141 1.6717 3.247 3.0282 3.3435 4.8705 4.5423 5.0152Standard

Deviation0.2699 0.26247 0.25607 1.4194 1.36 1.4975 2.8388 2.72 2.9951 4.2582 4.0799 4.4926

Test Value 0.3 0.3 0.3 1.5 1.5 1.5 3 3 3 4.5 4.5 4.5

Mean 0.31791 0.30411 0.31442 1.5895 1.5204 1.5721 3.1791 3.0408 3.1443 4.7686 4.561 4.7164Standard

Deviation0.26725 0.26865 0.25623 1.3362 1.3433 1.2811 2.6724 2.6865 2.5622 4.0086 4.0298 3.8433

Percentage of Mean Error

1.08% 0.54% 3.55% 1.07% 0.55% 3.55% 1.08% 0.55% 3.55% 1.08% 0.55% 3.55%

Mean 0.33496 0.27327 0.31317 1.675 1.3665 1.566 3.3503 2.7332 3.1321 5.0256 4.1 4.6983Standard

Deviation0.26557 0.23173 0.25669 1.3278 1.1586 1.2834 2.6554 2.3172 2.5667 3.9831 3.4757 3.85

Percentage of Mean Error

3.16% 9.76% 6.33% 3.17% 9.75% 6.32% 3.18% 9.74% 6.32% 3.18% 9.74% 6.32%

Classification with complete

data

Classification with Imputed data

Null Hypothesis

h 0 0 0 0 0 0 0 0 0 0 0 0Significance

Level of confidence

0.2197 0.1336 0.1381 0.3265 0.2207 0.2117 0.4262 0.3092 0.2836 0.5139 0.3914 0.3493

Confidence Interval

[-0.0553, 0.0127] [-0.0589, 0.0078] [-0.0606, 0.0084][-0.2547, 0.0848] [-0.2707, 0.0626] [-0.2815, 0.0625][-0.4766, 0.2015] [-0.5055, 0.1603] [-0.5311, 0.1557][-0.6772, 0.3390] [-0.7172, 0.2811] [-0.7600, 0.2690]

C1

C1

C1

C1

C1

C1

C1

Complete data Class 1 Complete data Class 1

METHOD : LINEAR REGRESSION IMPUTATIONσ=1 σ=5 σ=10 σ=15

C1

Complete data Class 1 Complete data Class 1

Complete data Class 2 Complete data Class 2 Complete data Class 2 Complete data Class 2

Mean Imputed data Class 1 Mean Imputed data Class 1 Mean Imputed data Class 1 Mean Imputed data Class 1

Mean Imputed data Class 2 Mean Imputed data Class 2 Mean Imputed data Class 2 Mean Imputed data Class 2

Mean Imputed distribution is equal to Complete Case Distribution

Mean Imputed distribution is equal to Complete Case Distribution

Mean Imputed distribution is equal to Complete Case Distribution

Mean Imputed distribution is equal to Complete Case Distribution

Table 6.7 Missing data rate at 20% with linear regression

39

Variable 1 Variable 2 Variable 3 Variable 1 Variable 2 Variable 3 Variable 1 Variable 2 Variable 3 Variable 1 Variable 2 Variable 3

Mean 0.31452 0.30575 0.326 1.5726 1.5288 1.63 3.1452 3.0575 3.26 4.7178 4.5863 4.89Standard

Deviation0.2699 0.26247 0.25607 1.4571 1.4757 1.3957 2.9142 2.9514 2.7915 4.3714 4.4271 4.1872

Mean 0.3247 0.30282 0.33435 1.6235 1.5141 1.6717 3.247 3.0282 3.3435 4.8705 4.5423 5.0152Standard

Deviation0.2699 0.26247 0.25607 1.4194 1.36 1.4975 2.8388 2.72 2.9951 4.2582 4.0799 4.4926

Test Value 0.3 0.3 0.3 1.5 1.5 1.5 3 3 3 4.5 4.5 4.5

Mean 0.28137 0.27279 0.28492 1.4069 1.3641 1.4249 2.8139 2.7283 2.85 4.221 4.0926 4.2753Standard

Deviation0.23815 0.24016 0.2327 1.1907 1.2007 1.1634 2.3814 2.4014 2.3266 3.5721 3.602 3.4898

Percentage of Mean Error

10.54% 10.78% 12.60% 10.54% 10.77% 12.58% 10.53% 10.77% 12.58% 10.53% 10.76% 12.57%

Mean 0.26788 0.25818 0.27402 1.3395 1.2912 1.3701 2.6792 2.5827 2.7401 4.019 3.8744 4.11Standard

Deviation0.23355 0.22214 0.24164 1.1677 1.1105 1.2082 2.3353 2.221 2.4164 3.5028 3.3313 3.6246

Percentage of Mean Error

17.50% 14.74% 18.04% 17.49% 14.72% 18.04% 17.49% 14.71% 18.05% 17.48% 14.70% 18.05%

Classification with complete

data

Classification with Imputed data

Null Hypothesis

h 1 1 1 1 1 1 1 1 1 1 1 1Significance

Level of confidence

5.37E-04 0.0049 1.96E-04 4.84E-04 0.0042 1.84E-04 4.47E-04 0.038 1.73E-04 4.21E-04 0.0035 1.66E-04

Confidence Interval

[-0.0902, -0.0251] [-0.0784, -0.0140] [-0.0947, -0.0295] [-0.4537, -0.1278] [-0.3964, -0.0743] [-0.4749, -0.1489] [-0.9113, -0.2592] [-0.7988, -0.1542] [-0.9525, -0.3002] [-1.3713, -0.3930] [-1.2048, -0.2376] [-1.4318, -0.4532]

C2

C2

C2

C2

C2

C2

Mean Imputed data Class 2 Mean Imputed data Class 2 Mean Imputed data Class 2

Mean Imputed distribution is equal to Complete Case Distribution

Mean Imputed distribution is equal to Complete Case Distribution

Mean Imputed distribution is equal to Complete Case Distribution

Mean Imputed distribution is equal to Complete Case Distribution

Mean Imputed data Class 2

Mean Imputed data Class 1 Mean Imputed data Class 1 Mean Imputed data Class 1 Mean Imputed data Class 1

Complete data Class 1

Complete data Class 2 Complete data Class 2 Complete data Class 2 Complete data Class 2

C2

C2

METHOD : LINEAR REGRESSION IMPUTATIONσ=1 σ=5 σ=10 σ=15

Complete data Class 1 Complete data Class 1 Complete data Class 1

Table 6.8 Missing data rate at 40% with linear regression

40

Variable 1 Variable 2 Variable 3 Variable 1 Variable 2 Variable 3 Variable 1 Variable 2 Variable 3 Variable 1 Variable 2 Variable 3

Mean 0.31452 0.30575 0.326 1.5726 1.5288 1.63 3.1452 3.0575 3.26 4.7178 4.5863 4.89Standard

Deviation0.29142 0.29514 0.27915 1.4571 1.4757 1.3957 2.9142 2.9514 2.7915 4.3714 4.4271 4.1872

Mean 0.3247 0.30282 0.33435 1.6235 1.5141 1.6717 3.247 3.0282 3.3435 4.8705 4.5423 5.0152Standard

Deviation0.28388 0.272 0.29951 1.4194 1.36 1.4975 2.8388 2.72 2.9951 4.2582 4.0799 4.4926

Test Value 0.2 0.2 0.2 1.5 1.5 1.5 3 3 3 4.5 4.5 4.5

Mean 0.25285 0.22079 0.25613 1.2645 1.1042 1.2808 2.5293 2.2087 2.5616 3.7941 3.3133 3.8425Standard

Deviation0.22358 0.22552 0.2115 1.1178 1.1275 1.0574 2.2354 2.2548 2.1147 3.353 3.3822 3.172

Percentage of Mean Error

19.61% 27.79% 21.43% 19.59% 27.77% 21.42% 19.58% 27.76% 21.42% 19.58% 27.76% 21.42%

Mean 0.26167 0.20792 0.26347 1.3083 1.0396 1.3174 2.6167 2.0792 2.6348 3.925 3.1188 3.9523Standard

Deviation0.22997 0.19598 0.22639 1.1498 0.9799 1.1319 2.2996 1.9598 2.2638 3.4494 2.9397 3.3956

Percentage of Mean Error

19.41% 31.34% 21.20% 19.41% 31.34% 21.19% 19.41% 31.34% 21.20% 19.41% 31.34% 21.19%

Classification with complete

data

Classification with Imputed data

Null Hypothesis

h 1 1 1 1 1 1 1 1 1 1 1 1Significance

Level of confidence

1.48E-04 1.69E-08 1.58E-05 1.49E-04 1.70E-08 1.59E-05 1.49E-04 1.71E-08 1.59E-05 1.50E-04 1.71E-08 1.59E-05

Confidence Interval

[-0.0945, -0.0302] [-0.1210, -0.0589] [-0.1022, -0.0385] [-0.4722, -0.1510] [-0.6046, -0.2944] [-0.5110, -0.1926] [-0.9443, -0.3019] [-1.2091, -0.5887] [-1.0218, -0.3852] [-1.4164, -0.4528] [-1.8135, -0.8829] [-1.5327, -0.5778]

C2

C2

C2

C2

C2

C2

C2

Complete data Class 1 Complete data Class 1

METHOD : LINEAR REGRESSION IMPUTATIONσ=1 σ=5 σ=10 σ=15

C2

Complete data Class 1 Complete data Class 1

Complete data Class 2 Complete data Class 2 Complete data Class 2 Complete data Class 2

Mean Imputed data Class 1 Mean Imputed data Class 1 Mean Imputed data Class 1 Mean Imputed data Class 1

Mean Imputed data Class 2 Mean Imputed data Class 2 Mean Imputed data Class 2 Mean Imputed data Class 2

Mean Imputed distribution is equal to Complete Case Distribution

Mean Imputed distribution is equal to Complete Case Distribution

Mean Imputed distribution is equal to Complete Case Distribution

Mean Imputed distribution is equal to Complete Case Distribution

Table 6.9 Missing data rate at 50% with linear regression

41

Variable 1 Variable 2 Variable 3 Variable 1 Variable 2 Variable 3 Variable 1 Variable 2 Variable 3 Variable 1 Variable 2 Variable 3

Mean 0.31452 0.30575 0.326 1.5726 1.5288 1.63 3.1452 3.0575 3.26 4.7178 4.5863 4.89Standard

Deviation0.2699 0.26247 0.25607 1.4571 1.4757 1.3957 2.9142 2.9514 2.7915 4.3714 4.4271 4.1872

Mean 0.3247 0.30282 0.33435 1.6235 1.5141 1.6717 3.247 3.0282 3.3435 4.8705 4.5423 5.0152Standard

Deviation0.2699 0.26247 0.25607 1.4194 1.36 1.4975 2.8388 2.72 2.9951 4.2582 4.0799 4.4926

Test Value 0.3 0.3 0.3 1.5 1.5 1.5 3 3 3 4.5 4.5 4.5

Mean 0.42829 0.44765 0.41436 1.6172 1.6647 1.6446 3.2396 3.2633 3.2563 4.8345 4.8308 4.901Standard

Deviation0.32514 0.33841 0.31359 1.4334 1.4346 1.3809 2.8573 2.9101 2.7906 4.3091 4.4244 4.1683

Percentage of Mean Error

36.17% 46.41% 27.10% 2.84% 8.89% 0.90% 3.00% 6.73% 0.11% 2.47% 5.33% 0.22%

Mean 0.43954 0.47639 0.43657 1.6747 1.4356 1.7143 3.3064 2.7783 3.4028 4.9118 4.1596 5.0878Standard

Deviation0.34593 0.32609 0.34044 1.3791 1.2722 1.4965 2.7812 2.6048 3.0112 4.2176 3.9131 4.5317

Percentage of Mean Error

35.37% 57.32% 30.57% 3.15% 5.18% 2.55% 1.83% 8.25% 1.77% 0.85% 8.43% 1.45%

Classification with complete

data

Classification with Imputed data

Null Hypothesis

h 1 1 1 0 0 0 0 0 0 0 0 0Significance

Level of confidence

4.87E-08 4.11E-04 4.14E-06 0.6158 0.7853 0.7679 0.6875 0.9068 0.8861 0.7837 0.8073 0.886

Confidence Interval

[0.0735, 0.1551] [0.1174, 0.1980] [0.00549, 0.1357] [-0.1394, 0.2353] [-0.1545, 0.2120] [-0.1614, 0.2185] [-0.2981, 0.4519] [-0.3915, 0.3474] [-0.3534, 0.4091] [-0.4855, 0.6435] [-0.6251, 0.4868] [-0.5301, 0.6136]

C2

C2

C2

C2

C2

C1

C1

Complete data Class 1 Complete data Class 1

METHOD : REWEIGHTINGσ=1 σ=5 σ=10 σ=15

C2

Complete data Class 1 Complete data Class 1

Complete data Class 2 Complete data Class 2 Complete data Class 2 Complete data Class 2

Mean Imputed data Class 1 Mean Imputed data Class 1 Mean Imputed data Class 1 Mean Imputed data Class 1

Mean Imputed data Class 2 Mean Imputed data Class 2 Mean Imputed data Class 2 Mean Imputed data Class 2

Mean Imputed distribution is equal to Complete Case Distribution

Mean Imputed distribution is equal to Complete Case Distribution

Mean Imputed distribution is equal to Complete Case Distribution

Mean Imputed distribution is equal to Complete Case Distribution

Table 6.10 Missing data rate at 20% with Reweighting

42

Variable 1 Variable 2 Variable 3 Variable 1 Variable 2 Variable 3 Variable 1 Variable 2 Variable 3 Variable 1 Variable 2 Variable 3

Mean 0.31452 0.30575 0.326 1.5726 1.5288 1.63 3.1452 3.0575 3.26 4.7178 4.5863 4.89Standard

Deviation0.2699 0.26247 0.25607 1.4571 1.4757 1.3957 2.9142 2.9514 2.7915 4.3714 4.4271 4.1872

Mean 0.3247 0.30282 0.33435 1.6235 1.5141 1.6717 3.247 3.0282 3.3435 4.8705 4.5423 5.0152Standard

Deviation0.2699 0.26247 0.25607 1.4194 1.36 1.4975 2.8388 2.72 2.9951 4.2582 4.0799 4.4926

Test Value 0.3 0.3 0.3 1.5 1.5 1.5 3 3 3 4.5 4.5 4.5

Mean 0.4236 0.4594 0.41767 1.686 1.5455 1.5342 3.3768 2.9717 3.0338 5.0465 4.3671 4.6049Standard

Deviation0.32952 0.33992 0.33667 1.444 1.373 1.3551 2.8805 2.8121 2.7438 5.0465 4.3671 4.6049

Percentage of Mean Error

34.68% 50.25% 28.12% 7.21% 1.09% 5.88% 7.36% 2.81% 6.94% 6.97% 4.78% 5.83%

Mean 0.42237 0.45395 0.4579 1.8174 1.5074 1.6398 3.5906 2.927 3.2783 5.3849 4.4126 4.8523Standard

Deviation0.34326 0.3234 0.33321 1.4243 1.3473 1.4213 2.8745 2.7559 2.8359 4.309 4.113 4.3168

Percentage of Mean Error

30.08% 49.91% 36.95% 11.94% 0.44% 1.91% 10.58% 3.34% 1.95% 10.56% 2.86% 3.25%

Classification with complete

data

Classification with Imputed data

Null Hypothesis

h 1 1 1 0 0 0 0 0 0 0 0 0Significance

Level of confidence

1.63E-04 3.80E-14 5.56E-07 0.4222 0.7521 0.985 0.1711 0.6494 0.4847 0.1813 0.5727 0.4379

Confidence Interval

[0.0403, 0.1269] [0.1275, 0.2147] [0.0688, 0.1563] [-0.1223, 0.2918] [-0.1700, 0.2352] [-0.2042, 0.2082] [-0.1245, 0.6996] [-0.4970, 0.3100] [-0.5548, 0.2634] [-0.1970, 1.0401] [-0.7811, 0.4323] [-0.8379, 0.3898]

C2

C2

C2

C2

C2

C1

C2

Complete data Class 1 Complete data Class 1

METHOD : REWEIGHTINGσ=1 σ=5 σ=10 σ=15

C2

Complete data Class 1 Complete data Class 1

Complete data Class 2 Complete data Class 2 Complete data Class 2 Complete data Class 2

Mean Imputed data Class 1 Mean Imputed data Class 1 Mean Imputed data Class 1 Mean Imputed data Class 1

Mean Imputed data Class 2 Mean Imputed data Class 2 Mean Imputed data Class 2 Mean Imputed data Class 2

Mean Imputed distribution is equal to Complete Case Distribution

Mean Imputed distribution is equal to Complete Case Distribution

Mean Imputed distribution is equal to Complete Case Distribution

Mean Imputed distribution is equal to Complete Case Distribution

Table 6.11 Missing data rate at 40% with Reweighting

43

Variable 1 Variable 2 Variable 3 Variable 1 Variable 2 Variable 3 Variable 1 Variable 2 Variable 3 Variable 1 Variable 2 Variable 3

Mean 0.31452 0.30575 0.326 1.5726 1.5288 1.63 3.1452 3.0575 3.26 4.7178 4.5863 4.89Standard

Deviation0.2699 0.26247 0.25607 1.4571 1.4757 1.3957 2.9142 2.9514 2.7915 4.3714 4.4271 4.1872

Mean 0.3247 0.30282 0.33435 1.6235 1.5141 1.6717 3.247 3.0282 3.3435 4.8705 4.5423 5.0152Standard

Deviation0.2699 0.26247 0.25607 1.4194 1.36 1.4975 2.8388 2.72 2.9951 4.2582 4.0799 4.4926

Test Value 0.3 0.3 0.3 1.5 1.5 1.5 3 3 3 4.5 4.5 4.5

Mean 0.42829 0.44765 0.41436 1.7654 1.6111 1.6628 3.556 3.0867 3.2416 5.3109 4.5696 4.8451Standard

Deviation0.32514 0.33841 0.31359 1.4388 1.4413 1.3989 2.8492 2.9815 2.8582 4.2982 4.5254 4.3025

Percentage of Mean Error

36.17% 46.41% 27.10% 12.26% 5.38% 2.01% 13.06% 0.96% 0.56% 12.57% 0.36% 0.92%

Mean 0.43954 0.47639 0.43657 1.6502 1.4824 1.4776 3.2698 2.9157 2.9521 4.8961 4.3977 4.4346Standard

Deviation0.34593 0.32609 0.34044 1.3909 1.3541 1.3422 2.8057 2.7458 2.6852 4.2155 4.0979 4.0205

Percentage of Mean Error

35.37% 57.32% 30.57% 1.64% 2.09% 11.61% 0.70% 3.72% 11.71% 0.53% 3.18% 11.58%

Classification with complete

data

Classification with Imputed data

Null Hypothesis

h 1 1 1 0 0 0 0 0 0 0 0 0Significance

Level of confidence

1.14E-04 7.57E-11 4.06E-04 0.3218 0.8168 0.464 0.3278 0.85 0.3541 0.3525 0.8076 0.3458

Confidence Interval

[0.0447, 0.1362] [0.1076, 0.1986] [0.0741, 0.1655] [-0.1076, 0.3271] [-0.1892, 0.2399] [-0.2969, 0.1355] [-0.2179, 0.6514] [-0.4741, 0.3908] [-0.6387, 0.2289] [-0.3434, 0.9620] [-0.7302, 0.5689] [-0.9637, 0.3381]

Mean Imputed distribution is equal to Complete Case Distribution

Mean Imputed distribution is equal to Complete Case Distribution

Mean Imputed distribution is equal to Complete Case Distribution

Mean Imputed distribution is equal to Complete Case Distribution

Mean Imputed data Class 1

Mean Imputed data Class 2 Mean Imputed data Class 2 Mean Imputed data Class 2 Mean Imputed data Class 2

C2

Complete data Class 1 Complete data Class 1

Complete data Class 2 Complete data Class 2 Complete data Class 2 Complete data Class 2

Mean Imputed data Class 1 Mean Imputed data Class 1 Mean Imputed data Class 1

METHOD : REWEIGHTINGσ=1 σ=5 σ=10 σ=15

C2

C2

C2

C2

C2

C1

C2

Complete data Class 1 Complete data Class 1

Table 6.12 Missing data rate at 50% with Reweighting

44

Method

Sigma 1 5 10 15 1 5 10 15 1 5 10 15 1 5 10 15Percentage of

Missing20% 20% 20% 20% 20% 20% 20% 20% 20% 20% 20% 20% 20% 20% 20% 20%

Misclassification N N N N N N N N N N N N N N N NSigma 1 5 10 15 1 5 10 15 1 5 10 15 1 5 10 15

Percentage of Missing

40% 40% 40% 40% 40% 40% 40% 40% 40% 40% 40% 40% 40% 40% 40% 40%

Misclassification Y Y Y Y Y Y Y Y Y N N N N N N NSigma 1 5 10 15 1 5 10 15 1 5 10 15 1 5 10 15

Percentage of Missing

50% 50% 40% 50% 50% 50% 40% 50% 50% 50% 40% 50% 50% 50% 40% 50%

Misclassification Y Y Y Y Y Y Y Y Y N N N N N N N

Linear RegressionOverall Mean Imputation Class Mean Imputation Reweighting

Table 6.13 Methods Comparison

45

7 CONCLUSIONS

In this project, various techniques of missing data treatment have been studied in

detail and some have been used in the evaluation. The goal is to find the best and highly

reliable method to be used for a given set of cases. Beforehand, the mechanism of

missing data plays an important role in the selection of methods to be used. Three known

mechanisms have been discussed which are Missing Completely At Random (MCAR),

Missing At Random (MAR) and Missing Not At Random (MNAR). Knowing the

properties of those will give ideas on which method is suitable to be used.

Obviously, deletion is the less favorable method as it leads to bias solution.

However, we might need to consider this method if we have sufficient number of data to

represent the samples population. Deletion is widely used as default for many software

packages due to its simplicity and faster computation.

Imputation methods should be opted if the missing data in the samples chosen is

not sufficient to support our decisions. Imputation in contrast to deletion predicts values

to be imputed in the missing cells. By doing so, there will be sufficient data to support

our analysis. With various imputation methods available, ones could select the most

appropriate to be used depending on the mechanisms of missing. Some imputation

methods appear to be reliable for example hot deck imputation but again it depends on

missing mechanism.

With multiple imputations, confidence on predicted values is higher and hence

misleading is very unlikely to happen. However, multiple imputations add complexity to

analysis. If time is a constraint, multiple imputations should not be of choice.

Reweighting utilizes the observed data for making inferences. Those missing

values will be taken into account by assigning weights to the data. The missing values are

not being ignored at all. They are just being represented by the observed data.

Reweighting combined with regression would not be of much difficulty since there is a

lot of prediction software available.

46

REFERENCES [1] Myrtveit I, Stensrud E and Olsson Ulf H., 2001, “Analyzing Data Sets with

Missing Data: An Empirical Evaluation of Imputation Methods and Likelihood-Based Methods”, IEEE Transactions On Software Engineering”, vol 27, pp 999-1010.

[2] Beaumont J.F, Haziza. D, Mitchell C. and Rancourt E., “New Tools at Statistics Canada to Measure and Evaluate the Impact of Nonresponse and Imputation”, pp 76-77.

[3] Honaker J, Joseph A, King G, Scheve K and Singh N., “A program for missing data”, http://gking.harvard.edu/amelia/.

[4] SAS Institute Inc., SAS OnlineDoc®, Version 8, Cary, NC: SAS Institute Inc., 1999, http://www.id.unizh.ch/software/unix/statmath/sas/sasdoc/stat/chap55/sect39.htm

[5] “Generalized Linear Model”, StatSci.org http://www.statsci.org/glm/intro.htm [6] Duda R.O, Hart P.E and Stork D.G., 2001,“Pattern Classification”, Second

Edition, (John Wiley & Sons, Inc)

47

APPENDIX A – OVERALL MEAN IMPUTATION

rand('seed',1); x=rand(500,3)*sqrt(10)+5; % complete random data [m,n]=size(x); d=(abs(x-(5*ones(m,n)))).^2; mu=mean(d) % mean of the set of data Mu=(mu'*ones(1,m))'; % vector for mu sigma=cov(d); % standard deviation for the set of d ata % CLASS 1 %============================================== c1=d(1:m/2,:); % data for class 1 muc1=mean(c1) % mean for class 1 sigmac1=cov(c1); % sigma for c1 Muc1=(muc1'*ones(1,m/2))'; varc1=std(c1); prc1=0.5; % prior probability of class 1 % CLASS 2 %================================================== === c2=d((m/2)+1:m,:); % data for class 2 muc2=mean(c2) % mean for class 2 sigmac2=cov(c2); % sigma for c2 Muc2=(muc2'*ones(1,m/2))'; varc2=std(c2); prc2=1-prc1; % prior probability of class 2 %================================================== ======================== % CREATING MISSING DATA mis=250; idx1=unique(floor(rand(mis,1)*499+1)); % missing da ta for variable 1 idx1=idx1(randperm(length(idx1))); idx1=idx1(1:100); idx2=unique(floor(rand(mis,1)*499+1)); % missing da ta for variable 2 idx2=idx2(randperm(length(idx2))); idx2=idx2(1:100); idx3=unique(floor(rand(mis,1)*499+1)); % missing da ta for variable 3 idx3=idx3(randperm(length(idx3))); idx3=idx3(1:100); dm=d; dm(idx1,1)=0; dm(idx2,2)=0; dm(idx3,3)=0; dmiss=dm;

48

%================================================== ==================== % OVERALL MEAN IMPUTATION mudmiss=mean(dmiss); dmiss(idx1,1)=mudmiss(1); dmiss(idx2,2)=mudmiss(2); dmiss(idx3,3)=mudmiss(3); doverall_mean=dmiss; mu_doverall_mean=mean(doverall_mean); stdv_doverall_mean=std(doverall_mean); sigma_doverall_mean=cov(doverall_mean); %%% CLASS 1 c1_doverall_mean=doverall_mean(1:(m/2),:); muc1_doverall_mean=mean(c1_doverall_mean); var_c1_doverall_mean=std(c1_doverall_mean); sigmac1_doverall_mean=cov(doverall_mean); %%% CLASS 2 c2_doverall_mean=doverall_mean((m/2)+1:m,:); muc2_doverall_mean=mean(c2_doverall_mean); var_c2_doverall_mean=std(doverall_mean); sigmac2_doverall_mean=cov(c2_doverall_mean); % T TEST % Hypothesis testing % Null Hypothesis : There is no difference with com plete case result % h=0; reject the null hypothesis at alpha level of significance % h=1; can reject null hypothesis at significance l evel [h1,sig1,ci1]=ttest2(doverall_mean(:,1),d(:,1),0.05 ) [h2,sig2,ci2]=ttest2(doverall_mean(:,2),d(:,2),0.05 ) [h3,sig3,ci3]=ttest2(doverall_mean(:,3),d(:,3),0.05 ) % TEST test=input(' enter test values '); Acc1=(-0.5.*(test)*inv(sigmac1)*(test)'); Bcc1=test*inv(sigmac1)*muc1'; Ccc1=-0.5*muc1*inv(sigmac1)*muc1'; Dcc1=log(det(sigmac1)); g1=Acc1+Bcc1+Ccc1-Dcc1+log(prc1); Acc2=(-0.5.*(test)*inv(sigmac2)*(test)'); Bcc2=test*inv(sigmac2)*muc2'; Ccc2=-0.5*muc2*inv(sigmac2)*muc2'; Dcc2=log(det(sigmac2)); g2=Acc2+Bcc2+Ccc2-Dcc2+log(prc2); res=g1-g2; if res>0 disp('complete case = c1') else disp('complete case = c2') end

49

Ac1=(-0.5.*(test)*inv(sigmac1_doverall_mean)*(test) '); Bc1=test*inv(sigmac1_doverall_mean)*muc1_doverall_m ean'; Cc1=-0.5*muc1_doverall_mean*inv(sigmac1_doverall_mean)*m uc1_doverall_mean'; Dc1=log(det(sigmac1_doverall_mean)); g1mdt=Ac1+Bc1+Cc1-Dc1+log(prc1); Ac2=(-0.5.*(test)*inv(sigmac2_doverall_mean)*(test) '); Bc2=test*inv(sigmac2_doverall_mean)*muc2_doverall_m ean'; Cc2=-0.5*muc2_doverall_mean*inv(sigmac2_doverall_mean)*m uc2_doverall_mean'; Dc2=log(det(sigmac2_doverall_mean)); g2mdt=Ac2+Bc2+Cc2-Dc2+log(prc2); resmdt=g1mdt-g2mdt; if resmdt>0 disp('regression = c1') else disp('regression = c2') end

50

APPENDIX B – CLASS MEAN IMPUTATION

rand('seed',1); x=rand(400,3)*sqrt(15)+5; % complete random data xreg=rand(500,3)+sqrt(20)+5; [m,n]=size(x); d=(abs(x-(5*ones(m,n)))).^2; mu=mean(d) % mean of the set of data Mu=(mu'*ones(1,m))'; % vector for mu sigma=cov(d); % standard deviation for the set of d ata % CLASS 1 %============================================== c1=d(1:m/2,:); % data for class 1 muc1=mean(c1) % mean for class 1 sigmac1=cov(c1); % sigma for c1 Muc1=(muc1'*ones(1,m/2))'; varc1=std(c1); pc1=exp(-0.5.*(c1-Muc1)*inv(sigmac1).*(c1-Muc1))/(2*pi)^(0.5*m/2)*(det(sigmac1)^0.5);% p(x|c1 ) prc1=0.5; % prior probability of class 1 % CLASS 2 %================================================== === c2=d((m/2)+1:m,:); % data for class 2 muc2=mean(c2) % mean for class 2 sigmac2=cov(c2); % sigma for c2 Muc2=(muc2'*ones(1,m/2))'; varc2=std(c2); pc2=exp(-0.5.*(c2-Muc2)*inv(sigmac2).*(c2-Muc2))/(2*pi)^(0.5*m/2)*(det(sigmac2)^0.5);% p(x|c2 ) prc2=1-prc1; % prior probability of class 2 %================================================== ======================== % CREATING MISSING DATA mis=200; idx1=unique(floor(rand(mis,1)*499+1)); % missing da ta for variable 1 idx1=idx1(randperm(length(idx1))); idx1=sort(idx1(1:100)); idx2=unique(floor(rand(mis,1)*499+1)); % missing da ta for variable 2 idx2=idx2(randperm(length(idx2))); idx2=sort(idx2(1:100)); idx3=unique(floor(rand(mis,1)*499+1)); % missing da ta for variable 3 idx3=idx3(randperm(length(idx3))); idx3=sort(idx3(1:100)); dm=d; dm(idx1,1)=0; dm(idx2,2)=0;

51

dm(idx3,3)=0; dmiss=dm; %================================================== ==================== % CLASS MEAN IMPUTATION %%% CLASS 1 c1_dmiss_class_imp=dmiss(1:m/2,:); c1idx1=max(find(idx1<=250)); % find number of m issing data in class 1, var 1 c1idx2=max(find(idx2<=250)); % find number of m issing data in class 1, var 2 c1idx3=max(find(idx3<=250)); % find number of m issing data in class 1, var 3 idx11=idx1(1:c1idx1); idx12=idx2(1:c1idx2); idx13=idx3(1:c1idx3); c1_dmiss_class_imp(idx11,1)=muc1(1); % class1 m ean imputation, var 1 c1_dmiss_class_imp(idx12,2)=muc1(2); % class1 m ean imputation, var 2 c1_dmiss_class_imp(idx13,3)=muc1(3); % class1 m ean imputation, var 3 c1_dmiss_class_imp=c1_dmiss_class_imp; muc1_dmiss_class_imp=mean(c1_dmiss_class_imp); varc1_dmiss_class_imp=std(c1_dmiss_class_imp); sigmac1_dmiss_class_imp=cov(c1_dmiss_class_imp) ; %%% CLASS 2 c2_dmiss_class_imp=dmiss((m/2)+1:m,1:n); idx21=idx1(c1idx1+1:100); idx22=idx2(c1idx2+1:100); idx23=idx3(c1idx3+1:100); dmiss(idx21,1)=muc2(1); % class2 mean imputatio n, var 1 dmiss(idx22,2)=muc2(2); % class2 mean imputatio n, var 2 dmiss(idx23,3)=muc2(3); % class2 mean imputatio n, var 3 c2_dmiss_class_imp=dmiss((m/2)+1:m,1:n); muc2_dmiss_class_imp=mean(c2_dmiss_class_imp); var_c2_dmiss_class_imp=std(c2_dmiss_class_imp); sigmac2_dmiss_class_imp=cov(c2_dmiss_class_imp) ; % WHOLE DATA AFTER CLASS MEAN IMPUTATION dmiss_class_imp=[c1_dmiss_class_imp; c2_dmiss_class _imp]; % T TEST % Hypothesis testing % Null Hypothesis : There is no difference with com plete case result % h=0; reject the null hypothesis at alpha level of significance % h=1; can reject null hypothesis at significance l evel [h1,sig1,ci1]=ttest2(dmiss_class_imp(:,1),d(1),0.01 )

52

[h2,sig2,ci2]=ttest2(dmiss_class_imp(:,2),d(2),0.01 ) [h3,sig3,ci3]=ttest2(dmiss_class_imp(:,3),d(3),0.01 ) % TEST %===================================== test=input(' enter test values '); %g1=ln(p(x|c1))+ln(P(c1))=ln(ptc1)+ln(prc1) %p(x|c1)=exp(-0.5(test-Muc1)*inv(Muc1)*(test-Muc1))/(2*pi^(0.5*m/2)*(det(sigmac1^0.5)))=exp(-0.5*A*inv(Muc1)*A'/(2*pi^(0.5*m/2)*(det(sigmac1^0.5 )))); Acc1=(-0.5.*(test)*inv(sigmac1)*(test)'); Bcc1=test*inv(sigmac1)*muc1'; Ccc1=-0.5*muc1*inv(sigmac1)*muc1'; Dcc1=log(det(sigmac1)); g1=Acc1+Bcc1+Ccc1-Dcc1+log(prc1); Acc2=(-0.5.*(test)*inv(sigmac2)*(test)'); Bcc2=test*inv(sigmac2)*muc2'; Ccc2=-0.5*muc2*inv(sigmac2)*muc2'; Dcc2=log(det(sigmac2)); g2=Acc2+Bcc2+Ccc2-Dcc2+log(prc2); res=g1-g2; if res>0 disp('complete case = c1') else disp('complete case = c2') end %===================================== Ac1=(-0.5.*(test)*inv(sigmac1_dmiss_class_imp)*(tes t')); Bc1=test*inv(sigmac1_dmiss_class_imp)*muc1_dmiss_cl ass_imp'; Cc1=-0.5*muc1_dmiss_class_imp*inv(sigmac1_dmiss_class_im p)*muc1_dmiss_class_imp'; Dc1=log(det(sigmac1_dmiss_class_imp)); g1reg=Ac1+Bc1+Cc1-Dc1+log(prc1); Ac2=(-0.5.*(test)*inv(sigmac2_dmiss_class_imp)*(tes t')); Bc2=test*inv(sigmac2_dmiss_class_imp)*muc2_dmiss_cl ass_imp'; Cc2=-0.5*muc2_dmiss_class_imp*inv(sigmac2_dmiss_class_im p)*muc2_dmiss_class_imp'; Dc2=log(det(sigmac2_dmiss_class_imp)); g2reg=Ac2+Bc2+Cc2-Dc2+log(prc2); resreg=g1reg-g2reg; if resreg>0 disp('class mean imputation = c1') else disp('class mean imputation = c2') end %=============================================

53

APPENDIX C – REGRESSION IMPUTATION rand('seed',1); x=rand(500,3)*sqrt(1)+5; % complete random data xaux=rand(400,3)*sqrt(15)+5; [m,n]=size(x); d=(abs(x-(5*ones(m,n)))).^2; mu=mean(d) % mean of the set of data Mu=(mu'*ones(1,m))'; % vector for mu sigma=cov(d); % covariance for the set of data var=std(d); % CLASS 1 %============================================== c1=d(1:m/2,:); % data for class 1 muc1=mean(c1) % mean for class 1 sigmac1=cov(c1); % sigma for c1 varc1=std(c1); Muc1=(muc1'*ones(1,m/2))'; pc1=exp(-0.5.*(c1-Muc1)*inv(sigmac1).*(c1-Muc1))/(2*pi)^(0.5*m/2)*(det(sigmac1)^0.5);% p(x|c1 ) prc1=0.5; % prior probability of class 1 % CLASS 2 %================================================== === c2=d((m/2)+1:m,:); % data for class 2 muc2=mean(c2) % mean for class 2 sigmac2=cov(c2); % sigma for c2 varc2=std(c2); Muc2=(muc2'*ones(1,m/2))'; pc2=exp(-0.5.*(c2-Muc2)*inv(sigmac2).*(c2-Muc2))/(2*pi)^(0.5*m/2)*(det(sigmac2)^0.5);% p(x|c2 ) prc2=1-prc1; % prior probability of class 2 %================================================== ======================== %INCOMPLETE DATA mis=200; %idx=floor(rand(50,3)*499+1.5); % case missing idx1=unique(floor(rand(mis,5)*499+1)); % missing da ta for variable 1 idx1=idx1(randperm(length(idx1))); idx1=idx1(1:100); idx2=unique(floor(rand(mis,1)*499+1)); % missing da ta for variable 2 idx2=idx2(randperm(length(idx2))); idx2=idx2(1:100); idx3=unique(floor(rand(mis,1)*499+1)); % missing da ta for variable 3 idx3=idx3(randperm(length(idx3))); idx3=idx3(1:100);

54

dm=d; dm(idx1,1)=0; % inserting missing data into variabl e 1 dm(idx2,2)=0; % inserting missing data into variabl e 2 dm(idx3,3)=0; % inserting missing data into variabl e 3 %================================================== ============== % PREPROCESS DATA dmm=(dm(dm>0)); p=length(dmm); dmm1=dm(1:p/3); dmm2=dm((p/3)+1:2*p/3); dmm3=dm((2*p/3)+1:p); dmm=[dmm1' dmm2' dmm3']; sigmam=cov(dmm); % new covariance with missing data varm=std(dmm); % covariance for missing data mum=mean(dmm); % new mean with missing data % CLASS 1 WITH MISSING DATA c1m=dm(1:m/2,:); % data for class 1 muc1m=mean(c1m) % new mean for class 1 sigmac1m=cov(c1m); % sigma for c1 varc1=std(c1m); %standard deviation for c1m Muc1m=(muc1m'*ones(1,m/2))'; prc1m=prc1; % prior probability of class 1 % CLASS 2 WITH MISSING DATA c2m=dm((m/2)+1:m,:); % data for class 2 muc2m=mean(c2m) % new mean for class 2 sigmac2m=cov(c2m); % sigma for c2 varc2m=std(c2m); Muc2m=(muc2m'*ones(1,m/2))'; prc2m=1-prc1m; % prior probability of class 2 %REGRESSION % Y=bX bd1=regress(dmm1',xaux); bd2=regress(dmm2',xaux); bd3=regress(dmm3',xaux); dreg1=x*bd1; dreg2=x*bd2; dreg3=x*bd3; dreg=[dreg1 dreg2 dreg3]; dm(idx1,1)=dreg1(idx1);dm(idx2,2)=dreg2(idx2);dm(id x3,3)=dreg3(idx3); dnew=dm; mudnew=mean(dnew); sigmadnew=cov(dnew); c1reg=dnew(1:m/2,:); muc1reg=mean(c1reg); sigmac1reg=cov(c1reg); varc1reg=std(c1reg); c2reg=dnew((m/2)+1:m,:); muc2reg=mean(c2reg);

55

sigmac2reg=cov(c2reg); varc2reg=std(c2reg); % T TEST % Hypothesis testing % h=0; cannot reject the null hypothesis at alpha l evel of significance % h=1; can reject null hypothesis at significance l evel [h1,sig1,ci1]=ttest2(dnew(:,1),d(:,1),0.05) [h2,sig2,ci2]=ttest2(dnew(:,2),d(:,2),0.05) [h3,sig3,ci3]=ttest2(dnew(:,3),d(:,3),0.05) % TEST test=input(' enter test values '); %g1=ln(p(x|c1))+ln(P(c1))=ln(ptc1)+ln(prc1) %p(x|c1)=exp(-0.5(test-Muc1)*inv(Muc1)*(test-Muc1))/(2*pi^(0.5*m/2)*(det(sigmac1^0.5)))=exp(-0.5*A*inv(Muc1)*A'/(2*pi^(0.5*m/2)*(det(sigmac1^0.5 )))); Acc1=(-0.5.*(test)*inv(sigmac1)*(test)'); Bcc1=test*inv(sigmac1)*muc1'; Ccc1=-0.5*muc1*inv(sigmac1)*muc1'; Dcc1=log(det(sigmac1)); g1=Acc1+Bcc1+Ccc1-Dcc1+log(prc1); Acc2=(-0.5.*(test)*inv(sigmac2)*(test)'); Bcc2=test*inv(sigmac2)*muc2'; Ccc2=-0.5*muc2reg*inv(sigmac2)*muc2'; Dcc2=log(det(sigmac2)); g2=Acc2+Bcc2+Ccc2-Dcc2+log(prc2); res=g1-g2; if res>0 disp('complete case = c1') else disp('complete case = c2') end %===================================== Ac1=(-0.5.*(test)*inv(sigmac1reg)*(test)'); Bc1=test*inv(sigmac1reg)*muc1reg'; Cc1=-0.5*muc1reg*inv(sigmac1reg)*muc1reg'; Dc1=log(det(sigmac1reg)); g1reg=Ac1+Bc1+Cc1-Dc1+log(prc1m); Ac2=(-0.5.*(test)*inv(sigmac2reg)*(test)'); Bc2=test*inv(sigmac2reg)*muc2reg'; Cc2=-0.5*muc2reg*inv(sigmac2reg)*muc2reg'; Dc2=log(det(sigmac2reg)); g2reg=Ac2+Bc2+Cc2-Dc2+log(prc2m); resreg=g1reg-g2reg; if resreg>0 disp('regression = c1') else disp('regression = c2') end

56

APPENDIX D – REWEIGHTING rand('seed',1); x=rand(500,3)*sqrt(1)+5; % complete random data xreg=rand(400,3)*sqrt(1)+5; [m,n]=size(x); d=(abs(x-(5*ones(m,n)))).^2; rd=x-5*ones(m,n); mu=mean(d) % mean of the set of data Mu=(mu'*ones(1,m))'; % vector for mu sigma=cov(d); % covariance for the set of data var=std(d); % CLASS 1 %============================================== c1=d(1:m/2,:); % data for class 1 muc1=mean(c1) % mean for class 1 sigmac1=cov(c1); % sigma for c1 varc1=std(c1); Muc1=(muc1'*ones(1,m/2))'; pc1=exp(-0.5.*(c1-Muc1)*inv(sigmac1).*(c1-Muc1))/(2*pi)^(0.5*m/2)*(det(sigmac1)^0.5);% p(x|c1 ) prc1=0.5; % prior probability of class 1 % CLASS 2 %================================================== === c2=d((m/2)+1:m,:); % data for class 2 muc2=mean(c2) % mean for class 2 sigmac2=cov(c2); % sigma for c2 varc2=std(c2); Muc2=(muc2'*ones(1,m/2))'; pc2=exp(-0.5.*(c2-Muc2)*inv(sigmac2).*(c2-Muc2))/(2*pi)^(0.5*m/2)*(det(sigmac2)^0.5);% p(x|c2 ) prc2=1-prc1; % prior probability of class 2 %================================================== ======================== %INCOMPLETE DATA mis=200; %idx=floor(rand(50,3)*499+1.5); % case missing idx1=unique(floor(rand(mis,5)*499+1)); % missing da ta for variable 1 idx1=idx1(randperm(length(idx1))); idx11=idx1(1:50); idx12=idx1(51:100); idx2=unique(floor(rand(mis,1)*499+1)); % missing da ta for variable 2 idx2=idx2(randperm(length(idx2))); idx21=idx2(1:50); idx22=idx2(51:100);

57

idx3=unique(floor(rand(mis,1)*499+1)); % missing da ta for variable 3 idx3=idx3(randperm(length(idx3))); idx3=idx3(1:100); idx31=idx3(1:50); idx32=idx3(51:100); dm=d; dm(idx11,1)=0; % inserting missing data into variab le 1 dm(idx12,1)=0; % inserting missing data into variab le 1 dm(idx21,2)=0; % inserting missing data into variab le 2 dm(idx22,2)=0; % inserting missing data into variab le 1 dm(idx31,3)=0; % inserting missing data into variab le 1 dm(idx32,3)=0; % inserting missing data into variab le 3 %================================================== ============== % PREPROCESS DATA dm1=dm(1:m,1); dm2=dm(1:m,2); dm3=dm(1:m,3); dm1=dm1(dm1>0); dm2=dm2(dm2>0); dm3=dm3(dm3>0); dmm=[dm1 dm2 dm3]; rddmm=sqrt(dmm); %======================== % REWEIGHTING weight=rddmm; [v11,w11]=find(rddmm(:,1)<=0.2); [v12,w12]=find(rddmm(:,1)>0.2&rddmm(:,1)<=0.4); [v13,w13]=find(rddmm(:,1)>0.4&rddmm(:,1)<=0.6); [v14,w14]=find(rddmm(:,1)>0.6&rddmm(:,1)<0.8); [v15,w15]=find(rddmm(:,1)>0.8&rddmm(:,1)<1.0); idxv11=v11; idxv12=v12; idxv13=v13; idxv14=v14; idxv15=v15; weight(idxv11,1)=1; weight(idxv12,1)=0.75; weight(idxv13,1)=0.6; weight(idxv14,1)=0.45; weight(idxv15,1)=0.15; %================================== [v21,w21]=find(rddmm(:,2)<=0.2); [v22,w22]=find(rddmm(:,2)>0.2&rddmm(:,2)<=0.4); [v23,w23]=find(rddmm(:,2)>0.4&rddmm(:,2)<=0.6); [v24,w24]=find(rddmm(:,2)>0.6&rddmm(:,2)<=0.8); [v25,w25]=find(rddmm(:,2)>0.8&rddmm(:,2)<=1.0); idxv21=v21; idxv22=v22; idxv23=v23;

58

idxv24=v24; idxv25=v25; weight(idxv21,2)=1; weight(idxv22,2)=0.75; weight(idxv23,2)=0.6; weight(idxv24,2)=0.45; weight(idxv25,2)=0.15; %==================================== [v31,w31]=find(rddmm(:,3)<=0.2); [v32,w32]=find(rddmm(:,3)>0.2&rddmm(:,3)<=0.4); [v33,w33]=find(rddmm(:,3)>0.4&rddmm(:,3)<=0.6); [v34,w34]=find(rddmm(:,3)>0.6&rddmm(:,3)<=0.8); [v35,w35]=find(rddmm(:,3)>0.8&rddmm(:,3)<=1.0); idxv31=v31; idxv32=v32; idxv33=v33; idxv34=v34; idxv35=v35; weight(idxv31,3)=1; weight(idxv32,3)=0.75; weight(idxv33,3)=0.6; weight(idxv34,3)=0.45; weight(idxv35,3)=0.15; newdmm=(weight.*1); newx=5*ones(400,3)+newdmm; rdnew=newx-5*ones(400,n); sqrdnew=rdnew.^2; % CLASS 1 c1rew=sqrdnew(1:200,:); muc1rew=mean(c1rew); sigmac1rew=cov(c1rew); varc1rew=std(c1rew); % CLASS 2 c2rew=sqrdnew(201:400,:); muc2rew=mean(c2rew); sigmac2rew=cov(c2rew); varc2rew=std(c2rew); % T TEST % Hypothesis testing % h=0; cannot reject the null hypothesis at alpha l evel of significance % h=1; can reject null hypothesis at significance l evel [h1,sig1,ci1]=ttest2(sqrdnew(:,1),d(:,1),0.05) [h2,sig2,ci2]=ttest2(sqrdnew(:,2),d(:,2),0.05) [h3,sig3,ci3]=ttest2(sqrdnew(:,3),d(:,3),0.05) % TEST test=input(' enter test values ');

59

%g1=ln(p(x|c1))+ln(P(c1))=ln(ptc1)+ln(prc1) %p(x|c1)=exp(-0.5(test-Muc1)*inv(Muc1)*(test-Muc1))/(2*pi^(0.5*m/2)*(det(sigmac1^0.5)))=exp(-0.5*A*inv(Muc1)*A'/(2*pi^(0.5*m/2)*(det(sigmac1^0.5 )))); Acc1=(-0.5.*(test)*inv(sigmac1)*(test)'); Bcc1=test*inv(sigmac1)*muc1'; Ccc1=-0.5*muc1*inv(sigmac1)*muc1'; Dcc1=log(det(sigmac1)); g1=Acc1+Bcc1+Ccc1-Dcc1+log(prc1); Acc2=(-0.5.*(test)*inv(sigmac2)*(test)'); Bcc2=test*inv(sigmac2)*muc2'; Ccc2=-0.5*muc2rew*inv(sigmac2)*muc2'; Dcc2=log(det(sigmac2)); g2=Acc2+Bcc2+Ccc2-Dcc2+log(prc2); res=g1-g2; if res>0 disp('complete case = c1') else disp('complete case = c2') end %===================================== Ac1=(-0.5.*(test)*inv(sigmac1rew)*(test)'); Bc1=test*inv(sigmac1rew)*muc1rew'; Cc1=-0.5*muc1rew*inv(sigmac1rew)*muc1rew'; Dc1=log(det(sigmac1rew)); g1reg=Ac1+Bc1+Cc1-Dc1+log(prc1); Ac2=(-0.5.*(test)*inv(sigmac2rew)*(test)'); Bc2=test*inv(sigmac2rew)*muc2rew'; Cc2=-0.5*muc2rew*inv(sigmac2rew)*muc2rew'; Dc2=log(det(sigmac2rew)); g2reg=Ac2+Bc2+Cc2-Dc2+log(prc2); resreg=g1reg-g2reg; if resreg>0 disp('regression = c1') else disp('regression = c2') end %================================================== ========================