Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Sweden A...

23
Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Sweden http://www.imt.liu.se A Data Pre-processing Method to Increase Efficiency and Accuracy in Data Mining Amir R Razavi, Hans Gill, Hans Åhlfeldt, Nosrat Shahsavar Department of Biomedical Engineering, Division of Medical Informatics Linköpings universitet, Linköping, Sweden

Transcript of Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Sweden A...

Page 1: Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Sweden  A Data Pre-processing Method to Increase.

Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Swedenhttp://www.imt.liu.se

A Data Pre-processing Method to Increase Efficiency and Accuracy in Data Mining

Amir R Razavi, Hans Gill, Hans Åhlfeldt, Nosrat Shahsavar

Department of Biomedical Engineering, Division of Medical Informatics

Linköpings universitet, Linköping, Sweden

Page 2: Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Sweden  A Data Pre-processing Method to Increase.

Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Swedenhttp://www.imt.liu.se

2

A Data Pre-processing Method in Data Mining

• Outline– Introduction– Dataset and variables– Data pre-processing– Data mining Algorithm (DTI)– Result– Discussion

Page 3: Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Sweden  A Data Pre-processing Method to Increase.

Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Swedenhttp://www.imt.liu.se

3

Introduction

• Abundance of data in medicine and availability of comprehensive registers

• Difficulty in analysing huge amount of data with traditional methods

• Efficient data mining methods

Page 4: Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Sweden  A Data Pre-processing Method to Increase.

Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Swedenhttp://www.imt.liu.se

4

Introduction

• Applying data mining methods to breast cancer register

• Pre-processing is an essential part of knowledge discovery in databases

• Finding an efficient pre-processing approach is essential for a successful data mining

Page 5: Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Sweden  A Data Pre-processing Method to Increase.

Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Swedenhttp://www.imt.liu.se

5

Methods

• Dataset

• Data pre-processing– Data combination and selection– Cleaning data– Replacing missing values– Dimension reduction

• Decision Tree Induction (DTI)

• Performance comparison

Page 6: Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Sweden  A Data Pre-processing Method to Increase.

Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Swedenhttp://www.imt.liu.se

6

Dataset

• 3949 female patients, 1986 to 1995, follow up to 2003

• Data from three registers: regional, tumour marker and death registers, overall more than 150 variables

Page 7: Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Sweden  A Data Pre-processing Method to Increase.

Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Swedenhttp://www.imt.liu.se

7

Variables

Predictor Set Outcome Set ‡

Age Distant metastasis, first five years

Quadrant Distant metastasis, more than 5 years

Side Loco-regional recurrence, first five years

Tumor size * Loco-regional recurrence, more than 5 years

Lymph node involvement *

Lymph node involvement †

Periglandular growth *

Multiple tumors *

Estrogen receptor

Progesterone receptor

S-phase fraction

DNA index

DNA ploidy

* from pathology report, † N0: Not palpable LN metastasis, ‡ all periods are time after diagnosis

Page 8: Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Sweden  A Data Pre-processing Method to Increase.

Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Swedenhttp://www.imt.liu.se

8

• After combining data from different registers, important variables (predictors/outcomes) were selected after consulting with domain experts:– Number of predictors were reduced from +150– Chosing four important outcomes for breast

cancer

Data Pre-processing – Data Selection

Page 9: Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Sweden  A Data Pre-processing Method to Increase.

Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Swedenhttp://www.imt.liu.se

9

• Cleaning the data from outliers and errors, for example:– Duration between diagnosis of the disease and

the recurrence– Age

Data Pre-processing – Cleaning Data

Page 10: Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Sweden  A Data Pre-processing Method to Increase.

Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Swedenhttp://www.imt.liu.se

10

Data Pre-processing - Replacing Missing Values

• EM (expectation maximization) algorithm– Dempster et al., 1977– A two step iterative approach that estimates the

parameters of a model starting from an initial guess. Each iteration consists of two steps:

• An expectation step that finds the distribution for the missing data based on the known values for the observed variables and the current estimate of the parameters.

• A maximization step that substitutes the missing data with the expected value.

Page 11: Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Sweden  A Data Pre-processing Method to Increase.

Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Swedenhttp://www.imt.liu.se

11

Data Pre-processing - Dimension Reduction

• Canonical Correlation Analysis (CCA)– It investigates the relationship between two sets

of variables. – A canonical correlation is the correlation of two

canonical variates, one representing a set of independent variables, the other a set of dependent variables.

– A canonical variate, is a linear combination of a set of original variables.

Page 12: Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Sweden  A Data Pre-processing Method to Increase.

Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Swedenhttp://www.imt.liu.se

12

Data Pre-processing - Dimension Reduction

– The aim is to create a number of canonical solutions each consisting of a linear combination of one set of variables:

Ui = a1 X1 + a2 X2 + … + am Xm

and a linear combination of the other set of variables: Vi = b1 Y1 + b2 Y2 + … + bn Yn

– The goal is to determine the coefficients (a’s and b’s) that maximize the correlation between canonical variates Ui and Vi.

Page 13: Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Sweden  A Data Pre-processing Method to Increase.

Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Swedenhttp://www.imt.liu.se

13

Data Pre-processing - Dimension Reduction

– For finding important variables in each set (predictors and outcomes) magnitude of loadings were used.

– Variables with the absolute value of loadings more than or equal to 0.3 were assumed important and entered into the next step for data mining.

– Loading shows how each original variable contribute towards each canonical variate.

Page 14: Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Sweden  A Data Pre-processing Method to Increase.

Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Swedenhttp://www.imt.liu.se

14

Data Pre-processing - Dimension Reduction

• Variables with their loadings

Page 15: Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Sweden  A Data Pre-processing Method to Increase.

Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Swedenhttp://www.imt.liu.se

15

Data Mining Algorithm

• Decision Tree Induction (DTI)– A decision tree is a tree in which each branch node

represents a choice between a number of alternatives, and each leaf node represents a classification or decision.

– Each internal node denotes a test on variables, each branch stands for an outcome of the test, leaf nodes represent an outcome, and the uppermost node in a tree is the root node.

Page 16: Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Sweden  A Data Pre-processing Method to Increase.

Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Swedenhttp://www.imt.liu.se

16

Resulted Decision Tree

Page 17: Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Sweden  A Data Pre-processing Method to Increase.

Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Swedenhttp://www.imt.liu.se

17

Performance comparison

• Sensitivity =

• Specificity =

• Accuracy =

• Number of leaves and tree size

TP, TN, FP and FN denotes true positive, true negatives, false positives and false negatives, respectively

FNTP

TP

FPTN

TN

FNFPTNTP

TNTP

Page 18: Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Sweden  A Data Pre-processing Method to Increase.

Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Swedenhttp://www.imt.liu.se

18

Performance Comparison

• Comparing different approaches

Without

pre-processing With replacing

missing values With

pre-processing

Accuracy 54% 57% 67% Sensitivity 83% 82% 80% Specificity 41% 46% 63% Number of Leaves 137 196 14 Tree Size 273 391 27

Page 19: Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Sweden  A Data Pre-processing Method to Increase.

Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Swedenhttp://www.imt.liu.se

19

Discussion

• Effective data pre-processing is a very important step in knowledge discovery– Real word data are usually

• Incomplete

• Noisy

• Inconsistent

• Are not collected for data mining

Page 20: Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Sweden  A Data Pre-processing Method to Increase.

Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Swedenhttp://www.imt.liu.se

20

Discussion

• Replacing missing values before dimension reduction – Providing more information to CCA for

dimension reduction

• Running CCA prior to DTI– Reducing the number of variables while

increasing accuracy of classification– Considerable increase in the interpretability of

DTI

Page 21: Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Sweden  A Data Pre-processing Method to Increase.

Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Swedenhttp://www.imt.liu.se

21

Discussion

• In medical studies often no pre-processing is done before DTI

• Proper pre-processing including consulting with domain experts, replacing missing values and dimension reduction prepares the data for a better data mining by DTI

• Increasing the accuracy and interpretability of DTI are the result of our approach

Page 22: Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Sweden  A Data Pre-processing Method to Increase.

Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Swedenhttp://www.imt.liu.se

22

Future Works

• Increase the efficiency of knowledge discovery of medical registers.

• Validate the result of our methodology (pre-processing prior to data mining ) with domain experts for the prediction of recurrence of cancer.

• How to use the discovered knowledge and integrate it with clinical workflow.

• Improve the quality of registers with adding and completing important predictors.

Page 23: Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Sweden  A Data Pre-processing Method to Increase.

Dept of Biomedical Engineering, Medical Informatics Linköpings universitet, Linköping, Swedenhttp://www.imt.liu.se

23

Thanks for your attention

[email protected]