Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje...

35
Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering University of Ulster, N.Ireland,
  • date post

    20-Dec-2015
  • Category

    Documents

  • view

    219
  • download

    0

Transcript of Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje...

Page 1: Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.

Supervised classification performance (prediction) assessment

Dr. Huiru Zheng Dr. Franscisco Azuaje

School of Computing and Mathematics

Faculty of Engineering

University of Ulster, N.Ireland, UK

Page 2: Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.

Building prediction models• Different models, tools and applications.

•The problem of prediction (classification).

DataPrediction

model

Predictions

EventProcess

ConditionProperties

Category

ValuesAction

Response

P

Page 3: Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.

Building prediction models Supervised learning methods

Prediction model

A, CC’

Training phase: A set of cases and their respective labels are used to build a classification model.

Prediction model

A, CC*

Page 4: Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.

Prediction model

A, (C)C*

Test phase: the trained classifier is used to predict new cases.

Prediction models, such as ANN, aim to achieve an ability to generalise: The capacity to correctly classify cases or problems unseen during training.

Quality indicator: Accuracy during the test phase

Page 5: Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.

Building prediction models – Assessing their quality

A classifier will be able to generalise if:a) its architecture and learning parameters have been

properly defined, andb) enough training data are available.

• The second condition is difficult to achieve due to resource and time constraints.

• Key limitations appear when dealing with small-data samples, which is a common feature observed in many studies.

• A small test data set may contribute to an inaccurate performance assessment.

Page 6: Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.

Key questions• How to measure classification quality?

• How can I select training and test cases ?

• How many experiments ?

• How to estimate prediction accuracy ?

• Effects on small – large data sets ?

The problem of Data Sampling

Page 7: Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.

What is Accuracy?

Page 8: Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.

What is Accuracy?

Accuracy =No. of correct predictions

No. of predictions

=TP + TN

TP + TN + FP + FN

Page 9: Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.

Examples (1)

• Clearly, B, C, D are all better than A

• Is B better than C, D?

• Is C better than B, D?

• Is D better than B, C?

classifier TP TN FP FN AccuracyA 25 25 25 25 50%B 50 25 25 0 75%C 25 50 0 25 75%D 37 37 13 13 74%

Accuracy may nottell the whole story

Page 10: Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.

Examples (2)-

• Clearly, D is better than A

• Is B better than A, C, D?

classifier TP TN FP FN AccuracyA 25 75 75 25 50%B 0 150 0 50 75%C 50 0 150 0 25%D 30 100 50 20 65%

Page 11: Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.

What is Sensitivity (recall)?

Sensitivity =No. of correct positive predictions

No. of positives

=TP

TP + FN

True positive rate

True negative rate is termed specificity

Page 12: Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.

What is Precision?

Precision =No. of correct positive predictions

No. of positives predictions

=TP

TP + FP

wrt positives

Page 13: Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.

Precision-Recall Trade-off

• A predicts better than B if A has better recall and precision than B

• There is a trade-off between recall and precision

• In some applications, once you reach a satisfactory precision, you optimize for recall

• In some applications, once you reach a satisfactory recall, you optimize for precisionre

call

precision

Page 14: Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.

Comparing Prediction Performance

• Accuracy is the obvious measure– But it conveys the right intuition only when the

positive and negative populations are roughly equal in size

• Recall and precision together form a better measure– But what do you do when A has better recall

than B and B has better precision than A?

Page 15: Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.

Some Alternate measures

• F-Measure - Take the harmonic mean of recall and precision

• Adjusted Accuracy – weight

• ROC curve - Receiver Operating Characteristic analysis

F =2 * recall * precision

recall + precision(wrt positives)

Page 16: Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.

Adjusted Accuracy

• Weigh by the importance of the classes

classifier TP TN FP FN Accuracy Adj AccuracyA 25 75 75 25 50% 50%B 0 150 0 50 75% 50%C 50 0 150 0 25% 50%D 30 100 50 20 65% 63%

Adjusted accuracy = * Sensitivity * Specificity+

where + = 1typically, = = 0.5

But values for , ?

Page 17: Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.

ROC Curves

• By changing t, we get a range of sensitivities and specificities of a classifier

• A predicts better than B if A has better sensitivities than B at most specificities

• Leads to ROC curve that plots sensitivity vs. (1 – specificity)

• Then the larger the area under the ROC curve, the better

s en s

iti v

ity

1 – specificity

Page 18: Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.

Key questions• How to measure classification quality?

• How can I select training and test cases ?

• How many experiments ?

• How to estimate prediction accuracy ?

• Effects on small – large data sets ?

The problem of Data Sampling

Page 19: Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.

Data sampling techniques

Main goals:

Reduction of the estimation bias

Reduction of the variance introduced by a small data set

Too optimistic

Too conservative

Page 20: Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.

a) to establish differences between data sampling techniques when applied to small and larger datasets,

b) to study the response of these methods to the size and number of train-test sets, and

c) to discuss criteria for the selection of sampling techniques.

Other important goals

Page 21: Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.

Three Data Sampling Techniques

• cross-validation

• leave-one-out

• bootstrap.

Page 22: Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.

k-fold cross validation

N samples, p training samples, q test samples (q = N – p)

Data

Randomly divides the data into the training and test sets. This process is repeated k times and the classification performance is the average of the individual test estimates.

Experiment 1

Data

Experiment 2

Data

Experiment k

N

Page 23: Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.

k-fold cross validationThe classifier may not be able to accurately predict new cases if the amount of data used for training is too small. At the same time, the quality assessment may not be accurate if the portion of data used for testing is too small.

p % q %

??

Splitting procedure

Page 24: Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.

The Leave-One-Out Method• Given N cases available in a dataset, a classifier is trained on (N-1) cases, and then is tested on the case that was left out.

• This is repeated N times until every case in the dataset has been included once as a cross-validation instance.

• The results are averaged across the N test cases to estimate the classifier’s prediction performance.

Data

Experiment 1

Data

Experiment 2

Data

Experiment N

N

Page 25: Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.

The Bootstrap Method • A training dataset is generated by sampling with replacement N times from the available N cases. • The classifier is trained on this set and then tested on the original dataset. • This process is repeated several times, and the classifier’s accuracy estimate is the average of these individual estimates.

Data

Case 1

Case 2

Case 3

Case 4

Case 5

Training (1)

Case 1

Case 1

Case 3

Case 3

Case 5

Test (1)

Case 1

Case 2

Case 3

Case 4

Case 5

Page 26: Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.

An example• 88 cases categorised into four classes: Ewing family of tumors (EWS, 30), rhabdomyosarcoma (RMS, 11), Burkitt lymphomas (BL, 19) and euroblastomas (NB, 28).

• Represented by the expression values of 2308 genes with suspected roles in processes relevant to these tumors.

• PCA was applied to reduce the dimensionality of the cases, the 10 dominant components per case were used to train the networks.

• All of the classifiers (BP-ANN) were trained using the same learning parameters.

• The BP-ANN architectures comprised 10 input nodes, 8 hidden nodes and 4 output nodes.

• Each output node encodes one of the tumor classes.

Page 27: Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.

The cross-validation results were analysed for three different data splitting methods:

a)50% of the available cases were used for training the classifiers and the remaining 50% for testing,

b) 75% for training and 25% for testing,

c) 95% for training and 5% for testing.

Analysing the k-fold cross validation

Page 28: Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.

Tumour classification

A: 10 train-test runs, B: 25 train-test runs, C: 50 train-test runs, D: 100 train-test runs, E: 500 train-test runs (interval size equal to 0.01), F: 1000 train-test runs, G: 2000

train-test runs, H: 3000 train-test runs, I: 4000 train-test runs, J: 5000 train-test runs.

Cross-validation method based on a 50%-50% splitting.

A B C D E F G H I J

Train-test runs

0.71

0.72

0.73

0.74

0.75

0.76

0.77

0.78

0.79

0.80

0.81

Cla

ssifi

catio

n ac

cura

cy

Page 29: Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.

Tumour classificationCross-validation method based on a 75%-25% splitting.

A B C D E F G H I J

Train-test runs

0.71

0.72

0.73

0.74

0.75

0.76

0.77

0.78

0.79

0.80

Cla

ssifi

catio

n ac

cura

cy

A: 10 train-test runs, B: 25 train-test runs, C: 50 train-test runs, D: 100 train-test runs, E: 500 train-test runs, F: 1000 train-test runs (interval size equal to

0.01), G: 2000 train-test runs, H: 3000 train-test runs, I: 4000 train-test runs, J: 5000 train-test runs.

Page 30: Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.

Tumour classificationCross-validation method based on a 95%-5% splitting.

A B C D E F G H I J

Train-test runs

0.55

0.60

0.65

0.70

0.75

0.80

0.85

0.90

0.95

Cla

ssifi

catio

n ac

cura

cy

A: 10 train-test runs, B: 25 train-test runs, C: 50 train-test runs, D: 100 train-test runs, E: 500 train-test runs, F: 1000 train-test runs, G: 2000 train-test runs, H: 3000 train-test runs, I: 4000 train-test runs, J: 5000 train-test runs

(interval size equal to 0.01) .

Page 31: Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.

Tumour classification• The 50%-50% cross-validation produced the most conservative accuracy estimates.

• The 95%-5% cross-validation method produced the most optimistic cross-validation accuracy estimates.

• The leave-one-out method produced the highest accuracy estimate for this dataset (0.79).

• The estimation of high accuracy values may be linked to an increase of the size of the training datasets.

Page 32: Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.

Tumour classification Bootstrap method

A B C D E F G H I J

Train-test runs

0.725

0.730

0.735

0.740

0.745

0.750

0.755

0.760

0.765

0.770

Cla

ssifi

catio

n ac

cura

cy

A: 100 train-test runs, B: 200 train-test runs, C: 300 train-test runs, D: 400 train-test runs, E: 500 train-test runs, F: 600 train-test runs, G: 700 train-test

runs, H: 800 train-test runs, I: 900 train-test runs (interval size equal to 0.01) , J: 1000 train-test runs.

Page 33: Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.

Final remarks• The problem of estimating prediction quality should be carefully addressed and deserves further investigations.

• Sampling techniques can be implemented to assess the classification quality factors (such accuracy) of classifiers (such as ANNs).

• In general there is variability among the three techniques.

• These experiments suggest that it is possible to achieve lower variance estimates for different numbers of train-test runs.

Page 34: Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.

• Furthermore, one may identify conservative and optimistic accuracy predictors, whose overall estimates may be significantly different.

• This effect is more distinguishable in small-sample applications.

• The predicted accuracy of a classifier is generally proportional to the size of the training dataset.

• The bootstrap method may be applied to generate conservative and robust accuracy estimates, based on a relatively small number of train-test experiments.

Final remarks (II)

Page 35: Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.

• This presentation highlights the importance of performing more rigorous procedures on the selection of data and classification quality assessment.

• In general the application of more than one sampling technique may provide the basis for accurate and reliable predictions.

Final remarks (III)