Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje...

Supervised classification performance (prediction) assessment

Dr. Huiru Zheng Dr. Franscisco Azuaje

School of Computing and Mathematics

Faculty of Engineering

University of Ulster, N.Ireland, UK

Building prediction models• Different models, tools and applications.

•The problem of prediction (classification).

DataPrediction

model

Predictions

EventProcess

ConditionProperties

Category

ValuesAction

Response

P

Building prediction models Supervised learning methods

Prediction model

A, CC’

Training phase: A set of cases and their respective labels are used to build a classification model.

Prediction model

A, CC*

Prediction model

A, (C)C*

Test phase: the trained classifier is used to predict new cases.

Prediction models, such as ANN, aim to achieve an ability to generalise: The capacity to correctly classify cases or problems unseen during training.

Quality indicator: Accuracy during the test phase

Building prediction models – Assessing their quality

A classifier will be able to generalise if:a) its architecture and learning parameters have been

properly defined, andb) enough training data are available.

• The second condition is difficult to achieve due to resource and time constraints.

• Key limitations appear when dealing with small-data samples, which is a common feature observed in many studies.

• A small test data set may contribute to an inaccurate performance assessment.

Key questions• How to measure classification quality?

• How can I select training and test cases ?

• How many experiments ?

• How to estimate prediction accuracy ?

• Effects on small – large data sets ?

The problem of Data Sampling

What is Accuracy?

What is Accuracy?

Accuracy =No. of correct predictions

No. of predictions

=TP + TN

TP + TN + FP + FN

Examples (1)

• Clearly, B, C, D are all better than A

• Is B better than C, D?

• Is C better than B, D?

• Is D better than B, C?

classifier TP TN FP FN AccuracyA 25 25 25 25 50%B 50 25 25 0 75%C 25 50 0 25 75%D 37 37 13 13 74%

Accuracy may nottell the whole story

Examples (2)-

• Clearly, D is better than A

• Is B better than A, C, D?

classifier TP TN FP FN AccuracyA 25 75 75 25 50%B 0 150 0 50 75%C 50 0 150 0 25%D 30 100 50 20 65%

What is Sensitivity (recall)?

Sensitivity =No. of correct positive predictions

No. of positives

=TP

TP + FN

True positive rate

True negative rate is termed specificity

What is Precision?

Precision =No. of correct positive predictions

No. of positives predictions

=TP

TP + FP

wrt positives

Precision-Recall Trade-off

• A predicts better than B if A has better recall and precision than B

• There is a trade-off between recall and precision

• In some applications, once you reach a satisfactory precision, you optimize for recall

• In some applications, once you reach a satisfactory recall, you optimize for precisionre

call

precision

Comparing Prediction Performance

• Accuracy is the obvious measure– But it conveys the right intuition only when the

positive and negative populations are roughly equal in size

• Recall and precision together form a better measure– But what do you do when A has better recall

than B and B has better precision than A?

Some Alternate measures

• F-Measure - Take the harmonic mean of recall and precision

• Adjusted Accuracy – weight

• ROC curve - Receiver Operating Characteristic analysis

F =2 * recall * precision

recall + precision(wrt positives)

Adjusted Accuracy

• Weigh by the importance of the classes

classifier TP TN FP FN Accuracy Adj AccuracyA 25 75 75 25 50% 50%B 0 150 0 50 75% 50%C 50 0 150 0 25% 50%D 30 100 50 20 65% 63%

Adjusted accuracy = * Sensitivity * Specificity+

where + = 1typically, = = 0.5

But values for , ?

ROC Curves

• By changing t, we get a range of sensitivities and specificities of a classifier

• A predicts better than B if A has better sensitivities than B at most specificities

• Leads to ROC curve that plots sensitivity vs. (1 – specificity)

• Then the larger the area under the ROC curve, the better

s en s

iti v

ity

1 – specificity

Key questions• How to measure classification quality?

• How can I select training and test cases ?

• How many experiments ?

• How to estimate prediction accuracy ?

• Effects on small – large data sets ?

The problem of Data Sampling

Data sampling techniques

Main goals:

Reduction of the estimation bias

Reduction of the variance introduced by a small data set

Too optimistic

Too conservative

a) to establish differences between data sampling techniques when applied to small and larger datasets,

b) to study the response of these methods to the size and number of train-test sets, and

c) to discuss criteria for the selection of sampling techniques.

Other important goals

Three Data Sampling Techniques

• cross-validation

• leave-one-out

• bootstrap.

k-fold cross validation

N samples, p training samples, q test samples (q = N – p)

Data

Randomly divides the data into the training and test sets. This process is repeated k times and the classification performance is the average of the individual test estimates.

Experiment 1

Data

Experiment 2

Data

Experiment k

N

k-fold cross validationThe classifier may not be able to accurately predict new cases if the amount of data used for training is too small. At the same time, the quality assessment may not be accurate if the portion of data used for testing is too small.

p % q %

??

Splitting procedure

The Leave-One-Out Method• Given N cases available in a dataset, a classifier is trained on (N-1) cases, and then is tested on the case that was left out.

• This is repeated N times until every case in the dataset has been included once as a cross-validation instance.

• The results are averaged across the N test cases to estimate the classifier’s prediction performance.

Data

Experiment 1

Data

Experiment 2

Data

Experiment N

N

The Bootstrap Method • A training dataset is generated by sampling with replacement N times from the available N cases. • The classifier is trained on this set and then tested on the original dataset. • This process is repeated several times, and the classifier’s accuracy estimate is the average of these individual estimates.

Data

Case 1

Case 2

Case 3

Case 4

Case 5

Training (1)

Case 1

Case 1

Case 3

Case 3

Case 5

Test (1)

Case 1

Case 2

Case 3

Case 4

Case 5

An example• 88 cases categorised into four classes: Ewing family of tumors (EWS, 30), rhabdomyosarcoma (RMS, 11), Burkitt lymphomas (BL, 19) and euroblastomas (NB, 28).

• Represented by the expression values of 2308 genes with suspected roles in processes relevant to these tumors.

• PCA was applied to reduce the dimensionality of the cases, the 10 dominant components per case were used to train the networks.

• All of the classifiers (BP-ANN) were trained using the same learning parameters.

• The BP-ANN architectures comprised 10 input nodes, 8 hidden nodes and 4 output nodes.

• Each output node encodes one of the tumor classes.

The cross-validation results were analysed for three different data splitting methods:

a)50% of the available cases were used for training the classifiers and the remaining 50% for testing,

b) 75% for training and 25% for testing,

c) 95% for training and 5% for testing.

Analysing the k-fold cross validation

Tumour classification

A: 10 train-test runs, B: 25 train-test runs, C: 50 train-test runs, D: 100 train-test runs, E: 500 train-test runs (interval size equal to 0.01), F: 1000 train-test runs, G: 2000

train-test runs, H: 3000 train-test runs, I: 4000 train-test runs, J: 5000 train-test runs.

Cross-validation method based on a 50%-50% splitting.

A B C D E F G H I J

Train-test runs

0.71

0.72

0.73

0.74

0.75

0.76

0.77

0.78

0.79

0.80

0.81

Cla

ssifi

catio

n ac

cura

cy

Tumour classificationCross-validation method based on a 75%-25% splitting.

A B C D E F G H I J

Train-test runs

0.71

0.72

0.73

0.74

0.75

0.76

0.77

0.78

0.79

0.80

Cla

ssifi

catio

n ac

cura

cy

A: 10 train-test runs, B: 25 train-test runs, C: 50 train-test runs, D: 100 train-test runs, E: 500 train-test runs, F: 1000 train-test runs (interval size equal to

0.01), G: 2000 train-test runs, H: 3000 train-test runs, I: 4000 train-test runs, J: 5000 train-test runs.

Tumour classificationCross-validation method based on a 95%-5% splitting.

A B C D E F G H I J

Train-test runs

0.55

0.60

0.65

0.70

0.75

0.80

0.85

0.90

0.95

Cla

ssifi

catio

n ac

cura

cy

A: 10 train-test runs, B: 25 train-test runs, C: 50 train-test runs, D: 100 train-test runs, E: 500 train-test runs, F: 1000 train-test runs, G: 2000 train-test runs, H: 3000 train-test runs, I: 4000 train-test runs, J: 5000 train-test runs

(interval size equal to 0.01) .

Tumour classification• The 50%-50% cross-validation produced the most conservative accuracy estimates.

• The 95%-5% cross-validation method produced the most optimistic cross-validation accuracy estimates.

• The leave-one-out method produced the highest accuracy estimate for this dataset (0.79).

• The estimation of high accuracy values may be linked to an increase of the size of the training datasets.

Tumour classification Bootstrap method

A B C D E F G H I J

Train-test runs

0.725

0.730

0.735

0.740

0.745

0.750

0.755

0.760

0.765

0.770

Cla

ssifi

catio

n ac

cura

cy

A: 100 train-test runs, B: 200 train-test runs, C: 300 train-test runs, D: 400 train-test runs, E: 500 train-test runs, F: 600 train-test runs, G: 700 train-test

runs, H: 800 train-test runs, I: 900 train-test runs (interval size equal to 0.01) , J: 1000 train-test runs.

Final remarks• The problem of estimating prediction quality should be carefully addressed and deserves further investigations.

• Sampling techniques can be implemented to assess the classification quality factors (such accuracy) of classifiers (such as ANNs).

• In general there is variability among the three techniques.

• These experiments suggest that it is possible to achieve lower variance estimates for different numbers of train-test runs.

• Furthermore, one may identify conservative and optimistic accuracy predictors, whose overall estimates may be significantly different.

• This effect is more distinguishable in small-sample applications.

• The predicted accuracy of a classifier is generally proportional to the size of the training dataset.

• The bootstrap method may be applied to generate conservative and robust accuracy estimates, based on a relatively small number of train-test experiments.

Final remarks (II)

• This presentation highlights the importance of performing more rigorous procedures on the selection of data and classification quality assessment.

• In general the application of more than one sampling technique may provide the basis for accurate and reliable predictions.

Final remarks (III)

Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje...

Documents

Transcript of Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje...