Applied Machine Learning Lecture 6-2: Evaluation methodsrichajo/dit866/lectures/l6/l6_2.pdf · I...
Transcript of Applied Machine Learning Lecture 6-2: Evaluation methodsrichajo/dit866/lectures/l6/l6_2.pdf · I...
Applied Machine LearningLecture 6-2: Evaluation methods
Selpi ([email protected])
The slides are further development of Richard Johansson's slides
February 11, 2020
Overview
introduction
evaluating classi�cation systems
evaluating scorers and rankers
evaluating regressors
detour: training losses and evaluation metrics
application-speci�c evaluation: a sample
why evaluate?
I predictive systems are rarely perfect
I so we need to measure how well our systems are doingI and compare alternatives
I selecting a meaningful evaluation protocol is crucial in machinelearning projectsI this is something we emphasize for thesis projects
I what is the performance needed to be �useful�?
how do we evaluate? intrinsic and extrinsic
I intrinsic evaluation: measure the performance in isolation,using some automatically computed metric
I extrinsic evaluation: I changed my predictor � how does thisa�ect the performance of my �downstream task�?I how much more money do I make?I how many more clicks do I get?
is there already an established evaluation procedure?
I for many classical prediction problems, there might be awell-established procedure
I some of which can be quite speci�c to the type of application:I word error rate for speech recognition systemsI BLEU score for machine translation systems
I in this lecture, we'll look at some of the most commongenerally applicable evaluation methods that areI intrinsicI automatic
I we'll come back to some application-speci�c metrics at the end
high-level setup
high-level setup
high-level setup
qualitative analysis
I it can be useful to look at �interesting�instancesI but not as an argument for why your
system is goodI or for systematic comparison
I error analysis � an art more than a scienceI we need to understand our data
I can help me understand what data I lack, orhelp me improve features
Overview
introduction
evaluating classi�cation systems
evaluating scorers and rankers
evaluating regressors
detour: training losses and evaluation metrics
application-speci�c evaluation: a sample
accuracy and error rate
accuracy =nbr. correct
N
error rate =nbr. incorrect
N
example
considering one class
I let's say we're interested in one class in particularI �spotting problems� or ��nding the needles in the haystack�
evaluation metrics derived from a confusion matrix
I several evaluation metrics are de�nedfrom the cells in a confusion matrixI depending on the scienti�c �eld
I these metrics often come in pairsI overgeneration vs undergeneration
TPFN FP
TN
relevant predicted
see https://en.wikipedia.org/wiki/Confusion_matrix
precision and recall
I the precision and recall are commonlyused for evaluating classi�ers
P =TP
TP + FPR =
TP
TP + FN
I in scikit-learn:I sklearn.metrics.precision_score
I sklearn.metrics.recall_score
I sklearn.metrics.classification_report
TPFN FP
TN
relevant predicted
FN FP
precision / recall tradeo�
FN
TN
FN
FPTN
TP FP
TP
F-score
I the F -score is the harmonic mean of the precision and recall
F =2 · P · RP + R
I in scikit-learn: sklearn.metrics.f1_score
other similar pairs of metrics
I sensitivity/speci�city (in medicine)
Sen =TP
TP + FNSpe =
TN
TN + FP
I true positive rate/false positive rate
TPR =TP
TP + FNFPR =
FP
FP + TN
TPFN FP
TN
relevant predicted
example (continued)
Overview
introduction
evaluating classi�cation systems
evaluating scorers and rankers
evaluating regressors
detour: training losses and evaluation metrics
application-speci�c evaluation: a sample
what if it's more important to �nd the malignant tumors?
I how can we adjust our decisions?
breast cancer example
2 4 6 8 10Clump Thickness
2
4
6
8
10
Sing
le E
pith
elia
l Cel
l Size
ranking and scoring systems
I for instance, search engines
I but also some classi�ers (e.g. decision_function,predict_proba)
intuition: evaluating rankers
evaluating a search engine: quality of the �rst page
I when I use a search engine, how goodare the results on the �rst page?
I precision at k
P@k =nbr relevant items in top k
k
evaluating a search engine (another example)
precision / recall curve
I is there a way to evaluate ranker or scorer that does notdepend on the threshold?
I in a precision/recall curve, we compute the precision fordi�erent recall levelsI that is, for di�erent thresholds
I we visualize the relationship using a plotI good = curve is near top right
P / R curve (example)
0.0 0.2 0.4 0.6 0.8 1.0recall
0.0
0.2
0.4
0.6
0.8
1.0
prec
ision
P / R curve (example)
0.0 0.2 0.4 0.6 0.8 1.0recall
0.0
0.2
0.4
0.6
0.8
1.0
prec
ision
P / R curve (example)
0.0 0.2 0.4 0.6 0.8 1.0recall
0.0
0.2
0.4
0.6
0.8
1.0
prec
ision
P / R curve (example)
0.0 0.2 0.4 0.6 0.8 1.0recall
0.0
0.2
0.4
0.6
0.8
1.0
prec
ision
P / R curve (example)
0.0 0.2 0.4 0.6 0.8 1.0recall
0.0
0.2
0.4
0.6
0.8
1.0
prec
ision
P / R curve (example)
0.0 0.2 0.4 0.6 0.8 1.0recall
0.0
0.2
0.4
0.6
0.8
1.0
prec
ision
comparing P/R curves
0.0 0.2 0.4 0.6 0.8 1.0recall
0.0
0.2
0.4
0.6
0.8
1.0
prec
ision
average precision
0.0 0.2 0.4 0.6 0.8 1.0recall
0.0
0.2
0.4
0.6
0.8
1.0
prec
ision
AP = 0.903
ROC curve (Receiver Operating Characteristic)
[source]
Area Under Curve score
0.0 0.2 0.4 0.6 0.8 1.0false positive rate
0.0
0.2
0.4
0.6
0.8
1.0
true
posit
ive
rate
AUC = 0.992
Overview
introduction
evaluating classi�cation systems
evaluating scorers and rankers
evaluating regressors
detour: training losses and evaluation metrics
application-speci�c evaluation: a sample
example
2 4 6 8 10
0
100
200
300
400
2 4 6 8 10
0
100
200
300
400
errors
2 4 6 8 10
0
100
200
300
400
2 4 6 8 10
0
100
200
300
400
mean squared error and mean absolute error
MSE =1
N
N∑i=1
(yi − yi )2 MAE =
1
N
N∑i=1
|yi − yi |
(N = number of instances, yi = prediction for instance i)
R2: the coe�cient of determination
I the MSE and MAE scores depend on the scale
I they don't directly tell whether the regressor behaves �well�
I the coe�cient of determination (R2) is an evaluation scorethat does not depend on the scale
R2 = 1−∑
N
i=1(yi − yi )
2∑N
i=1(yi − y)2
I in case of a perfect regression, this score is 1
I if completely messed up, 0 or negative
MSE and R2 for the example
2 4 6 8 10
0
100
200
300
400 MSE = 3298.02, R2 = 0.75
2 4 6 8 10
0
100
200
300
400 MSE = 1106.66, R2 = 0.92
what to do about ordinal scales
I see e.g. Baccianella et al. (2009) Evaluation Measures for
Ordinal Regression
Overview
introduction
evaluating classi�cation systems
evaluating scorers and rankers
evaluating regressors
detour: training losses and evaluation metrics
application-speci�c evaluation: a sample
regression and classi�cation evaluation
MSE =1
N
N∑i=1
(yi − yi )2 error =
1
N
N∑i=1
1[yi 6= yi ]
I for regression, MSE is often usedI as the quality metric when we evaluateI as the loss function when we train
I but typically, for classi�cation what we compute duringtraining and testing is di�erentI such as hinge loss or log loss during training
loss functions
3 2 1 0 1 2 30.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0 log losshinge losszero-one loss
Overview
introduction
evaluating classi�cation systems
evaluating scorers and rankers
evaluating regressors
detour: training losses and evaluation metrics
application-speci�c evaluation: a sample
application-speci�c quality metrics
I as we discussed, there may be an established evaluationprotocol for the task you're working on
I or you might design a new one
humans in the loop
I evaluating using people can be an alternative to automaticevaluationI if we don't have data for evaluating automaticallyI or if we want to evaluate something that is hard to measure
I expensive!
I reliable?
I replicable?
to conclude
Next lecture
I Neural networks