1 Evaluation of Learning Models Literature: Literature: T. Mitchel, Machine Learning, chapter 5 T....

1

Evaluation of Learning ModelsEvaluation of Learning Models

Literature:Literature: T. Mitchel, Machine Learning, chapter 5T. Mitchel, Machine Learning, chapter 5 I.H. Witten and E. Frank, Data Mining, chapter I.H. Witten and E. Frank, Data Mining, chapter

55

2

data

Targetdata

Processeddata

Transformeddata Patterns

Knowledge

Selection

Preprocessing& cleaning

Transformation& featureselection

Data Mining

InterpretationEvaluation

Fayyad’s KDD MethodologyFayyad’s KDD Methodology

3

Overview of the lectureOverview of the lecture

• Evaluating Hypotheses (errors, accuracy)Evaluating Hypotheses (errors, accuracy)• Comparing HypothesesComparing Hypotheses• Comparing Learning Algorithms (hold-out Comparing Learning Algorithms (hold-out

methods)methods)• Performance MeasuresPerformance Measures• Varia (Occam’s razor, warning)Varia (Occam’s razor, warning)• No Free LunchNo Free Lunch

4

Evaluating Hypotheses:Evaluating Hypotheses:Two definitions of errorTwo definitions of error

True errorTrue error of hypothesis of hypothesis hh with respect to with respect to target function target function f f and distribution and distribution DD is the is the probability that probability that hh will misclassify an instance will misclassify an instance drawn at random according to drawn at random according to DD..

)]()([Pr)( xhxfherrorDx

D

5

Two definitions of error (2)Two definitions of error (2)

Sample errorSample error of hypothesis of hypothesis hh with respect to with respect to target function target function f f and data sample and data sample SS is the is the proportion of examples proportion of examples hh misclassifies misclassifies

where is 1 if where is 1 if and 0 otherwise.and 0 otherwise.

))()(()( 1 xhxfherrorSx

nS

))()(( xhxf )()( xhxf

6

Two definitions of error (3)Two definitions of error (3)

How well does How well does errorerrorSS((hh) estimate ) estimate errorerrorDD((hh)?)?

7

Problems estimating errorProblems estimating error

1. 1. BiasBias: If : If SS is training set, is training set, errorerrorSS((hh) is ) is optimistically biasedoptimistically biased

For unbiased estimate, For unbiased estimate, hh and and SS must be must be chosen independently.chosen independently.

2. Variance: Even with unbiased 2. Variance: Even with unbiased SS, , errorerrorSS((hh) ) may still vary from may still vary from errorerrorDD((hh).).

)()]([ herrorherrorEbias DS

8

ExampleExample

Hypothesis Hypothesis h h misclassifies 12 of the 40 misclassifies 12 of the 40 examples in examples in SS

What is What is errorerrorDD((hh)?)?

30.040

12)( herrorS

9

EstimatorsEstimators

Experiment:Experiment:1. choose sample 1. choose sample SS of size of size nn according to according to distribution distribution DD2. measure 2. measure errorerrorSS((hh))errorerrorSS((hh) is a random variable (i.e., result of an ) is a random variable (i.e., result of an experiment)experiment)errorerrorSS((hh) is an unbiased estimator for ) is an unbiased estimator for errorerrorDD((hh))Given observed Given observed errorerrorSS((hh) what can we ) what can we conclude about conclude about errorerrorDD((hh)?)?

10

Confidence intervalsConfidence intervals

IfIfSS contains contains nn examples, drawn independently of examples, drawn independently of hh

and each otherand each othernn 30 30

ThenThenWith approximately 95% probability, With approximately 95% probability, errorerrorDD((hh) lies in ) lies in

the intervalthe interval

nherrorherror

SSSherror ))(1)((96.1)(

11

Confidence intervals (2)Confidence intervals (2) IfIf

SS contains contains nn examples, drawn independently of examples, drawn independently of hh and each otherand each other

nn 30 30 ThenThen

With approximately With approximately NN% probability, % probability, errorerrorDD((hh) lies in ) lies in the intervalthe interval

wherewhereNN%:%: 50%50% 68%68% 80%80% 90%90% 95%95% 98%98% 99%99%zzNN:: 0.670.67 1.001.00 1.281.28 1.641.64 1.961.96 2.332.33 2.582.58

nherrorherror

NSSSzherror ))(1)(()(

12

errorerrorSS((hh) is a random variable) is a random variable

Rerun the experiment with different randomly Rerun the experiment with different randomly drawn drawn SS (of size (of size nn))

Probability of observing Probability of observing rr misclassified misclassified examples:examples:

rnD

rD herrorherror

rnr

nrP

))(1()(

)!(!

!)(

13

Binomial probability distributionBinomial probability distribution

Probability Probability PP((rr) of ) of rr heads in heads in nn coin flips, coin flips, if if pp = Pr( = Pr(headsheads))

rnD

rD herrorherror

rnr

nrP

))(1()(

)!(!

!)(

Binomial distributionfor n = 10 and p = 0.3

14

Binomial probability distribution (2)Binomial probability distribution (2)

Expected, or mean value of Expected, or mean value of XX, , EE[[XX], is], is

Variance of Variance of XX is is

Standard deviation of Standard deviation of XX, , XX, is, is

npiiPXEn

i

0

)(][

)1(]])[[()( 2 pnpXEXEXVar

)1(]])[[( 2 pnpXEXEX

15

Normal distribution approximates Normal distribution approximates binomialbinomial

errorerrorSS((hh) follows a Binomial distribution, with) follows a Binomial distribution, withmean mean standard deviation standard deviation

Approximate this by a Normal distribution withApproximate this by a Normal distribution withmeanmeanstandard deviationstandard deviation

)()( herrorDherrorS

nherrorherror

herrorDD

S

))(1)(()(

)()( herrorDherrorS

nherrorherror

herrorDD

S

))(1)(()(

16

Normal probability distributionNormal probability distribution

The probability that X will fall into the interval The probability that X will fall into the interval ((aa,,bb) is given by) is given by

Expected, or mean value of Expected, or mean value of XX,, E E[[XX], is ], is EE[[XX] = ] =

Variance of Variance of XX is is VarVar((xx) = ) = 22

Standard deviation of Standard deviation of XX, , X X , , XX = =

221

2

)(

2

1)(

x

exp

b

adxxp )(

17

Normal probability distributionNormal probability distribution

80% of area (probability) lies in 80% of area (probability) lies in 1.28 1.28 N% N% of area (probability) lies in of area (probability) lies in zzNN

N N%:%: 50%50% 68%68% 80%80% 90%90% 95%95% 98%98% 99%99% zzNN:: 0.670.67 1.001.00 1.281.28 1.641.64 1.961.96 2.332.33 2.582.58

18

Confidence intervals, more correctlyConfidence intervals, more correctly IfIf

SS contains contains nn examples, drawn independently of examples, drawn independently of hh and each otherand each other

nn 30 30 ThenThen

with approximately 95% probability, with approximately 95% probability, errorerrorSS((hh) lies in ) lies in the intervalthe interval

and and errorerrorDD((hh) approximately lies in the interval) approximately lies in the intervaln

herrorherrorD

DDherror ))(1)((96.1)(

nherrorherror

SSSherror ))(1)((96.1)(

19

Central Limit TheoremCentral Limit Theorem

Consider a set of independent, identically Consider a set of independent, identically distributed random variables distributed random variables YY11......YYnn, all governed , all governed by an arbitrary probability distribution with mean by an arbitrary probability distribution with mean and finite variance and finite variance 22. Define the sample mean,. Define the sample mean,

Central Limit Theorem. As Central Limit Theorem. As nn , the distribution , the distribution governing approaches a Normal distribution, governing approaches a Normal distribution, with mean with mean and variance and variance 22//nn..Y

n

iin YY

1

1

20

Comparing Hypotheses:Comparing Hypotheses:Difference between hypothesesDifference between hypotheses

Test Test hh11 on sample on sample SS11, test , test hh2 2 on on SS22

1. Pick parameter to estimate1. Pick parameter to estimate

2. Choose an estimator2. Choose an estimator

3. Determine probability distribution that 3. Determine probability distribution that governs estimatorgoverns estimator

)()( 21 herrorherrord DD

)()(ˆ21 21herrorherrord SS

2

2222

1

1111))(1)(())(1)((

ˆ n

herrorherror

n

herrorherror

d

SSSS

21

Difference between hypotheses (2)Difference between hypotheses (2)

4. Find interval (L, U) such that N% of 4. Find interval (L, U) such that N% of probability mass falls in the intervalprobability mass falls in the interval

2

2222

1

1111))(1)(())(1)((ˆ

n

herrorherror

n

herrorherror

NSSSSzd

22

Paired Paired tt test to compare test to compare hhAA, , hhBB

1. Partition data into 1. Partition data into k k disjoint test sets disjoint test sets TT11,,TT22,,…,…,TTkk of equal size, where this size is at least of equal size, where this size is at least 30.30.

2. For 2. For ii from 1 to from 1 to kk, do, do

3. Return the value , where3. Return the value , where

)()( BTATi herrorherrorii

k

iik

0

1

23

Paired Paired tt test to compare test to compare hhAA, , hhB (2)B (2)

N% confidence interval estimate for N% confidence interval estimate for dd::

Note approximately Normally distributedNote approximately Normally distributed

k

iik

0

1

k

iikks

0

2)1(

1 )(

i

24

Comparing learning algorithms:Comparing learning algorithms:LLAA and and LLBB

What we’d like to estimate:What we’d like to estimate:

where where LL((SS) is the hypothesis output by learner ) is the hypothesis output by learner LL using training set using training set SS

I.e., the expected difference in true error I.e., the expected difference in true error between hypotheses output by learners between hypotheses output by learners LLAA and and LLBB, when trained using randomly selected , when trained using randomly selected training sets training sets SS drawn according to distribution drawn according to distribution DD..

))](())(([ SLerrorSLerrorE BDADDS

25

Comparing learning algorithms Comparing learning algorithms LLAA and and LLB B (2)(2)

But, given limited data But, given limited data DD00, what is a good , what is a good estimator?estimator?Could partition Could partition DD00 into training set into training set SS00 and test set and test set TT00, ,

and measureand measure

Even better, repeat this many times and average Even better, repeat this many times and average the resultsthe results

))(())(( 00 00SLerrorSLerror BTAT

26

Comparing learning algorithms Comparing learning algorithms LLAA and and LLB B (3): (3): kk-fold cross validation-fold cross validation

1. Partition data 1. Partition data DD00 into into kk disjoint test sets disjoint test sets TT11,,TT22,…,,…,TTkk of equal size, where this size is at of equal size, where this size is at least 30.least 30.

2. For 2. For ii from 1 to from 1 to kk, do, douse use TTii for the test set, and the remaining for the test set, and the remaining data for training set data for training set SSii

3. Return the average of the errors on the test 3. Return the average of the errors on the test setssets

27

Practical AspectsPractical AspectsA note on parameter tuningA note on parameter tuning

It is important that the test data is not used It is important that the test data is not used in any wayin any way to create the classifierto create the classifier

Some learning schemes operate in two stages:Some learning schemes operate in two stages: Stage 1: builds the basic structureStage 1: builds the basic structure Stage 2: optimizes parameter settingsStage 2: optimizes parameter settings

The test data can’t be used for parameter tuning!The test data can’t be used for parameter tuning! Proper procedure uses Proper procedure uses threethree sets: sets: training data,training data, validation data, and test datavalidation data, and test data Validation data is used to optimize parametersValidation data is used to optimize parameters

28

Holdout estimation, stratificationHoldout estimation, stratification

What shall we do if the amount of data is limited?What shall we do if the amount of data is limited? The The holdoutholdout method reserves a certain amount for method reserves a certain amount for testing and uses the remainder for trainingtesting and uses the remainder for training

Usually: one third for testing, the rest for trainingUsually: one third for testing, the rest for training Problem: the samples might not be representativeProblem: the samples might not be representative

Example: class might be missing in the test dataExample: class might be missing in the test data Advanced version uses Advanced version uses stratificationstratification

Ensures that each class is represented with approximately Ensures that each class is represented with approximately equal proportions in both subsetsequal proportions in both subsets

29

More on cross-validationMore on cross-validation

Standard method for evaluation: stratified ten-foldStandard method for evaluation: stratified ten-foldcross-validationcross-validation

Why ten? Extensive experiments have shown that this Why ten? Extensive experiments have shown that this is the best choice to get an accurate estimateis the best choice to get an accurate estimate There is also some theoretical evidence for thisThere is also some theoretical evidence for this

Stratification reduces the estimate’s varianceStratification reduces the estimate’s variance Even better: repeated stratified cross-validationEven better: repeated stratified cross-validation

E.g. ten-fold cross-validation is repeated ten timesE.g. ten-fold cross-validation is repeated ten times and results are averaged (reduces the variance)and results are averaged (reduces the variance)

30

Estimation of the accuracy of a Estimation of the accuracy of a learning algorithmlearning algorithm

10-fold cross validation gives a pessimistic estimate of 10-fold cross validation gives a pessimistic estimate of the accuracy of the hypothesis build on all training the accuracy of the hypothesis build on all training data, provided that the law “the more training data the data, provided that the law “the more training data the better” holds.better” holds.

For model selection 10-fold cross validation often For model selection 10-fold cross validation often works fine.works fine.

An other method is: leave-one-out or jackknife (An other method is: leave-one-out or jackknife (NN-fold -fold cross validation with cross validation with NN = training set size). = training set size).

Also the standard deviation is essential for comparing Also the standard deviation is essential for comparing learning algorithms.learning algorithms.

31

Performance Measures:Performance Measures:Issues in evaluationIssues in evaluation

Statistical reliability of estimated differences in Statistical reliability of estimated differences in performanceperformance

Choice of performance measureChoice of performance measureNumber of correct classificationsNumber of correct classificationsAccuracy of probability estimatesAccuracy of probability estimatesError in numeric predictionsError in numeric predictions

Costs assigned to different types of errorsCosts assigned to different types of errorsMany practical applications involve costsMany practical applications involve costs

32

Counting the costsCounting the costs

In practice, different types of classification In practice, different types of classification errors often incur different costserrors often incur different costs

Examples:Examples:Predicting when cows are in heat (“in estrus”)Predicting when cows are in heat (“in estrus”)

“ “Not in estrus” correct 97% of the timeNot in estrus” correct 97% of the time

Loan decisionsLoan decisionsOil-slick detectionOil-slick detectionFault diagnosisFault diagnosisPromotional mailingPromotional mailing

33

Taking costs into accountTaking costs into account The confusion matrix:The confusion matrix:

There are many other types of costs!There are many other types of costs!E.g.: costs of collecting training dataE.g.: costs of collecting training data

predicted classpredicted class

actualactual

classclass

yesyes nono

yesyes True positiveTrue positive False negativeFalse negative

nono False positiveFalse positive True negativeTrue negative

34

Lift chartsLift charts

In practice, costs are rarely knownIn practice, costs are rarely known Decisions are usually made by comparing Decisions are usually made by comparing

possible scenariospossible scenarios Example: promotional mailoutExample: promotional mailout

Situation 1: classifier predicts that 0.1% of all Situation 1: classifier predicts that 0.1% of all households will respondhouseholds will respond

Situation 2: classifier predicts that 0.4% of the Situation 2: classifier predicts that 0.4% of the 10000 most promising households will respond10000 most promising households will respond

A A lift chartlift chart allows for a visual comparison allows for a visual comparison

35

Generating a lift chartGenerating a lift chart

Instances are sorted according to their predicted Instances are sorted according to their predicted probability of being a true positive:probability of being a true positive:

RankRank Predicted probabilityPredicted probability Actual classActual class11 0.950.95 YesYes22 0.930.93 YesYes33 0.930.93 NoNo44 0.880.88 YesYes…… …… ……

In lift chart, x axis is sample size and y axis is number In lift chart, x axis is sample size and y axis is number of true positivesof true positives

36

A hypothetical lift chartA hypothetical lift chart

37

Probabilities, reliabilityProbabilities, reliability

In order to generate a lift chart we need In order to generate a lift chart we need information that tells that one classification is information that tells that one classification is more probable/reliable to be of the class of more probable/reliable to be of the class of interest than another classification.interest than another classification.

The Naïve Bayes classifier and also Nearest The Naïve Bayes classifier and also Nearest Neighbor classifiers can output such Neighbor classifiers can output such information.information.

If we start to fill our subset with the most If we start to fill our subset with the most probable examples, the subset will contain a probable examples, the subset will contain a larger proportion desired elements.larger proportion desired elements.

38

Summary of measuresSummary of measures

39

Varia:Varia:Model selection criteriaModel selection criteria

Model selection criteria attempt to find a good Model selection criteria attempt to find a good compromise between:compromise between:A. The complexity of a modelA. The complexity of a modelB. Its prediction accuracy on the training dataB. Its prediction accuracy on the training data

Reasoning: a good model is a simple model Reasoning: a good model is a simple model that achieves high accuracy on the given datathat achieves high accuracy on the given data

Also known as Occam’s Razor: the best theory Also known as Occam’s Razor: the best theory is the smallest one that describes all the factsis the smallest one that describes all the facts

40

WarningWarning

Suppose you are gathering hypotheses that Suppose you are gathering hypotheses that have a probability of 95% to have an error have a probability of 95% to have an error level below 10%level below 10%

What if you have found 100 hypotheses What if you have found 100 hypotheses satisfying this condition?satisfying this condition?

Then the probability that all have an error Then the probability that all have an error below 10% is equal to (0.95)below 10% is equal to (0.95)100 100 0.013 0.013 corresponding to 1.3 %. So, the probability of corresponding to 1.3 %. So, the probability of having at least one hypothesis with an error having at least one hypothesis with an error above 10% is about 98.7%!above 10% is about 98.7%!

41

No Free LunchNo Free Lunch!!!!

Theorem (no free lunch) Theorem (no free lunch) For any two learning algorithms For any two learning algorithms LLAA and and LLBB the the

following is true, independently of the sampling following is true, independently of the sampling distribution and the number of training instances distribution and the number of training instances nn::

Uniformly averaged over all target functions Uniformly averaged over all target functions FF, , EE((errorerrorSS((hhAA) | ) | FF, , nn) = ) = EE((errorerrorSS((hhBB) | ) | FF, , nn) )

Idem for a fixed training set Idem for a fixed training set DD

See: Mitchell ch.2See: Mitchell ch.2

42

No Free Lunch (2)No Free Lunch (2) Sketch of proof:Sketch of proof:

If all functions are possible, for each function If all functions are possible, for each function FF for which for which LLAA outperforms outperforms LLBB a function a function F’F’ can be found with the opposite can be found with the opposite conclusion.conclusion.

Conclusion: Without assumptions on the underlying Conclusion: Without assumptions on the underlying functions, the hypothesis space, no generalization is functions, the hypothesis space, no generalization is possible!possible!

In order to generalize, machine learning algorithms In order to generalize, machine learning algorithms need some kind of bias:need some kind of bias: inductive biasinductive bias : not all possible functions are in the hypothesis : not all possible functions are in the hypothesis

space (space (restriction biasrestriction bias) or not all possible functions will be ) or not all possible functions will be found because of the search strategy (found because of the search strategy (preference or search preference or search biasbias) (Mitchell, ch. 2, 3).) (Mitchell, ch. 2, 3).

43

No free lunch (3)No free lunch (3)

The other way around: for any ML algorithm The other way around: for any ML algorithm there exist data sets on which it performs well there exist data sets on which it performs well and there exist data sets on which it performs and there exist data sets on which it performs badly!badly!

We hope that the latter sets do not occur too We hope that the latter sets do not occur too often in real life.often in real life.

44

Summary of notionsSummary of notions• True error, sample errorTrue error, sample error• Bias of sample errorBias of sample error• Accuracy, confidence intervalsAccuracy, confidence intervals• Central limit theoremCentral limit theorem• Paired t testPaired t test• kk-fold cross validation, leave-one-out, holdout-fold cross validation, leave-one-out, holdout• StratificationStratification• Training set, validation set, and test setTraining set, validation set, and test set• Confusion matrix, TP, FP, TN, FNConfusion matrix, TP, FP, TN, FN• Lift chart, ROC curve, Recall-precision curveLift chart, ROC curve, Recall-precision curve• Occam’s razorOccam’s razor• No free lunchNo free lunch• Inductive bias; restriction bias; search biasInductive bias; restriction bias; search bias

1 Evaluation of Learning Models Literature: Literature: T. Mitchel, Machine Learning, chapter 5 T....

Documents

Transcript of 1 Evaluation of Learning Models Literature: Literature: T. Mitchel, Machine Learning, chapter 5 T....