Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of...

Chapter 5: Credibility

Introduction• Performance on the training set is not a good indicator of

performance on an independent set.• We need to predict performance bounds• Quality training data is difficult to obtain---not always in

abundance• Performance prediction based on limited training data is

controversial---repeated cross validation technique is most useful in these situations

• Cost of misclassification is also an important criteria• Statistical tests are also needed to validate the

conclusions

Training and Testing

• Error rate of a classifier • Error rate on training data (resubstitution error)

is not a good indicator of error rate on test data!! (overfit)

• Tests data --- data not used in the training phase• Training data, validation data, test data

Predicting Performance

• 100-%error rate = %success rate• Confidence interval: When the test data is

not large, we refer to the resulting error rates (success rate) in the context of confidence intervals

Cross Validation• Hold out: Hold back 1/3 of the available data for testing and the

remaining for training.• Stratified holdout: Training data should be a good representative of

the overall data---each class should be represented in the same proportion as its size

• Repeated holdout method---repeat the random selection several times and obtain different error rates

• Three-fold Cross validation---divide the data into three equal partitions: make three iterations, each time choosing one of the three partitions (folds) as test data and the other two as training data

• 10-fold cross-validation: Use 9 of the 10 to train, and the remaining one for testing; 10 error estimates are averaged to yield an overall error estimate

• Sometimes, we may repeat the 10-fold cross validation several times with different random samples of 10 folds

Other Estimates• Leave-one-out cross validation: It is n-fold cross validation where n is the

#of instances in the dataset. Each instance in turn is left out for testing, and n-1 instances are used for training. The results of all n judgments are averaged and that is the final error estimate.

• Bootstrap error estimation method: Sampling with replacement: A data set of n instances is sampled n times, with replacement, to give another dataset of n instances. Those instances that have not been picked in the training data will be chosen in the validation set. This is also referred as 0.632 bootsrtrap---because there is a 0.632 probability that an instance may not be chosen in the training set.

– The error estimate obtained over the test set will be a pessimistic estimate of the true error rate, because the trainings et only containing 63% of the overall data, where as it covers 90% data in the 10-fold validation.

– Final error rate is computed as: E = 0.632 ERROR RATE OVER TEST INSTANCES + 0.368* ERROR RATE OVER TRAINING INSTANCES

– Repeat the bootstrap procedure several times and average the error rate

Comparing Data Mining Methods• If a new learning algorithm is proposed, its proponents

must show that it improves on the state of the art for the problem at hand and demonstrates that the observed improvement is not just a chance effect in the estimation process.

• A technique cannot be thrown out because it does not do well on one dataset; its average performance over different sets must be considered

• Determine whether or the mean of a set of samples---cross-validation estimates for the various data sets that we sampled from the domain---is significantly greater than, or significantly less than, the mean of another. t-test or Student’s t-test and paired t-test are preferred tools.

Predicting Probabilities• 0-1 loss function• When a classification is done with a probability, it is not a 0-1

situation• Quadratic loss function: If <p1,p2,…,pk> is a probability vector for

an instance to belong to the k classes, and <a1,a2,…,ak> is the actual outcome vector where all but the entry that it belongs to is 1 and the rest 0. ∑j (pj-aj)2 is the quadratic loss function. If i is the correct class (ai=1), then it can be rewritten as: 1-2pi+∑jpj2. When test set contains several instance, the loss function is summed over all of them.

• Informational loss function: -log2pi where the ith prediction is the correct one

Counting the Cost• Cost of making a wrong decision?• Cost of missing a threat versus cost of false positives?• Confusion matrix: True positive (TP) and True negative (TN) are correct ones; false

positive (FP) and false negative (FN) are incorrect ones.• True positive rate = TP/(TP+FN)• False positive rate = FP/(TN+FP)• Overall success rate = (TP+TN)/(TP+TN+FP+FN)• Error rate = 1-success rate• Multiclass prediction---use confusion matrix: c rows and c columns; In Table 5.4 (a),

there are 100 instances of class a, 60 of b, and 40 of c. Out of these 88+40+12 or 140 were correctly predicted---a success rate of 70%. The predictor predicted 120 of class a, 60 of b, and 20 of c.

• The question is “is it a random prediction or a chance or an intelligent one?” If there is a random predictor that randomly classifies the instances based on the actual ratio of classes (6:3:1 in this case), we get Fig. 5.4 (b) results. It got 82 instances correct as opposed to 140 by the learning technique. Is this significant?

• Kappa statistic: 140-82 = 58 extra successes out of a possible total of 200-82 = 118, or 49.2%. Maximum value of Kappa is 100%. This is also not a cost-sensitive classification

• Good link

Cost-sensitive classification

• Benefits of TP and TN; costs of FP and FN• Sometimes the cost of a learning

techniques may also be taken into account• Suppose a predictor predicts a class a

instance as class a, b, c with probabilities pa, pb, and pc, then the default cost is pb+pc or 1-pa.

Cost-sensitive Learning• Take cost into consideration at training time• Generate training data with a different proportion of yes

and no instances.• For example, if we want to avoid errors on the no

instances, since false positives are penalized 10 times to that of false negatives, we could choose the number of no instances to be 10 times that of yes instances in the training set.

• One way to vary the proportion of training instances is to duplicate instances in the training dataset.

• Other is to assign weights for different instances and build cost sensitive trees

Life Charts• Lift factor• Table 5.6: 150 instances; 50 are yes; 100 no. So 33% success rate.

The 150 instances are sorted based on the predicted probability for yes by the learning scheme. For example, for instance 1, the learning scheme predicts a success of 0.95. For the next one, it is 0.93, and so on. When the actual class is no, and the technique predicts yes, then it is a false positive.

• If we were to chose only 10 samples, then we go for the top most 10. Out of these, 2 are actually negative. So success rate would be 80%. Compared to the overall average success of 33%, there is a life factor of 80/33 or 2.4 (tps/Ns)/(tpt/Nt)

• Lift chart: Figure 5.1 --- % sample size (proportion of total test data); and number of respondents. Diagonal---expected number of respondents if random sample is taken; The upper curve shows a more intelligent choice of the samples.

• Reference 1• Reference 2

ROC Curves• Receiver operating characteristics---how does a

signal receiver respond to noise + signal.• ROC curves depict the performance of a

sample; % of +ves in the sample w.r.t. total +ves in the test data vs. % of –ves in the sample over all –ves in the test data.

• Figure 5.3: ROC curves for two learming methods

• By combing both the techniques with a weight factor, we can get the best: the top of the convex hull.

Recall-precision curves• Example: A1 locates 100 documents of which 20

are relevant: A2 locates 400 documents of which 80 are relevant. Which one is better? Cost of false positives and false negatives.

• Recall = #of documents retrieved that are relevant/total #of documents that are relevant (Ex: If the total# of relevant are 100, then A1: recall = 0.2; A2 recall = 0.8)

• Precision; #of docs retrieved that are relevant/total #of docs retrieved (Ex: A1: precision=20/100=0.2; A2 precision: 80/400=0.2)

• Summary: Table 5.7 page 172

Cost curves• A single classifier corresponds to a straight line

that shows how the performance varies as the class distribution changes. Works best in two classes.

• Fig 5.4 a: Expected error against the prob of one of the classes (+ and -)

• If p(+) < 0.2, then always predicting – is better than method A. If p(+)>0.65, then always picking + is better than A.

• Figure 5.4b: Taking costs into consideration.

Evaluating Numeric Prediction• Applies to numeric prediction (not nominal values)• Metrics:

– Mean squared error (MSE)– Root MSE– Mean absolute error– Relative squared error---relative to what it would have been if a simple

predictor had been used---say the average values of the training data– Relative absolute error– Correlation coefficient---statistical correlation between actual values and

predicted values• Table 5.8 and Table 5.8• Choose the classifiers that gives the best results in terms of the chosen

metric

Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of...

Documents

Transcript of Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of...