Analyzing Road Side Breath Test Data with WEKA
-
Upload
yogesh-shinde -
Category
Education
-
view
22 -
download
0
Transcript of Analyzing Road Side Breath Test Data with WEKA
![Page 1: Analyzing Road Side Breath Test Data with WEKA](https://reader036.fdocuments.net/reader036/viewer/2022081605/58ee84ee1a28aba0218b45d7/html5/thumbnails/1.jpg)
ANALYZING ROAD-SIDE BREATH TEST DATA
![Page 2: Analyzing Road Side Breath Test Data with WEKA](https://reader036.fdocuments.net/reader036/viewer/2022081605/58ee84ee1a28aba0218b45d7/html5/thumbnails/2.jpg)
GROUP MEMBERS
• Micheal Abaho• Yogesh Shinde• Natasha Thakur• Mingyang Chen• Huw Fulcher• Kai Wang
![Page 3: Analyzing Road Side Breath Test Data with WEKA](https://reader036.fdocuments.net/reader036/viewer/2022081605/58ee84ee1a28aba0218b45d7/html5/thumbnails/3.jpg)
OBJECTIVE
• To understand how attributes explain intoxication in pulled over drivers
• Analyze the dataset• Determine what attributes to classify intoxication with• Perform classification using dataset• Assess success of classification in explaining
intoxication
![Page 4: Analyzing Road Side Breath Test Data with WEKA](https://reader036.fdocuments.net/reader036/viewer/2022081605/58ee84ee1a28aba0218b45d7/html5/thumbnails/4.jpg)
DATASET
• Acquired from data.gov.uk• 2014 data on roadside breath
tests• Approximately 300,000
records
![Page 5: Analyzing Road Side Breath Test Data with WEKA](https://reader036.fdocuments.net/reader036/viewer/2022081605/58ee84ee1a28aba0218b45d7/html5/thumbnails/5.jpg)
• Reason for test• Suspicion of Alcohol, Road Traffic Collision, Moving Traffic Violation and Other
• Month • Jan to Dec
• Year• 2014
• Week Type:• Weekday and Weekend
• Time Band• 12am-4am, 12pm-4pm, 4am-8am, 4pm-8pm, 8am-12pm, 8pm-12am and Unknown
• Age Band for Drivers• 16-19, 20-24, 25-29, 30-39, 40-49, 50-59, 60-69, 70-98 and Other
• Gender for Drivers• Male and Female
• Breath Alcohol Level
ATTRIBUTES
![Page 6: Analyzing Road Side Breath Test Data with WEKA](https://reader036.fdocuments.net/reader036/viewer/2022081605/58ee84ee1a28aba0218b45d7/html5/thumbnails/6.jpg)
EXPLORATORY ANALYSIS
![Page 7: Analyzing Road Side Breath Test Data with WEKA](https://reader036.fdocuments.net/reader036/viewer/2022081605/58ee84ee1a28aba0218b45d7/html5/thumbnails/7.jpg)
PRE-PROCESSING DATA
• Removing year• Removing outliers
• Creating decision variable
![Page 8: Analyzing Road Side Breath Test Data with WEKA](https://reader036.fdocuments.net/reader036/viewer/2022081605/58ee84ee1a28aba0218b45d7/html5/thumbnails/8.jpg)
REASON*
Intoxicated = 0.0735 * Reason=Suspicion of Alcohol + 0.0365 * Reason=Other +-0.0428 * Reason=Moving Traffic Violation + 0.1132
![Page 9: Analyzing Road Side Breath Test Data with WEKA](https://reader036.fdocuments.net/reader036/viewer/2022081605/58ee84ee1a28aba0218b45d7/html5/thumbnails/9.jpg)
MONTH
Intoxicated = -0.0453 * Month=Jan + -0.0224 * Month=Feb + -0.0173 * Month=Mar + -0.0147 * Month=Apr + -0.0086 * Month=May + -0.0952 * Month=Jun + -0.0189 * Month=Jul + -0.013 * Month=Sep + -0.0179 * Month=Oct + -0.0295 * Month=Nov + -0.1249 * Month=Dec + 0.1669
![Page 10: Analyzing Road Side Breath Test Data with WEKA](https://reader036.fdocuments.net/reader036/viewer/2022081605/58ee84ee1a28aba0218b45d7/html5/thumbnails/10.jpg)
WEEKTYPE
![Page 11: Analyzing Road Side Breath Test Data with WEKA](https://reader036.fdocuments.net/reader036/viewer/2022081605/58ee84ee1a28aba0218b45d7/html5/thumbnails/11.jpg)
TIMEBAND*
Intoxicated = 0.1009 * TimeBand=12am-4am + 0.0733 * TimeBand=4am-8am + -0.0368 * TimeBand=4pm-8pm + -0.0539 * TimeBand=12pm-4pm + -0.0598 * TimeBand=8am-12pm + 0.118
![Page 12: Analyzing Road Side Breath Test Data with WEKA](https://reader036.fdocuments.net/reader036/viewer/2022081605/58ee84ee1a28aba0218b45d7/html5/thumbnails/12.jpg)
AGE* + GENDER*
![Page 13: Analyzing Road Side Breath Test Data with WEKA](https://reader036.fdocuments.net/reader036/viewer/2022081605/58ee84ee1a28aba0218b45d7/html5/thumbnails/13.jpg)
CLASSIFICATION OF THE DATASET
![Page 14: Analyzing Road Side Breath Test Data with WEKA](https://reader036.fdocuments.net/reader036/viewer/2022081605/58ee84ee1a28aba0218b45d7/html5/thumbnails/14.jpg)
EVALUATION MEASURE
• A classifier predicts all data instances of a dataset as either positive or negative.
• This classification (or prediction) produces four outcomes – true positive, true negative, false positive and false negative.
![Page 15: Analyzing Road Side Breath Test Data with WEKA](https://reader036.fdocuments.net/reader036/viewer/2022081605/58ee84ee1a28aba0218b45d7/html5/thumbnails/15.jpg)
WHAT IS TP,FP,FN,TN?
• True Positive (TP) – It is an instance which is correctly predicted to belong to class.• True Negative (TN) – It is an instance which is correctly
predicted to not belong to class.• False Positive (FP) – It is an instance which is
incorrectly predicted to belong to class.• False Negative (FN) – It is an instance which is
incorrectly predicted to not belong to class.
![Page 16: Analyzing Road Side Breath Test Data with WEKA](https://reader036.fdocuments.net/reader036/viewer/2022081605/58ee84ee1a28aba0218b45d7/html5/thumbnails/16.jpg)
CONFUSION MATRIX
• A confusion matrix is a two by two table formed by counting of the number of the four outcomes of a classifier that is TP, FP, TN, FN.
Predicted
Class A
Class B <- classified as
Observed
TP FN Class A
FP TN Class B
![Page 17: Analyzing Road Side Breath Test Data with WEKA](https://reader036.fdocuments.net/reader036/viewer/2022081605/58ee84ee1a28aba0218b45d7/html5/thumbnails/17.jpg)
MEASURES FROM THE CONFUSION MATRIX
• Error rate (ERR) is calculated as the number of all incorrect predictions divided by the total number of the dataset.
• The best error rate is 0.0, whereas the worst is 1.0.
• Accuracy (ACC) is calculated as the number of all correct predictions divided by the total number of the dataset.
• The best accuracy is 1.0, whereas the worst is 0.0.
![Page 18: Analyzing Road Side Breath Test Data with WEKA](https://reader036.fdocuments.net/reader036/viewer/2022081605/58ee84ee1a28aba0218b45d7/html5/thumbnails/18.jpg)
• True positive rate (TPR) is calculated as the number of correct positive predictions divided by the total number of positives.
• The best sensitivity is 1.0, whereas the worst is 0.0.
• False positive rate (FPR) is calculated as the number of incorrect negative predictions divided by the total number of negatives.
• The best false positive rate is 0.0 whereas the worst is 1.0.
![Page 19: Analyzing Road Side Breath Test Data with WEKA](https://reader036.fdocuments.net/reader036/viewer/2022081605/58ee84ee1a28aba0218b45d7/html5/thumbnails/19.jpg)
• Precision (PREC) is calculated as the number of correct positive predictions divided by the total number of positive predictions.
• The best precision is 1.0, whereas the worst is 0.0.
• Recall is proportion of actual positives that were predicted positive.
• F-measure is a harmonic mean of precision and recall.
![Page 20: Analyzing Road Side Breath Test Data with WEKA](https://reader036.fdocuments.net/reader036/viewer/2022081605/58ee84ee1a28aba0218b45d7/html5/thumbnails/20.jpg)
J48
![Page 21: Analyzing Road Side Breath Test Data with WEKA](https://reader036.fdocuments.net/reader036/viewer/2022081605/58ee84ee1a28aba0218b45d7/html5/thumbnails/21.jpg)
• J48 is the improved version of C4.5
• C4.5 is a program that creates a decision tree based on a set of labelled input data.
• First it constructs a very huge tree by considering all attribute values and narrow down the decision rule with the help of pruning.
• Pruning reduces the size of decision trees by removing sections of the tree that provide little power to classify instances.
• Information gain or entropy measure is used to get the best attribute to split the Nodes.
• A tree structure is created with root node, intermediate and leaf nodes, where Node
holds the decision and in turn decision helps to achieve our result.
CLASSIFICATION BASED ON TREES (J48)
![Page 22: Analyzing Road Side Breath Test Data with WEKA](https://reader036.fdocuments.net/reader036/viewer/2022081605/58ee84ee1a28aba0218b45d7/html5/thumbnails/22.jpg)
• Attributes: Reasons, AgeBand, TimeBand,
Gender• Object: Driver• Class: Yes/No for intoxication.• Test Mode:10 Fold Cross Validation• Pruned Tree
EXPERIMENT WORK AND OUTCOME
![Page 23: Analyzing Road Side Breath Test Data with WEKA](https://reader036.fdocuments.net/reader036/viewer/2022081605/58ee84ee1a28aba0218b45d7/html5/thumbnails/23.jpg)
Summary
J48:Pruned Tree
Number of Leaves :1Size of the tree :1
No (323555.0/37379.0)
J48 CLASSIFICATION OUTPUT
![Page 24: Analyzing Road Side Breath Test Data with WEKA](https://reader036.fdocuments.net/reader036/viewer/2022081605/58ee84ee1a28aba0218b45d7/html5/thumbnails/24.jpg)
Confusion Matrix Predicted
Actual
Detailed Accuracy By Class
![Page 25: Analyzing Road Side Breath Test Data with WEKA](https://reader036.fdocuments.net/reader036/viewer/2022081605/58ee84ee1a28aba0218b45d7/html5/thumbnails/25.jpg)
JRIP
![Page 26: Analyzing Road Side Breath Test Data with WEKA](https://reader036.fdocuments.net/reader036/viewer/2022081605/58ee84ee1a28aba0218b45d7/html5/thumbnails/26.jpg)
RULE BASED CLASSIFICATION (JRIP)
Decision Tree and Decision Table(classify rule)
![Page 27: Analyzing Road Side Breath Test Data with WEKA](https://reader036.fdocuments.net/reader036/viewer/2022081605/58ee84ee1a28aba0218b45d7/html5/thumbnails/27.jpg)
RULE BASED CLASSIFICATION (JRIP)
• Repeated Incremental Pruning to Produce Error Reduction (RIPPER)•Optimized version of IREP (reduced error pruning) a very common and effective technique found in decision tree algorithms
![Page 28: Analyzing Road Side Breath Test Data with WEKA](https://reader036.fdocuments.net/reader036/viewer/2022081605/58ee84ee1a28aba0218b45d7/html5/thumbnails/28.jpg)
RULE BASED CLASSIFICATION (JRIP)
• The training data is split into a growing set and a pruning set
• Growing set: greedily adding conditions until the rule is perfect
• pruning set: delete conditions until find better rule• Rule set generate by growing rule and pruning rule• Optimization stage
![Page 29: Analyzing Road Side Breath Test Data with WEKA](https://reader036.fdocuments.net/reader036/viewer/2022081605/58ee84ee1a28aba0218b45d7/html5/thumbnails/29.jpg)
RULE OF JRIP
![Page 30: Analyzing Road Side Breath Test Data with WEKA](https://reader036.fdocuments.net/reader036/viewer/2022081605/58ee84ee1a28aba0218b45d7/html5/thumbnails/30.jpg)
PERFORMANCE OF JRIP
![Page 31: Analyzing Road Side Breath Test Data with WEKA](https://reader036.fdocuments.net/reader036/viewer/2022081605/58ee84ee1a28aba0218b45d7/html5/thumbnails/31.jpg)
COMPARE WITH J48
![Page 32: Analyzing Road Side Breath Test Data with WEKA](https://reader036.fdocuments.net/reader036/viewer/2022081605/58ee84ee1a28aba0218b45d7/html5/thumbnails/32.jpg)
WHICH CLASSIFICATION ALGORITHM?
•Accuracy of Classifier Both J48 and JRip for our case is high •Speed: Time 4.26s in JRip; 1.14 in J48•Robustness:Noisy data/missing data•Scalability: Size of dataset becomes big
![Page 33: Analyzing Road Side Breath Test Data with WEKA](https://reader036.fdocuments.net/reader036/viewer/2022081605/58ee84ee1a28aba0218b45d7/html5/thumbnails/33.jpg)
REGRESSION
![Page 34: Analyzing Road Side Breath Test Data with WEKA](https://reader036.fdocuments.net/reader036/viewer/2022081605/58ee84ee1a28aba0218b45d7/html5/thumbnails/34.jpg)
WHAT IT IS AND WHAT IT DOES Determines how a dependent variable is affected by one or more independent
variables.Dependent variable:- Is a result or something that is being predicted.Independent variable: Predictor.
Regression Equation (In its simplicity)Y = a + bX +
[ Y – (Dependent variable), X – (Ind variable)Expected value of )
Aim is to ensure you find values of a and b such that e is small
![Page 35: Analyzing Road Side Breath Test Data with WEKA](https://reader036.fdocuments.net/reader036/viewer/2022081605/58ee84ee1a28aba0218b45d7/html5/thumbnails/35.jpg)
THE REGRESSION MODEL DERIVED
y
- error
a - intercept
X
𝑦=𝑎+𝑏𝑥+𝑒
![Page 36: Analyzing Road Side Breath Test Data with WEKA](https://reader036.fdocuments.net/reader036/viewer/2022081605/58ee84ee1a28aba0218b45d7/html5/thumbnails/36.jpg)
LOGISTIC REGRESSION
Why this regression
1. Predictive analysis of a dichotomous dependent variable.• E.g. for our case we are building a model that predicts whether
some one is intoxicated or not. i.e. what do factors like violating traffic rules, age-band and time band tell us about the probability that a person is intoxicated or not when they’re stopped by police.
2. We discover additional trends in data without having to run other tests how each of the predictors affects the resultant dependent variable.
![Page 37: Analyzing Road Side Breath Test Data with WEKA](https://reader036.fdocuments.net/reader036/viewer/2022081605/58ee84ee1a28aba0218b45d7/html5/thumbnails/37.jpg)
RESULTS AND EVALUATION – REGRESSION MODEL
No Yes Precision
Recall
F-Measure
ROC Area
No 285403
773 0.885 0.997
0.938 0.726
Yes 36923 456 0.371 0.012
0.024 0.726
Weighted Average
0.826 0.883
0.832 0.726
Classified/Predicted
Actual
Correctly Classified Instances
285859 88.3494 %
Incorrectly Classified Instances
37696 11.6506 %
Mean absolute error 0.1886Root mean squared error
0.3075
Relative absolute error 92.2893 %Root relative squared error
96.1835 %
Total Number of Instances
323555
![Page 38: Analyzing Road Side Breath Test Data with WEKA](https://reader036.fdocuments.net/reader036/viewer/2022081605/58ee84ee1a28aba0218b45d7/html5/thumbnails/38.jpg)
Attribute Coefficients OddsReason=Suspicion of Alcohol -0.486 0.6151Reason=Moving Traffic Violation 0.5511 1.7352TimeBand=12am-4am -0.849 0.4278TimeBand=4am-8am -0.6143 0.541TimeBand=8am-12pm 0.6492 1.914AgeBand=16-19 0.1976 1.2184AgeBand=25-29 -0.2082 0.8121AgeBand=70-98 0.86 2.3632Gender=Male -0.1297 0.8784Gender=Female 0.1268 1.1352Intercept 2.3189
From = a + bX = 2.3189 – 0.486*(Sus_Alc) + 0.5511*(Mov_Traf) – 0.849 * (Timeband) + …………………………..
Regression equation predicting whether some one is intoxicated or not.
![Page 39: Analyzing Road Side Breath Test Data with WEKA](https://reader036.fdocuments.net/reader036/viewer/2022081605/58ee84ee1a28aba0218b45d7/html5/thumbnails/39.jpg)
CONCLUSION
![Page 40: Analyzing Road Side Breath Test Data with WEKA](https://reader036.fdocuments.net/reader036/viewer/2022081605/58ee84ee1a28aba0218b45d7/html5/thumbnails/40.jpg)
CONCLUSION/WHAT WE DISCOVERED
• Four “optimal” attributes to use in classification• J48 – Performs well but not practical• JRip – Most accurate (Not by much) but needs
tweaking• Regression – “Best” of the 3
![Page 41: Analyzing Road Side Breath Test Data with WEKA](https://reader036.fdocuments.net/reader036/viewer/2022081605/58ee84ee1a28aba0218b45d7/html5/thumbnails/41.jpg)
CONCLUSION / RECOMMENDATION
• Test the data set for more assumptions – Normality, Multi-collinearity,• and Homoscedasticity.• Transform the dataset to minimize the errors
generated from the biased number of cases belonging to class (No – None intoxication).• Explore further experiments including other
factors that are potential predictors of intoxication. e.g Offences (How offensive is a person when asked to pull-over by police).