Banknotes Data - Oregon Institute of...
Transcript of Banknotes Data - Oregon Institute of...
![Page 1: Banknotes Data - Oregon Institute of Technologymath.oit.edu/~overholserr/current/math407/notes/math407_day11.pdf · My set-up My Goal: predict whether a banknote is genuine or not](https://reader031.fdocuments.net/reader031/viewer/2022011919/601c4820b590f5794c2b301c/html5/thumbnails/1.jpg)
Banknotes Data
Is a banknote genuine or not?
![Page 2: Banknotes Data - Oregon Institute of Technologymath.oit.edu/~overholserr/current/math407/notes/math407_day11.pdf · My set-up My Goal: predict whether a banknote is genuine or not](https://reader031.fdocuments.net/reader031/viewer/2022011919/601c4820b590f5794c2b301c/html5/thumbnails/2.jpg)
The datasetGoal: predict whether a banknote is genuine or not based on the following four characteristics obtained from wavelet transformed images of 1370 bills:
• Variance• Skewness• Kurtosis• Entropy
Can download from the UCI ML Repositoryhttps://archive.ics.uci.edu/ml/datasets/banknote+authentication#
![Page 3: Banknotes Data - Oregon Institute of Technologymath.oit.edu/~overholserr/current/math407/notes/math407_day11.pdf · My set-up My Goal: predict whether a banknote is genuine or not](https://reader031.fdocuments.net/reader031/viewer/2022011919/601c4820b590f5794c2b301c/html5/thumbnails/3.jpg)
My set-upMy Goal: predict whether a banknote is genuine or not based on the following two characteristics obtained from wavelet transformed images of 1370 bills:
I set aside 10% of the data as a test set and will use the remaining data to train a logistic regression model, kNNs, LDA and QDA.
![Page 4: Banknotes Data - Oregon Institute of Technologymath.oit.edu/~overholserr/current/math407/notes/math407_day11.pdf · My set-up My Goal: predict whether a banknote is genuine or not](https://reader031.fdocuments.net/reader031/viewer/2022011919/601c4820b590f5794c2b301c/html5/thumbnails/4.jpg)
Notation
Random Variables:
• Say Y = 1 if a bill is genuine, 0 if fake.
• Xv= variance of wavelet transformed image
• Xs= skewness of wavelet transformed image
We have n=1235 observations of these variables in our training set.
![Page 5: Banknotes Data - Oregon Institute of Technologymath.oit.edu/~overholserr/current/math407/notes/math407_day11.pdf · My set-up My Goal: predict whether a banknote is genuine or not](https://reader031.fdocuments.net/reader031/viewer/2022011919/601c4820b590f5794c2b301c/html5/thumbnails/5.jpg)
Logistic Regression
• Assumes banknotes are “independent” and that the log odds is linear in the predictors:
log(π
1−π) where π=P(Y=1|𝑋𝑣 = 𝑥𝑣 , 𝑋𝑠 = 𝑥𝑠 )
![Page 6: Banknotes Data - Oregon Institute of Technologymath.oit.edu/~overholserr/current/math407/notes/math407_day11.pdf · My set-up My Goal: predict whether a banknote is genuine or not](https://reader031.fdocuments.net/reader031/viewer/2022011919/601c4820b590f5794c2b301c/html5/thumbnails/6.jpg)
Logistic regression model
> model1<-glm(type~variance+skewness, data=ss,
family="binomial")
> model1
Call: glm(formula = type ~ variance + skewness,
family = "binomial",
data = ss)
Coefficients:
(Intercept) variance skewness
0.6192 -1.1224 -0.2885
Using the training set of 1235 bills and the method of maximum likelihood, I found the coefficients of a logistic regression model
Proportion of test bills that were incorrectly classified: 16.79%
![Page 7: Banknotes Data - Oregon Institute of Technologymath.oit.edu/~overholserr/current/math407/notes/math407_day11.pdf · My set-up My Goal: predict whether a banknote is genuine or not](https://reader031.fdocuments.net/reader031/viewer/2022011919/601c4820b590f5794c2b301c/html5/thumbnails/7.jpg)
kNNs
-15
-10
-5
0
5
10
-4 0 4
variance
ske
wn
ess
genuine
no
yes
Using Euclidean distance, the minimum error rate was 6.57% for k=10
![Page 8: Banknotes Data - Oregon Institute of Technologymath.oit.edu/~overholserr/current/math407/notes/math407_day11.pdf · My set-up My Goal: predict whether a banknote is genuine or not](https://reader031.fdocuments.net/reader031/viewer/2022011919/601c4820b590f5794c2b301c/html5/thumbnails/8.jpg)
kNNs has a pretty good error rate – can we do any better?
Let’s look at the dataset another way…
Do these shapes remind you of a famous distribution?
![Page 9: Banknotes Data - Oregon Institute of Technologymath.oit.edu/~overholserr/current/math407/notes/math407_day11.pdf · My set-up My Goal: predict whether a banknote is genuine or not](https://reader031.fdocuments.net/reader031/viewer/2022011919/601c4820b590f5794c2b301c/html5/thumbnails/9.jpg)
The Normal Distribution• Bell-shape
• can be described by mean µ and standard deviation σ
• Common: sums or means of enough iid RVs are alwaysapproximately normally distributed
![Page 10: Banknotes Data - Oregon Institute of Technologymath.oit.edu/~overholserr/current/math407/notes/math407_day11.pdf · My set-up My Goal: predict whether a banknote is genuine or not](https://reader031.fdocuments.net/reader031/viewer/2022011919/601c4820b590f5794c2b301c/html5/thumbnails/10.jpg)
How would you describe these distributions in terms of mean and SD?
![Page 11: Banknotes Data - Oregon Institute of Technologymath.oit.edu/~overholserr/current/math407/notes/math407_day11.pdf · My set-up My Goal: predict whether a banknote is genuine or not](https://reader031.fdocuments.net/reader031/viewer/2022011919/601c4820b590f5794c2b301c/html5/thumbnails/11.jpg)
Given a bill with skewness = -2 and variance = -1.5, would you say it’s real?
Genuine Fake
skewness variance skewness variance
mean -1.19 -1.89 4.31 2.28
SD 5.43 1.86 5.12 2.04
![Page 12: Banknotes Data - Oregon Institute of Technologymath.oit.edu/~overholserr/current/math407/notes/math407_day11.pdf · My set-up My Goal: predict whether a banknote is genuine or not](https://reader031.fdocuments.net/reader031/viewer/2022011919/601c4820b590f5794c2b301c/html5/thumbnails/12.jpg)
![Page 13: Banknotes Data - Oregon Institute of Technologymath.oit.edu/~overholserr/current/math407/notes/math407_day11.pdf · My set-up My Goal: predict whether a banknote is genuine or not](https://reader031.fdocuments.net/reader031/viewer/2022011919/601c4820b590f5794c2b301c/html5/thumbnails/13.jpg)
![Page 14: Banknotes Data - Oregon Institute of Technologymath.oit.edu/~overholserr/current/math407/notes/math407_day11.pdf · My set-up My Goal: predict whether a banknote is genuine or not](https://reader031.fdocuments.net/reader031/viewer/2022011919/601c4820b590f5794c2b301c/html5/thumbnails/14.jpg)
![Page 15: Banknotes Data - Oregon Institute of Technologymath.oit.edu/~overholserr/current/math407/notes/math407_day11.pdf · My set-up My Goal: predict whether a banknote is genuine or not](https://reader031.fdocuments.net/reader031/viewer/2022011919/601c4820b590f5794c2b301c/html5/thumbnails/15.jpg)
Linear Discriminant Analysis (LDA)
Idea: model the distributions of the predictor variables given the class of Y as normally distributed random variables with the same SD and then use Bayes Theorem to predict the class of Y given values of the predictors.
![Page 16: Banknotes Data - Oregon Institute of Technologymath.oit.edu/~overholserr/current/math407/notes/math407_day11.pdf · My set-up My Goal: predict whether a banknote is genuine or not](https://reader031.fdocuments.net/reader031/viewer/2022011919/601c4820b590f5794c2b301c/html5/thumbnails/16.jpg)
Results of LDA
> model3 <- lda(formula = type ~ variance+skewness, data = ss)
> model3
Call:
lda(type ~ variance + skewness, data = ss)
Prior probabilities of groups:
0 1
0.5465587 0.4534413
Group means:
variance skewness
0 2.280648 4.311509
1 -1.892714 -1.190899
Coefficients of linear discriminants:
LD1
variance -0.46505122
skewness -0.09733833
![Page 17: Banknotes Data - Oregon Institute of Technologymath.oit.edu/~overholserr/current/math407/notes/math407_day11.pdf · My set-up My Goal: predict whether a banknote is genuine or not](https://reader031.fdocuments.net/reader031/viewer/2022011919/601c4820b590f5794c2b301c/html5/thumbnails/17.jpg)
Why is this called Linear Discriminant Analysis?
Bills on one side of the black line are “real”, the others are “fake”
Error rate of 15.3%
![Page 18: Banknotes Data - Oregon Institute of Technologymath.oit.edu/~overholserr/current/math407/notes/math407_day11.pdf · My set-up My Goal: predict whether a banknote is genuine or not](https://reader031.fdocuments.net/reader031/viewer/2022011919/601c4820b590f5794c2b301c/html5/thumbnails/18.jpg)
Quadratic Discriminant Analysis (QDA)
What if we’d allowed a quadratic classification border instead of a linear one?
Note: this corresponds to allowing unequal SD’s in the normal distributions of the predictors.
Error rate decreases to 14.6%...not much better in this case
![Page 19: Banknotes Data - Oregon Institute of Technologymath.oit.edu/~overholserr/current/math407/notes/math407_day11.pdf · My set-up My Goal: predict whether a banknote is genuine or not](https://reader031.fdocuments.net/reader031/viewer/2022011919/601c4820b590f5794c2b301c/html5/thumbnails/19.jpg)
So kNN with k=10 looks like the winner
We expect to misclassify 6.57% of bills.
![Page 20: Banknotes Data - Oregon Institute of Technologymath.oit.edu/~overholserr/current/math407/notes/math407_day11.pdf · My set-up My Goal: predict whether a banknote is genuine or not](https://reader031.fdocuments.net/reader031/viewer/2022011919/601c4820b590f5794c2b301c/html5/thumbnails/20.jpg)
But wait,
Is it equally bad to misclassify a genuine bill as it is to misclassify a fake bill?
Types of Misclassification:
False positive
True negative
![Page 21: Banknotes Data - Oregon Institute of Technologymath.oit.edu/~overholserr/current/math407/notes/math407_day11.pdf · My set-up My Goal: predict whether a banknote is genuine or not](https://reader031.fdocuments.net/reader031/viewer/2022011919/601c4820b590f5794c2b301c/html5/thumbnails/21.jpg)
Error Rates by Bill Type
Overall Error Rate (%)
Misclassified Genuine Bills (%)
MisclassifiedFake Bills(%)
LogisticRegression
16.6 17.2 16.0
kNNs, k=10 6.6 2.3 14.0
LDA 15.3 17.2 12.0
QDA 14.6 17.2 10.0
![Page 22: Banknotes Data - Oregon Institute of Technologymath.oit.edu/~overholserr/current/math407/notes/math407_day11.pdf · My set-up My Goal: predict whether a banknote is genuine or not](https://reader031.fdocuments.net/reader031/viewer/2022011919/601c4820b590f5794c2b301c/html5/thumbnails/22.jpg)
Error Rates by Bill Type
Overall Error Rate (%)
Misclassified Genuine Bills (%)
MisclassifiedFake Bills(%)
LogisticRegression
16.6 17.2 16.0
kNNs, k=10 6.6 2.3 14.0
LDA 15.3 17.2 12.0
QDA 14.6 17.2 10.0
QDA has the lowest error rate for fake bills!