Lecture 6 Linear Classification & Logistic RegressionFormulate a classification problem using...
Transcript of Lecture 6 Linear Classification & Logistic RegressionFormulate a classification problem using...
Lecture 6 Linear Classification & Logistic RegressionEE-UY 4563/EL-GY 9123: INTRODUCTION TO MACHINE LEARNING
PROF. SUNDEEP RANGAN (WITH MODIFICATION BY YAO WANG)
1
Learning ObjectivesFormulate a machine learning problem as a classification problem
β¦ Identify features, class variable, training data
Visualize classification data using a scatter plot.
Describe a linear classifier as an equation and on a plot.β¦ Determine visually if data is perfect linearly separable.
Formulate a classification problem using logistic regression⦠Binary and multi-class⦠Describe the logistic and soft-max function⦠Logistic function to approximate the probability
Derive the loss function for ML estimation of the weights in logistic regression
Use sklearn packages to fit logistic regression models
Measure the accuracy of classification
Adjust threshold of classifiers for trading off types of classification errors. Draw a ROC curve.
Perform LASSO regularization for feature selection
2
OutlineMotivating Example: Classifying a breast cancer test
Linear classifiers
Logistic regression
Fitting logistic regression models
Measuring accuracy in classification
3
Diagnosing Breast CancerFine needle aspiration of suspicious lumps
Cytopathologist visually inspects cells β¦ Sample is stained and viewed under microscope
Determines if cells are benign or malignant and furthermore provides grading if malignant
Uses many features:β¦ Size and shape of cells, degree of mitosis, differentiation, β¦
Diagnosis is not exact
If uncertain, use a more comprehensive biopsy⦠Additional cost and time⦠Stress to patient
Can machine learning provide better rules?
4
Grades of carcinoma cellshttp://breast-cancer.ca/5a-types/
Demo on GithubGithub: https://github.com/sdrangan/introml/blob/master/logistic/breast_cancer.ipynb
5
DataUniv. Wisconsin study, 1994
569 samples
10 visual features for each sample
Ground truth determined by biopsy
First publication: O.L. Mangasarian, W.N. Street and W.H. Wolberg. Breast cancer diagnosis and prognosis via linear programming. Operations Research, 43(4), pages 570-577, July-August 1995.
6
Loading The DataFollow standard pandas routine
All code in Lect06_Demo.ipynb
7
Class = 2 => benignClass = 4 => malignant
Drops missing samples
See following for explanation of attributeshttps://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.names
Visualizing the Data
8
Scatter plot of points from each class
Plot not informative⦠Many points overlap⦠Relative frequency at each point not visible
Improving the Plot
9
Make circle size proportional to count
Many gymnastics to make this plot in python
In-Class ExerciseGet into groups⦠At least one must have a laptop with jupyter notebook
Determine a classification rule⦠Predict class label from the two features
Test in python⦠Make the predictions⦠Measure the accuracy
10
A Possible Classification RuleFrom inspection, benign if:
marg +23
(size_unif) < 4
Classification rule from linear constraint
What are other possible classification rules?
Every rule misclassifies some points
What is optimal?
11
Mangasarianβs Original PaperProposes Multisurface method β Tree (MSM-T)β¦ Decision tree based on linear rules in each step
Fig to left from β¦ Pantel, βBreast Cancer Diagnosis and Prognosis,β 1995
Best methods today use neural networks
This lecture will look at linear classifiers⦠These are much simpler⦠Do not provide same level of accuracy
But, building block to more complex classifiers
12
ClassificationGiven features ππ, determine its class label, π¦π¦ = 1, β¦ ,πΎπΎ
Binary classification: π¦π¦ = 0 or 1
Many applications:β¦ Face detection: Is a face present or not?β¦ Reading a digit: Is the digit 0,1,β¦,9?β¦ Are the cells cancerous or not?β¦ Is the email spam?
Equivalently, determine classification function:οΏ½π¦π¦ = ππ ππ β {1, β¦ ,πΎπΎ}
β¦ Like regression, but with a discrete responseβ¦ May index {1, β¦ ,πΎπΎ} or {0, β¦ ,πΎπΎ β 1}
13
OutlineMotivating Example: Classifying a breast cancer test
Linear classifiers
Logistic regression
Fitting logistic regression models
Measuring accuracy in classification
14
Linear ClassifierGeneral binary classification rule:
οΏ½π¦π¦ = ππ π₯π₯ = 0 or 1
Linear classification rule:β¦ Take linear combination π§π§ = π€π€0 + βππ=1ππ π€π€πππ₯π₯ππβ¦ Predict class from π§π§
οΏ½π¦π¦ = οΏ½1 π§π§ β₯ 00 π§π§ < 0
Decision regions described by a half-space.
ππ = (π€π€0, β¦ ,π€π€ππ) is called the weight vector
15
Breast Cancer exampleFrom inspection, benign if:
marg +23
(size_unif) < 4
Classification rule from linear constraint
What are other possible classification rules?
Every rule misclassifies some points
What is optimal?
16
Using Linear Regression on Two FeaturesBad idea: Use linear regressionβ¦ Labels are π¦π¦ β {0,1}β¦ Use linear model: π§π§ = π€π€0 + π€π€1π₯π₯1 +π€π€2 π₯π₯2β¦ Find linear fit so that βππ π¦π¦ππ β π§π§ππ 2 is minimizedβ¦ Then threshold the linear fit:
οΏ½π¦π¦ = οΏ½1 π§π§ > 0.50 π§π§ < 0.5
This yields the line:π€π€0 + π€π€1π₯π₯1 +π€π€2 π₯π₯2 = 0.5
Why the line is not as expected?
Should not use MSE as the optimization criterion!β¦ Squared error is not related to classifier accuracy
17
Using linear classifier on all featuresHard to visualize, but by setting π§π§ = 0.5, we get a hyper plane in high dimension
18
Go through the demoUp to linear regression
19
Linear vs. Non-LinearLinear boundaries are limited
Can only describe very simple regions
But, serves as building blockβ¦ Many classifiers use linear rules as first stepβ¦ Neural networks, decision trees, β¦
Breast cancer example:β¦ Is the region linear or non-linear?
20
Perfect Linear SeparabilityGiven training data ππππ ,π¦π¦ππ , ππ = 1, β¦ ,ππ
Binary class label: π¦π¦ππ = Β±1
Perfectly linearly separable if there exists a ππ = (π€π€0,π€π€1, β¦ ,π€π€ππ) s.t.:β¦ π€π€0 + π€π€1π₯π₯ππ1 + β―π€π€πππ₯π₯ππππ > πΎπΎ when π¦π¦ππ = 1β¦ π€π€0 + π€π€1π₯π₯ππ1 + β―π€π€πππ₯π₯ππππ < βπΎπΎ when π¦π¦ππ = β1
ππ is the separating hyperplane, πΎπΎ is the margin
Single equation form:π¦π¦ππ π€π€0 + π€π€1π₯π₯ππ1 + β―π€π€πππ₯π₯ππππ > πΎπΎ for all ππ = 1, β¦ ,ππ
21
Most Data not Perfectly SeparableGenerally cannot find a separating hyperplane
Always, some points that will be mis-classified
Algorithms attempt to find βgoodβ hyper-planesβ¦ Reduce the number of mis-classified pointsβ¦ Or, some similar metric
Example: Look again at breast cancer data
22
Non-UniquenessWhen one exists, separating hyper-plane is not unique
Example: β¦ If ππ is separating, then so is πΌπΌππ for all πΌπΌ > 0
Fig. on right: Many separating planes
Which one is optimal?
23
OutlineMotivating Example: Classifying a breast cancer test
Linear classifiers
Logistic regression
Fitting logistic regression models
Measuring accuracy in classification
24
Logistic Model for Binary ClassificationBinary classification problem: π¦π¦ = 0, 1
Consider probabilistic model
ππ π¦π¦ = 1 ππ =1
1 + ππβπ§π§, ππ π¦π¦ = 0 ππ =
ππβπ§π§
1 + ππβπ§π§β¦ π§π§ = π€π€0 + βππ=1ππ π€π€πππ₯π₯ππ
Logistic function: ππ π§π§ = 1/(1 + ππβπ§π§)β¦ Classical βSβ-shape. Also called sigmoidal
Value of f(ππ) does not perfectly predict class π¦π¦.β¦ Only a probability of π¦π¦β¦ Allows for linear classification to be imperfect.β¦ Training will not require perfect separability
25
Logistic Model as a βSoftβ ClassifierPlot of
ππ π¦π¦ = 1 π₯π₯ =1
1 + ππβπ§π§, π§π§ = π€π€1π₯π₯
β¦ Markers are random samples
Higher π€π€1: prob transition becomes sharperβ¦ Fewer samples occur across boundary
As π€π€1 β β logistic becomes βhardβ rule
ππ π¦π¦ = 1 π₯π₯ β οΏ½1 π₯π₯ > 00 π₯π₯ < 0
26
π§π§ = 0.5π₯π₯ π§π§ = π₯π₯ π§π§ = 2π₯π₯ π§π§ = 100π₯π₯
Multi-Class Logistic RegressionSuppose π¦π¦ β 1, β¦ ,πΎπΎβ¦ πΎπΎ possible classes (e.g. digits, letters, spoken words, β¦)
Multi-class regression:β¦ πΎπΎ β π π πΎπΎΓππ,ππ0 β π π πΎπΎ Slope matrix and biasβ¦ ππ = πΎπΎππ + ππ0: Creates πΎπΎ linear functions
Then, class probabilities given by:
ππ π¦π¦ = ππ ππ =eπ§π§ππ
ββ=1πΎπΎ πππ§π§β
27
Softmax OperationConsider soft-max function:
ππππ ππ =eπ§π§ππ
ββ=1πΎπΎ πππ§π§ββ¦ πΎπΎ inputs ππ = z1, β¦ , π§π§πΎπΎ , πΎπΎ outputs ππ(ππ) = (ππ(π³π³)1, β¦ , ππ(π³π³)πΎπΎ)
Properties: ππ ππ is like a PMF on the labels [0,1, β¦ ,πΎπΎ β 1]β¦ ππππ ππ β [0,1] for each component ππβ¦ βππ=1πΎπΎ ππππ ππ = 1
Softmax property: When π§π§ππ β« π§π§β for all β β ππ:β¦ ππππ ππ β 1β¦ ππβ ππ β 0 for all β β ππ
Multi-class logistic regression: Assigns highest probability to class ππ when π§π§ππ is largestπ§π§ππ = ππππ
ππππ + π€π€0ππ
28
Multi-Class Logistic RegressionDecision Regions
Each decision region defined by set of hyperplanes
Intersection of linear constraints
Sometimes called a polytope
29
Transform Linear ModelsAs in regression, logistic models can be applied to transform features
Step 1: Map ππ to some transform features, ππ ππ = ππ1 ππ , β¦ ,ππππ ππ ππ
Step 2: Linear weights: π§π§ππ = βππ=1ππ ππππππππππ ππ
Step 3: Soft-max ππ π¦π¦ = ππ ππ = ππππ ππ , ππππ ππ = πππ§π§ππββ ππ
π§π§β
Example transforms:β¦ Standard regression ππ ππ = 1, π₯π₯1, β¦ , π₯π₯ππ
ππ(ππ original features, j+1 transformed features)
β¦ Polynomial regression: ππ ππ = 1, π₯π₯, β¦ , π₯π₯ππ ππ (1 original feature, ππ + 1 transformed features)
30
Additional transform step
Using Transformed FeaturesEnables richer class boundaries
Example: Fig B is not linearly separable
But, consider nonlinear featuresβ¦ ππ ππ = 1, π₯π₯1, π₯π₯2, π₯π₯12, π₯π₯22 ππ
Then can discriminate classes with linear functionβ¦ π§π§ = βππ2, 0,0,1,1 ππ π₯π₯ = π₯π₯12 + π₯π₯22 β ππ2
Blue when π§π§ β€ 0 and Green when π§π§ > 0
31
OutlineMotivating Example: Classifying a breast cancer test
Linear classifiers
Logistic regression
Fitting logistic regression models
Measuring accuracy in classification
32
Learning the Logistic Model ParametersConsider general three part logistic model: β¦ Transform to features: ππ β¦ ππ(ππ)β¦ Linear weights: ππ = πΎπΎππ ππ , πΎπΎ β π π πΎπΎΓππ
β¦ Softmax: ππ π¦π¦ = ππ ππ = ππππ ππ = ππππ πΎπΎππ ππ
Weight matrix πΎπΎ represents unknown model parameters
Learning problem: β¦ Given training data, ππππ ,π¦π¦ππ , ππ = 1, β¦ ,ππβ¦ Learn weight matrix πΎπΎβ¦ What loss function to minimize?
33
Likelihood FunctionRepresent training data in vector form: β¦ Data matrix: πΏπΏ = ππ1, β¦ ,ππππ ππ
β¦ Class label vector: ππ = (π¦π¦1, β¦ ,π¦π¦ππ)ππ
β¦ One component for each training sample
Likelihood function: β¦ ππ ππ πΏπΏ,πΎπΎ = Likelihood (i.e. probability) of class labels given inputs πΏπΏ and weightsβ¦ Function of training data (πΏπΏ, ππ) and parameters πΎπΎ
34
Min and ArgminGiven a function ππ(π₯π₯)
minπ₯π₯ππ π₯π₯
β¦ Minimum value of the ππ(π₯π₯)β¦ Point on the π¦π¦-axis
arg minπ₯π₯
ππ π₯π₯
β¦ Value of π₯π₯ where ππ(π₯π₯) is a minimumβ¦ Point on the π₯π₯-axis
Similarly, define maxπ₯π₯
ππ π₯π₯ and arg maxπ₯π₯
ππ π₯π₯
35
ππ π₯π₯ = π₯π₯ β 1 2 + 2
10
arg minπ₯π₯ππ π₯π₯ = 1
minπ₯π₯ππ π₯π₯ = 2
Maximum Likelihood EstimationGiven training data (πΏπΏ, ππ)
Likelihood function: ππ ππ πΏπΏ,πΎπΎ
Maximum likelihood estimationοΏ½πΎπΎ = arg max
πΎπΎππ ππ πΏπΏ,πΎπΎ
⦠Finds parameters for which observations are most likely⦠Very general method in estimation
36
Log LikelihoodAssume outputs π¦π¦ππ are independent, depending only on ππππThen, likelihood factors:
ππ ππ πΏπΏ,πΎπΎ = οΏ½ππ=1
ππππ(π¦π¦ππ|ππππ ,πΎπΎ)
Define negative log likelihood:
πΏπΏ πΎπΎ = β lnππ ππ πΏπΏ,πΎπΎ = βοΏ½ππ=1
ππlnππ π¦π¦ππ ππππ ,πΎπΎ
Maximum likelihood estimator can be re-written as:
οΏ½πΎπΎ = arg maxπΎπΎ
ππ ππ πΏπΏ,πΎπΎ = arg minπΎπΎ
πΏπΏ πΎπΎ
37
Logistic Loss Function for binary classification Negative log likelihood function: π½π½ π€π€ = ββππ=1ππ lnππ π¦π¦ππ ππππ ,ππ
ππ π¦π¦ππ = 1 ππππ ,ππ =1
1 + ππβπ§π§ππ, π§π§ππ = ππ1:ππ
ππ ππππ + π€π€0
Therefore,
ππ π¦π¦ππ = 1 ππππ ,ππ =πππ§π§ππ
1 + πππ§π§ππ, ππ π¦π¦ππ = 0 ππππ ,ππ =
11 + πππ§π§ππ
Hence,
lnππ(π¦π¦ππ|ππππ ,ππ) = π¦π¦ππ lnππ(π¦π¦ππ = 1|ππππ ,ππ) + 1 β π¦π¦ππ lnππ π¦π¦ππ = 0 ππππ ,ππ = π¦π¦πππ§π§ππ β ln 1 + πππ§π§ππ
Loss function = binary cross entropy:
π½π½ ππ = οΏ½ππ=1
ππ
ln 1 + πππ§π§ππ β π¦π¦πππ§π§ππ
38
One-Hot Log Likelihood for Multi-Class ClassificationTo find MLE, we re-write the negative log likelihood
Define the βone-hotβ vector:
ππππππ = οΏ½1 π¦π¦ππ = ππ0 π¦π¦ππ β ππ , ππ = 1, β¦ ,ππ, ππ = 1, β¦ ,πΎπΎ
Then, lnππ π¦π¦ππ ππππ ,πΎπΎ = βππ=1πΎπΎ ππππππ lnππ π¦π¦ππ = ππ ππππ ,πΎπΎ
Hence, negative log likelihood is (proof on board):
πΏπΏ πΎπΎ = οΏ½ππ=1
ππln οΏ½
πππππ§π§ππππ βοΏ½
πππ§π§ππππππππππ
β¦ Sometimes called the cross-entropy
39
Gradient CalculationsTo minimize take partial derivatives: ππππ ππ
ππππππππ= 0 for all ππππππ
Define transform matrix: π΄π΄ππππ = ππππ(ππππ)
Hence, π§π§ππππ = βππ=1ππ π΄π΄ππππππππππ
Estimated class probabilities: ππππππ = πππ§π§ππππββ ππ
π§π§ππβ
Gradient components are (proof on board): ππππ πΎπΎππππππππ
= βππ=1ππ ππππππ β ππππππ π΄π΄ππππ = 0β¦ πΎπΎ Γ ππ equations and πΎπΎ Γ ππ unknowns
Unfortunately, no closed-form solution to these equations β¦ Nonlinear dependence of ππππππ on terms in ππ
40
Numerical Optimization
41
We saw that we can find minima by setting π»π»ππ π₯π₯ = 0β¦ ππ equations and ππ unknowns. β¦ May not have closed-form solution
Numerical methods: Finds a sequence of estimates π₯π₯π‘π‘π₯π₯π‘π‘ β π₯π₯β
β¦ Under some conditions, it converges to some other βgoodβ minimaβ¦ Run on a computer program, like python
Next lecture: Will discuss numerical methods to perform optimization
This lecture: Use built-in python routine
Logistic Regression in Python
Sklearn uses very efficient numerical optimization.
Mostly internal to userβ¦ Donβt need to compute gradients
42
w
http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
OutlineMotivating Example: Classifying a breast cancer test
Linear classifiers
Logistic regression
Fitting logistic regression models
Measuring accuracy in classification
43
Errors in Binary ClassificationTwo types of errors:β¦ Type I error (False positive / false alarm): Decide οΏ½π¦π¦ = 1 when π¦π¦ = 0β¦ Type II error (False negative / missed detection): Decide οΏ½π¦π¦ = 0 when π¦π¦ = 1
Implication of these errors may be different⦠Think of breast cancer diagnosis
Accuracy of classifier can be measured by:β¦ πππππ π = ππ( οΏ½π¦π¦ = 1|π¦π¦ = 1)β¦ πΉπΉπππ π = ππ οΏ½π¦π¦ = 1 π¦π¦ = 0β¦ Accuracy=ππ( οΏ½π¦π¦ = 1|π¦π¦ = 1)+ ππ οΏ½π¦π¦ = 0 π¦π¦ = 0
β¦ (percentage of correct classification)
44
Many Other MetricsFrom previous slideβ¦ πππππ π = ππ( οΏ½π¦π¦ = 1|π¦π¦ = 1)=sensitivityβ¦ πΉπΉπππ π = ππ οΏ½π¦π¦ = 1 π¦π¦ = 0 =1-specificity
Machine learning often uses (positive=items of interests in retrieval applications)β¦ Recall = Sensitivity =TP/(TP+FN) (How many positives are detected among all positive?)β¦ Precision =TP/(TP+FP) (How many detected positive is actually positive?)β¦ F1-score = ππππππππππππππππππ βπ π πππππ π π π π π
(ππππππππππππππππππ+π π πππππ π π π π π )/2= 2ππππ
2ππππ+πΉπΉππ+πΉπΉππ= ππππ
ππππ+πΉπΉπΉπΉ+πΉπΉπΉπΉ2β¦ Accuracy=(TP+TF)/(TP+FP+TN+FN) (percentage of correct classification)
Medical tests:β¦ Sensitivity = ππ οΏ½π¦π¦ = 1 π¦π¦ = 1 = πππππ π β¦ Specificity = ππ οΏ½π¦π¦ = 0 π¦π¦ = 0 = 1 β πΉπΉπππ π =True negative rateβ¦ Need a good tradeoff between sensitivity and specificity
45
Breast CancerMeasure accuracy on test data
Use 4-fold cross-validation
Sklearn has built-in functions for CV
46
Go through the demoUp to binary classification with cross validation
47
Hard DecisionsLogistic classifier outputs a soft label: ππ π¦π¦ = 1 π₯π₯ β [0,1]β¦ ππ π¦π¦ = 1 π₯π₯ β 1 β π¦π¦ = 1 more likelyβ¦ ππ π¦π¦ = 0 π₯π₯ β 1 β π¦π¦ = 0 more likely
Can obtain a hard label by thresholding:β¦ Set οΏ½π¦π¦ = 1 ππππ ππ π¦π¦ = 1 π₯π₯ > π‘π‘β¦ π‘π‘ = Threshold
How to set threshold?β¦ Set π‘π‘ = 1
2β Minimizes overall error rate
β¦ Increasing π‘π‘ β Decreases false positives, but also reduces sensitivityβ¦ Decreasing π‘π‘ β Increases sensitivity, but also increases false positive
48
ROC CurveVarying threshold obtains a set of classifier
Trades off FPR (1-specificity) and TPR (sensitivity)
Can visualize with ROC curve⦠Receiver operating curve⦠Term from digital communications
49
Area Under the Curve (AUC)As one may choose a particular threshold based on the desired trade-off between the TPR and FPR, it may not be appropriate to evaluate the performance of a classifier for a fixed threshold.
AUC is a measure of goodness for a classifier that is independent of the threshold.
A method with a higher AUC means that under the same FPR, it has higher PPR.
What is the highest AUC?
Should report average AUC over cross validation folds
50
Go through the demoUp to binary classification evaluation using ROC
51
Multi-Class Classification in PythonTwo options
One vs Rest (OVR)β¦ Solve a binary classification problem for each class ππβ¦ For each class ππ, train on modified binary labels (indicates if sample is in class or not)
οΏ½π¦π¦ππ = οΏ½1 if π¦π¦ππ = ππ0 if π¦π¦ππ β ππ
β¦ Predict based on classifier that yields highest score
Multinomial⦠Directly solve weights for all classes using the multi-class cross entropy
52
Metrics for Multiclass ClassificationUsing a πΎπΎ Γ πΎπΎ confusion matrix
Should normalize the matrix:β¦ Sum over each row =1
Can compute accuracy:⦠Per class: This is the diagonal entry⦠Average: The average of the diagonal entries
53
Pred-->
Realβ
1 2 β¦ K
1
2
β¦
K
LASSO Regularization for Logistic RegressionSimilar to linear regression, we can use LASSO regularization with logistic regression⦠Forces the weighting coefficients to be sparse.
Add L1 penalty πΏπΏ πΎπΎ = βππ=1ππ ln βππ πππ§π§ππππ β π§π§ππππππππππ + ππ ππ 1
The regularization level ππ should be chosen via cross validation as before
Sklearn implementation:
β¦ Default use l2 penalty, to reduce the magnitude of weights
C is the inverse of regularization strength (C = 1/ππ); must be a positive float. β¦ Should use a large C is you do not want to apply regularization
Go through the LASSO part of the demo
54
Go through the demoGo though the last part with LASSO regression
55
What You Should KnowFormulate a machine learning problem as a classification problem
β¦ Identify features, class variable, training data
Visualize classification data using a scatter plot.
Describe a linear classifier as an equation and on a plot.β¦ Determine visually if data is perfect linearly separable.
Formulate a classification problem using logistic regression⦠Binary and multi-class⦠Describe the logistic and soft-max function⦠Understand the idea of using the logistic function to approximate the probability
Derive the loss function for ML estimation of the weights in logistic regression
Use sklearn packages to fit logistic regression models
Measure the accuracy of classification: precision, recall, accuracy
Adjust threshold of classifiers for trading off types of classification errors. Draw a ROC curve and determine AUC
Perform LASSO regularization for feature selection
56