Lecture 6 Linear Classification & Logistic RegressionFormulate a classification problem using...

56
Lecture 6 Linear Classification & Logistic Regression EE-UY 4563/EL-GY 9123: INTRODUCTION TO MACHINE LEARNING PROF. SUNDEEP RANGAN (WITH MODIFICATION BY YAO WANG) 1

Transcript of Lecture 6 Linear Classification & Logistic RegressionFormulate a classification problem using...

Page 1: Lecture 6 Linear Classification & Logistic RegressionFormulate a classification problem using logistic regression Binary and multi-class Describe the logistic and soft-max function

Lecture 6 Linear Classification & Logistic RegressionEE-UY 4563/EL-GY 9123: INTRODUCTION TO MACHINE LEARNING

PROF. SUNDEEP RANGAN (WITH MODIFICATION BY YAO WANG)

1

Page 2: Lecture 6 Linear Classification & Logistic RegressionFormulate a classification problem using logistic regression Binary and multi-class Describe the logistic and soft-max function

Learning ObjectivesFormulate a machine learning problem as a classification problem

β—¦ Identify features, class variable, training data

Visualize classification data using a scatter plot.

Describe a linear classifier as an equation and on a plot.β—¦ Determine visually if data is perfect linearly separable.

Formulate a classification problem using logistic regressionβ—¦ Binary and multi-classβ—¦ Describe the logistic and soft-max functionβ—¦ Logistic function to approximate the probability

Derive the loss function for ML estimation of the weights in logistic regression

Use sklearn packages to fit logistic regression models

Measure the accuracy of classification

Adjust threshold of classifiers for trading off types of classification errors. Draw a ROC curve.

Perform LASSO regularization for feature selection

2

Page 3: Lecture 6 Linear Classification & Logistic RegressionFormulate a classification problem using logistic regression Binary and multi-class Describe the logistic and soft-max function

OutlineMotivating Example: Classifying a breast cancer test

Linear classifiers

Logistic regression

Fitting logistic regression models

Measuring accuracy in classification

3

Page 4: Lecture 6 Linear Classification & Logistic RegressionFormulate a classification problem using logistic regression Binary and multi-class Describe the logistic and soft-max function

Diagnosing Breast CancerFine needle aspiration of suspicious lumps

Cytopathologist visually inspects cells β—¦ Sample is stained and viewed under microscope

Determines if cells are benign or malignant and furthermore provides grading if malignant

Uses many features:β—¦ Size and shape of cells, degree of mitosis, differentiation, …

Diagnosis is not exact

If uncertain, use a more comprehensive biopsyβ—¦ Additional cost and timeβ—¦ Stress to patient

Can machine learning provide better rules?

4

Grades of carcinoma cellshttp://breast-cancer.ca/5a-types/

Page 5: Lecture 6 Linear Classification & Logistic RegressionFormulate a classification problem using logistic regression Binary and multi-class Describe the logistic and soft-max function

Demo on GithubGithub: https://github.com/sdrangan/introml/blob/master/logistic/breast_cancer.ipynb

5

Page 6: Lecture 6 Linear Classification & Logistic RegressionFormulate a classification problem using logistic regression Binary and multi-class Describe the logistic and soft-max function

DataUniv. Wisconsin study, 1994

569 samples

10 visual features for each sample

Ground truth determined by biopsy

First publication: O.L. Mangasarian, W.N. Street and W.H. Wolberg. Breast cancer diagnosis and prognosis via linear programming. Operations Research, 43(4), pages 570-577, July-August 1995.

6

Page 7: Lecture 6 Linear Classification & Logistic RegressionFormulate a classification problem using logistic regression Binary and multi-class Describe the logistic and soft-max function

Loading The DataFollow standard pandas routine

All code in Lect06_Demo.ipynb

7

Class = 2 => benignClass = 4 => malignant

Drops missing samples

See following for explanation of attributeshttps://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.names

Page 8: Lecture 6 Linear Classification & Logistic RegressionFormulate a classification problem using logistic regression Binary and multi-class Describe the logistic and soft-max function

Visualizing the Data

8

Scatter plot of points from each class

Plot not informativeβ—¦ Many points overlapβ—¦ Relative frequency at each point not visible

Page 9: Lecture 6 Linear Classification & Logistic RegressionFormulate a classification problem using logistic regression Binary and multi-class Describe the logistic and soft-max function

Improving the Plot

9

Make circle size proportional to count

Many gymnastics to make this plot in python

Page 10: Lecture 6 Linear Classification & Logistic RegressionFormulate a classification problem using logistic regression Binary and multi-class Describe the logistic and soft-max function

In-Class ExerciseGet into groupsβ—¦ At least one must have a laptop with jupyter notebook

Determine a classification ruleβ—¦ Predict class label from the two features

Test in pythonβ—¦ Make the predictionsβ—¦ Measure the accuracy

10

Page 11: Lecture 6 Linear Classification & Logistic RegressionFormulate a classification problem using logistic regression Binary and multi-class Describe the logistic and soft-max function

A Possible Classification RuleFrom inspection, benign if:

marg +23

(size_unif) < 4

Classification rule from linear constraint

What are other possible classification rules?

Every rule misclassifies some points

What is optimal?

11

Page 12: Lecture 6 Linear Classification & Logistic RegressionFormulate a classification problem using logistic regression Binary and multi-class Describe the logistic and soft-max function

Mangasarian’s Original PaperProposes Multisurface method – Tree (MSM-T)β—¦ Decision tree based on linear rules in each step

Fig to left from β—¦ Pantel, β€œBreast Cancer Diagnosis and Prognosis,” 1995

Best methods today use neural networks

This lecture will look at linear classifiersβ—¦ These are much simplerβ—¦ Do not provide same level of accuracy

But, building block to more complex classifiers

12

Page 13: Lecture 6 Linear Classification & Logistic RegressionFormulate a classification problem using logistic regression Binary and multi-class Describe the logistic and soft-max function

ClassificationGiven features 𝒙𝒙, determine its class label, 𝑦𝑦 = 1, … ,𝐾𝐾

Binary classification: 𝑦𝑦 = 0 or 1

Many applications:β—¦ Face detection: Is a face present or not?β—¦ Reading a digit: Is the digit 0,1,…,9?β—¦ Are the cells cancerous or not?β—¦ Is the email spam?

Equivalently, determine classification function:�𝑦𝑦 = 𝑓𝑓 𝒙𝒙 ∈ {1, … ,𝐾𝐾}

β—¦ Like regression, but with a discrete responseβ—¦ May index {1, … ,𝐾𝐾} or {0, … ,𝐾𝐾 βˆ’ 1}

13

Page 14: Lecture 6 Linear Classification & Logistic RegressionFormulate a classification problem using logistic regression Binary and multi-class Describe the logistic and soft-max function

OutlineMotivating Example: Classifying a breast cancer test

Linear classifiers

Logistic regression

Fitting logistic regression models

Measuring accuracy in classification

14

Page 15: Lecture 6 Linear Classification & Logistic RegressionFormulate a classification problem using logistic regression Binary and multi-class Describe the logistic and soft-max function

Linear ClassifierGeneral binary classification rule:

�𝑦𝑦 = 𝑓𝑓 π‘₯π‘₯ = 0 or 1

Linear classification rule:β—¦ Take linear combination 𝑧𝑧 = 𝑀𝑀0 + βˆ‘π‘—π‘—=1𝑑𝑑 𝑀𝑀𝑑𝑑π‘₯π‘₯𝑑𝑑◦ Predict class from 𝑧𝑧

�𝑦𝑦 = οΏ½1 𝑧𝑧 β‰₯ 00 𝑧𝑧 < 0

Decision regions described by a half-space.

π’˜π’˜ = (𝑀𝑀0, … ,𝑀𝑀𝑑𝑑) is called the weight vector

15

Page 16: Lecture 6 Linear Classification & Logistic RegressionFormulate a classification problem using logistic regression Binary and multi-class Describe the logistic and soft-max function

Breast Cancer exampleFrom inspection, benign if:

marg +23

(size_unif) < 4

Classification rule from linear constraint

What are other possible classification rules?

Every rule misclassifies some points

What is optimal?

16

Page 17: Lecture 6 Linear Classification & Logistic RegressionFormulate a classification problem using logistic regression Binary and multi-class Describe the logistic and soft-max function

Using Linear Regression on Two FeaturesBad idea: Use linear regressionβ—¦ Labels are 𝑦𝑦 ∈ {0,1}β—¦ Use linear model: 𝑧𝑧 = 𝑀𝑀0 + 𝑀𝑀1π‘₯π‘₯1 +𝑀𝑀2 π‘₯π‘₯2β—¦ Find linear fit so that βˆ‘π‘–π‘– 𝑦𝑦𝑖𝑖 βˆ’ 𝑧𝑧𝑖𝑖 2 is minimizedβ—¦ Then threshold the linear fit:

�𝑦𝑦 = οΏ½1 𝑧𝑧 > 0.50 𝑧𝑧 < 0.5

This yields the line:𝑀𝑀0 + 𝑀𝑀1π‘₯π‘₯1 +𝑀𝑀2 π‘₯π‘₯2 = 0.5

Why the line is not as expected?

Should not use MSE as the optimization criterion!β—¦ Squared error is not related to classifier accuracy

17

Page 18: Lecture 6 Linear Classification & Logistic RegressionFormulate a classification problem using logistic regression Binary and multi-class Describe the logistic and soft-max function

Using linear classifier on all featuresHard to visualize, but by setting 𝑧𝑧 = 0.5, we get a hyper plane in high dimension

18

Page 19: Lecture 6 Linear Classification & Logistic RegressionFormulate a classification problem using logistic regression Binary and multi-class Describe the logistic and soft-max function

Go through the demoUp to linear regression

19

Page 20: Lecture 6 Linear Classification & Logistic RegressionFormulate a classification problem using logistic regression Binary and multi-class Describe the logistic and soft-max function

Linear vs. Non-LinearLinear boundaries are limited

Can only describe very simple regions

But, serves as building blockβ—¦ Many classifiers use linear rules as first stepβ—¦ Neural networks, decision trees, …

Breast cancer example:β—¦ Is the region linear or non-linear?

20

Page 21: Lecture 6 Linear Classification & Logistic RegressionFormulate a classification problem using logistic regression Binary and multi-class Describe the logistic and soft-max function

Perfect Linear SeparabilityGiven training data 𝒙𝒙𝑖𝑖 ,𝑦𝑦𝑖𝑖 , 𝑖𝑖 = 1, … ,𝑁𝑁

Binary class label: 𝑦𝑦𝑖𝑖 = Β±1

Perfectly linearly separable if there exists a π’˜π’˜ = (𝑀𝑀0,𝑀𝑀1, … ,𝑀𝑀𝑑𝑑) s.t.:β—¦ 𝑀𝑀0 + 𝑀𝑀1π‘₯π‘₯𝑖𝑖1 + ⋯𝑀𝑀𝑑𝑑π‘₯π‘₯𝑖𝑖𝑑𝑑 > 𝛾𝛾 when 𝑦𝑦𝑖𝑖 = 1β—¦ 𝑀𝑀0 + 𝑀𝑀1π‘₯π‘₯𝑖𝑖1 + ⋯𝑀𝑀𝑑𝑑π‘₯π‘₯𝑖𝑖𝑑𝑑 < βˆ’π›Ύπ›Ύ when 𝑦𝑦𝑖𝑖 = βˆ’1

π’˜π’˜ is the separating hyperplane, 𝛾𝛾 is the margin

Single equation form:𝑦𝑦𝑖𝑖 𝑀𝑀0 + 𝑀𝑀1π‘₯π‘₯𝑖𝑖1 + ⋯𝑀𝑀𝑑𝑑π‘₯π‘₯𝑖𝑖𝑑𝑑 > 𝛾𝛾 for all 𝑖𝑖 = 1, … ,𝑁𝑁

21

Page 22: Lecture 6 Linear Classification & Logistic RegressionFormulate a classification problem using logistic regression Binary and multi-class Describe the logistic and soft-max function

Most Data not Perfectly SeparableGenerally cannot find a separating hyperplane

Always, some points that will be mis-classified

Algorithms attempt to find β€œgood” hyper-planesβ—¦ Reduce the number of mis-classified pointsβ—¦ Or, some similar metric

Example: Look again at breast cancer data

22

Page 23: Lecture 6 Linear Classification & Logistic RegressionFormulate a classification problem using logistic regression Binary and multi-class Describe the logistic and soft-max function

Non-UniquenessWhen one exists, separating hyper-plane is not unique

Example: β—¦ If π’˜π’˜ is separating, then so is π›Όπ›Όπ’˜π’˜ for all 𝛼𝛼 > 0

Fig. on right: Many separating planes

Which one is optimal?

23

Page 24: Lecture 6 Linear Classification & Logistic RegressionFormulate a classification problem using logistic regression Binary and multi-class Describe the logistic and soft-max function

OutlineMotivating Example: Classifying a breast cancer test

Linear classifiers

Logistic regression

Fitting logistic regression models

Measuring accuracy in classification

24

Page 25: Lecture 6 Linear Classification & Logistic RegressionFormulate a classification problem using logistic regression Binary and multi-class Describe the logistic and soft-max function

Logistic Model for Binary ClassificationBinary classification problem: 𝑦𝑦 = 0, 1

Consider probabilistic model

𝑃𝑃 𝑦𝑦 = 1 𝒙𝒙 =1

1 + π‘’π‘’βˆ’π‘§π‘§, 𝑃𝑃 𝑦𝑦 = 0 𝒙𝒙 =

π‘’π‘’βˆ’π‘§π‘§

1 + π‘’π‘’βˆ’π‘§π‘§β—¦ 𝑧𝑧 = 𝑀𝑀0 + βˆ‘π‘—π‘—=1π‘˜π‘˜ π‘€π‘€π‘˜π‘˜π‘₯π‘₯π‘˜π‘˜

Logistic function: 𝑓𝑓 𝑧𝑧 = 1/(1 + π‘’π‘’βˆ’π‘§π‘§)β—¦ Classical β€œS”-shape. Also called sigmoidal

Value of f(𝒙𝒙) does not perfectly predict class 𝑦𝑦.β—¦ Only a probability of 𝑦𝑦◦ Allows for linear classification to be imperfect.β—¦ Training will not require perfect separability

25

Page 26: Lecture 6 Linear Classification & Logistic RegressionFormulate a classification problem using logistic regression Binary and multi-class Describe the logistic and soft-max function

Logistic Model as a β€œSoft” ClassifierPlot of

𝑃𝑃 𝑦𝑦 = 1 π‘₯π‘₯ =1

1 + π‘’π‘’βˆ’π‘§π‘§, 𝑧𝑧 = 𝑀𝑀1π‘₯π‘₯

β—¦ Markers are random samples

Higher 𝑀𝑀1: prob transition becomes sharperβ—¦ Fewer samples occur across boundary

As 𝑀𝑀1 β†’ ∞ logistic becomes β€œhard” rule

𝑃𝑃 𝑦𝑦 = 1 π‘₯π‘₯ β‰ˆ οΏ½1 π‘₯π‘₯ > 00 π‘₯π‘₯ < 0

26

𝑧𝑧 = 0.5π‘₯π‘₯ 𝑧𝑧 = π‘₯π‘₯ 𝑧𝑧 = 2π‘₯π‘₯ 𝑧𝑧 = 100π‘₯π‘₯

Page 27: Lecture 6 Linear Classification & Logistic RegressionFormulate a classification problem using logistic regression Binary and multi-class Describe the logistic and soft-max function

Multi-Class Logistic RegressionSuppose 𝑦𝑦 ∈ 1, … ,𝐾𝐾◦ 𝐾𝐾 possible classes (e.g. digits, letters, spoken words, …)

Multi-class regression:β—¦ 𝑾𝑾 ∈ 𝑅𝑅𝐾𝐾×𝑑𝑑,π’˜π’˜0 ∈ 𝑅𝑅𝐾𝐾 Slope matrix and biasβ—¦ 𝒛𝒛 = 𝑾𝑾𝒙𝒙 + π’˜π’˜0: Creates 𝐾𝐾 linear functions

Then, class probabilities given by:

𝑃𝑃 𝑦𝑦 = π‘˜π‘˜ 𝒙𝒙 =eπ‘§π‘§π‘˜π‘˜

βˆ‘β„“=1𝐾𝐾 𝑒𝑒𝑧𝑧ℓ

27

Page 28: Lecture 6 Linear Classification & Logistic RegressionFormulate a classification problem using logistic regression Binary and multi-class Describe the logistic and soft-max function

Softmax OperationConsider soft-max function:

π‘”π‘”π‘˜π‘˜ 𝒛𝒛 =eπ‘§π‘§π‘˜π‘˜

βˆ‘β„“=1𝐾𝐾 𝑒𝑒𝑧𝑧ℓ◦ 𝐾𝐾 inputs 𝒛𝒛 = z1, … , 𝑧𝑧𝐾𝐾 , 𝐾𝐾 outputs 𝑓𝑓(𝒛𝒛) = (𝑓𝑓(𝐳𝐳)1, … , 𝑓𝑓(𝐳𝐳)𝐾𝐾)

Properties: 𝑓𝑓 𝒛𝒛 is like a PMF on the labels [0,1, … ,𝐾𝐾 βˆ’ 1]β—¦ π‘”π‘”π‘˜π‘˜ 𝒛𝒛 ∈ [0,1] for each component π‘˜π‘˜β—¦ βˆ‘π‘˜π‘˜=1𝐾𝐾 π‘”π‘”π‘˜π‘˜ 𝒛𝒛 = 1

Softmax property: When π‘§π‘§π‘˜π‘˜ ≫ 𝑧𝑧ℓ for all β„“ β‰  π‘˜π‘˜:β—¦ π‘”π‘”π‘˜π‘˜ 𝒛𝒛 β‰ˆ 1β—¦ 𝑔𝑔ℓ 𝒛𝒛 β‰ˆ 0 for all β„“ β‰  π‘˜π‘˜

Multi-class logistic regression: Assigns highest probability to class π‘˜π‘˜ when π‘§π‘§π‘˜π‘˜ is largestπ‘§π‘§π‘˜π‘˜ = π’˜π’˜π‘˜π‘˜

𝑇𝑇𝒙𝒙 + 𝑀𝑀0π‘˜π‘˜

28

Page 29: Lecture 6 Linear Classification & Logistic RegressionFormulate a classification problem using logistic regression Binary and multi-class Describe the logistic and soft-max function

Multi-Class Logistic RegressionDecision Regions

Each decision region defined by set of hyperplanes

Intersection of linear constraints

Sometimes called a polytope

29

Page 30: Lecture 6 Linear Classification & Logistic RegressionFormulate a classification problem using logistic regression Binary and multi-class Describe the logistic and soft-max function

Transform Linear ModelsAs in regression, logistic models can be applied to transform features

Step 1: Map 𝒙𝒙 to some transform features, πœ™πœ™ 𝒙𝒙 = πœ™πœ™1 𝒙𝒙 , … ,πœ™πœ™π‘π‘ 𝒙𝒙 𝑇𝑇

Step 2: Linear weights: π‘§π‘§π‘˜π‘˜ = βˆ‘π‘—π‘—=1𝑝𝑝 π‘Šπ‘Šπ‘˜π‘˜π‘—π‘—πœ™πœ™π‘—π‘— 𝒙𝒙

Step 3: Soft-max 𝑃𝑃 𝑦𝑦 = π‘˜π‘˜ 𝒛𝒛 = π‘”π‘”π‘˜π‘˜ 𝒛𝒛 , π‘”π‘”π‘˜π‘˜ 𝒛𝒛 = π‘’π‘’π‘§π‘§π‘˜π‘˜βˆ‘β„“ 𝑒𝑒

𝑧𝑧ℓ

Example transforms:β—¦ Standard regression πœ™πœ™ 𝒙𝒙 = 1, π‘₯π‘₯1, … , π‘₯π‘₯𝑗𝑗

𝑇𝑇(𝑗𝑗 original features, j+1 transformed features)

β—¦ Polynomial regression: πœ™πœ™ 𝒙𝒙 = 1, π‘₯π‘₯, … , π‘₯π‘₯𝑑𝑑 𝑇𝑇 (1 original feature, 𝑑𝑑 + 1 transformed features)

30

Additional transform step

Page 31: Lecture 6 Linear Classification & Logistic RegressionFormulate a classification problem using logistic regression Binary and multi-class Describe the logistic and soft-max function

Using Transformed FeaturesEnables richer class boundaries

Example: Fig B is not linearly separable

But, consider nonlinear featuresβ—¦ πœ™πœ™ 𝒙𝒙 = 1, π‘₯π‘₯1, π‘₯π‘₯2, π‘₯π‘₯12, π‘₯π‘₯22 𝑇𝑇

Then can discriminate classes with linear functionβ—¦ 𝑧𝑧 = βˆ’π‘Ÿπ‘Ÿ2, 0,0,1,1 πœ™πœ™ π‘₯π‘₯ = π‘₯π‘₯12 + π‘₯π‘₯22 βˆ’ π‘Ÿπ‘Ÿ2

Blue when 𝑧𝑧 ≀ 0 and Green when 𝑧𝑧 > 0

31

Page 32: Lecture 6 Linear Classification & Logistic RegressionFormulate a classification problem using logistic regression Binary and multi-class Describe the logistic and soft-max function

OutlineMotivating Example: Classifying a breast cancer test

Linear classifiers

Logistic regression

Fitting logistic regression models

Measuring accuracy in classification

32

Page 33: Lecture 6 Linear Classification & Logistic RegressionFormulate a classification problem using logistic regression Binary and multi-class Describe the logistic and soft-max function

Learning the Logistic Model ParametersConsider general three part logistic model: β—¦ Transform to features: 𝒙𝒙 ↦ πœ™πœ™(𝒙𝒙)β—¦ Linear weights: 𝒛𝒛 = π‘Ύπ‘Ύπœ™πœ™ 𝒙𝒙 , 𝑾𝑾 ∈ 𝑅𝑅𝐾𝐾×𝑝𝑝

β—¦ Softmax: 𝑃𝑃 𝑦𝑦 = π‘˜π‘˜ 𝒙𝒙 = π‘”π‘”π‘˜π‘˜ 𝒛𝒛 = π‘”π‘”π‘˜π‘˜ π‘Ύπ‘Ύπœ™πœ™ 𝒙𝒙

Weight matrix 𝑾𝑾 represents unknown model parameters

Learning problem: β—¦ Given training data, 𝒙𝒙𝑖𝑖 ,𝑦𝑦𝑖𝑖 , 𝑖𝑖 = 1, … ,𝑁𝑁◦ Learn weight matrix 𝑾𝑾◦ What loss function to minimize?

33

Page 34: Lecture 6 Linear Classification & Logistic RegressionFormulate a classification problem using logistic regression Binary and multi-class Describe the logistic and soft-max function

Likelihood FunctionRepresent training data in vector form: β—¦ Data matrix: 𝑿𝑿 = 𝒙𝒙1, … ,𝒙𝒙𝑁𝑁 𝑇𝑇

β—¦ Class label vector: π’šπ’š = (𝑦𝑦1, … ,𝑦𝑦𝑁𝑁)𝑇𝑇

β—¦ One component for each training sample

Likelihood function: β—¦ 𝑃𝑃 π’šπ’š 𝑿𝑿,𝑾𝑾 = Likelihood (i.e. probability) of class labels given inputs 𝑿𝑿 and weightsβ—¦ Function of training data (𝑿𝑿, π’šπ’š) and parameters 𝑾𝑾

34

Page 35: Lecture 6 Linear Classification & Logistic RegressionFormulate a classification problem using logistic regression Binary and multi-class Describe the logistic and soft-max function

Min and ArgminGiven a function 𝑓𝑓(π‘₯π‘₯)

minπ‘₯π‘₯𝑓𝑓 π‘₯π‘₯

β—¦ Minimum value of the 𝑓𝑓(π‘₯π‘₯)β—¦ Point on the 𝑦𝑦-axis

arg minπ‘₯π‘₯

𝑓𝑓 π‘₯π‘₯

β—¦ Value of π‘₯π‘₯ where 𝑓𝑓(π‘₯π‘₯) is a minimumβ—¦ Point on the π‘₯π‘₯-axis

Similarly, define maxπ‘₯π‘₯

𝑓𝑓 π‘₯π‘₯ and arg maxπ‘₯π‘₯

𝑓𝑓 π‘₯π‘₯

35

𝑓𝑓 π‘₯π‘₯ = π‘₯π‘₯ βˆ’ 1 2 + 2

10

arg minπ‘₯π‘₯𝑓𝑓 π‘₯π‘₯ = 1

minπ‘₯π‘₯𝑓𝑓 π‘₯π‘₯ = 2

Page 36: Lecture 6 Linear Classification & Logistic RegressionFormulate a classification problem using logistic regression Binary and multi-class Describe the logistic and soft-max function

Maximum Likelihood EstimationGiven training data (𝑿𝑿, π’šπ’š)

Likelihood function: 𝑃𝑃 π’šπ’š 𝑿𝑿,𝑾𝑾

Maximum likelihood estimation�𝑾𝑾 = arg max

𝑾𝑾𝑃𝑃 π’šπ’š 𝑿𝑿,𝑾𝑾

β—¦ Finds parameters for which observations are most likelyβ—¦ Very general method in estimation

36

Page 37: Lecture 6 Linear Classification & Logistic RegressionFormulate a classification problem using logistic regression Binary and multi-class Describe the logistic and soft-max function

Log LikelihoodAssume outputs 𝑦𝑦𝑖𝑖 are independent, depending only on 𝒙𝒙𝑖𝑖Then, likelihood factors:

𝑃𝑃 π’šπ’š 𝑿𝑿,𝑾𝑾 = �𝑖𝑖=1

𝑁𝑁𝑃𝑃(𝑦𝑦𝑖𝑖|𝒙𝒙𝑖𝑖 ,𝑾𝑾)

Define negative log likelihood:

𝐿𝐿 𝑾𝑾 = βˆ’ ln𝑃𝑃 π’šπ’š 𝑿𝑿,𝑾𝑾 = βˆ’οΏ½π‘–π‘–=1

𝑁𝑁ln𝑃𝑃 𝑦𝑦𝑖𝑖 𝒙𝒙𝑖𝑖 ,𝑾𝑾

Maximum likelihood estimator can be re-written as:

�𝑾𝑾 = arg max𝑾𝑾

𝑃𝑃 π’šπ’š 𝑿𝑿,𝑾𝑾 = arg min𝑾𝑾

𝐿𝐿 𝑾𝑾

37

Page 38: Lecture 6 Linear Classification & Logistic RegressionFormulate a classification problem using logistic regression Binary and multi-class Describe the logistic and soft-max function

Logistic Loss Function for binary classification Negative log likelihood function: 𝐽𝐽 𝑀𝑀 = βˆ’βˆ‘π‘–π‘–=1𝑛𝑛 ln𝑃𝑃 𝑦𝑦𝑖𝑖 𝒙𝒙𝑖𝑖 ,π’˜π’˜

𝑃𝑃 𝑦𝑦𝑖𝑖 = 1 𝒙𝒙𝑖𝑖 ,π’˜π’˜ =1

1 + π‘’π‘’βˆ’π‘§π‘§π‘–π‘–, 𝑧𝑧𝑖𝑖 = π’˜π’˜1:𝑝𝑝

𝑇𝑇 𝒙𝒙𝑖𝑖 + 𝑀𝑀0

Therefore,

𝑃𝑃 𝑦𝑦𝑖𝑖 = 1 𝒙𝒙𝑖𝑖 ,π’˜π’˜ =𝑒𝑒𝑧𝑧𝑖𝑖

1 + 𝑒𝑒𝑧𝑧𝑖𝑖, 𝑃𝑃 𝑦𝑦𝑖𝑖 = 0 𝒙𝒙𝑖𝑖 ,π’˜π’˜ =

11 + 𝑒𝑒𝑧𝑧𝑖𝑖

Hence,

ln𝑃𝑃(𝑦𝑦𝑖𝑖|𝒙𝒙𝑖𝑖 ,π’˜π’˜) = 𝑦𝑦𝑖𝑖 ln𝑃𝑃(𝑦𝑦𝑖𝑖 = 1|𝒙𝒙𝑖𝑖 ,π’˜π’˜) + 1 βˆ’ 𝑦𝑦𝑖𝑖 ln𝑃𝑃 𝑦𝑦𝑖𝑖 = 0 𝒙𝒙𝑖𝑖 ,π’˜π’˜ = 𝑦𝑦𝑖𝑖𝑧𝑧𝑖𝑖 βˆ’ ln 1 + 𝑒𝑒𝑧𝑧𝑖𝑖

Loss function = binary cross entropy:

𝐽𝐽 π’˜π’˜ = �𝑖𝑖=1

𝑛𝑛

ln 1 + 𝑒𝑒𝑧𝑧𝑖𝑖 βˆ’ 𝑦𝑦𝑖𝑖𝑧𝑧𝑖𝑖

38

Page 39: Lecture 6 Linear Classification & Logistic RegressionFormulate a classification problem using logistic regression Binary and multi-class Describe the logistic and soft-max function

One-Hot Log Likelihood for Multi-Class ClassificationTo find MLE, we re-write the negative log likelihood

Define the β€œone-hot” vector:

π‘Ÿπ‘Ÿπ‘–π‘–π‘˜π‘˜ = οΏ½1 𝑦𝑦𝑖𝑖 = π‘˜π‘˜0 𝑦𝑦𝑖𝑖 β‰  π‘˜π‘˜ , 𝑖𝑖 = 1, … ,𝑁𝑁, π‘˜π‘˜ = 1, … ,𝐾𝐾

Then, ln𝑃𝑃 𝑦𝑦𝑖𝑖 𝒙𝒙𝑖𝑖 ,𝑾𝑾 = βˆ‘π‘˜π‘˜=1𝐾𝐾 π‘Ÿπ‘Ÿπ‘–π‘–π‘˜π‘˜ ln𝑃𝑃 𝑦𝑦𝑖𝑖 = π‘˜π‘˜ 𝒙𝒙𝑖𝑖 ,𝑾𝑾

Hence, negative log likelihood is (proof on board):

𝐿𝐿 𝑾𝑾 = �𝑖𝑖=1

𝑁𝑁ln οΏ½

π‘˜π‘˜π‘’π‘’π‘§π‘§π‘–π‘–π‘˜π‘˜ βˆ’οΏ½

π‘˜π‘˜π‘§π‘§π‘–π‘–π‘˜π‘˜π‘Ÿπ‘Ÿπ‘–π‘–π‘˜π‘˜

β—¦ Sometimes called the cross-entropy

39

Page 40: Lecture 6 Linear Classification & Logistic RegressionFormulate a classification problem using logistic regression Binary and multi-class Describe the logistic and soft-max function

Gradient CalculationsTo minimize take partial derivatives: πœ•πœ•πœ•πœ• π‘Šπ‘Š

πœ•πœ•π‘Šπ‘Šπ‘˜π‘˜π‘˜π‘˜= 0 for all π‘Šπ‘Šπ‘˜π‘˜π‘—π‘—

Define transform matrix: 𝐴𝐴𝑖𝑖𝑗𝑗 = πœ™πœ™π‘—π‘—(𝒙𝒙𝑖𝑖)

Hence, π‘§π‘§π‘–π‘–π‘˜π‘˜ = βˆ‘π‘—π‘—=1𝑝𝑝 π΄π΄π‘–π‘–π‘—π‘—π‘Šπ‘Šπ‘˜π‘˜π‘—π‘—

Estimated class probabilities: π‘π‘π‘–π‘–π‘˜π‘˜ = π‘’π‘’π‘§π‘§π‘–π‘–π‘˜π‘˜βˆ‘β„“ 𝑒𝑒

𝑧𝑧𝑖𝑖ℓ

Gradient components are (proof on board): πœ•πœ•πœ•πœ• π‘Ύπ‘Ύπœ•πœ•π‘Šπ‘Šπ‘˜π‘˜π‘˜π‘˜

= βˆ‘π‘–π‘–=1𝑁𝑁 π‘π‘π‘–π‘–π‘˜π‘˜ βˆ’ π‘Ÿπ‘Ÿπ‘–π‘–π‘˜π‘˜ 𝐴𝐴𝑖𝑖𝑗𝑗 = 0β—¦ 𝐾𝐾 Γ— 𝑝𝑝 equations and 𝐾𝐾 Γ— 𝑝𝑝 unknowns

Unfortunately, no closed-form solution to these equations β—¦ Nonlinear dependence of π‘π‘π‘–π‘–π‘˜π‘˜ on terms in π‘Šπ‘Š

40

Page 41: Lecture 6 Linear Classification & Logistic RegressionFormulate a classification problem using logistic regression Binary and multi-class Describe the logistic and soft-max function

Numerical Optimization

41

We saw that we can find minima by setting 𝛻𝛻𝑓𝑓 π‘₯π‘₯ = 0β—¦ 𝑀𝑀 equations and 𝑀𝑀 unknowns. β—¦ May not have closed-form solution

Numerical methods: Finds a sequence of estimates π‘₯π‘₯𝑑𝑑π‘₯π‘₯𝑑𝑑 β†’ π‘₯π‘₯βˆ—

β—¦ Under some conditions, it converges to some other β€œgood” minimaβ—¦ Run on a computer program, like python

Next lecture: Will discuss numerical methods to perform optimization

This lecture: Use built-in python routine

Page 42: Lecture 6 Linear Classification & Logistic RegressionFormulate a classification problem using logistic regression Binary and multi-class Describe the logistic and soft-max function

Logistic Regression in Python

Sklearn uses very efficient numerical optimization.

Mostly internal to userβ—¦ Don’t need to compute gradients

42

w

http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression

Page 43: Lecture 6 Linear Classification & Logistic RegressionFormulate a classification problem using logistic regression Binary and multi-class Describe the logistic and soft-max function

OutlineMotivating Example: Classifying a breast cancer test

Linear classifiers

Logistic regression

Fitting logistic regression models

Measuring accuracy in classification

43

Page 44: Lecture 6 Linear Classification & Logistic RegressionFormulate a classification problem using logistic regression Binary and multi-class Describe the logistic and soft-max function

Errors in Binary ClassificationTwo types of errors:β—¦ Type I error (False positive / false alarm): Decide �𝑦𝑦 = 1 when 𝑦𝑦 = 0β—¦ Type II error (False negative / missed detection): Decide �𝑦𝑦 = 0 when 𝑦𝑦 = 1

Implication of these errors may be differentβ—¦ Think of breast cancer diagnosis

Accuracy of classifier can be measured by:β—¦ 𝑇𝑇𝑃𝑃𝑅𝑅 = 𝑃𝑃( �𝑦𝑦 = 1|𝑦𝑦 = 1)β—¦ 𝐹𝐹𝑃𝑃𝑅𝑅 = 𝑃𝑃 �𝑦𝑦 = 1 𝑦𝑦 = 0β—¦ Accuracy=𝑃𝑃( �𝑦𝑦 = 1|𝑦𝑦 = 1)+ 𝑃𝑃 �𝑦𝑦 = 0 𝑦𝑦 = 0

β—¦ (percentage of correct classification)

44

Page 45: Lecture 6 Linear Classification & Logistic RegressionFormulate a classification problem using logistic regression Binary and multi-class Describe the logistic and soft-max function

Many Other MetricsFrom previous slideβ—¦ 𝑇𝑇𝑃𝑃𝑅𝑅 = 𝑃𝑃( �𝑦𝑦 = 1|𝑦𝑦 = 1)=sensitivityβ—¦ 𝐹𝐹𝑃𝑃𝑅𝑅 = 𝑃𝑃 �𝑦𝑦 = 1 𝑦𝑦 = 0 =1-specificity

Machine learning often uses (positive=items of interests in retrieval applications)β—¦ Recall = Sensitivity =TP/(TP+FN) (How many positives are detected among all positive?)β—¦ Precision =TP/(TP+FP) (How many detected positive is actually positive?)β—¦ F1-score = 𝑃𝑃𝑃𝑃𝑒𝑒𝑃𝑃𝑖𝑖𝑃𝑃𝑖𝑖𝑃𝑃𝑛𝑛 βˆ—π‘…π‘…π‘’π‘’π‘ƒπ‘ƒπ‘…π‘…π‘…π‘…π‘…π‘…

(𝑃𝑃𝑃𝑃𝑒𝑒𝑃𝑃𝑖𝑖𝑃𝑃𝑖𝑖𝑃𝑃𝑛𝑛+𝑅𝑅𝑒𝑒𝑃𝑃𝑅𝑅𝑅𝑅𝑅𝑅)/2= 2𝑇𝑇𝑃𝑃

2𝑇𝑇𝑃𝑃+𝐹𝐹𝑁𝑁+𝐹𝐹𝑃𝑃= 𝑇𝑇𝑃𝑃

𝑇𝑇𝑃𝑃+𝐹𝐹𝐹𝐹+𝐹𝐹𝐹𝐹2β—¦ Accuracy=(TP+TF)/(TP+FP+TN+FN) (percentage of correct classification)

Medical tests:β—¦ Sensitivity = 𝑃𝑃 �𝑦𝑦 = 1 𝑦𝑦 = 1 = 𝑇𝑇𝑃𝑃𝑅𝑅◦ Specificity = 𝑃𝑃 �𝑦𝑦 = 0 𝑦𝑦 = 0 = 1 βˆ’ 𝐹𝐹𝑃𝑃𝑅𝑅 =True negative rateβ—¦ Need a good tradeoff between sensitivity and specificity

45

Page 46: Lecture 6 Linear Classification & Logistic RegressionFormulate a classification problem using logistic regression Binary and multi-class Describe the logistic and soft-max function

Breast CancerMeasure accuracy on test data

Use 4-fold cross-validation

Sklearn has built-in functions for CV

46

Page 47: Lecture 6 Linear Classification & Logistic RegressionFormulate a classification problem using logistic regression Binary and multi-class Describe the logistic and soft-max function

Go through the demoUp to binary classification with cross validation

47

Page 48: Lecture 6 Linear Classification & Logistic RegressionFormulate a classification problem using logistic regression Binary and multi-class Describe the logistic and soft-max function

Hard DecisionsLogistic classifier outputs a soft label: 𝑃𝑃 𝑦𝑦 = 1 π‘₯π‘₯ ∈ [0,1]β—¦ 𝑃𝑃 𝑦𝑦 = 1 π‘₯π‘₯ β‰ˆ 1 β‡’ 𝑦𝑦 = 1 more likelyβ—¦ 𝑃𝑃 𝑦𝑦 = 0 π‘₯π‘₯ β‰ˆ 1 β‡’ 𝑦𝑦 = 0 more likely

Can obtain a hard label by thresholding:β—¦ Set �𝑦𝑦 = 1 𝑖𝑖𝑓𝑓 𝑃𝑃 𝑦𝑦 = 1 π‘₯π‘₯ > 𝑑𝑑◦ 𝑑𝑑 = Threshold

How to set threshold?β—¦ Set 𝑑𝑑 = 1

2β‡’ Minimizes overall error rate

β—¦ Increasing 𝑑𝑑 β‡’ Decreases false positives, but also reduces sensitivityβ—¦ Decreasing 𝑑𝑑 β‡’ Increases sensitivity, but also increases false positive

48

Page 49: Lecture 6 Linear Classification & Logistic RegressionFormulate a classification problem using logistic regression Binary and multi-class Describe the logistic and soft-max function

ROC CurveVarying threshold obtains a set of classifier

Trades off FPR (1-specificity) and TPR (sensitivity)

Can visualize with ROC curveβ—¦ Receiver operating curveβ—¦ Term from digital communications

49

Page 50: Lecture 6 Linear Classification & Logistic RegressionFormulate a classification problem using logistic regression Binary and multi-class Describe the logistic and soft-max function

Area Under the Curve (AUC)As one may choose a particular threshold based on the desired trade-off between the TPR and FPR, it may not be appropriate to evaluate the performance of a classifier for a fixed threshold.

AUC is a measure of goodness for a classifier that is independent of the threshold.

A method with a higher AUC means that under the same FPR, it has higher PPR.

What is the highest AUC?

Should report average AUC over cross validation folds

50

Page 51: Lecture 6 Linear Classification & Logistic RegressionFormulate a classification problem using logistic regression Binary and multi-class Describe the logistic and soft-max function

Go through the demoUp to binary classification evaluation using ROC

51

Page 52: Lecture 6 Linear Classification & Logistic RegressionFormulate a classification problem using logistic regression Binary and multi-class Describe the logistic and soft-max function

Multi-Class Classification in PythonTwo options

One vs Rest (OVR)β—¦ Solve a binary classification problem for each class π‘˜π‘˜β—¦ For each class π‘˜π‘˜, train on modified binary labels (indicates if sample is in class or not)

�𝑦𝑦𝑖𝑖 = οΏ½1 if 𝑦𝑦𝑖𝑖 = π‘˜π‘˜0 if 𝑦𝑦𝑖𝑖 β‰  π‘˜π‘˜

β—¦ Predict based on classifier that yields highest score

Multinomialβ—¦ Directly solve weights for all classes using the multi-class cross entropy

52

Page 53: Lecture 6 Linear Classification & Logistic RegressionFormulate a classification problem using logistic regression Binary and multi-class Describe the logistic and soft-max function

Metrics for Multiclass ClassificationUsing a 𝐾𝐾 Γ— 𝐾𝐾 confusion matrix

Should normalize the matrix:β—¦ Sum over each row =1

Can compute accuracy:β—¦ Per class: This is the diagonal entryβ—¦ Average: The average of the diagonal entries

53

Pred-->

Real↓

1 2 … K

1

2

…

K

Page 54: Lecture 6 Linear Classification & Logistic RegressionFormulate a classification problem using logistic regression Binary and multi-class Describe the logistic and soft-max function

LASSO Regularization for Logistic RegressionSimilar to linear regression, we can use LASSO regularization with logistic regressionβ—¦ Forces the weighting coefficients to be sparse.

Add L1 penalty 𝐿𝐿 𝑾𝑾 = βˆ‘π‘–π‘–=1𝑁𝑁 ln βˆ‘π‘˜π‘˜ π‘’π‘’π‘§π‘§π‘–π‘–π‘˜π‘˜ βˆ’ π‘§π‘§π‘–π‘–π‘˜π‘˜π‘Ÿπ‘Ÿπ‘–π‘–π‘˜π‘˜ + πœ†πœ† π‘Šπ‘Š 1

The regularization level πœ†πœ† should be chosen via cross validation as before

Sklearn implementation:

β—¦ Default use l2 penalty, to reduce the magnitude of weights

C is the inverse of regularization strength (C = 1/πœ†πœ†); must be a positive float. β—¦ Should use a large C is you do not want to apply regularization

Go through the LASSO part of the demo

54

Page 55: Lecture 6 Linear Classification & Logistic RegressionFormulate a classification problem using logistic regression Binary and multi-class Describe the logistic and soft-max function

Go through the demoGo though the last part with LASSO regression

55

Page 56: Lecture 6 Linear Classification & Logistic RegressionFormulate a classification problem using logistic regression Binary and multi-class Describe the logistic and soft-max function

What You Should KnowFormulate a machine learning problem as a classification problem

β—¦ Identify features, class variable, training data

Visualize classification data using a scatter plot.

Describe a linear classifier as an equation and on a plot.β—¦ Determine visually if data is perfect linearly separable.

Formulate a classification problem using logistic regressionβ—¦ Binary and multi-classβ—¦ Describe the logistic and soft-max functionβ—¦ Understand the idea of using the logistic function to approximate the probability

Derive the loss function for ML estimation of the weights in logistic regression

Use sklearn packages to fit logistic regression models

Measure the accuracy of classification: precision, recall, accuracy

Adjust threshold of classifiers for trading off types of classification errors. Draw a ROC curve and determine AUC

Perform LASSO regularization for feature selection

56