C3.1.logistic intro

67
Linear classifiers: Logistic regression Emily Fox & Carlos Guestrin Machine Learning Specialization University of Washington 1 ©2015-2016 Emily Fox & Carlos Guestrin

Transcript of C3.1.logistic intro

Page 1: C3.1.logistic intro

Machine Learning Specialization

Linear classifiers: Logistic regression

Emily Fox & Carlos Guestrin Machine Learning Specialization

University of Washington

1 ©2015-2016 Emily Fox & Carlos Guestrin

Page 2: C3.1.logistic intro

Machine Learning Specialization

Predicting sentiment by topic: An intelligent restaurant review system

2 ©2015-2016 Emily Fox & Carlos Guestrin

Page 3: C3.1.logistic intro

Machine Learning Specialization 3

It’s a big day & I want to book a table at a nice Japanese restaurant

©2015-2016 Emily Fox & Carlos Guestrin

Seattle has many ★★★★

sushi restaurants

What are people saying about

the food? the ambiance?...

Page 4: C3.1.logistic intro

Machine Learning Specialization 4

Positive reviews not positive about everything

©2015-2016 Emily Fox & Carlos Guestrin

Sample review: Watching the chefs create incredible edible art made the experience very unique.

My wife tried their ramen and it was pretty forgettable.

All the sushi was delicious! Easily best sushi in Seattle.

Experience

Page 5: C3.1.logistic intro

Machine Learning Specialization 5

Classifying sentiment of review

©2015-2016 Emily Fox & Carlos Guestrin

Easily best sushi in Seattle.

Sentence Sentiment Classifier

Page 6: C3.1.logistic intro

Machine Learning Specialization

Linear classifier: Intuition

6 ©2015-2016 Emily Fox & Carlos Guestrin

Page 7: C3.1.logistic intro

Machine Learning Specialization 7

Classifier

©2015-2016 Emily Fox & Carlos Guestrin

Sentence from

review

Classifier MODEL

Input: x Output: y Predicted class

Note: we’ll start talking about 2 classes, and address multiclass later

ŷ = +1

ŷ = -1

Page 8: C3.1.logistic intro

Machine Learning Specialization 8

A (linear) classifier •  Will use training data to learn a weight

or coefficient for each word

©2015-2016 Emily Fox & Carlos Guestrin

Word Coefficient good 1.0

great 1.5

awesome 2.7

bad -1.0

terrible -2.1

awful -3.3

restaurant, the, we, where, … 0.0

… …

Page 9: C3.1.logistic intro

Machine Learning Specialization 10

Scoring a sentence

©2015-2016 Emily Fox & Carlos Guestrin

Word Coefficient good 1.0

great 1.2

awesome 1.7

bad -1.0

terrible -2.1

awful -3.3

restaurant, the, we, where, …

0.0

… …

Input xi: Sushi was great, the food was awesome, but the service was terrible.

Called a linear classifier, because output is weighted sum of input.

Page 10: C3.1.logistic intro

Machine Learning Specialization 12

Score(x) = weighted count of words in sentence

If Score (x) > 0:

ŷ = Else:

ŷ =

Word Coefficient

… …

©2015-2016 Emily Fox & Carlos Guestrin

Sentence from

review

Input: x

Simple linear classifier

Page 11: C3.1.logistic intro

Machine Learning Specialization 13

Training a classifier = Learning the coefficients

©2015-2016 Emily Fox & Carlos Guestrin

Data

(x,y) (Sentence1, ) (Sentence2, )

Training set

Validation set

Learn classifier

Accuracy

Word Coefficient

good 1.0

awesome 1.7

bad -1.0

awful -3.3

… …

Page 12: C3.1.logistic intro

Machine Learning Specialization

Decision boundaries

14 ©2015-2016 Emily Fox & Carlos Guestrin

Page 13: C3.1.logistic intro

Machine Learning Specialization 15

Suppose only two words had non-zero coefficient

©2015-2016 Emily Fox & Carlos Guestrin

Word Coefficient

#awesome 1.0

#awful -1.5

#awesome 0 1 2 3 4 …

#aw

ful

0

1

2

3

4

Sushi was awesome, the food was awesome, but the service was awful.

Score(x) = 1.0 #awesome – 1.5 #awful

Page 14: C3.1.logistic intro

Machine Learning Specialization 16

Decision boundary example

©2015-2016 Emily Fox & Carlos Guestrin

#awesome

#aw

ful

0 1 2 3 4 …

0

1

2

3

4

Score(x) > 0

Score(x) < 0

Word Coefficient

#awesome 1.0

#awful -1.5 Score(x) = 1.0 #awesome – 1.5 #awful

Page 15: C3.1.logistic intro

Machine Learning Specialization 18

Decision boundary separates positive & negative predictions

•  For linear classifiers: - When 2 coefficients are non-zero è line

- When 3 coefficients are non-zero è plane

- When many coefficients are non-zero è hyperplane

•  For more general classifiers è more complicated shapes

©2015-2016 Emily Fox & Carlos Guestrin

Page 16: C3.1.logistic intro

Machine Learning Specialization

Linear classifier: Model

20 ©2015-2016 Emily Fox & Carlos Guestrin

Page 17: C3.1.logistic intro

Machine Learning Specialization 21 ©2015-2016 Emily Fox & Carlos Guestrin

Training Data

Feature extraction

ML model

Quality metric

ML algorithm

y

ŷ

ŵ

h(x) x

Page 18: C3.1.logistic intro

Machine Learning Specialization 23

Coefficients of classifier

©2015-2016 Emily Fox & Carlos Guestrin

#awesome

#aw

ful

x[1]

x[3]

Score(x) = w0 + w1

#awesome + w2

#awful + w3

#great

x[2

]

Page 19: C3.1.logistic intro

Machine Learning Specialization 25

General notation

Output: y Inputs: x = (x[1],x[2],…, x[d]) Notational conventions:

x[j] = jth input (scalar) hj(x) = jth feature (scalar) xi = input of ith data point (vector) xi[j] = jth input of ith data point (scalar)

©2015-2016 Emily Fox & Carlos Guestrin

d-dim vector

{-1,+1}

Page 20: C3.1.logistic intro

Machine Learning Specialization 26

Simple hyperplane

©2015-2016 Emily Fox & Carlos Guestrin

Model: ŷi = sign(Score(xi))

Score(xi) = w0 + w1 xi[1] + … + wd

xi[d] feature 1 = 1

feature 2 = x[1] … e.g., #awesome

feature 3 = x[2] … e.g., #awful

feature d+1 = x[d] … e.g., #ramen

= wTxi

Page 21: C3.1.logistic intro

Machine Learning Specialization 27

Decision boundary: effect of changing coefficients

©2015-2016 Emily Fox & Carlos Guestrin

Input Coefficient Value

w0 0.0

#awesome w1 1.0

#awful w2 -1.5

#awesome

#aw

ful

0 1 2 3 4 …

0

1

2

3

4

Score(x) = 1.0 #awesome – 1.5 #awful

Score(x) > 0

Score(x) < 0

Page 22: C3.1.logistic intro

Machine Learning Specialization 28

Input Coefficient Value

w0

#awesome w1 1.0

#awful w2 -1.5

Decision boundary: effect of changing coefficients

©2015-2016 Emily Fox & Carlos Guestrin

#awesome

#aw

ful

0 1 2 3 4 …

0

1

2

3

4

Score(x) =

Score(x) > 0

Score(x) < 0

0.0

1.0 #awesome – 1.5 #awful 1.0 + 1.0

Page 23: C3.1.logistic intro

Machine Learning Specialization 29

Input Coefficient Value

w0

#awesome w1 1.0

#awful w2

Decision boundary: effect of changing coefficients

©2015-2016 Emily Fox & Carlos Guestrin

#awesome

#aw

ful

0 1 2 3 4 …

0

1

2

3

4

Score(x) =

Score(x) > 0

Score(x) < 0

1.0 + 1.0 #awesome – 1.5 #awful

1.0

-1.5 -3.0 – 3.0

Page 24: C3.1.logistic intro

Machine Learning Specialization 30

More generic features… D-dimensional hyperplane

©2015-2016 Emily Fox & Carlos Guestrin

feature 1 = h0(x) … e.g., 1

feature 2 = h1(x) … e.g., x[1] = #awesome

feature 3 = h2(x) … e.g., x[2] = #awful or, log(x[7]) x[2] = log(#bad) x #awful

or, tf-idf(“awful”)

feature D+1 = hD(x) … some other function of x[1],…, x[d]

DX

j=0

Model: ŷi = sign(Score(xi))

Score(xi) = w0 h0(xi) + w1 h1(xi) + … + wD

hD(xi)

= wj hj(xi) = wTh(xi)

Page 25: C3.1.logistic intro

Machine Learning Specialization 31 ©2015-2016 Emily Fox & Carlos Guestrin

Training Data

Feature extraction

ML model

Quality metric

ML algorithm

y ŵ

(either -1 or +1)

ŷ=sign(ŵTh(x))h(x) x

Page 26: C3.1.logistic intro

Machine Learning Specialization

Are you sure about the prediction? Class probability

32 ©2015-2016 Emily Fox & Carlos Guestrin

Page 27: C3.1.logistic intro

Machine Learning Specialization 33

How confident is your prediction?

•  Thus far, we’ve outputted a prediction or

•  But, how sure are you about the prediction?

©2015-2016 Emily Fox & Carlos Guestrin

Definite

“The sushi & everything else were awesome!”

“The sushi was good, the service was OK”

Not sure

ŷ = +1 with high probability

ŷ = +1 with probability 0.5

Page 28: C3.1.logistic intro

Machine Learning Specialization

Basics of probabilities – quick review

34 ©2015-2016 Emily Fox & Carlos Guestrin

Page 29: C3.1.logistic intro

Machine Learning Specialization 36

Basic probability

©2015-2016 Emily Fox & Carlos Guestrin

x = review text

y = sentiment

All the sushi was delicious! Easily best sushi in Seattle. +1

The sushi & everything else were awesome! +1

My wife tried their ramen, it was pretty forgettable. -1

The sushi was good, the service was OK +1

… …

Probability a review is positive is 0.7

I expect 70% of rows to have y = +1

(Exact number will vary for each specific dataset)

Page 30: C3.1.logistic intro

Machine Learning Specialization 37

Interpreting probabilities as degrees of belief

©2015-2016 Emily Fox & Carlos Guestrin

0.0 1.0

P(y=+1)

Absolutely sure reviews are positive

Absolutely sure reviews are negative

Not sure if reviews are positive or negative

0.5

Page 31: C3.1.logistic intro

Machine Learning Specialization 38

Key properties of probabilities

©2015-2016 Emily Fox & Carlos Guestrin

Property Two class (e.g., y is +1 or -1)

Multiple classes (e.g., y is dog, cat or bird)

Probabilities always between 0 & 1

Probabilities sum up to 1

Page 32: C3.1.logistic intro

Machine Learning Specialization 39

Conditional probability

©2015-2016 Emily Fox & Carlos Guestrin

x = review text y = sentiment All the sushi was delicious! Easily best sushi in Seattle. +1

Sushi was awesome & everything else was awesome! The service was awful, but overall awesome place!

+1

My wife tried their ramen, it was pretty forgettable. -1

The sushi was good, the service was OK +1

… …

awesome … awesome … awful … awesome +1

… …

awesome … awesome … awful … awesome -1

… …

… …

awesome … awesome … awful … awesome +1

x = review text y = sentiment All the sushi was delicious! Easily best sushi in Seattle. +1

Sushi was awesome & everything else was awesome! The service was awful, but overall awesome place!

+1

My wife tried their ramen, it was pretty forgettable. -1

The sushi was good, the service was OK +1

… …

awesome … awesome … awful … awesome +1

… …

awesome … awesome … awful … awesome -1

… …

… …

awesome … awesome … awful … awesome +1

I expect 90% of rows with reviews containing

3 “awesome” & 1 “awful” to have y = +1

(Exact number will vary for each specific dataset)

Probability a review with 3 “awesome” and 1 “awful” is positive is 0.9

Page 33: C3.1.logistic intro

Machine Learning Specialization 40

Interpreting conditional probabilities

©2015-2016 Emily Fox & Carlos Guestrin

0.0 1.0

P(y=+1|xi=“All the sushi was delicious!”)

Absolutely sure review “All the sushi was delicious!”

is positive

Absolutely sure review “All the sushi was delicious!”

is negative

Not sure if review “All the sushi was delicious!”

is positive or negative

0.5

Output label Input sentence

Page 34: C3.1.logistic intro

Machine Learning Specialization 41

Key properties of conditional probabilities

©2015-2016 Emily Fox & Carlos Guestrin

Property Two class

(e.g., y is +1 or -1, xi is review text)

Multiple classes (e.g., y is dog, cat or bird,

xi is image)

Conditional probabilities always between 0 & 1

Conditional probabilities sum up to 1 over y, but not over x

Page 35: C3.1.logistic intro

Machine Learning Specialization

Using probabilities in classification

43 ©2015-2016 Emily Fox & Carlos Guestrin

Page 36: C3.1.logistic intro

Machine Learning Specialization 44

How confident is your prediction?

©2015-2016 Emily Fox & Carlos Guestrin

Definite

“The sushi & everything else were awesome!”

“The sushi was good, the service was OK”

Not sure

P(y=+1| x= )

= 0.99

P(y=+1| x= )

= 0.55

“The sushi & everything else were awesome!”

“The sushi was good, the service was OK”

Many classifiers provide a degree of certainty:

P(y|x)

Extremely useful in practice

Output label Input sentence

Page 37: C3.1.logistic intro

Machine Learning Specialization 45

Goal: Learn conditional probabilities from data

x[1] = #awesome x[2] = #awful y = sentiment

2 1 +1

0 2 -1

3 3 -1

4 1 +1

… … …

©2015-2016 Emily Fox & Carlos Guestrin

Training data: N observations (xi,yi)

Optimize quality metric on training data

Find best model P by finding best ŵ

⌃ Useful for

predicting ŷ

Page 38: C3.1.logistic intro

Machine Learning Specialization 46

= estimate of class probabilities

If > 0.5:

ŷ = Else:

ŷ =

©2015-2016 Emily Fox & Carlos Guestrin

Sentence from

review

Input: x

Predict most likely class

P(y|x) ⌃

P(y=+1|x) ⌃

•  Estimating improves interpretability: - Predict ŷ = +1 and tell me how sure you are

P(y|x) ⌃

Page 39: C3.1.logistic intro

Machine Learning Specialization

Predicting class probabilities with generalized linear models

47 ©2015-2016 Emily Fox & Carlos Guestrin

Page 40: C3.1.logistic intro

Machine Learning Specialization 48 ©2015-2016 Emily Fox & Carlos Guestrin

Training Data

Feature extraction

ML model

Quality metric

ML algorithm

y

h(x)

ŵ

x P(y|x) ⌃

Page 41: C3.1.logistic intro

Machine Learning Specialization 49

Thus far, we focused on decision boundaries

©2015-2016 Emily Fox & Carlos Guestrin

#awesome

#aw

ful

0 1 2 3 4 …

0

1

2

3

4

Score(x) > 0

Score(x) < 0

Score(xi) = w0 h0(xi) + w1 h1(xi) + … + wD

hD(xi)

= wTh(xi)

Relate Score(xi) to

P(y=+1|x,ŵ)? ⌃

Page 42: C3.1.logistic intro

Machine Learning Specialization 50

Interpreting Score(xi)

©2015-2016 Emily Fox & Carlos Guestrin

-∞ +∞

Score(xi)=wTh(xi)

ŷi = +1ŷi = -1

Very sure ŷi = +1

Very sure ŷi = -1

P(y=+1|xi) = 1 ⌃

P(y=+1|xi) = 0 ⌃

Not sure if

ŷi = -1 or +1

P(y=+1|xi) = 0.5 ⌃

0.0

Page 43: C3.1.logistic intro

Machine Learning Specialization 51

Why not just use regression to build classifier?

©2015-2016 Emily Fox & Carlos Guestrin

Score(xi) = w0h0(xi) + w1 h1(xi) + … + wD

hD(xi)

-∞ +∞0.0

0.50.0 1.0

P(y=+1|xi)

-∞ < Score(xi) < +∞

But probabilities between 0 and 1

How do we link -∞,+∞

to 0,1???

Page 44: C3.1.logistic intro

Machine Learning Specialization 52

Link function: squeeze real line into [0,1]

©2015-2016 Emily Fox & Carlos Guestrin

-∞ +∞

Score(xi) = wTh(xi)

0.0 1.0

0.0

0.5

g(wTh(xi))

Link function

P(y=+1|xi) = ⌃

Generalized linear model

Page 45: C3.1.logistic intro

Machine Learning Specialization 53 ©2015-2016 Emily Fox & Carlos Guestrin

Training Data

Feature extraction

ML model

Quality metric

ML algorithm

y ŵ

P(y=+1|x,ŵ) =g(ŵTh(x))

h(x) x

Page 46: C3.1.logistic intro

Machine Learning Specialization

Logistic regression classifier: linear score with logistic link function

54 ©2015-2016 Emily Fox & Carlos Guestrin

Page 47: C3.1.logistic intro

Machine Learning Specialization 56

Logistic function (sigmoid, logit)

©2015-2016 Emily Fox & Carlos Guestrin

Score -∞ -2 0.0 +2 +∞

sigmoid(Score)

sigmoid(Score) =

1

1 + e

�Score

Page 48: C3.1.logistic intro

Machine Learning Specialization 57

Logistic regression model

©2015-2016 Emily Fox & Carlos Guestrin

-∞ +∞

Score(xi) = wTh(xi)

0.0 1.0

0.0

0.5

P(y=+1|xi,w) = sigmoid(Score(xi))

Page 49: C3.1.logistic intro

Machine Learning Specialization 58

Understanding the logistic regression model

©2015-2016 Emily Fox & Carlos Guestrin

P (y = +1 | x,w) =1

1 + e�w

>h(x)

1

1+

e�w

>h(x

)

w

>h(x)

Score(xi) P(y=+1|xi,w)

Page 50: C3.1.logistic intro

Machine Learning Specialization 59

Logistic regression è Linear decision boundary

©2015-2016 Emily Fox & Carlos Guestrin

#awesome

#aw

ful

0 1 2 3 4 …

0

1

2

3

4

1

1+

e�w

>h(x

)

w

>h(x)

Page 51: C3.1.logistic intro

Machine Learning Specialization 60

Effect of coefficients on logistic regression model

©2015-2016 Emily Fox & Carlos Guestrin

w0 0

w#awesome +1

w#awful -1

w0 0

w#awesome +3

w#awful -3

w0 -2

w#awesome +1

w#awful -1

#awesome - #awful

1

1+

e�w

>h(x

)

1

1+

e�w

>h(x

)

#awesome - #awful

1

1+

e�w

>h(x

)

#awesome - #awful

Page 52: C3.1.logistic intro

Machine Learning Specialization 62 ©2015-2016 Emily Fox & Carlos Guestrin

Training Data

Feature extraction

ML model

Quality metric

ML algorithm

y ŵ

h(x) x

P(y=+1|x,ŵ) =sigmoid(ŵTh(x)) = 1 .

1 + e-ŵ h(x)T

Page 53: C3.1.logistic intro

Machine Learning Specialization

Overview of learning logistic regression model

©2015-2016 Emily Fox & Carlos Guestrin

Page 54: C3.1.logistic intro

Machine Learning Specialization 64

Training a classifier = Learning the coefficients

©2015-2016 Emily Fox & Carlos Guestrin

Data

(x,y) (Sentence1, ) (Sentence2, )

Training set

Validation set

Learn classifier

Quality

Word Coefficient Value

ŵ0 -2.0

good ŵ1 1.0

awesome ŵ2 1.7

bad ŵ3 -1.0

awful ŵ4 -3.3

… … …

P(y=+1|x,ŵ) =1 .

1 + e-ŵ h(x)T

Page 55: C3.1.logistic intro

Machine Learning Specialization 66

Find “best” classifier = Maximize quality metric over all possible w0,w1,w2

©2015-2016 Emily Fox & Carlos Guestrin

#awesome

#aw

ful

0 1 2 3 4 …

0

1

2

3

4

ℓ(w0=1, w1=0.5, w2=-1.5) = 10-4

ℓ(w0=1, w1=1, w2=-1.5) = 10-5

ℓ(w0=0, w1=1, w2=-1.5) = 10-6

Best model: Highest likelihood ℓ(w) ŵ = (w0=1, w1=0.5, w2=-1.5)

Likelihood ℓ(w)

Find best model coefficients w with

gradient ascent!

Page 56: C3.1.logistic intro

Machine Learning Specialization

Encoding categorical inputs

68 ©2015-2016 Emily Fox & Carlos Guestrin

Page 57: C3.1.logistic intro

Machine Learning Specialization 69

Categorical inputs •  Numeric inputs:

-  #awesome, age, salary,…

-  Intuitive when multiplied by coefficient •  e.g., 1.5 #awesome

•  Categorical inputs:

©2015-2016 Emily Fox & Carlos Guestrin

Gender (Male, Female,...)

Country of birth (Argentina, Brazil, USA,...)

Zipcode (10005, 98195,...)

Numeric value, but should be interpreted as category

(98195 not about 9x larger than 10005)

How do we multiply category by coefficient??? Must convert categorical inputs into numeric features

Page 58: C3.1.logistic intro

Machine Learning Specialization 71

Encoding categories as numeric features

©2015-2016 Emily Fox & Carlos Guestrin

Country of birth (Argentina, Brazil, USA,...)

x =

196 categories

1-hot encoding x h1(x) h2(x) … h195(x) h196(x)

Brazil

Zimbabwe

196 features

10,000 words in vocabulary

Bag of words

x h1(x) h2(x) … h9999(x) h10000(x)

10,000 features

Restaurant review (Text data)

x =

Page 59: C3.1.logistic intro

Machine Learning Specialization

Multiclass classification using 1 versus all

73 ©2015-2016 Emily Fox & Carlos Guestrin

Page 60: C3.1.logistic intro

Machine Learning Specialization 74

Multiclass classification

©2015-2016 Emily Fox & Carlos Guestrin

Input: x Image pixels

Output: y Object in image

Page 61: C3.1.logistic intro

Machine Learning Specialization 75

Multiclass classification formulation

•  C possible classes: - y can be 1, 2,…, C

•  N datapoints:

©2015-2016 Emily Fox & Carlos Guestrin

Data point

x[1] x[2] y

x1,y1 2 1

x2,y2 0 2

x3,y3 3 3

x4,y4 4 1

Learn:

P(y= |x) ⌃

P(y= |x) ⌃

P(y= |x) ⌃

Page 62: C3.1.logistic intro

Machine Learning Specialization 77

1 versus all: Estimate using 2-class model

©2015-2016 Emily Fox & Carlos Guestrin

Predict:

Train classifier:

+1 class: points with yi= -1 class: points with yi= OR

P(y= |x) ⌃

P (y=+1|x) ⌃

P(y= |xi)= ⌃

P (y=+1|xi) ⌃

Page 63: C3.1.logistic intro

Machine Learning Specialization 78

1 versus all: simple multiclass classification using C 2-class models

©2015-2016 Emily Fox & Carlos Guestrin

P(y= |xi) = ⌃

P(y= |xi) = ⌃

P(y= |xi) = ⌃

Page 64: C3.1.logistic intro

Machine Learning Specialization 80

= estimate of 1 vs all model for each class

max_prob = 0; ŷ = 0

For c = 1,…,C:

If > max_prob:

ŷ = c

max_prob =

©2015-2016 Emily Fox & Carlos Guestrin

Input: xi

Multiclass training

Pc(y=+1|x) ⌃

Predict most likely class

Pc(y=+1|xi) ⌃

Pc(y=+1|xi) ⌃

Page 65: C3.1.logistic intro

Machine Learning Specialization

Summary of logistic regression classifier

81 ©2015-2016 Emily Fox & Carlos Guestrin

Page 66: C3.1.logistic intro

Machine Learning Specialization 83 ©2015-2016 Emily Fox & Carlos Guestrin

Training Data

Feature extraction

ML model

Quality metric

ML algorithm

y ŵ

h(x) x

P(y=+1|x,ŵ) = 1 .

1 + e-ŵ h(x)T

Page 67: C3.1.logistic intro

Machine Learning Specialization 85

What you can do now… •  Describe decision boundaries and linear

classifiers •  Use class probability to express degree of

confidence in prediction •  Define a logistic regression model •  Interpret logistic regression outputs as class

probabilities •  Describe impact of coefficient values on

logistic regression output •  Use 1-hot encoding to represent categorical

inputs •  Perform multiclass classification using the

1-versus-all approach ©2015-2016 Emily Fox & Carlos Guestrin