Lecture 2: Generative Learningtzhao80/Lectures/Lecture_2.pdf · Lecture 2: Generative Learning Tuo...

105
Lecture 2: Generative Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech

Transcript of Lecture 2: Generative Learningtzhao80/Lectures/Lecture_2.pdf · Lecture 2: Generative Learning Tuo...

Page 1: Lecture 2: Generative Learningtzhao80/Lectures/Lecture_2.pdf · Lecture 2: Generative Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech

Lecture 2: Generative Learning

Tuo Zhao

Schools of ISYE and CSE, Georgia Tech

Page 2: Lecture 2: Generative Learningtzhao80/Lectures/Lecture_2.pdf · Lecture 2: Generative Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Generative Learning

Tuo Zhao — Lecture 2: Generative Learning 2/47

Page 3: Lecture 2: Generative Learningtzhao80/Lectures/Lecture_2.pdf · Lecture 2: Generative Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Generative Learning

Tuo Zhao — Lecture 2: Generative Learning 3/47

Page 4: Lecture 2: Generative Learningtzhao80/Lectures/Lecture_2.pdf · Lecture 2: Generative Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Modeling Dogs

Tuo Zhao — Lecture 2: Generative Learning 4/47

Page 5: Lecture 2: Generative Learningtzhao80/Lectures/Lecture_2.pdf · Lecture 2: Generative Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Modeling Cats

Tuo Zhao — Lecture 2: Generative Learning 5/47

Page 6: Lecture 2: Generative Learningtzhao80/Lectures/Lecture_2.pdf · Lecture 2: Generative Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Discriminative Learning

GenerativeDiscriminative GenerativeDiscriminative

Tuo Zhao — Lecture 2: Generative Learning 6/47

Page 7: Lecture 2: Generative Learningtzhao80/Lectures/Lecture_2.pdf · Lecture 2: Generative Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Which One is Better for Classification?

Tuo Zhao — Lecture 2: Generative Learning 7/47

Page 8: Lecture 2: Generative Learningtzhao80/Lectures/Lecture_2.pdf · Lecture 2: Generative Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Joint and Posterior Distributions

We consider a binary classification problem:

Feature: X ∈ Rd

Response: Y ∈ {0, 1}Class Prior: P(Y = 1) = p and P(Y = 0) = 1− pPosterior: Conditional Probability of Y Given X, i.e.,

P(Y |X) =P(Y )P(X|Y )

P(X).

Tuo Zhao — Lecture 2: Generative Learning 8/47

Page 9: Lecture 2: Generative Learningtzhao80/Lectures/Lecture_2.pdf · Lecture 2: Generative Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Joint and Posterior Distributions

We consider a binary classification problem:

Feature: X ∈ Rd

Response: Y ∈ {0, 1}Class Prior: P(Y = 1) = p and P(Y = 0) = 1− pPosterior: Conditional Probability of Y Given X, i.e.,

P(Y |X) =P(Y )P(X|Y )

P(X).

Tuo Zhao — Lecture 2: Generative Learning 8/47

Page 10: Lecture 2: Generative Learningtzhao80/Lectures/Lecture_2.pdf · Lecture 2: Generative Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Joint and Posterior Distributions

We consider a binary classification problem:

Feature: X ∈ Rd

Response: Y ∈ {0, 1}Class Prior: P(Y = 1) = p and P(Y = 0) = 1− pPosterior: Conditional Probability of Y Given X, i.e.,

P(Y |X) =P(Y )P(X|Y )

P(X).

Tuo Zhao — Lecture 2: Generative Learning 8/47

Page 11: Lecture 2: Generative Learningtzhao80/Lectures/Lecture_2.pdf · Lecture 2: Generative Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Joint and Posterior Distributions

We consider a binary classification problem:

Feature: X ∈ Rd

Response: Y ∈ {0, 1}Class Prior: P(Y = 1) = p and P(Y = 0) = 1− pPosterior: Conditional Probability of Y Given X, i.e.,

P(Y |X) =P(Y )P(X|Y )

P(X).

Tuo Zhao — Lecture 2: Generative Learning 8/47

Page 12: Lecture 2: Generative Learningtzhao80/Lectures/Lecture_2.pdf · Lecture 2: Generative Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Joint and Posterior Distributions

We consider a binary classification problem:

Feature: X ∈ Rd

Response: Y ∈ {0, 1}Class Prior: P(Y = 1) = p and P(Y = 0) = 1− pPosterior: Conditional Probability of Y Given X, i.e.,

P(Y |X) =P(Y )P(X|Y )

P(X).

Tuo Zhao — Lecture 2: Generative Learning 8/47

Page 13: Lecture 2: Generative Learningtzhao80/Lectures/Lecture_2.pdf · Lecture 2: Generative Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Discriminative Learning

Posterior is sufficient for prediction:

y = argmaxy

P(Y = y|X = x)

= argmaxy

P(Y = y)P(X = x|Y = y)

P(X = x)

= argmaxy

P(Y = y)P(X = x|Y = y)

= argmaxy

P(X = x, Y = y)

Which one to model?

Joint Distribution? Conditional Distribution?

Tuo Zhao — Lecture 2: Generative Learning 9/47

Page 14: Lecture 2: Generative Learningtzhao80/Lectures/Lecture_2.pdf · Lecture 2: Generative Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech

Gaussian Discriminant Analysis

Page 15: Lecture 2: Generative Learningtzhao80/Lectures/Lecture_2.pdf · Lecture 2: Generative Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Gaussian Discriminant Analysis

Multivariate Gaussian Distribution: X ∼ N(µ,Σ)

Probability Density Function

f(x;µ,Σ) =1

(2π)d/2|Σ|1/2 exp(−1

2(x− µ)>Σ−1(x− µ)

)

Expectation: EX = µ

Covariance: E(X − µ)(X − µ)> = Σ

Standard Gaussian Distribution: µ = 0 and Σ = Id.

Tuo Zhao — Lecture 2: Generative Learning 11/47

Page 16: Lecture 2: Generative Learningtzhao80/Lectures/Lecture_2.pdf · Lecture 2: Generative Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Gaussian Discriminant Analysis

Multivariate Gaussian Distribution: X ∼ N(µ,Σ)

Probability Density Function

f(x;µ,Σ) =1

(2π)d/2|Σ|1/2 exp(−1

2(x− µ)>Σ−1(x− µ)

)

Expectation: EX = µ

Covariance: E(X − µ)(X − µ)> = Σ

Standard Gaussian Distribution: µ = 0 and Σ = Id.

Tuo Zhao — Lecture 2: Generative Learning 11/47

Page 17: Lecture 2: Generative Learningtzhao80/Lectures/Lecture_2.pdf · Lecture 2: Generative Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Gaussian Discriminant Analysis

Multivariate Gaussian Distribution: X ∼ N(µ,Σ)

Probability Density Function

f(x;µ,Σ) =1

(2π)d/2|Σ|1/2 exp(−1

2(x− µ)>Σ−1(x− µ)

)

Expectation: EX = µ

Covariance: E(X − µ)(X − µ)> = Σ

Standard Gaussian Distribution: µ = 0 and Σ = Id.

Tuo Zhao — Lecture 2: Generative Learning 11/47

Page 18: Lecture 2: Generative Learningtzhao80/Lectures/Lecture_2.pdf · Lecture 2: Generative Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Gaussian Discriminant Analysis

Multivariate Gaussian Distribution: X ∼ N(µ,Σ)

Probability Density Function

f(x;µ,Σ) =1

(2π)d/2|Σ|1/2 exp(−1

2(x− µ)>Σ−1(x− µ)

)

Expectation: EX = µ

Covariance: E(X − µ)(X − µ)> = Σ

Standard Gaussian Distribution: µ = 0 and Σ = Id.

Tuo Zhao — Lecture 2: Generative Learning 11/47

Page 19: Lecture 2: Generative Learningtzhao80/Lectures/Lecture_2.pdf · Lecture 2: Generative Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Gaussian Discriminant Analysis

Multivariate Gaussian Distribution: X ∼ N(µ,Σ)

3

real-valued random variable. The covariance can also be defined as Cov(Z) =E[ZZT ]− (E[Z])(E[Z])T . (You should be able to prove to yourself that thesetwo definitions are equivalent.) If X ∼ N (µ, Σ), then

Cov(X) = Σ.

Here’re some examples of what the density of a Gaussian distributionlooks like:

−3−2

−10

12

3

−3−2

−10

12

3

0.05

0.1

0.15

0.2

0.25

−3−2

−10

12

3

−3−2

−10

12

3

0.05

0.1

0.15

0.2

0.25

−3−2

−10

12

3

−3−2

−10

12

3

0.05

0.1

0.15

0.2

0.25

The left-most figure shows a Gaussian with mean zero (that is, the 2x1zero-vector) and covariance matrix Σ = I (the 2x2 identity matrix). A Gaus-sian with zero mean and identity covariance is also called the standard nor-mal distribution. The middle figure shows the density of a Gaussian withzero mean and Σ = 0.6I; and in the rightmost figure shows one with , Σ = 2I.We see that as Σ becomes larger, the Gaussian becomes more “spread-out,”and as it becomes smaller, the distribution becomes more “compressed.”

Let’s look at some more examples.

−3−2

−10

12

3

−3

−2

−1

0

1

2

3

0.05

0.1

0.15

0.2

0.25

−3−2

−10

12

3

−3

−2

−1

0

1

2

3

0.05

0.1

0.15

0.2

0.25

−3−2

−10

12

3

−3

−2

−1

0

1

2

3

0.05

0.1

0.15

0.2

0.25

The figures above show Gaussians with mean 0, and with covariancematrices respectively

Σ =

!1 00 1

"; Σ =

!1 0.5

0.5 1

"; .Σ =

!1 0.8

0.8 1

".

The leftmost figure shows the familiar standard normal distribution, and wesee that as we increase the off-diagonal entry in Σ, the density becomes more“compressed” towards the 45◦ line (given by x1 = x2). We can see this moreclearly when we look at the contours of the same three densities:

4

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

Here’s one last set of examples generated by varying Σ:

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

The plots above used, respectively,

Σ =

!1 -0.5

-0.5 1

"; Σ =

!1 -0.8

-0.8 1

"; .Σ =

!3 0.8

0.8 1

".

From the leftmost and middle figures, we see that by decreasing the off-diagonal elements of the covariance matrix, the density now becomes “com-pressed” again, but in the opposite direction. Lastly, as we vary the pa-rameters, more generally the contours will form ellipses (the rightmost figureshowing an example).

As our last set of examples, fixing Σ = I, by varying µ, we can also movethe mean of the density around.

−3−2

−10

12

3

−3−2

−10

12

3

0.05

0.1

0.15

0.2

0.25

−3−2

−10

12

3

−3−2

−10

12

3

0.05

0.1

0.15

0.2

0.25

−3−2

−10

12

3

−3−2

−10

12

3

0.05

0.1

0.15

0.2

0.25

The figures above were generated using Σ = I, and respectively

µ =

!10

"; µ =

!-0.50

"; µ =

!-1

-1.5

".

Tuo Zhao — Lecture 2: Generative Learning 12/47

Page 20: Lecture 2: Generative Learningtzhao80/Lectures/Lecture_2.pdf · Lecture 2: Generative Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Gaussian Discriminant Analysis

Generative View:

Y ∼ Bernoulli(p)

P(Y = y) = py(1− p)1−y

P(X|Y = 0) ∼ N(µ0,Σ)

P(X = x|Y = 0) =exp

(−1

2(x− µ0)>Σ−1(x− µ0)

)

(2π)d/2|Σ|1/2

P(X|Y = 1) ∼ N(µ1,Σ)

P(X = x|Y = 1) =exp

(−1

2(x− µ1)>Σ−1(x− µ1)

)

(2π)d/2|Σ|1/2

Tuo Zhao — Lecture 2: Generative Learning 13/47

Page 21: Lecture 2: Generative Learningtzhao80/Lectures/Lecture_2.pdf · Lecture 2: Generative Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Gaussian Discriminant Analysis

Generative View:

Y ∼ Bernoulli(p)

P(Y = y) = py(1− p)1−y

P(X|Y = 0) ∼ N(µ0,Σ)

P(X = x|Y = 0) =exp

(−1

2(x− µ0)>Σ−1(x− µ0)

)

(2π)d/2|Σ|1/2

P(X|Y = 1) ∼ N(µ1,Σ)

P(X = x|Y = 1) =exp

(−1

2(x− µ1)>Σ−1(x− µ1)

)

(2π)d/2|Σ|1/2

Tuo Zhao — Lecture 2: Generative Learning 13/47

Page 22: Lecture 2: Generative Learningtzhao80/Lectures/Lecture_2.pdf · Lecture 2: Generative Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Gaussian Discriminant Analysis

Generative View:

Y ∼ Bernoulli(p)

P(Y = y) = py(1− p)1−y

P(X|Y = 0) ∼ N(µ0,Σ)

P(X = x|Y = 0) =exp

(−1

2(x− µ0)>Σ−1(x− µ0)

)

(2π)d/2|Σ|1/2

P(X|Y = 1) ∼ N(µ1,Σ)

P(X = x|Y = 1) =exp

(−1

2(x− µ1)>Σ−1(x− µ1)

)

(2π)d/2|Σ|1/2

Tuo Zhao — Lecture 2: Generative Learning 13/47

Page 23: Lecture 2: Generative Learningtzhao80/Lectures/Lecture_2.pdf · Lecture 2: Generative Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Generative Learning

Maximum Likelihood Estimation:

L(p,µ0,µ1,Σ) = log

n∏

i=1

f(xi, yi; p,µ0,µ1,Σ)

= log

n∏

i=1

h(xi|yi; p,µ0,µ1,Σ)g(yi; p).

6

By maximizing ℓ with respect to the parameters, we find the maximum like-lihood estimate of the parameters (see problem set 1) to be:

φ =1

m

m!

i=1

1{y(i) = 1}

µ0 =

"mi=1 1{y(i) = 0}x(i)

"mi=1 1{y(i) = 0}

µ1 =

"mi=1 1{y(i) = 1}x(i)

"mi=1 1{y(i) = 1}

Σ =1

m

m!

i=1

(x(i) − µy(i))(x(i) − µy(i))T .

Pictorially, what the algorithm is doing can be seen in as follows:

−2 −1 0 1 2 3 4 5 6 7−7

−6

−5

−4

−3

−2

−1

0

1

Shown in the figure are the training set, as well as the contours of thetwo Gaussian distributions that have been fit to the data in each of thetwo classes. Note that the two Gaussians have contours that are the sameshape and orientation, since they share a covariance matrix Σ, but they havedifferent means µ0 and µ1. Also shown in the figure is the straight linegiving the decision boundary at which p(y = 1|x) = 0.5. On one side ofthe boundary, we’ll predict y = 1 to be the most likely outcome, and on theother side, we’ll predict y = 0.

1.3 Discussion: GDA and logistic regression

The GDA model has an interesting relationship to logistic regression. If weview the quantity p(y = 1|x; φ, µ0, µ1, Σ) as a function of x, we’ll find that it

Tuo Zhao — Lecture 2: Generative Learning 14/47

Page 24: Lecture 2: Generative Learningtzhao80/Lectures/Lecture_2.pdf · Lecture 2: Generative Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Generative Learning

Maximum Likelihood Estimation:

L(p,µ0,µ1,Σ) = log

n∏

i=1

h(xi|yi; p,µ0,µ1,Σ)g(yi; p).

Convex Minimization

µ0 =

∑ni=1 xi · (1− yi)n−∑n

i=1 yiand µ1 =

∑ni=1 xi · yi∑ni=1 yi

p =

∑ni=1 yin

and Σk =1

nk

yi=k

(xi − µyi)(xi − µyi)>

d(d+ 1) + 2d+ 1 parameters to estimate.

Tuo Zhao — Lecture 2: Generative Learning 15/47

Page 25: Lecture 2: Generative Learningtzhao80/Lectures/Lecture_2.pdf · Lecture 2: Generative Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Generative Learning

Maximum Likelihood Estimation:

L(p,µ0,µ1,Σ) = log

n∏

i=1

h(xi|yi; p,µ0,µ1,Σ)g(yi; p).

Convex Minimization

µ0 =

∑ni=1 xi · (1− yi)n−∑n

i=1 yiand µ1 =

∑ni=1 xi · yi∑ni=1 yi

p =

∑ni=1 yin

and Σk =1

nk

yi=k

(xi − µyi)(xi − µyi)>

d(d+ 1) + 2d+ 1 parameters to estimate.

Tuo Zhao — Lecture 2: Generative Learning 15/47

Page 26: Lecture 2: Generative Learningtzhao80/Lectures/Lecture_2.pdf · Lecture 2: Generative Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Generative Learning

Maximum Likelihood Estimation:

L(p,µ0,µ1,Σ) = log

n∏

i=1

h(xi|yi; p,µ0,µ1,Σ)g(yi; p).

Convex Minimization

µ0 =

∑ni=1 xi · (1− yi)n−∑n

i=1 yiand µ1 =

∑ni=1 xi · yi∑ni=1 yi

p =

∑ni=1 yin

and Σk =1

nk

yi=k

(xi − µyi)(xi − µyi)>

d(d+ 1) + 2d+ 1 parameters to estimate.

Tuo Zhao — Lecture 2: Generative Learning 15/47

Page 27: Lecture 2: Generative Learningtzhao80/Lectures/Lecture_2.pdf · Lecture 2: Generative Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Generative Learning

Maximum Likelihood Estimation:

L(p,µ0,µ1,Σ) = log

n∏

i=1

h(xi|yi; p,µ0,µ1,Σ)g(yi; p).

Convex Minimization

µ0 =

∑ni=1 xi · (1− yi)n−∑n

i=1 yiand µ1 =

∑ni=1 xi · yi∑ni=1 yi

p =

∑ni=1 yin

and Σk =1

nk

yi=k

(xi − µyi)(xi − µyi)>

d(d+ 1) + 2d+ 1 parameters to estimate.

Tuo Zhao — Lecture 2: Generative Learning 15/47

Page 28: Lecture 2: Generative Learningtzhao80/Lectures/Lecture_2.pdf · Lecture 2: Generative Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Generative Learning

Maximum Likelihood Estimation:

L(p,µ0,µ1,Σ) = log

n∏

i=1

h(xi|yi; p,µ0,µ1,Σ)g(yi; p).

Convex Minimization

µ0 =

∑ni=1 xi · (1− yi)n−∑n

i=1 yiand µ1 =

∑ni=1 xi · yi∑ni=1 yi

p =

∑ni=1 yin

and Σk =1

nk

yi=k

(xi − µyi)(xi − µyi)>

d(d+ 1) + 2d+ 1 parameters to estimate.

Tuo Zhao — Lecture 2: Generative Learning 15/47

Page 29: Lecture 2: Generative Learningtzhao80/Lectures/Lecture_2.pdf · Lecture 2: Generative Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Gaussian Discriminant Analysis

Prediction: Given X ∈ Rd, we predict

Y = argmaxY ∈{0,1}

P(Y |X; p, µ0, µ1, Σ).

Since we have [Analytical Problem in HW3]

log

(P(Y = 1|X)

1− P(Y = 1|X)

)= −1

2(µ1 + µ0)

>Σ−1(µ1 − µ0)

+ (µ1 − µ0)Σ−1X + log

(p

1− p

),

this is actually a logistic regression model!

But different from maximizing the conditional log likelihood!

Tuo Zhao — Lecture 2: Generative Learning 16/47

Page 30: Lecture 2: Generative Learningtzhao80/Lectures/Lecture_2.pdf · Lecture 2: Generative Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Gaussian Discriminant Analysis

Prediction: Given X ∈ Rd, we predict

Y = argmaxY ∈{0,1}

P(Y |X; p, µ0, µ1, Σ).

Since we have [Analytical Problem in HW3]

log

(P(Y = 1|X)

1− P(Y = 1|X)

)= −1

2(µ1 + µ0)

>Σ−1(µ1 − µ0)

+ (µ1 − µ0)Σ−1X + log

(p

1− p

),

this is actually a logistic regression model!

But different from maximizing the conditional log likelihood!

Tuo Zhao — Lecture 2: Generative Learning 16/47

Page 31: Lecture 2: Generative Learningtzhao80/Lectures/Lecture_2.pdf · Lecture 2: Generative Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Gaussian Discriminant Analysis

Prediction: Given X ∈ Rd, we predict

Y = argmaxY ∈{0,1}

P(Y |X; p, µ0, µ1, Σ).

Since we have [Analytical Problem in HW3]

log

(P(Y = 1|X)

1− P(Y = 1|X)

)= −1

2(µ1 + µ0)

>Σ−1(µ1 − µ0)

+ (µ1 − µ0)Σ−1X + log

(p

1− p

),

this is actually a logistic regression model!

But different from maximizing the conditional log likelihood!

Tuo Zhao — Lecture 2: Generative Learning 16/47

Page 32: Lecture 2: Generative Learningtzhao80/Lectures/Lecture_2.pdf · Lecture 2: Generative Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

GDA v.s. Logistic Regression

Gaussian Discriminant Analysis

Modeling Assumption: Terrible

d(d+ 1)/2 + 2d+ 1 parameters: Terrible

Simple with a closed form solution: Not very useful!

Logistic Regression

Modeling Assumption: More Robust!

d parameters: Fewer!

Need an iterative optimization algorithm: Not bad!

Tuo Zhao — Lecture 2: Generative Learning 17/47

Page 33: Lecture 2: Generative Learningtzhao80/Lectures/Lecture_2.pdf · Lecture 2: Generative Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

GDA v.s. Logistic Regression

Gaussian Discriminant Analysis

Modeling Assumption: Terrible

d(d+ 1)/2 + 2d+ 1 parameters: Terrible

Simple with a closed form solution: Not very useful!

Logistic Regression

Modeling Assumption: More Robust!

d parameters: Fewer!

Need an iterative optimization algorithm: Not bad!

Tuo Zhao — Lecture 2: Generative Learning 17/47

Page 34: Lecture 2: Generative Learningtzhao80/Lectures/Lecture_2.pdf · Lecture 2: Generative Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

GDA v.s. Logistic Regression

Gaussian Discriminant Analysis

Modeling Assumption: Terrible

d(d+ 1)/2 + 2d+ 1 parameters: Terrible

Simple with a closed form solution: Not very useful!

Logistic Regression

Modeling Assumption: More Robust!

d parameters: Fewer!

Need an iterative optimization algorithm: Not bad!

Tuo Zhao — Lecture 2: Generative Learning 17/47

Page 35: Lecture 2: Generative Learningtzhao80/Lectures/Lecture_2.pdf · Lecture 2: Generative Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

GDA v.s. Logistic Regression

Gaussian Discriminant Analysis

Modeling Assumption: Terrible

d(d+ 1)/2 + 2d+ 1 parameters: Terrible

Simple with a closed form solution: Not very useful!

Logistic Regression

Modeling Assumption: More Robust!

d parameters: Fewer!

Need an iterative optimization algorithm: Not bad!

Tuo Zhao — Lecture 2: Generative Learning 17/47

Page 36: Lecture 2: Generative Learningtzhao80/Lectures/Lecture_2.pdf · Lecture 2: Generative Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

GDA v.s. Logistic Regression

Gaussian Discriminant Analysis

Modeling Assumption: Terrible

d(d+ 1)/2 + 2d+ 1 parameters: Terrible

Simple with a closed form solution: Not very useful!

Logistic Regression

Modeling Assumption: More Robust!

d parameters: Fewer!

Need an iterative optimization algorithm: Not bad!

Tuo Zhao — Lecture 2: Generative Learning 17/47

Page 37: Lecture 2: Generative Learningtzhao80/Lectures/Lecture_2.pdf · Lecture 2: Generative Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

GDA v.s. Logistic Regression

Gaussian Discriminant Analysis

Modeling Assumption: Terrible

d(d+ 1)/2 + 2d+ 1 parameters: Terrible

Simple with a closed form solution: Not very useful!

Logistic Regression

Modeling Assumption: More Robust!

d parameters: Fewer!

Need an iterative optimization algorithm: Not bad!

Tuo Zhao — Lecture 2: Generative Learning 17/47

Page 38: Lecture 2: Generative Learningtzhao80/Lectures/Lecture_2.pdf · Lecture 2: Generative Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech

Naive Bayes Classification

Page 39: Lecture 2: Generative Learningtzhao80/Lectures/Lecture_2.pdf · Lecture 2: Generative Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Naive Bayes Gaussian Discriminant Analysis

Generative View:

Y ∼ Bernoulli(p)

P(X|Y = 0) ∼ N(µ0,Σ)

P(X|Y = 1) ∼ N(µ1,Σ)

Σ =

σ21σ22

. . .

σ2d

4

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

Here’s one last set of examples generated by varying Σ:

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

The plots above used, respectively,

Σ =

!1 -0.5

-0.5 1

"; Σ =

!1 -0.8

-0.8 1

"; .Σ =

!3 0.8

0.8 1

".

From the leftmost and middle figures, we see that by decreasing the off-diagonal elements of the covariance matrix, the density now becomes “com-pressed” again, but in the opposite direction. Lastly, as we vary the pa-rameters, more generally the contours will form ellipses (the rightmost figureshowing an example).

As our last set of examples, fixing Σ = I, by varying µ, we can also movethe mean of the density around.

−3−2

−10

12

3

−3−2

−10

12

3

0.05

0.1

0.15

0.2

0.25

−3−2

−10

12

3

−3−2

−10

12

3

0.05

0.1

0.15

0.2

0.25

−3−2

−10

12

3

−3−2

−10

12

3

0.05

0.1

0.15

0.2

0.25

The figures above were generated using Σ = I, and respectively

µ =

!10

"; µ =

!-0.50

"; µ =

!-1

-1.5

".

4

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

Here’s one last set of examples generated by varying Σ:

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

The plots above used, respectively,

Σ =

!1 -0.5

-0.5 1

"; Σ =

!1 -0.8

-0.8 1

"; .Σ =

!3 0.8

0.8 1

".

From the leftmost and middle figures, we see that by decreasing the off-diagonal elements of the covariance matrix, the density now becomes “com-pressed” again, but in the opposite direction. Lastly, as we vary the pa-rameters, more generally the contours will form ellipses (the rightmost figureshowing an example).

As our last set of examples, fixing Σ = I, by varying µ, we can also movethe mean of the density around.

−3−2

−10

12

3

−3−2

−10

12

3

0.05

0.1

0.15

0.2

0.25

−3−2

−10

12

3

−3−2

−10

12

3

0.05

0.1

0.15

0.2

0.25

−3−2

−10

12

3

−3−2

−10

12

3

0.05

0.1

0.15

0.2

0.25

The figures above were generated using Σ = I, and respectively

µ =

!10

"; µ =

!-0.50

"; µ =

!-1

-1.5

".

Tuo Zhao — Lecture 2: Generative Learning 19/47

Page 40: Lecture 2: Generative Learningtzhao80/Lectures/Lecture_2.pdf · Lecture 2: Generative Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Naive Bayes Gaussian Discriminant Analysis

Generative View:

Y ∼ Bernoulli(p)

P(X|Y = 0) ∼ N(µ0,Σ)

P(X|Y = 1) ∼ N(µ1,Σ)

Σ =

σ21σ22

. . .

σ2d

4

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

Here’s one last set of examples generated by varying Σ:

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

The plots above used, respectively,

Σ =

!1 -0.5

-0.5 1

"; Σ =

!1 -0.8

-0.8 1

"; .Σ =

!3 0.8

0.8 1

".

From the leftmost and middle figures, we see that by decreasing the off-diagonal elements of the covariance matrix, the density now becomes “com-pressed” again, but in the opposite direction. Lastly, as we vary the pa-rameters, more generally the contours will form ellipses (the rightmost figureshowing an example).

As our last set of examples, fixing Σ = I, by varying µ, we can also movethe mean of the density around.

−3−2

−10

12

3

−3−2

−10

12

3

0.05

0.1

0.15

0.2

0.25

−3−2

−10

12

3

−3−2

−10

12

3

0.05

0.1

0.15

0.2

0.25

−3−2

−10

12

3

−3−2

−10

12

3

0.05

0.1

0.15

0.2

0.25

The figures above were generated using Σ = I, and respectively

µ =

!10

"; µ =

!-0.50

"; µ =

!-1

-1.5

".

4

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

Here’s one last set of examples generated by varying Σ:

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

The plots above used, respectively,

Σ =

!1 -0.5

-0.5 1

"; Σ =

!1 -0.8

-0.8 1

"; .Σ =

!3 0.8

0.8 1

".

From the leftmost and middle figures, we see that by decreasing the off-diagonal elements of the covariance matrix, the density now becomes “com-pressed” again, but in the opposite direction. Lastly, as we vary the pa-rameters, more generally the contours will form ellipses (the rightmost figureshowing an example).

As our last set of examples, fixing Σ = I, by varying µ, we can also movethe mean of the density around.

−3−2

−10

12

3

−3−2

−10

12

3

0.05

0.1

0.15

0.2

0.25

−3−2

−10

12

3

−3−2

−10

12

3

0.05

0.1

0.15

0.2

0.25

−3−2

−10

12

3

−3−2

−10

12

3

0.05

0.1

0.15

0.2

0.25

The figures above were generated using Σ = I, and respectively

µ =

!10

"; µ =

!-0.50

"; µ =

!-1

-1.5

".

Tuo Zhao — Lecture 2: Generative Learning 19/47

Page 41: Lecture 2: Generative Learningtzhao80/Lectures/Lecture_2.pdf · Lecture 2: Generative Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Naive Bayes Gaussian Discriminant Analysis

Generative View:

Y ∼ Bernoulli(p)

P(X|Y = 0) ∼ N(µ0,Σ)

P(X|Y = 1) ∼ N(µ1,Σ)

Σ =

σ21σ22

. . .

σ2d

4

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

Here’s one last set of examples generated by varying Σ:

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

The plots above used, respectively,

Σ =

!1 -0.5

-0.5 1

"; Σ =

!1 -0.8

-0.8 1

"; .Σ =

!3 0.8

0.8 1

".

From the leftmost and middle figures, we see that by decreasing the off-diagonal elements of the covariance matrix, the density now becomes “com-pressed” again, but in the opposite direction. Lastly, as we vary the pa-rameters, more generally the contours will form ellipses (the rightmost figureshowing an example).

As our last set of examples, fixing Σ = I, by varying µ, we can also movethe mean of the density around.

−3−2

−10

12

3

−3−2

−10

12

3

0.05

0.1

0.15

0.2

0.25

−3−2

−10

12

3

−3−2

−10

12

3

0.05

0.1

0.15

0.2

0.25

−3−2

−10

12

3

−3−2

−10

12

3

0.05

0.1

0.15

0.2

0.25

The figures above were generated using Σ = I, and respectively

µ =

!10

"; µ =

!-0.50

"; µ =

!-1

-1.5

".

4

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

Here’s one last set of examples generated by varying Σ:

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

The plots above used, respectively,

Σ =

!1 -0.5

-0.5 1

"; Σ =

!1 -0.8

-0.8 1

"; .Σ =

!3 0.8

0.8 1

".

From the leftmost and middle figures, we see that by decreasing the off-diagonal elements of the covariance matrix, the density now becomes “com-pressed” again, but in the opposite direction. Lastly, as we vary the pa-rameters, more generally the contours will form ellipses (the rightmost figureshowing an example).

As our last set of examples, fixing Σ = I, by varying µ, we can also movethe mean of the density around.

−3−2

−10

12

3

−3−2

−10

12

3

0.05

0.1

0.15

0.2

0.25

−3−2

−10

12

3

−3−2

−10

12

3

0.05

0.1

0.15

0.2

0.25

−3−2

−10

12

3

−3−2

−10

12

3

0.05

0.1

0.15

0.2

0.25

The figures above were generated using Σ = I, and respectively

µ =

!10

"; µ =

!-0.50

"; µ =

!-1

-1.5

".

Tuo Zhao — Lecture 2: Generative Learning 19/47

Page 42: Lecture 2: Generative Learningtzhao80/Lectures/Lecture_2.pdf · Lecture 2: Generative Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Naive Bayes Gaussian Discriminant Analysis

Generative View:

Y ∼ Bernoulli(p)

P(X|Y = 0) ∼ N(µ0,Σ)

P(X|Y = 1) ∼ N(µ1,Σ)

Σ =

σ21σ22

. . .

σ2d

4

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

Here’s one last set of examples generated by varying Σ:

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

The plots above used, respectively,

Σ =

!1 -0.5

-0.5 1

"; Σ =

!1 -0.8

-0.8 1

"; .Σ =

!3 0.8

0.8 1

".

From the leftmost and middle figures, we see that by decreasing the off-diagonal elements of the covariance matrix, the density now becomes “com-pressed” again, but in the opposite direction. Lastly, as we vary the pa-rameters, more generally the contours will form ellipses (the rightmost figureshowing an example).

As our last set of examples, fixing Σ = I, by varying µ, we can also movethe mean of the density around.

−3−2

−10

12

3

−3−2

−10

12

3

0.05

0.1

0.15

0.2

0.25

−3−2

−10

12

3

−3−2

−10

12

3

0.05

0.1

0.15

0.2

0.25

−3−2

−10

12

3

−3−2

−10

12

3

0.05

0.1

0.15

0.2

0.25

The figures above were generated using Σ = I, and respectively

µ =

!10

"; µ =

!-0.50

"; µ =

!-1

-1.5

".

4

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

Here’s one last set of examples generated by varying Σ:

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

The plots above used, respectively,

Σ =

!1 -0.5

-0.5 1

"; Σ =

!1 -0.8

-0.8 1

"; .Σ =

!3 0.8

0.8 1

".

From the leftmost and middle figures, we see that by decreasing the off-diagonal elements of the covariance matrix, the density now becomes “com-pressed” again, but in the opposite direction. Lastly, as we vary the pa-rameters, more generally the contours will form ellipses (the rightmost figureshowing an example).

As our last set of examples, fixing Σ = I, by varying µ, we can also movethe mean of the density around.

−3−2

−10

12

3

−3−2

−10

12

3

0.05

0.1

0.15

0.2

0.25

−3−2

−10

12

3

−3−2

−10

12

3

0.05

0.1

0.15

0.2

0.25

−3−2

−10

12

3

−3−2

−10

12

3

0.05

0.1

0.15

0.2

0.25

The figures above were generated using Σ = I, and respectively

µ =

!10

"; µ =

!-0.50

"; µ =

!-1

-1.5

".

Tuo Zhao — Lecture 2: Generative Learning 19/47

Page 43: Lecture 2: Generative Learningtzhao80/Lectures/Lecture_2.pdf · Lecture 2: Generative Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Naive Bayes Gaussian Discriminant Analysis

Conditional Independence:

P(X|Y ) =

d∏

j=1

P(Xj |Y ) = P(X1|Y )P(X2|Y ) · · ·P(Xd|Y )

A Simpler Decision Rule:

P(Y = 1|X) =P(X|Y = 1)P(Y = 1)

P(X)

=

∏dj=1 P(Xj |Y = 1)P(Y = 1)

∏dj=1 P(Xj |Y = 1)P(Y = 1) +

∏dj=1 P(Xj |Y = 0)P(Y = 0)

Tuo Zhao — Lecture 2: Generative Learning 20/47

Page 44: Lecture 2: Generative Learningtzhao80/Lectures/Lecture_2.pdf · Lecture 2: Generative Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Naive Bayes Gaussian Discriminant Analysis

Conditional Independence:

P(X|Y ) =

d∏

j=1

P(Xj |Y ) = P(X1|Y )P(X2|Y ) · · ·P(Xd|Y )

A Simpler Decision Rule:

P(Y = 1|X) =P(X|Y = 1)P(Y = 1)

P(X)

=

∏dj=1 P(Xj |Y = 1)P(Y = 1)

∏dj=1 P(Xj |Y = 1)P(Y = 1) +

∏dj=1 P(Xj |Y = 0)P(Y = 0)

Tuo Zhao — Lecture 2: Generative Learning 20/47

Page 45: Lecture 2: Generative Learningtzhao80/Lectures/Lecture_2.pdf · Lecture 2: Generative Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Naive Bayes Gaussian Discriminant Analysis

Maximum Likelihood Estimation:

µ0 =

∑ni=1 xi · (1− yi)n−∑n

i=1 yiand µ1 =

∑ni=1 xi · yi∑ni=1 yi

p =

∑ni=1 yin

and σ2j =1

n

n∑

i=1

(xi,j − µyi,j)2

3d+ 1 parameters to estimate.

Tuo Zhao — Lecture 2: Generative Learning 21/47

Page 46: Lecture 2: Generative Learningtzhao80/Lectures/Lecture_2.pdf · Lecture 2: Generative Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Naive Bayes Gaussian Discriminant Analysis

Maximum Likelihood Estimation:

µ0 =

∑ni=1 xi · (1− yi)n−∑n

i=1 yiand µ1 =

∑ni=1 xi · yi∑ni=1 yi

p =

∑ni=1 yin

and σ2j =1

n

n∑

i=1

(xi,j − µyi,j)2

3d+ 1 parameters to estimate.

Tuo Zhao — Lecture 2: Generative Learning 21/47

Page 47: Lecture 2: Generative Learningtzhao80/Lectures/Lecture_2.pdf · Lecture 2: Generative Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Naive Bayes Gaussian Discriminant Analysis

Maximum Likelihood Estimation:

µ0 =

∑ni=1 xi · (1− yi)n−∑n

i=1 yiand µ1 =

∑ni=1 xi · yi∑ni=1 yi

p =

∑ni=1 yin

and σ2j =1

n

n∑

i=1

(xi,j − µyi,j)2

3d+ 1 parameters to estimate.

Tuo Zhao — Lecture 2: Generative Learning 21/47

Page 48: Lecture 2: Generative Learningtzhao80/Lectures/Lecture_2.pdf · Lecture 2: Generative Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Naive Bayes Gaussian Discriminant Analysis

Missing values?

Example: X = (X1, ..., Xd−1)

P(Y = 1|X) =P(X|Y = 1)P(Y = 1)

P(X)

=

∏d−1j=1 P(Xj |Y = 1)P(Y = 1)

∏d−1j=1 P(Xj |Y = 1)P(Y = 1) +

∏d−1j=1 P(Xj |Y = 0)P(Y = 0)

Tuo Zhao — Lecture 2: Generative Learning 22/47

Page 49: Lecture 2: Generative Learningtzhao80/Lectures/Lecture_2.pdf · Lecture 2: Generative Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

GDA v.s. Naive Bayes GDA

Gaussian Discriminant Analysis

Stronger Modeling Assumption: Terrible

d(d+ 1)/2 + 2d+ 1 parameters: Terrible

A simple closed form solution: Not very useful!

Naive Bayes GDA

Even Stronger Modeling Assumption: Terrible!

3d+ 1 parameters: Good!

A super simple closed form solution: Useful sometimes!

Tuo Zhao — Lecture 2: Generative Learning 23/47

Page 50: Lecture 2: Generative Learningtzhao80/Lectures/Lecture_2.pdf · Lecture 2: Generative Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Naive Bayes Bernoulli Discriminant Analysis

Generative View:

Y ∼ Bernoulli(p)

P(Xj |Y = 0) ∼ Bernoulli(γ(0)j )

P(Xj |Y = 1) ∼ Bernoulli(γ(1)j )

Conditional Independence:

P(X|Y ) =d∏

j=1

P(Xj |Y ) = P(X1|Y )P(X2|Y ) · · ·P(Xd|Y )

Tuo Zhao — Lecture 2: Generative Learning 24/47

Page 51: Lecture 2: Generative Learningtzhao80/Lectures/Lecture_2.pdf · Lecture 2: Generative Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Naive Bayes Bernoulli Discriminant Analysis

Generative View:

Y ∼ Bernoulli(p)

P(Xj |Y = 0) ∼ Bernoulli(γ(0)j )

P(Xj |Y = 1) ∼ Bernoulli(γ(1)j )

Conditional Independence:

P(X|Y ) =d∏

j=1

P(Xj |Y ) = P(X1|Y )P(X2|Y ) · · ·P(Xd|Y )

Tuo Zhao — Lecture 2: Generative Learning 24/47

Page 52: Lecture 2: Generative Learningtzhao80/Lectures/Lecture_2.pdf · Lecture 2: Generative Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Naive Bayes Bernoulli Discriminant Analysis

Generative View:

Y ∼ Bernoulli(p)

P(Xj |Y = 0) ∼ Bernoulli(γ(0)j )

P(Xj |Y = 1) ∼ Bernoulli(γ(1)j )

Conditional Independence:

P(X|Y ) =

d∏

j=1

P(Xj |Y ) = P(X1|Y )P(X2|Y ) · · ·P(Xd|Y )

Tuo Zhao — Lecture 2: Generative Learning 24/47

Page 53: Lecture 2: Generative Learningtzhao80/Lectures/Lecture_2.pdf · Lecture 2: Generative Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Naive Bayes Bernoulli Discriminant Analysis

Generative View:

Y ∼ Bernoulli(p)

P(Xj |Y = 0) ∼ Bernoulli(γ(0)j )

P(Xj |Y = 1) ∼ Bernoulli(γ(1)j )

Conditional Independence:

P(X|Y ) =

d∏

j=1

P(Xj |Y ) = P(X1|Y )P(X2|Y ) · · ·P(Xd|Y )

Tuo Zhao — Lecture 2: Generative Learning 24/47

Page 54: Lecture 2: Generative Learningtzhao80/Lectures/Lecture_2.pdf · Lecture 2: Generative Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Naive Bayes Poisson Discriminant Analysis

Generative View:

Y ∼ Bernoulli(p)

P(Xj |Y = 0) ∼ Poisson(λ(0)j )

P(Xj |Y = 1) ∼ Poisson(λ(1)j )

Conditional Independence:

P(X|Y ) =d∏

j=1

P(Xj |Y ) = P(X1|Y )P(X2|Y ) · · ·P(Xd|Y )

Tuo Zhao — Lecture 2: Generative Learning 25/47

Page 55: Lecture 2: Generative Learningtzhao80/Lectures/Lecture_2.pdf · Lecture 2: Generative Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Naive Bayes Poisson Discriminant Analysis

Generative View:

Y ∼ Bernoulli(p)

P(Xj |Y = 0) ∼ Poisson(λ(0)j )

P(Xj |Y = 1) ∼ Poisson(λ(1)j )

Conditional Independence:

P(X|Y ) =d∏

j=1

P(Xj |Y ) = P(X1|Y )P(X2|Y ) · · ·P(Xd|Y )

Tuo Zhao — Lecture 2: Generative Learning 25/47

Page 56: Lecture 2: Generative Learningtzhao80/Lectures/Lecture_2.pdf · Lecture 2: Generative Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Naive Bayes Poisson Discriminant Analysis

Generative View:

Y ∼ Bernoulli(p)

P(Xj |Y = 0) ∼ Poisson(λ(0)j )

P(Xj |Y = 1) ∼ Poisson(λ(1)j )

Conditional Independence:

P(X|Y ) =

d∏

j=1

P(Xj |Y ) = P(X1|Y )P(X2|Y ) · · ·P(Xd|Y )

Tuo Zhao — Lecture 2: Generative Learning 25/47

Page 57: Lecture 2: Generative Learningtzhao80/Lectures/Lecture_2.pdf · Lecture 2: Generative Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Naive Bayes Poisson Discriminant Analysis

Generative View:

Y ∼ Bernoulli(p)

P(Xj |Y = 0) ∼ Poisson(λ(0)j )

P(Xj |Y = 1) ∼ Poisson(λ(1)j )

Conditional Independence:

P(X|Y ) =

d∏

j=1

P(Xj |Y ) = P(X1|Y )P(X2|Y ) · · ·P(Xd|Y )

Tuo Zhao — Lecture 2: Generative Learning 25/47

Page 58: Lecture 2: Generative Learningtzhao80/Lectures/Lecture_2.pdf · Lecture 2: Generative Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Example: Spam Email Classification

Data Set:

4601 email messages

Goal: predict whether an email message is spam or good.

Features: the frequencies in a message of 48 of the mostcommonly occurring words in all these email messages.

We coded spam as 1 and email as 0.

Tuo Zhao — Lecture 2: Generative Learning 26/47

Page 59: Lecture 2: Generative Learningtzhao80/Lectures/Lecture_2.pdf · Lecture 2: Generative Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Example: Spam Email Classification

Data Set:

4601 email messages

Goal: predict whether an email message is spam or good.

Features: the frequencies in a message of 48 of the mostcommonly occurring words in all these email messages.

We coded spam as 1 and email as 0.

Tuo Zhao — Lecture 2: Generative Learning 26/47

Page 60: Lecture 2: Generative Learningtzhao80/Lectures/Lecture_2.pdf · Lecture 2: Generative Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Example: Spam Email Classification

Data Set:

4601 email messages

Goal: predict whether an email message is spam or good.

Features: the frequencies in a message of 48 of the mostcommonly occurring words in all these email messages.

We coded spam as 1 and email as 0.

Tuo Zhao — Lecture 2: Generative Learning 26/47

Page 61: Lecture 2: Generative Learningtzhao80/Lectures/Lecture_2.pdf · Lecture 2: Generative Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Example: Spam Email Classification

Data Set:

4601 email messages

Goal: predict whether an email message is spam or good.

Features: the frequencies in a message of 48 of the mostcommonly occurring words in all these email messages.

We coded spam as 1 and email as 0.

Tuo Zhao — Lecture 2: Generative Learning 26/47

Page 62: Lecture 2: Generative Learningtzhao80/Lectures/Lecture_2.pdf · Lecture 2: Generative Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Example: Spam Email Classification

Data Set:

4601 email messages

Goal: predict whether an email message is spam or good.

Features: the frequencies in a message of 48 of the mostcommonly occurring words in all these email messages.

We coded spam as 1 and email as 0.

Tuo Zhao — Lecture 2: Generative Learning 26/47

Page 63: Lecture 2: Generative Learningtzhao80/Lectures/Lecture_2.pdf · Lecture 2: Generative Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Example: Spam Email Classification

Transforming Features:

Naive Bayes GDA:

Relative Frequency of “free” =# of free in this email

# of all words in this email

Naive Bayes Bernoulli DA:

Indicator of “free” = 1 if “free” appears in this email

Naive Bayes Poisson DA: No transformation needed

Coding Problem in HW2.

Tuo Zhao — Lecture 2: Generative Learning 27/47

Page 64: Lecture 2: Generative Learningtzhao80/Lectures/Lecture_2.pdf · Lecture 2: Generative Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Example: Spam Email Classification

Transforming Features:

Naive Bayes GDA:

Relative Frequency of “free” =# of free in this email

# of all words in this email

Naive Bayes Bernoulli DA:

Indicator of “free” = 1 if “free” appears in this email

Naive Bayes Poisson DA: No transformation needed

Coding Problem in HW2.

Tuo Zhao — Lecture 2: Generative Learning 27/47

Page 65: Lecture 2: Generative Learningtzhao80/Lectures/Lecture_2.pdf · Lecture 2: Generative Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Example: Spam Email Classification

Transforming Features:

Naive Bayes GDA:

Relative Frequency of “free” =# of free in this email

# of all words in this email

Naive Bayes Bernoulli DA:

Indicator of “free” = 1 if “free” appears in this email

Naive Bayes Poisson DA: No transformation needed

Coding Problem in HW2.

Tuo Zhao — Lecture 2: Generative Learning 27/47

Page 66: Lecture 2: Generative Learningtzhao80/Lectures/Lecture_2.pdf · Lecture 2: Generative Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Example: Spam Email Classification

Transforming Features:

Naive Bayes GDA:

Relative Frequency of “free” =# of free in this email

# of all words in this email

Naive Bayes Bernoulli DA:

Indicator of “free” = 1 if “free” appears in this email

Naive Bayes Poisson DA: No transformation needed

Coding Problem in HW2.

Tuo Zhao — Lecture 2: Generative Learning 27/47

Page 67: Lecture 2: Generative Learningtzhao80/Lectures/Lecture_2.pdf · Lecture 2: Generative Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech

Multiclass Fisher Discriminant Analysis

Page 68: Lecture 2: Generative Learningtzhao80/Lectures/Lecture_2.pdf · Lecture 2: Generative Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Revisiting GDA

A Dimensionality Reduction Perspective:

Between-Class Scatter Matrix:

Γ =∑

k=0,1

nkn(µk − µ)(µk − µ)>,

where

µ =1

n

n∑

i=1

xi, n1 =

n∑

i=1

yi and n0 = n− n1.

Rayleigh Quotient Formulation

w = argmaxw

w>Γw

w>Σw= argmax

ww>Γw s.t. w>Σw = 1.

Tuo Zhao — Lecture 2: Generative Learning 29/47

Page 69: Lecture 2: Generative Learningtzhao80/Lectures/Lecture_2.pdf · Lecture 2: Generative Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Revisiting GDA

A Dimensionality Reduction Perspective:

Between-Class Scatter Matrix:

Γ =∑

k=0,1

nkn(µk − µ)(µk − µ)>,

where

µ =1

n

n∑

i=1

xi, n1 =

n∑

i=1

yi and n0 = n− n1.

Rayleigh Quotient Formulation

w = argmaxw

w>Γw

w>Σw= argmax

ww>Γw s.t. w>Σw = 1.

Tuo Zhao — Lecture 2: Generative Learning 29/47

Page 70: Lecture 2: Generative Learningtzhao80/Lectures/Lecture_2.pdf · Lecture 2: Generative Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

FDA and Dimension Reduction

Tuo Zhao — Lecture 2: Generative Learning 30/47

Page 71: Lecture 2: Generative Learningtzhao80/Lectures/Lecture_2.pdf · Lecture 2: Generative Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Multiclass Fisher Discriminant Analysis

Generative View:

Y ∼ Discrete(p1, ..., pm) with∑m

k=1 pk = 1

P(X|Y = k) ∼ N(µk,Σ)

Between-Class Scatter Matrix:

Γ =1

m

m∑

k=1

nkn(µk − µ)(µk − µ)> with nk =

n∑

i=1

1(yi = k).

Rayleigh Quotient Formulation

W = argmaxW∈Rd×r

trace(W>ΓW) s.t. W>ΣW = Ir.

Tuo Zhao — Lecture 2: Generative Learning 31/47

Page 72: Lecture 2: Generative Learningtzhao80/Lectures/Lecture_2.pdf · Lecture 2: Generative Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Multiclass Fisher Discriminant Analysis

Generative View:

Y ∼ Discrete(p1, ..., pm) with∑m

k=1 pk = 1

P(X|Y = k) ∼ N(µk,Σ)

Between-Class Scatter Matrix:

Γ =1

m

m∑

k=1

nkn(µk − µ)(µk − µ)> with nk =

n∑

i=1

1(yi = k).

Rayleigh Quotient Formulation

W = argmaxW∈Rd×r

trace(W>ΓW) s.t. W>ΣW = Ir.

Tuo Zhao — Lecture 2: Generative Learning 31/47

Page 73: Lecture 2: Generative Learningtzhao80/Lectures/Lecture_2.pdf · Lecture 2: Generative Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Multiclass Fisher Discriminant Analysis

Generative View:

Y ∼ Discrete(p1, ..., pm) with∑m

k=1 pk = 1

P(X|Y = k) ∼ N(µk,Σ)

Between-Class Scatter Matrix:

Γ =1

m

m∑

k=1

nkn(µk − µ)(µk − µ)> with nk =

n∑

i=1

1(yi = k).

Rayleigh Quotient Formulation

W = argmaxW∈Rd×r

trace(W>ΓW) s.t. W>ΣW = Ir.

Tuo Zhao — Lecture 2: Generative Learning 31/47

Page 74: Lecture 2: Generative Learningtzhao80/Lectures/Lecture_2.pdf · Lecture 2: Generative Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Multiclass Fisher Discriminant Analysis

Generative View:

Y ∼ Discrete(p1, ..., pm) with∑m

k=1 pk = 1

P(X|Y = k) ∼ N(µk,Σ)

Between-Class Scatter Matrix:

Γ =1

m

m∑

k=1

nkn(µk − µ)(µk − µ)> with nk =

n∑

i=1

1(yi = k).

Rayleigh Quotient Formulation

W = argmaxW∈Rd×r

trace(W>ΓW) s.t. W>ΣW = Ir.

Tuo Zhao — Lecture 2: Generative Learning 31/47

Page 75: Lecture 2: Generative Learningtzhao80/Lectures/Lecture_2.pdf · Lecture 2: Generative Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Multiclass Fisher Discriminant Analysis

Generative View:

Y ∼ Discrete(p1, ..., pm) with∑m

k=1 pk = 1

P(X|Y = k) ∼ N(µk,Σ)

Between-Class Scatter Matrix:

Γ =1

m

m∑

k=1

nkn(µk − µ)(µk − µ)> with nk =

n∑

i=1

1(yi = k).

Rayleigh Quotient Formulation

W = argmaxW∈Rd×r

trace(W>ΓW) s.t. W>ΣW = Ir.

Tuo Zhao — Lecture 2: Generative Learning 31/47

Page 76: Lecture 2: Generative Learningtzhao80/Lectures/Lecture_2.pdf · Lecture 2: Generative Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Multiclass Fisher Discriminant Analysis

Tuo Zhao — Lecture 2: Generative Learning 32/47

Page 77: Lecture 2: Generative Learningtzhao80/Lectures/Lecture_2.pdf · Lecture 2: Generative Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Eigenvalue Problem (Rank-1)

Rayleigh Quotient Formulation

w = argmaxw∈Rd

w>Aw s.t. w>w = 1.

Lagrangian Multiplier Method: λ ∈ R

L(w, λ) = w>Aw − λ(w>w − 1).

We only need eigenvectors of A, since

∇wL(w, λ) = 2Aw − 2λw = 0,

∇λL(w, λ) = w>w − 1 = 0.

Tuo Zhao — Lecture 2: Generative Learning 33/47

Page 78: Lecture 2: Generative Learningtzhao80/Lectures/Lecture_2.pdf · Lecture 2: Generative Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Eigenvalue Problem (Rank-1)

Rayleigh Quotient Formulation

w = argmaxw∈Rd

w>Aw s.t. w>w = 1.

Lagrangian Multiplier Method: λ ∈ R

L(w, λ) = w>Aw − λ(w>w − 1).

We only need eigenvectors of A, since

∇wL(w, λ) = 2Aw − 2λw = 0,

∇λL(w, λ) = w>w − 1 = 0.

Tuo Zhao — Lecture 2: Generative Learning 33/47

Page 79: Lecture 2: Generative Learningtzhao80/Lectures/Lecture_2.pdf · Lecture 2: Generative Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Eigenvalue Problem (Rank-1)

Rayleigh Quotient Formulation

w = argmaxw∈Rd

w>Aw s.t. w>w = 1.

Lagrangian Multiplier Method: λ ∈ R

L(w, λ) = w>Aw − λ(w>w − 1).

We only need eigenvectors of A, since

∇wL(w, λ) = 2Aw − 2λw = 0,

∇λL(w, λ) = w>w − 1 = 0.

Tuo Zhao — Lecture 2: Generative Learning 33/47

Page 80: Lecture 2: Generative Learningtzhao80/Lectures/Lecture_2.pdf · Lecture 2: Generative Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Eigenvalue Problem (Rank-r)

Rayleigh Quotient Formulation

U = argmaxU∈Rd×r

trace(U>AU) s.t. U>U = Ir,

Lagrangian Multiplier Method: Λ ∈ Rr×r and Λ = Λ>

L(U,Λ) = trace(U>AU)− trace(Λ>(U>U− Ir))

We only need eigenvectors of A, since

∇UL(U,Λ) = 2AU− 2UΛ = 0,

∇ΛL(U,Λ) = U>U− Ir = 0.

Tuo Zhao — Lecture 2: Generative Learning 34/47

Page 81: Lecture 2: Generative Learningtzhao80/Lectures/Lecture_2.pdf · Lecture 2: Generative Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Eigenvalue Problem (Rank-r)

Rayleigh Quotient Formulation

U = argmaxU∈Rd×r

trace(U>AU) s.t. U>U = Ir,

Lagrangian Multiplier Method: Λ ∈ Rr×r and Λ = Λ>

L(U,Λ) = trace(U>AU)− trace(Λ>(U>U− Ir))

We only need eigenvectors of A, since

∇UL(U,Λ) = 2AU− 2UΛ = 0,

∇ΛL(U,Λ) = U>U− Ir = 0.

Tuo Zhao — Lecture 2: Generative Learning 34/47

Page 82: Lecture 2: Generative Learningtzhao80/Lectures/Lecture_2.pdf · Lecture 2: Generative Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Eigenvalue Problem (Rank-r)

Rayleigh Quotient Formulation

U = argmaxU∈Rd×r

trace(U>AU) s.t. U>U = Ir,

Lagrangian Multiplier Method: Λ ∈ Rr×r and Λ = Λ>

L(U,Λ) = trace(U>AU)− trace(Λ>(U>U− Ir))

We only need eigenvectors of A, since

∇UL(U,Λ) = 2AU− 2UΛ = 0,

∇ΛL(U,Λ) = U>U− Ir = 0.

Tuo Zhao — Lecture 2: Generative Learning 34/47

Page 83: Lecture 2: Generative Learningtzhao80/Lectures/Lecture_2.pdf · Lecture 2: Generative Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Generalized Eigenvalue Problem

Rayleigh Quotient Formulation

W = argmaxW∈Rd×r

trace(W>ΓW) s.t. W>ΣW = Ir.

Replace U = Σ1/2

W

U = argmaxU∈Rd×r

trace(U>AU) s.t. U>U = Ir,

where A = Σ−1/2

ΓΣ−1/2

.

Eigenvalue Problem!

Tuo Zhao — Lecture 2: Generative Learning 35/47

Page 84: Lecture 2: Generative Learningtzhao80/Lectures/Lecture_2.pdf · Lecture 2: Generative Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Generalized Eigenvalue Problem

Rayleigh Quotient Formulation

W = argmaxW∈Rd×r

trace(W>ΓW) s.t. W>ΣW = Ir.

Replace U = Σ1/2

W

U = argmaxU∈Rd×r

trace(U>AU) s.t. U>U = Ir,

where A = Σ−1/2

ΓΣ−1/2

.

Eigenvalue Problem!

Tuo Zhao — Lecture 2: Generative Learning 35/47

Page 85: Lecture 2: Generative Learningtzhao80/Lectures/Lecture_2.pdf · Lecture 2: Generative Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Generalized Eigenvalue Problem

Rayleigh Quotient Formulation

W = argmaxW∈Rd×r

trace(W>ΓW) s.t. W>ΣW = Ir.

Replace U = Σ1/2

W

U = argmaxU∈Rd×r

trace(U>AU) s.t. U>U = Ir,

where A = Σ−1/2

ΓΣ−1/2

.

Eigenvalue Problem!

Tuo Zhao — Lecture 2: Generative Learning 35/47

Page 86: Lecture 2: Generative Learningtzhao80/Lectures/Lecture_2.pdf · Lecture 2: Generative Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Eigenvalue Problem

Power Iteration:

U(t+1) = QR(ΘU(t))

When r = 1, we have

u(t+1) =Θu(t)

∥∥u(t)∥∥2

.

where Θ = Σ−1/2

ΓΣ−1/2

,

We need T = O(gap · log(1/ε)) iterations to guarantee

|u>u(T )| = 1− ε,where gap = λ1/(λ1 − λ2).

Tuo Zhao — Lecture 2: Generative Learning 36/47

Page 87: Lecture 2: Generative Learningtzhao80/Lectures/Lecture_2.pdf · Lecture 2: Generative Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Eigenvalue Problem

Power Iteration:

U(t+1) = QR(ΘU(t))

When r = 1, we have

u(t+1) =Θu(t)

∥∥u(t)∥∥2

.

where Θ = Σ−1/2

ΓΣ−1/2

,

We need T = O(gap · log(1/ε)) iterations to guarantee

|u>u(T )| = 1− ε,where gap = λ1/(λ1 − λ2).

Tuo Zhao — Lecture 2: Generative Learning 36/47

Page 88: Lecture 2: Generative Learningtzhao80/Lectures/Lecture_2.pdf · Lecture 2: Generative Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Eigenvalue Problem

Power Iteration:

U(t+1) = QR(ΘU(t))

When r = 1, we have

u(t+1) =Θu(t)

∥∥u(t)∥∥2

.

where Θ = Σ−1/2

ΓΣ−1/2

,

We need T = O(gap · log(1/ε)) iterations to guarantee

|u>u(T )| = 1− ε,where gap = λ1/(λ1 − λ2).

Tuo Zhao — Lecture 2: Generative Learning 36/47

Page 89: Lecture 2: Generative Learningtzhao80/Lectures/Lecture_2.pdf · Lecture 2: Generative Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Quadratic Discriminant Analysis

Generative View:

Y ∼ Bernoulli(p)

P(Y = y) = py(1− p)1−y

P(X|Y = 0) ∼ N(µ0,Σ0)

P(X = x|Y = 0) =exp

(−1

2(x− µ0)>Σ−10 (x− µ0)

)

(2π)d/2|Σ0|1/2

P(X|Y = 1) ∼ N(µ1,Σ1)

P(X = x|Y = 1) =exp

(−1

2(x− µ1)>Σ−11 (x− µ1)

)

(2π)d/2|Σ1|1/2

Tuo Zhao — Lecture 2: Generative Learning 37/47

Page 90: Lecture 2: Generative Learningtzhao80/Lectures/Lecture_2.pdf · Lecture 2: Generative Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Quadratic Discriminant Analysis

Generative View:

Y ∼ Bernoulli(p)

P(Y = y) = py(1− p)1−y

P(X|Y = 0) ∼ N(µ0,Σ0)

P(X = x|Y = 0) =exp

(−1

2(x− µ0)>Σ−10 (x− µ0)

)

(2π)d/2|Σ0|1/2

P(X|Y = 1) ∼ N(µ1,Σ1)

P(X = x|Y = 1) =exp

(−1

2(x− µ1)>Σ−11 (x− µ1)

)

(2π)d/2|Σ1|1/2

Tuo Zhao — Lecture 2: Generative Learning 37/47

Page 91: Lecture 2: Generative Learningtzhao80/Lectures/Lecture_2.pdf · Lecture 2: Generative Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Quadratic Discriminant Analysis

Generative View:

Y ∼ Bernoulli(p)

P(Y = y) = py(1− p)1−y

P(X|Y = 0) ∼ N(µ0,Σ0)

P(X = x|Y = 0) =exp

(−1

2(x− µ0)>Σ−10 (x− µ0)

)

(2π)d/2|Σ0|1/2

P(X|Y = 1) ∼ N(µ1,Σ1)

P(X = x|Y = 1) =exp

(−1

2(x− µ1)>Σ−11 (x− µ1)

)

(2π)d/2|Σ1|1/2

Tuo Zhao — Lecture 2: Generative Learning 37/47

Page 92: Lecture 2: Generative Learningtzhao80/Lectures/Lecture_2.pdf · Lecture 2: Generative Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Quadratic Discriminant Analysis

Tuo Zhao — Lecture 2: Generative Learning 38/47

Page 93: Lecture 2: Generative Learningtzhao80/Lectures/Lecture_2.pdf · Lecture 2: Generative Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Quadratic Discriminant Analysis

Maximum Likelihood Estimation:

L(p,µ0,µ1,Σ0,Σ1) = log

n∏

i=1

h(xi|yi; p,µ0,µ1,Σ0,Σ1)g(yi; p).

Convex Minimization

µ0 =

∑ni=1 xi · (1− yi)n−∑n

i=1 yiand µ1 =

∑ni=1 xi · yi∑ni=1 yi

p =

∑ni=1 yin

and Σk =1

nk

yi=k

(xi − µk)(xi − µk)>

d(d+ 1) + 2d+ 1 parameters to estimate.

Tuo Zhao — Lecture 2: Generative Learning 39/47

Page 94: Lecture 2: Generative Learningtzhao80/Lectures/Lecture_2.pdf · Lecture 2: Generative Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Quadratic Discriminant Analysis

Maximum Likelihood Estimation:

L(p,µ0,µ1,Σ0,Σ1) = log

n∏

i=1

h(xi|yi; p,µ0,µ1,Σ0,Σ1)g(yi; p).

Convex Minimization

µ0 =

∑ni=1 xi · (1− yi)n−∑n

i=1 yiand µ1 =

∑ni=1 xi · yi∑ni=1 yi

p =

∑ni=1 yin

and Σk =1

nk

yi=k

(xi − µk)(xi − µk)>

d(d+ 1) + 2d+ 1 parameters to estimate.

Tuo Zhao — Lecture 2: Generative Learning 39/47

Page 95: Lecture 2: Generative Learningtzhao80/Lectures/Lecture_2.pdf · Lecture 2: Generative Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Quadratic Discriminant Analysis

Maximum Likelihood Estimation:

L(p,µ0,µ1,Σ0,Σ1) = log

n∏

i=1

h(xi|yi; p,µ0,µ1,Σ0,Σ1)g(yi; p).

Convex Minimization

µ0 =

∑ni=1 xi · (1− yi)n−∑n

i=1 yiand µ1 =

∑ni=1 xi · yi∑ni=1 yi

p =

∑ni=1 yin

and Σk =1

nk

yi=k

(xi − µk)(xi − µk)>

d(d+ 1) + 2d+ 1 parameters to estimate.

Tuo Zhao — Lecture 2: Generative Learning 39/47

Page 96: Lecture 2: Generative Learningtzhao80/Lectures/Lecture_2.pdf · Lecture 2: Generative Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Quadratic Discriminant Analysis

Maximum Likelihood Estimation:

L(p,µ0,µ1,Σ0,Σ1) = log

n∏

i=1

h(xi|yi; p,µ0,µ1,Σ0,Σ1)g(yi; p).

Convex Minimization

µ0 =

∑ni=1 xi · (1− yi)n−∑n

i=1 yiand µ1 =

∑ni=1 xi · yi∑ni=1 yi

p =

∑ni=1 yin

and Σk =1

nk

yi=k

(xi − µk)(xi − µk)>

d(d+ 1) + 2d+ 1 parameters to estimate.

Tuo Zhao — Lecture 2: Generative Learning 39/47

Page 97: Lecture 2: Generative Learningtzhao80/Lectures/Lecture_2.pdf · Lecture 2: Generative Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Quadratic Discriminant Analysis

Maximum Likelihood Estimation:

L(p,µ0,µ1,Σ0,Σ1) = log

n∏

i=1

h(xi|yi; p,µ0,µ1,Σ0,Σ1)g(yi; p).

Convex Minimization

µ0 =

∑ni=1 xi · (1− yi)n−∑n

i=1 yiand µ1 =

∑ni=1 xi · yi∑ni=1 yi

p =

∑ni=1 yin

and Σk =1

nk

yi=k

(xi − µk)(xi − µk)>

d(d+ 1) + 2d+ 1 parameters to estimate.

Tuo Zhao — Lecture 2: Generative Learning 39/47

Page 98: Lecture 2: Generative Learningtzhao80/Lectures/Lecture_2.pdf · Lecture 2: Generative Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

GDA v.s. QDA

Gaussian Discriminant Analysis

Stronger Modeling Assumption: Terrible

d(d+ 1)/2 + 2d+ 1 parameters: Terrible

A simple closed form solution: Not very useful!

Quadratic Discriminant Analysis

Weaker Modeling Assumption: Still Terrible!

d(d+ 1) + 2d+ 1 parameters: More Terrible!

A simple closed form solution: Not very useful!

Tuo Zhao — Lecture 2: Generative Learning 40/47

Page 99: Lecture 2: Generative Learningtzhao80/Lectures/Lecture_2.pdf · Lecture 2: Generative Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech

Multiclass Classification

Page 100: Lecture 2: Generative Learningtzhao80/Lectures/Lecture_2.pdf · Lecture 2: Generative Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

K-Nearest Neighbor Classification

Very intuitive....

Tuo Zhao — Lecture 2: Generative Learning 42/47

Page 101: Lecture 2: Generative Learningtzhao80/Lectures/Lecture_2.pdf · Lecture 2: Generative Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Model Complexity?

More flexible for larger K’s?

Not really!

Tuo Zhao — Lecture 2: Generative Learning 43/47

Page 102: Lecture 2: Generative Learningtzhao80/Lectures/Lecture_2.pdf · Lecture 2: Generative Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Curse of Dimensionality

Tuo Zhao — Lecture 2: Generative Learning 44/47

Page 103: Lecture 2: Generative Learningtzhao80/Lectures/Lecture_2.pdf · Lecture 2: Generative Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Local Linear Regression

Build linear regression models using ONLY neighbors

Tuo Zhao — Lecture 2: Generative Learning 45/47

Page 104: Lecture 2: Generative Learningtzhao80/Lectures/Lecture_2.pdf · Lecture 2: Generative Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Local Logistic Regression

Build logistic regression models using ONLY neighbors

Tuo Zhao — Lecture 2: Generative Learning 46/47

Page 105: Lecture 2: Generative Learningtzhao80/Lectures/Lecture_2.pdf · Lecture 2: Generative Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Multinomial Regression

Given x1, ...,xn ∈ Rd, y1, ..., yn ∈ {1, 2, ...,m}, andθ∗1, ...,θ

∗m−1 ∈ Rd, for k = 1, ...,m− 1 and i = 1, ..., n,

P(yi = k) =exp(−x>i θ∗k)

1 +

m−1∑

k=1

exp(−x>i θ∗k),

P(yi = m) =1

1 +

m−1∑

k=1

exp(−x>i θ∗k)

Maximum Likelihood Estimation: Still a convex problem.

Tuo Zhao — Lecture 2: Generative Learning 47/47