3. Generative Algorithms, Machine Learnig

Generative Algorithms

By : Shedriko

PreliminaryGDA (Gaussian Discriminant Analysis)NB (Naive Bayes)

Outline & Content

We have learned from previous chapter about : algorithm that try to learn directly (such as logistic regression) or algorithms that try to learn mappings directly from the space of inputs X to the labels {0,1} (such as the perceptron algorithm) are called discriminative learning algorithms.

Preliminary

Here, we’ll talk about algorithms that try to model

and/or . These algorithms are called generative learning algorithms. For e.g. :

If y indicates whether an example is a dog (y = 0) or an elephant (y = 1), then p(x|y = 0) models the distribution of dogs’ features, and p(x|y = 1) models the distribution of elephants’ features.

Preliminary

After modeling (called class prior) and our algorithm can then use Bayes rule to derive the posterior distribution on y given x :

Here the denominator is given by :p(x) = p(x|y = 1)p(y = 1) + p(x|y = 0)p(y

= 0) If we are calculating in order to make a prediction, we don’t need to calculate the denominator, since

Preliminary

The multivariate normal distribution In this model, we’ll assume that is distributed according to multivariate distribution The multivariate normal distribution in n-dimensions (also called multivariate Gaussian distribution), is parameterized by a mean vector and a covariance matrix where is symmetric and positive semi-definite Also written its density is given by :

GDA (Gaussian)

GDA (Gaussian)

denotes the determinant of the matrix For a random variable X distributed , the mean is given by :

The covariance of a vector-valued random variable Z is defined as Cov(Z) =

or Cov(Z) = If then

GDA (Gaussian)

Examples of what the density of a Gaussian distribution looks like :

The left-most figure shows a Gaussian with mean zero (that is the 2x1 zero-vector) and covariance matrix

(the 2x2 identity matrix). A Gaussian with zero mean and identity covariance is also called the

standard normal distribution

GDA (Gaussian)

The middle figure shows the density of a Gaussian with zero mean and The right-most figure shows one with We see that as becomes larger, the Gaussian becomes more “spread-out”

GDA (Gaussian)

Here is some more examples :

The figures above show Gaussian with mean 0, and with covariance matrices respectively

GDA (Gaussian)

The left-most figure shows the familiar standard normal distribution, and we see that as we increase the off-diagonal entry in , the density becomes more “compressed” towards the 45o line (given by x1 = x2), as seen at the contours of those three densities

below

GDA (Gaussian)

Here’s the last set of examples generated by varying :

The plots above used, respectively

GDA (Gaussian)

From the leftmost and middle figures, by decreasing the diagonal elements of the covariance matrix, the density now becomes “compressed” again. As we vary the parameters, the contours will form ellipses (the rightmost figure)

GDA (Gaussian)

Fixing , by varying , we can also move the mean of the density around

which plots is

GDA (Gaussian)

The GDA model When we have the classification problem in

which the input features x are continuous-valued random variables, we can use GDA model

It’s using a multivariate normal distribution, the model is :

GDA (Gaussian)

The distribution is

The parameter of the model are , , and (there are two different mean vectors and , with only one covariance matrix )

GDA (Gaussian)

The log-likelihood of the data is given by

By maximizing , we find the maximum likelihood :

GDA (Gaussian)

Example of two different mean vectors and , with only one covariance matrix

Also shown in the figure the straight line giving the decision boundary at which On one side of the boundary we’ll predict and the other side

GDA (Gaussian)

Discussion: GDA and logistic regression

GDA model has a relationship to logistic regression

If we view the quantity as a function of x, we’ll find

where is some appropriate function of

(This is the form of logistic regression – a discriminative algorithm – to model p(y = 1|x) )

GDA (Gaussian)

Which one is better ? GDA :

- stronger modeling assumptions- more data efficient (requires less

training data to learn “well”) Logistic regression :

- weaker assumptions, but more robust to deviations from modeling assumptions

- when the data is indeed non-Gaussian (in large dataset), it’s better than GDA

For the last reason, logistic regression is used more often than GDA

NB (Naive Bayes)

In GDA, the feature vector x were continuous, in Bayes the xi’s are discrete-valued

For e.g. : to classify whether an email is unsolicited commercial (spam) email or non-spam email

We begin our spam filter by specifying the feature xi used to represent an email

We’ll represent an email via a feature vector whose length is equal to the number of words in the dictionary, specifically if an email contains the i-th word of the dictionary, then we’ll set xi = 1; otherwise we let xi = 0

NB (Naive Bayes)

For instance, the vector :

The set of words encoded into the feature vector is called the vocabulary, so the dimension of x is equal to the size of the

vocabulary Then, we build a generative model If, say, we have 50,000 words, then

(x is a 50,000-dimensional vector of 0’s and 1’s), we model x with a multinomial distribution, we have 250000-1 dimensional parameter vector (too many)

NB (Naive Bayes)

To model , we have to make a very strong assumption

Assume: xi’s are conditionally independent given y (called NB assumption and NB classifier for the resulting algorithm)

For instance: I tell you y = 1 (spam email), so x2087 (word “buy” whether appears in the message) and x39831 (word “price” whether appears in the message) can be written:

x2087 and x39831 are conditionally independent given y

NB (Naive Bayes)

We now have

First equality : usual properties of probabilities

Second equality : NB assumption (extremely strong assumption, works well on many problems)

NB (Naive Bayes)

Our model is parameterized by , and

As usual given training set {(x(i),y(i))}; i = 1, …, m}, the joint likelihood of the data :

Maximizing this with respect to , and gives the maximum likelihood estimates :

( symbol means “and”, spam email in which word j does appear)

NB (Naive Bayes)

To make a prediction on a new example with features x

We have used NB where feature xi are binary-valued, for xi which has values {1, 2, …, ki}

Here, we model as multinomial rather than as Bernoulli

For e.g. : We have input of living area of a house in continuous valued, then we discretize it as follows :

NB (Naive Bayes)

Laplace smoothing When the original continuous-valued attributes are not well-modeled by a multivariate normal distribution, discretizing the features and using NB (instead of GDA) will often result in better classifier Simple change will make NB algorithm work much better, especially for text classification For e.g. : You never receive email with word “NIPS” before, it’s the 35000th word in the dictionary, your NB spam filter therefore had picked its max likelihood estimates of the parameters to be:

NB (Naive Bayes)

i.e. it’s never been in training examples (either spam or non-spam) when trying to decide “nips” is spam, it calculates the class posterior probabilities and obtains

NB (Naive Bayes)

The result 0 because each of terms includes a term = 0 We then, estimate the mean of a multinomial random variable z valued {1, …, k} parameterized with

, the maximum likelihood estimates are

The above might end’s up as zero (which was a problem), to avoid this, we use Laplace smoothing

NB (Naive Bayes)

Returning to our NB classifier above, with laplace smoothing to avoid 0, we have

NB (Naive Bayes)

Event models for text classification Usually for text classification, NB use multi-variate Bernoulli event model (as presented). First, it’s randomly determine whether spam or non- spam (according to class prior ) Then it runs through the dictionary, deciding whether to include each word in that email independently, according to Thus the probability of a message is given by

NB (Naive Bayes)

Here’s the different model, a better one, called the multinomial event model (MEM). We let xi denote the identity of the i-th in the email, which has values in {1, …, |V|}, where |V| is the size of our vocabulary For e.g. : An email starts with “A NIPS….”,

x1 = 1 (“a” is the first word in the dictionary)

x2 = 35000 (if “nips” is the 35000th word in the dictionary)

NB (Naive Bayes)

In MEM, assume an email is generated via a random process which spam/non-spam is first determined

(according to as before) Then, it generates x1, x2, x3 and so on from some multinomial distribution over words ( ), the overall probability is given by (as multinomial, not as Bernoulli distribution) The parameters :

, (for any j) and

NB (Naive Bayes)

The likelihood of the data is given by

Maximum likelihood

NB (Naive Bayes)

Maximum likelihood with laplace smoothing

Thank you….

3. Generative Algorithms, Machine Learnig

Documents

Transcript of 3. Generative Algorithms, Machine Learnig