Natural Language Processing COMPSCI 423/723 Rohit Kate 1.

49
Natural Language Processing COMPSCI 423/723 Rohit Kate 1

Transcript of Natural Language Processing COMPSCI 423/723 Rohit Kate 1.

Page 1: Natural Language Processing COMPSCI 423/723 Rohit Kate 1.

Natural Language ProcessingCOMPSCI 423/723

Rohit Kate

1

Page 2: Natural Language Processing COMPSCI 423/723 Rohit Kate 1.

Classification for NLP:Naïve Bayes Model

Maximum Entropy Model

Some of these slides have been adapted from Raymond Mooney’s slides from his NLP and Machine Learning courses at UT Austin.Referenes:- Sections 6.6 & 6.7 from Jurafsy & Martin book-Naïve Bayes portions from Word Sense Disambiguation chapters in Jurafsy & Martin and Manning & Schutze books-Generative and Discriminative Classifiers: Naive Bayes and Logistic Regression

by Tom Mitchell http://www.cs.cmu.edu/~tom/NewChapters.html-A Maximum Entropy approach to Natural Language Processing by Adam L. Berger, Stephen A. Della Pietra and  Vincent J. Della PietraComputational Linguistics, Vol. 22, No. 1. (1996), pp. 39-71.

2

Page 3: Natural Language Processing COMPSCI 423/723 Rohit Kate 1.

Naïve Bayes Model

3

Page 4: Natural Language Processing COMPSCI 423/723 Rohit Kate 1.

Classification in NLP• Several NLP problems can be formulated as

classification problems, a few examples:– Information Extraction

• Given an entity, is it a person name or not?• Given two protein names, does the sentence say they interact or

not?

– Word Sense Disambiguation• I am out of money. I am going to the bank.

– Document Classification• Given a document, which category does it belong to?

– Sentiment Analysis• Given a passage (e.g. product or movie review), is it saying positive

things or negative things?

– Textual Entailment• Given two sentences, can the second sentence be inferred from the

first?

4

Page 5: Natural Language Processing COMPSCI 423/723 Rohit Kate 1.

Classification• Usually the classification output variable is

denoted by Y and the input variables by XsY: {river bank, money bank, verb bank}X1: Previous wordX2: Next wordX3: Part-of-speech of previous wordX4: Part-of-speech of next word

• Xs are usually called features in NLP • Coming up with good feature sets for NLP

problems is a skill: feature engineering– Requires linguistic insights– Grasp of the theory behind the classification method

5

Page 6: Natural Language Processing COMPSCI 423/723 Rohit Kate 1.

Probabilistic Classification

• Often it is useful to know the probabilities of different output values and not just the best output value– To have confidence in the output (0.9-0.1

vs 0.6-0.4)– These probabilities may be useful for the

next stage of NLP processing

• Conditional probability: P(Y|X1,X2,..,Xn)

6

Page 7: Natural Language Processing COMPSCI 423/723 Rohit Kate 1.

Probabilistic Classification

• If the joint probability distribution P(Y,X1,X2,..,Xn) is given then the conditional probability distribution can be easily estimated

7

Page 8: Natural Language Processing COMPSCI 423/723 Rohit Kate 1.

Estimating Conditional Probabilities

X1,X2,Y P(Y,X1,X2)

Circle, Red, Positive 0.2

Circle, Red, Negative 0.05

Circle, Blue, Positive 0.02

Circle, Blue, Negative 0.2

Square, Red, Positive 0.02

Square, Red, Negative 0.3

Square, Blue, Positive 0.01

Square, Blue, Negative 0.2

80.025.0

20.0

)(

)()|(

circleredP

circleredpositivePcircleredpositiveP

Similarly estimate P(Y|X1,X2) for the remaining values.8

Page 9: Natural Language Processing COMPSCI 423/723 Rohit Kate 1.

Estimating Joint Probability Distributions Not Easy :-(

• Assuming Y and all Xi are binary, we need 2n+1 - 1 entries (parameters) to specify the joint probability distribution

• This is impossible to accurately estimate from a reasonably-sized training set

• Note that P(Y|X1,X2,..,Xn) requires fewer entries (2n-1), why? But they are still too many for even small size of n

9

Page 10: Natural Language Processing COMPSCI 423/723 Rohit Kate 1.

Estimating Joint Probability Distributions

• Simplification assumptions are made about the joint probability distribution to reduce the number of parameters to estimate

• Let the random variables be nodes of a graph, there are two major types of simplifications, they are represented as– Directed probabilistic graphical models

• Simplest: Naïve Bayes model• More complex: Hidden Markov Model (HMM)

– Undirected probabilistic graphical models• Simplest: Maximum entropy model• More complex: Conditional Random Field (CRFs)

10

Page 11: Natural Language Processing COMPSCI 423/723 Rohit Kate 1.

Directed Graphical Models

• Simplification assumption: Some random variables are conditionally independent of others given values for some other random variables

11

Page 12: Natural Language Processing COMPSCI 423/723 Rohit Kate 1.

Conditional Independence

• Two random variables A and B are conditionally independent given C if P(AПB|C) = P(A|C)P(B|C)Rain and Thunder are not independent

(given there was Rain, it increases the probability that there was Thunder).

But given that there was Lightning (or no Lighting) they are independent.

P(Rain^Thunder|Lightning) = P(Rain|Lightning)P(Thunder|Lightning)

12

Page 13: Natural Language Processing COMPSCI 423/723 Rohit Kate 1.

Directed Graphical Models

• Also known as Bayesian networks• Simplification assumption: Some

random variables are conditionally independent of others given values for some other random variables

• Simplest directed graphical model: Naïve Bayes

• Naïve Bayes assumption: The features are conditionally independent given the category

13

Page 14: Natural Language Processing COMPSCI 423/723 Rohit Kate 1.

Naïve Bayes Assumption

• Features are conditionally independent given the category

• How do we estimate P(Y|X1,X2,..,Xn) from this?

• Recall the Bayes’ theorem: Lets us calculate P(B|A) in terms of P(A|B)

P(X1,X2,Xn |Y) P(X i |Yi1

n

)

14

Page 15: Natural Language Processing COMPSCI 423/723 Rohit Kate 1.

Bayes’ Theorem

Simple proof from definition of conditional probability:

)(

)()|()|(

EP

HPHEPEHP

)(

)()|(

EP

EHPEHP

)(

)()|(

HP

EHPHEP

)()|()( HPHEPEHP

QED:

(Def. cond. prob.)

(Def. cond. prob.)

)(

)()|()|(

EP

HPHEPEHP

15

Page 16: Natural Language Processing COMPSCI 423/723 Rohit Kate 1.

Naïve Bayes Model

P(Y | X1,X2,Xn ) P(X1,X2,Xn |Y )P(Y )

P(X1,X2,Xn )From Bayes’ Theorem

P(X1,X2,Xn |Y) P(X i |Yi1

n

) Naïve Bayes assumption

P(Y | X1,X2,Xn ) P(X i |Y

i1

n

)P(Y )

P(X i |Yi1

n

)P(Y )y

)()|,,(),,( 2121 YPYXXXPXXXPy

nn Computing marginals anddefinition of conditional probability

yn

nn YPYXXXP

YPYXXXPXXXYP

)()|,,(

)()|,,(),,|(

21

2121

16

Page 17: Natural Language Processing COMPSCI 423/723 Rohit Kate 1.

Naïve Bayes Model

• Only need to estimate P(Y) and P(Xi|Y) for all i, that with the naïve Bayes assumption specifies the entire joint probability distribution

• Assuming all Y and Xis are binary, only 2n+1 parameters instead of 2n+1-1 parameters: a dramatic reduction

Y P(Y)

X1P(X1|Y)

X2P(X2|Y)

X3 …….P(X3|Y)

XnP(Xn|Y)

Directed graphical model representation

Lightning

Rain Thunder

17

Page 18: Natural Language Processing COMPSCI 423/723 Rohit Kate 1.

Naïve Bayes Example

Probability positive negative

P(Y) 0.5 0.5

P(small | Y) 0.4 0.4

P(medium | Y) 0.1 0.2

P(large | Y) 0.5 0.4

P(red | Y) 0.9 0.3

P(blue | Y) 0.05 0.3

P(green | Y) 0.05 0.4

P(square | Y) 0.05 0.4

P(triangle | Y) 0.05 0.3

P(circle | Y) 0.9 0.3

Test Instance:<medium ,red, circle>

P(Label|Size,Color,Shape)

18

Page 19: Natural Language Processing COMPSCI 423/723 Rohit Kate 1.

Naïve Bayes ExampleProbability positive negative

P(Y) 0.5 0.5

P(medium | Y) 0.1 0.2

P(red | Y) 0.9 0.3

P(circle | Y) 0.9 0.3

P(positive |medium,red,circle) = P(positive)*P(medium | positive)*P(red | positive)*P(circle | positive) / P(medium,red,cirlce) 0.5 * 0.1 * 0.9 * 0.9 = 0.0405 / P(medium,red,circle)

P(negative |medium,red,circle) = P(negative)*P(medium | negative)*P(red | negative)*P(circle | negative) / P(medium,red,cirlce) 0.5 * 0.2 * 0.3 * 0.3 = 0.009 / P(medium,red,circle)

= 0.0405 / 0.0495 = 0.8181

= 0.009 / 0.0495 = 0.1818

Test Instance:<medium ,red, circle>

19

Page 20: Natural Language Processing COMPSCI 423/723 Rohit Kate 1.

Estimating Probabilities

• Normally, probabilities are estimated based on observed frequencies in the training data.

• If D contains nk examples in category yk, and nijk of these nk examples have the jth value for feature Xi, xij, then:

• However, estimating such probabilities from small training sets is error-prone.

• If due only to chance, a rare feature, Xi, is always false in the training data, yk :P(Xi=true | Y=yk) = 0.

• If Xi=true then occurs in a test example, X, the result is that yk: P(X | Y=yk) = 0 and yk: P(Y=yk | X) = 0

k

ijkkiji n

nyYxXP )|(

20

Page 21: Natural Language Processing COMPSCI 423/723 Rohit Kate 1.

Probability Estimation Example

Probability positive negative

P(Y) 0.5 0.5

P(small | Y) 0.5 0.5

P(medium | Y) 0.0 0.0

P(large | Y) 0.5 0.5

P(red | Y) 1.0 0.5

P(blue | Y) 0.0 0.5

P(green | Y) 0.0 0.0

P(square | Y) 0.0 0.0

P(triangle | Y) 0.0 0.5

P(circle | Y) 1.0 0.5

Ex Size Color Shape Category

1 small red circle positive

2 large red circle positive

3 small red triangle negitive

4 large blue circle negitive

Test Instance X:<medium, red, circle>

P(positive | X) = 0.5 * 0.0 * 1.0 * 1.0 / P(X) = 0

P(negative | X) = 0.5 * 0.0 * 0.5 * 0.5 / P(X) = 0 21

Page 22: Natural Language Processing COMPSCI 423/723 Rohit Kate 1.

Smoothing

• To account for estimation from small samples, probability estimates are adjusted or smoothed.

• Laplace smoothing using an m-estimate assumes that each feature is given a prior probability, p, that is assumed to have been previously observed in a “virtual” sample of size m.

• For binary features, p is simply assumed to be 0.5.mn

mpnyYxXP

k

ijkkiji

)|(

22

Page 23: Natural Language Processing COMPSCI 423/723 Rohit Kate 1.

Laplace Smothing Example

• Assume training set contains 10 positive examples:– 4: small– 0: medium– 6: large

• Estimate parameters as follows (if m=1, p=1/3)– P(small | positive) = (4 + 1/3) / (10 + 1) = 0.394– P(medium | positive) = (0 + 1/3) / (10 + 1) = 0.03– P(large | positive) = (6 + 1/3) / (10 + 1) = 0.576– P(small or medium or large | positive) = 1.0

23

Page 24: Natural Language Processing COMPSCI 423/723 Rohit Kate 1.

Naïve Bayes Model is a Generative Model

• Models the joint probability distribution P(Y,X1,X2,..,Xn) using P(Y) and P(Xi|Y)

• An assumed generative process: First generate Y according to P(Y) then generate X1,X2,..,Xn independently according to P(X1|Y), P(X2|Y), .., P(Xn|Y) respectively

24

Page 25: Natural Language Processing COMPSCI 423/723 Rohit Kate 1.

Naïve Bayes Generative Model

Size Color Shape Size Color Shape

Positive Negative

posnegpos

pospos neg

neg

sm

medlg

lg

medsm

smmed

lg

red

redredred red

blue

bluegrn

circcirc

circ

circ

sqr

tri tri

circ sqrtri

sm

lg

medsm

lgmed

lgsmblue

red

grnblue

grnred

grnblue

circ

sqr tricirc

sqrcirc

tri

Category

25

Page 26: Natural Language Processing COMPSCI 423/723 Rohit Kate 1.

Naïve Bayes Inference Problem

Size Color Shape Size Color Shape

Positive Negative

posnegpos

pospos neg

neg

sm

medlg

lg

medsm

smmed

lg

red

redredred red

blue

bluegrn

circcirc

circ

circ

sqr

tri tri

circ sqrtri

sm

lg

medsm

lgmed

lgsmblue

red

grnblue

grnred

grnblue

circ

sqr tricirc

sqrcirc

tri

Category

lg red circ ?? ??

26

Page 27: Natural Language Processing COMPSCI 423/723 Rohit Kate 1.

Some Comments on Naïve Bayes Model

• Tends to work well despite strong (or naïve) assumption of conditional independence

• Experiments show it to be quite competitive with other classification methods on standard UCI datasets

• Although it does not produce accurate probability estimates when its independence assumptions are violated, it may still pick the correct maximum-probability class in many cases

27

Page 28: Natural Language Processing COMPSCI 423/723 Rohit Kate 1.

Maximum Entropy Model

28

Page 29: Natural Language Processing COMPSCI 423/723 Rohit Kate 1.

Maximum Entropy Models• Very popular in NLP• Several ways to look at them:

– Exponential or log-linear classifiers or multinomial logical regression

– Assume a parametric form for conditional distribution

– Maximize entropy of the joint distribution given the constraints

– Discriminative models instead of generative (directly estimates P(Y|X1,..,Xn) instead of via P(Y,X1,..,Xn))

29

Page 30: Natural Language Processing COMPSCI 423/723 Rohit Kate 1.

Linear Regression

• Classification: Predict a discrete values

• Regression: Predict a real value

• Linear Regression: Predict a real value using a linear combination of inputsY = W0 + W1*X1 + W2*X2 + … + Wn*Xn

Ws are the weights associated with the features Xs

Example:

price = 16550 - 4900*(# vague adjectives)

30

Page 31: Natural Language Processing COMPSCI 423/723 Rohit Kate 1.

Estimating Weights in Linear Regression

• Find the Ws that minimize the sum-squared error for the given M training examples

• Statistical packages are available that solve this fast

cos t(W ) (Ypredicted( j ) Yobserved

( j ) )2

j0

M

W * argminW cos t(W )

31

Page 32: Natural Language Processing COMPSCI 423/723 Rohit Kate 1.

Logistic Regression

• But we are interested in probabilistic classification, that is in predicting P(Y|X1,..,Xn)

• Can we modify linear regression to do that?– Nothing constrains it to be between [0,1]

which is required for a legal probability

• Predict odds (assume Y is binary) instead of the probability

P(Y true | X1,X2,...,Xn )

1 P(Y true | X1,X2,...,Xn ) W i * X i

i0

n

32

Page 33: Natural Language Processing COMPSCI 423/723 Rohit Kate 1.

Logistic Regression

• But LHS lies between 0 and infinity, RHS could be between -infinity to infinity

• Take log of LHS (known as logit function)

ln(P(Y true | X1,X2,...,Xn )

1 P(Y true | X1,X2,...,Xn )) W i * X i

i0

n

P(Y true | X1,X2,...,Xn ) 1

1 e Wi *X ii0

n

Logisticfunction

33

Page 34: Natural Language Processing COMPSCI 423/723 Rohit Kate 1.

Logistic Regression as a Log-Linear Model

• Logistic regression is basically a linear model, which is demonstrated by taking logs

Assign label Y true iff 1P(Y true | X1..Xn )

P(Y false | X1..Xn )

1 exp( wiX i)i0

n

0 wiX ii0

n

34

Page 35: Natural Language Processing COMPSCI 423/723 Rohit Kate 1.

Logistic Regression Training

• Weights are set during training to maximize the conditional data likelihood :

where D is the set of training examples and Yd and Xi

d denote, respectively, the values of Y and Xi for example d.

W * argmaxW

P(Y d

d D

| X1d ..Xn

d )

• Equivalently viewed as maximizing the conditional log likelihood (CLL)

W * argmaxW

lnP(Y d | X1d ..Xn

d )dD

35

Page 36: Natural Language Processing COMPSCI 423/723 Rohit Kate 1.

Logistic Regression Training

• Use standard gradient descent to find the parameters (weights) that optimize the CLL objective function

• Many other more advanced training methods are available– Conjugate gradient– Generalized Iterative Scaling (GIS)– Improved Iterative Scaling (IIS)– Limited-memory quasi-Newton (L-BFGS)

• Packages are available that do these36

Page 37: Natural Language Processing COMPSCI 423/723 Rohit Kate 1.

Preventing Overfitting in Logistic Regression

• To prevent overfitting, one can use regularization (smoothing) by penalizing large weights by changing the training objective:

W * argmaxW

lnP(Y d | X1d ,..,X1

n ,W )dD

2W

2

• This can be shown to be equivalent to assuming a Guassian prior for W with zero mean and a variance related to 1/λ.

Where λ is a constant that determines the amount of smoothing

37

Page 38: Natural Language Processing COMPSCI 423/723 Rohit Kate 1.

38

Generative vs. Discriminative Models

• Generative models and are not directly designed to maximize the performance of classification. They model the complete joint distribution P(Y,X1,...Xn).

• But a generative model can also be used to perform any other inference task, e.g. P(X1 | X2, …Xn, Y)– “Jack of all trades, master of none.”

• Discriminative models are specifically designed and trained to maximize performance of classification. They only model the conditional distribution P(Y | X1, …Xn).

• By focusing on modeling the conditional distribution, they generally perform better on classification than generative models when given a reasonable amount of training data.– Master of one trade: Classification P(Y|X1,.. Xn)

38

Page 39: Natural Language Processing COMPSCI 423/723 Rohit Kate 1.

Multinomial Logistic Regression (Maximum Entropy or MaxEnt)

• So far Y was binary, a generalization if Y takes multiple values (classes)

• Make weights dependent on the class c: Wci instead of Wi

P(c | X1..Xn ) exp( WciX i

i0

N

)

exp( Wc' iX i

i0

N

)c 'Classes

Normalization term (Z) so that probabilities sum to 1

39

Page 40: Natural Language Processing COMPSCI 423/723 Rohit Kate 1.

• Usually features take binary values in NLP

• Introduce indicator functions (0 or 1 output) that depend on the input and output class

• Call X as input, features are fi(c,x)

Multinomial Logistic Regression (Maximum Entropy or MaxEnt)

P(c | X) exp( Wci f i(c,x)

i0

N

)

exp( Wc' i f i(c',x)i0

N

)c 'Classes

40

Page 41: Natural Language Processing COMPSCI 423/723 Rohit Kate 1.

A Small MaxEnt Example• Word Sense Disambiguation:

Y: {river bank, money bank, verb bank}X: Entire SentenceFeatures:

f1(river bank,X) = 1 if “river” is in the sentence, 0 otherwise

f2(river bank,X) = 1 if “water” is in the sentence, 0 otherwise

f3(money bank,X) = 1 if “money” is in the sentence, 0 otherwise

f4(money bank,X) = 1 if “deposit” is in the sentence, 0 otherwise

f5(verb bank,X) = 1 if previous word was “to”, 0 otherwise

• Obtain examples of feature values and Y from annotated training data

• Compute weights Wci to maximize the conditional log-likelihood of the training data

• For a test example, predict Y using MaxEnt equation

41

Page 42: Natural Language Processing COMPSCI 423/723 Rohit Kate 1.

Why is it Called Maximum Entropy Model?

• Entropy of a random variable Y:

• The more uniform distribution, the higher is the entropy

• It can be shown that standard training for logistic regression gives the distribution with maximum entropy that is empirically consistent with the training data

H(Y ) P(Y )log2(P(Y ))Y

42

Page 43: Natural Language Processing COMPSCI 423/723 Rohit Kate 1.

Undirected Graphical Model• Also called Markov Network, Random Field• Undirected graph over a set of random

variables, where an edge represents a dependency

• The Markov blanket of a node, X, in a Markov Net is the set of its neighbors in the graph (nodes that have an edge connecting to X)

• Every node in a Markov Net is conditionally independent of every other node given its Markov blanket

• Simplest Markov Network: MaxEnt model

43

Page 44: Natural Language Processing COMPSCI 423/723 Rohit Kate 1.

Relation with Naïve Bayes

Y

X1 X2… Xn

Y

X1 X2… Xn

Naïve Bayes

LogisticRegression

Conditional

Generative

Discriminative

44

Page 45: Natural Language Processing COMPSCI 423/723 Rohit Kate 1.

Simplification Assumption for MaxEnt

• The probability P(Y|X1..Xn) can be factored as:

• Note there is no product term that has two or more Xis

P(c | X1..Xn ) exp( WciX i

i0

N

)

exp( Wc' iX i

i0

N

)c 'Classes

45

Page 46: Natural Language Processing COMPSCI 423/723 Rohit Kate 1.

Naïve Bayes and MaxEnt• Naïve Bayes can be extended to work with

continuous inputs X (like logistic regression)• Both make the conditional independence

assumption• MaxEnt is not rigidly tied with it because it

tries to maximize the conditional likelihood of the data even when the data disobeys the assumption

• It has been observed that with scarce training data Naïve Bayes performs better and with sufficient data MaxEnt performs better

46

Page 47: Natural Language Processing COMPSCI 423/723 Rohit Kate 1.

Classification in General• Several other classifiers are also available:

perceptron, neural networks, support vector machines, k-nearest neighbors, decision trees…

• Naïve Bayes and MaxEnt are based on probabilities– Can’t handle combination of features as features– If right features are engineered they work very well

• Are widely used for tasks other than NLP tasks• All this was for one label classification (there was

only one Y), extensions to handle multi-label classifications, e.g. sequence labeling with HMMs or CRFs

47

Page 48: Natural Language Processing COMPSCI 423/723 Rohit Kate 1.

HW 2

• Write Naïve Bayes (P(Y|f1,f2,f3,f4,f5)) and MaxEnt (P(Y|X)) equations for the example shown on slide #41.

48

Page 49: Natural Language Processing COMPSCI 423/723 Rohit Kate 1.

References for Next Class• Chapter 5 (part-of-speech tagging) of Jurafsky &

Martin book; Chapter 10 of Manning and Schutze book

• An Introduction to Conditional Random Fields for Relational Learning By Charles Sutton and Andrew McCallum, Book chapter in Introduction to Statistical Relational Learning. Edited by Lise Getoor and Ben Taskar. MIT Press. 2006

49