TOPIC 4: LANGUAGE MODELINGnlpcs724.weebly.com/.../2/66126761/cs724_nlp_topic_4-language_m… · •...

43
TOPIC 4: LANGUAGE MODELING NATURAL LANGUAGE PROCESSING (NLP) CS-724 Wondwossen Mulugeta (PhD) email: [email protected] 1

Transcript of TOPIC 4: LANGUAGE MODELINGnlpcs724.weebly.com/.../2/66126761/cs724_nlp_topic_4-language_m… · •...

Page 1: TOPIC 4: LANGUAGE MODELINGnlpcs724.weebly.com/.../2/66126761/cs724_nlp_topic_4-language_m… · • Introduction to Language Modeling, • Application, • N-gram model, • Smoothing

TOPIC 4: LANGUAGE MODELING

NATURAL LANGUAGE PROCESSING (NLP)

CS-724

WondwossenMulugeta (PhD) email: [email protected]

1

Page 2: TOPIC 4: LANGUAGE MODELINGnlpcs724.weebly.com/.../2/66126761/cs724_nlp_topic_4-language_m… · • Introduction to Language Modeling, • Application, • N-gram model, • Smoothing

Topics

Topics Subtopics

4: Language Modeling

• Introduction to Language Modeling,

• Application, • N-gram model, • Smoothing Techniques, • Model Evaluation

2

Page 3: TOPIC 4: LANGUAGE MODELINGnlpcs724.weebly.com/.../2/66126761/cs724_nlp_topic_4-language_m… · • Introduction to Language Modeling, • Application, • N-gram model, • Smoothing

3

What is a Language Model

Language models assign a probability that a sentence is a legal string in a

language.

Having a way to estimate the relative likelihood of different phrases is useful

in many natural language processing applications.

• We can view a finite state automaton as a deterministic language

Model

I wish I wish I wish I wish . . . CANNOT generate: “wish I wish” or “I wish I”.

Our basic model: each document was generated by a different automaton like this except that these automata are probabilistic.

Page 4: TOPIC 4: LANGUAGE MODELINGnlpcs724.weebly.com/.../2/66126761/cs724_nlp_topic_4-language_m… · • Introduction to Language Modeling, • Application, • N-gram model, • Smoothing

Language Models

Formal grammars (e.g. regular, context free) give a hard “binary” model of the legal sentences in a language.

For NLP, a probabilistic model of a language that gives a probability that a string is a member of a language is more useful.

To specify a correct probability distribution, the probability of all sentences in a language must sum to 1.

Page 5: TOPIC 4: LANGUAGE MODELINGnlpcs724.weebly.com/.../2/66126761/cs724_nlp_topic_4-language_m… · • Introduction to Language Modeling, • Application, • N-gram model, • Smoothing

Uses of Language Models

Speech recognition

“I ate a cherry” is a more likely sentence than “Eye eight uhJerry”

OCR & Handwriting recognition

More probable sentences are more likely correct readings.

Machine translation

More likely sentences are probably better translations.

Generation

More likely sentences are probably better NL generations.

Context sensitive spelling correction

“Their are problems wit this sentence.”

Page 6: TOPIC 4: LANGUAGE MODELINGnlpcs724.weebly.com/.../2/66126761/cs724_nlp_topic_4-language_m… · • Introduction to Language Modeling, • Application, • N-gram model, • Smoothing

Uses of Language Models

A language model also supports predicting the

completion of a sentence.

Please turn off your cell _____

Your program does not ______

Predictive text input systems can guess what you

are typing and give choices on how to complete it.

Page 7: TOPIC 4: LANGUAGE MODELINGnlpcs724.weebly.com/.../2/66126761/cs724_nlp_topic_4-language_m… · • Introduction to Language Modeling, • Application, • N-gram model, • Smoothing

N-Gram Models

Estimate probability of each word given prior context.

P(phone | Please turn off your cell)

Number of parameters required grows exponentially with the number of words of prior context.

An N-gram model uses only N1 words of prior context.

Unigram: P(phone)

Bigram: P(phone | cell)

Trigram: P(phone | your cell)

Page 8: TOPIC 4: LANGUAGE MODELINGnlpcs724.weebly.com/.../2/66126761/cs724_nlp_topic_4-language_m… · • Introduction to Language Modeling, • Application, • N-gram model, • Smoothing

N-Gram Models

The Markov assumption is the presumption that the future behavior of a dynamical system only depends on its recent history. In particular, in a kth-order Markov model, the next state only depends on the k most recent states, therefore an N-gram model is a (N1)-order Markov model.

Page 9: TOPIC 4: LANGUAGE MODELINGnlpcs724.weebly.com/.../2/66126761/cs724_nlp_topic_4-language_m… · • Introduction to Language Modeling, • Application, • N-gram model, • Smoothing

9

A probabilistic language model

This is a one-state probabilistic finite-state automaton

Or it is called a unigram language model and the state emission distribution for its one state q1.

STOP is not a word, but a special symbol indicating that the automaton stops.

Eg: “frog said that toad likes frog” STOP

P(string) = 0.01 x 0.03 x 0.04 x 0.01 x 0.02 x 0.01 x 0.02

= 0.0000000000048

Page 10: TOPIC 4: LANGUAGE MODELINGnlpcs724.weebly.com/.../2/66126761/cs724_nlp_topic_4-language_m… · • Introduction to Language Modeling, • Application, • N-gram model, • Smoothing

Probabilistic Language Modeling

Goal: compute the probability of a sentence or sequence of words:

P(W) = P(w1,w2,w3,w4,w5…wn)

Related task: probability of an upcoming word:

P(w5|w1,w2,w3,w4)

A model that computes either of these:

P(W) or P(wn|w1,w2…wn-1)

This is called a Language Model.

Page 11: TOPIC 4: LANGUAGE MODELINGnlpcs724.weebly.com/.../2/66126761/cs724_nlp_topic_4-language_m… · • Introduction to Language Modeling, • Application, • N-gram model, • Smoothing

How to compute P(W):

The Chain Rule

How to compute this joint probability:

P(its, water, is, so, transparent, that)

Intuition: let’s rely on the Chain Rule of Probability

Recall the definition of conditional probabilities

P(A,B,C,D) = P(A)P(B|A)P(C|A,B)P(D|A,B,C)

The Chain Rule in General

P(x1,x2,x3,…,xn) = P(x1)P(x2|x1)P(x3|x1,x2)…P(xn|x1,…,xn-1)

P(this is a boy) =P(this)*P(is/this)*P(a/this is)*P(boy/this is a)

Page 12: TOPIC 4: LANGUAGE MODELINGnlpcs724.weebly.com/.../2/66126761/cs724_nlp_topic_4-language_m… · • Introduction to Language Modeling, • Application, • N-gram model, • Smoothing

The Chain Rule applied to compute joint

probability of words in sentence

P(“its water is so transparent”) =

P(its) × P(water | its) × P( is| its water)

× P(so |its water is) × P(transparent | its water is so)

i

iin wwwwPwwwP )|()( 12121

Page 13: TOPIC 4: LANGUAGE MODELINGnlpcs724.weebly.com/.../2/66126761/cs724_nlp_topic_4-language_m… · • Introduction to Language Modeling, • Application, • N-gram model, • Smoothing

How to estimate these probabilities

Could we just count and divide?

Too many possible sentences!

We’ll never see enough data for estimating these

 

P(the | its water is so transparent that) =

Count(its water is so transparent that the)

Count(its water is so transparent that)

Page 14: TOPIC 4: LANGUAGE MODELINGnlpcs724.weebly.com/.../2/66126761/cs724_nlp_topic_4-language_m… · • Introduction to Language Modeling, • Application, • N-gram model, • Smoothing

Markov Assumption

Simplifying assumption:

Or maybe

 

P(the | its water is so transparent that) » P(the | that)

 

P(the | its water is so transparent that) » P(the | transparent that)

Andrei Markov

Page 15: TOPIC 4: LANGUAGE MODELINGnlpcs724.weebly.com/.../2/66126761/cs724_nlp_topic_4-language_m… · • Introduction to Language Modeling, • Application, • N-gram model, • Smoothing

Markov Assumption

In other words, we approximate each

component in the product

i

ikiin wwwPwwwP )|()( 121

)|()|( 1121 ikiiii wwwPwwwwP

Page 16: TOPIC 4: LANGUAGE MODELINGnlpcs724.weebly.com/.../2/66126761/cs724_nlp_topic_4-language_m… · • Introduction to Language Modeling, • Application, • N-gram model, • Smoothing

N-Gram Model Formulas

Word sequences

Chain rule of probability

Bigram approximation

N-gram approximation

n

n www ...11

)|()|()...|()|()()( 1

1

1

1

1

2

131211

kn

k

k

n

n

n wwPwwPwwPwwPwPwP

)|()( 1

1

1

1

k

Nk

n

k

k

n wwPwP

)|()( 1

1

1

k

n

k

k

n wwPwP

Page 17: TOPIC 4: LANGUAGE MODELINGnlpcs724.weebly.com/.../2/66126761/cs724_nlp_topic_4-language_m… · • Introduction to Language Modeling, • Application, • N-gram model, • Smoothing

Estimating Probabilities

N-gram conditional probabilities can be estimated from raw text based on the relative frequency of word sequences.

To have a consistent probabilistic model, append a unique start (<s>) and end (</s>) symbol to every sentence and treat these as additional words.

)(

)()|(

1

11

n

nnnn

wC

wwCwwP

)(

)()|(

1

1

1

11

1

n

Nn

n

n

Nnn

NnnwC

wwCwwP

Bigram:

N-gram:

Page 18: TOPIC 4: LANGUAGE MODELINGnlpcs724.weebly.com/.../2/66126761/cs724_nlp_topic_4-language_m… · • Introduction to Language Modeling, • Application, • N-gram model, • Smoothing

Generative Model & MLE

An N-gram model can be seen as a probabilistic

automata for generating sentences.

Relative frequency estimates can be proven to be

maximum likelihood estimates (MLE) since they

maximize the probability that the model M will

generate the training corpus T.

Initialize sentence with N1 <s> symbols

Until </s> is generated do:

Stochastically pick the next word based on the conditional

probability of each word given the previous N 1 words.

))(|(argmaxˆ

MTP

Page 19: TOPIC 4: LANGUAGE MODELINGnlpcs724.weebly.com/.../2/66126761/cs724_nlp_topic_4-language_m… · • Introduction to Language Modeling, • Application, • N-gram model, • Smoothing

Train and Test Corpora

A language model must be trained on a large corpus of text to estimate good parameter values.

Model can be evaluated based on its ability to predict a high probability for a disjoint test corpus (testing on the training corpus would give an optimistically biased estimate).

Ideally, the training (and test) corpus should be representative of the actual application data.

Page 20: TOPIC 4: LANGUAGE MODELINGnlpcs724.weebly.com/.../2/66126761/cs724_nlp_topic_4-language_m… · • Introduction to Language Modeling, • Application, • N-gram model, • Smoothing

Unknown Words

How to handle words in the test corpus that did not occur in the training data, i.e. out of vocabulary (OOV) words?

Train a model that includes an explicit symbol for an unknown word (<UNK>).

Choose a vocabulary in advance and replace other words in the training corpus with <UNK>.

Replace the first occurrence of each word in the training data with <UNK>.

Page 21: TOPIC 4: LANGUAGE MODELINGnlpcs724.weebly.com/.../2/66126761/cs724_nlp_topic_4-language_m… · • Introduction to Language Modeling, • Application, • N-gram model, • Smoothing

Evaluation of Language Models

Ideally, evaluate use of model in end users application

(extrinsic)

Realistic

Expensive

Evaluate on ability to model test corpus (intrinsic).

Less realistic

Cheaper

Verify at least once that intrinsic evaluation correlates

with an extrinsic one.

Page 22: TOPIC 4: LANGUAGE MODELINGnlpcs724.weebly.com/.../2/66126761/cs724_nlp_topic_4-language_m… · • Introduction to Language Modeling, • Application, • N-gram model, • Smoothing

Extrinsic evaluation of N-gram models

Best evaluation for comparing models A and B

Put each model in a task

spelling corrector, speech recognizer, machine translation

system

Run the task, get an accuracy for A and for B

How many misspelled words corrected properly

How many words translated correctly

Compare accuracy for A and B

22

Page 23: TOPIC 4: LANGUAGE MODELINGnlpcs724.weebly.com/.../2/66126761/cs724_nlp_topic_4-language_m… · • Introduction to Language Modeling, • Application, • N-gram model, • Smoothing

Difficulty of extrinsic evaluation

Extrinsic evaluationTime-consuming; can take days or weeks

Intrinsic evaluation intrinsic evaluation: perplexity

Bad approximation

unless the test data looks just like the training data

So generally only useful in pilot experiments

23

Page 24: TOPIC 4: LANGUAGE MODELINGnlpcs724.weebly.com/.../2/66126761/cs724_nlp_topic_4-language_m… · • Introduction to Language Modeling, • Application, • N-gram model, • Smoothing

Intuition of Perplexity

How well can we predict the next word?

Unigrams provide the lowest accuracy

A better model is one which assigns a higher probability to the word that actually occurs

I always order pizza with cheese and ____

The 33rd President of the US was ____

I saw a ____

mushrooms 0.1

pepperoni 0.1

anchovies 0.01

….

fried rice 0.0001

….

and 1e-100

24

Page 25: TOPIC 4: LANGUAGE MODELINGnlpcs724.weebly.com/.../2/66126761/cs724_nlp_topic_4-language_m… · • Introduction to Language Modeling, • Application, • N-gram model, • Smoothing

Perplexity

Perplexity is the probability of the test set, normalized by the number of words:

Chain rule:

For bigrams:

The best language model is one that best predicts an unseen test set

• Gives the highest P(sentence)

25

𝑃𝑃 𝑊 = 𝑃 𝑤1𝑤2…𝑤𝑁−1𝑁

=𝑁 1

𝑃(𝑤1𝑤2…𝑤𝑁)

Page 26: TOPIC 4: LANGUAGE MODELINGnlpcs724.weebly.com/.../2/66126761/cs724_nlp_topic_4-language_m… · • Introduction to Language Modeling, • Application, • N-gram model, • Smoothing

Perplexity

Measure of how well a model “fits” the test data.

Uses the probability that the model assigns to the test corpus.

Normalizes for the number of words in the test corpus and takes the inverse.

N

NwwwPWPP

)...(

1)(

21

• Measures the weighted average branching factor in predicting the next word (lower is better).

Page 27: TOPIC 4: LANGUAGE MODELINGnlpcs724.weebly.com/.../2/66126761/cs724_nlp_topic_4-language_m… · • Introduction to Language Modeling, • Application, • N-gram model, • Smoothing

Perplexity as branching factor

Let’s suppose a sentence consisting of only ten words

What is the perplexity of this sentence according to a model that assign P=1/10 to each word?

27

Page 28: TOPIC 4: LANGUAGE MODELINGnlpcs724.weebly.com/.../2/66126761/cs724_nlp_topic_4-language_m… · • Introduction to Language Modeling, • Application, • N-gram model, • Smoothing

Example

A random sentences have the following three words,

which appear with the following probabilities. Using a

unigram model, what is the perplexity of the

sequence (green, yellow, red)?

P(He) = 2/5

P(Loves) = 1/5

P(her) = 2/5

28

𝑃𝑃 ℎ𝑒, 𝑙𝑜𝑣𝑒𝑠, ℎ𝑒𝑟 =2

5×1

5×2

5

−13

Page 29: TOPIC 4: LANGUAGE MODELINGnlpcs724.weebly.com/.../2/66126761/cs724_nlp_topic_4-language_m… · • Introduction to Language Modeling, • Application, • N-gram model, • Smoothing

Lower perplexity = better model

Training 38 million words, test 1.5 million words, WSJ

N-gram

Order

Unigram Bigram Trigram

Perplexity 962 170 109

29

Page 30: TOPIC 4: LANGUAGE MODELINGnlpcs724.weebly.com/.../2/66126761/cs724_nlp_topic_4-language_m… · • Introduction to Language Modeling, • Application, • N-gram model, • Smoothing

Smoothing

Since there are a combinatorial number of possible word sequences, many rare (but not impossible) combinations never occur in training, so most models incorrectly assigns zero to many parameters The case of sparse data. Typical example is Amharic

If a new combination occurs during testing, it is given a probability of zero and the entire sequence gets a probability of zero (i.e. infinite perplexity).

In practice, parameters are smoothed to reassign some probability mass to unseen events. Adding probability mass to unseen events requires removing

it from seen ones (discounting) in order to maintain a joint distribution that sums to 1.

Page 31: TOPIC 4: LANGUAGE MODELINGnlpcs724.weebly.com/.../2/66126761/cs724_nlp_topic_4-language_m… · • Introduction to Language Modeling, • Application, • N-gram model, • Smoothing

Advanced Smoothing

There are various techniques developed to improve

smoothing for language models.

Good-Turing

Interpolation

Backoff

Class-based (cluster) N-grams

Page 32: TOPIC 4: LANGUAGE MODELINGnlpcs724.weebly.com/.../2/66126761/cs724_nlp_topic_4-language_m… · • Introduction to Language Modeling, • Application, • N-gram model, • Smoothing

Model Combination

As N increases, the power (expressiveness) of an N-

gram model increases, but the ability to estimate

accurate parameters from sparse data decreases (i.e.

the smoothing problem gets worse).

A general approach is to combine the results of

multiple N-gram models of increasing complexity

Page 33: TOPIC 4: LANGUAGE MODELINGnlpcs724.weebly.com/.../2/66126761/cs724_nlp_topic_4-language_m… · • Introduction to Language Modeling, • Application, • N-gram model, • Smoothing

Interpolation

Linearly combine estimates of N-gram models of

increasing order.

• Learn proper values for i by training to

(approximately) maximize the likelihood of an

independent development corpus.

• This is called tuning

• Recall the model we build for tagging

)()|()|()|(ˆ 3121,211,2 nnnnnnnnn wPwwPwwwPwwwP

Interpolated Trigram Model:

1i

iWhere:

Page 34: TOPIC 4: LANGUAGE MODELINGnlpcs724.weebly.com/.../2/66126761/cs724_nlp_topic_4-language_m… · • Introduction to Language Modeling, • Application, • N-gram model, • Smoothing

Backoff

Only use lower-order model when data for higher-

order model is unavailable (i.e. count is zero).

Recursively back-off to weaker models until data is

available.

Start from the most restrictive and move down to the

relaxed model.

Page 35: TOPIC 4: LANGUAGE MODELINGnlpcs724.weebly.com/.../2/66126761/cs724_nlp_topic_4-language_m… · • Introduction to Language Modeling, • Application, • N-gram model, • Smoothing

A Problem for N-Grams:

Long Distance Dependencies

Many times local context does not provide the most useful predictive clues, which instead are provided by long-distance dependencies. Syntactic dependencies

“The man next to the large oak tree near the grocery store on the corner is tall.”

“The men next to the large oak tree near the grocery store on the corner are tall.”

Semantic dependencies “The bird next to the large oak tree near the grocery store on the

corner flies rapidly.”

“The man next to the large oak tree near the grocery store on the corner talks rapidly.”

More complex models of language are needed to handle such dependencies.

Page 36: TOPIC 4: LANGUAGE MODELINGnlpcs724.weebly.com/.../2/66126761/cs724_nlp_topic_4-language_m… · • Introduction to Language Modeling, • Application, • N-gram model, • Smoothing

The cases

Unigrammodel: probability of the occurrence of

a word in a corpus

Bigrammodel: probability of the occurrence of a

word followed by another word

Trigrammodel: probability of the occurrence of a

word in a corpus followed by two specific words

N-grammodel: …… n words

Page 37: TOPIC 4: LANGUAGE MODELINGnlpcs724.weebly.com/.../2/66126761/cs724_nlp_topic_4-language_m… · • Introduction to Language Modeling, • Application, • N-gram model, • Smoothing

N-gram models

We can extend to trigrams, 4-grams, 5-grams

In general this is an insufficient model of

language

because language has long-distance dependencies:

“The computer which I had just put into the machine

room on the fifth floor crashed.”

But we can often get away with N-gram models

Page 38: TOPIC 4: LANGUAGE MODELINGnlpcs724.weebly.com/.../2/66126761/cs724_nlp_topic_4-language_m… · • Introduction to Language Modeling, • Application, • N-gram model, • Smoothing

Estimating bigram probabilities

The Maximum Likelihood Estimate

 

P(wi |wi-1) =count(wi-1,wi)

count(wi-1)

Page 39: TOPIC 4: LANGUAGE MODELINGnlpcs724.weebly.com/.../2/66126761/cs724_nlp_topic_4-language_m… · • Introduction to Language Modeling, • Application, • N-gram model, • Smoothing

An example

<s> I am Sam </s>

<s> Sam I am </s>

<s> I do not like green eggs and ham </s>

 

P(wi |wi-1) =c(wi-1,wi)

c(wi-1)

Page 40: TOPIC 4: LANGUAGE MODELINGnlpcs724.weebly.com/.../2/66126761/cs724_nlp_topic_4-language_m… · • Introduction to Language Modeling, • Application, • N-gram model, • Smoothing

Raw bigram counts

Out of 9222 sentences

Page 41: TOPIC 4: LANGUAGE MODELINGnlpcs724.weebly.com/.../2/66126761/cs724_nlp_topic_4-language_m… · • Introduction to Language Modeling, • Application, • N-gram model, • Smoothing

Raw bigram probabilities

Normalize by unigrams:

Result:

Page 42: TOPIC 4: LANGUAGE MODELINGnlpcs724.weebly.com/.../2/66126761/cs724_nlp_topic_4-language_m… · • Introduction to Language Modeling, • Application, • N-gram model, • Smoothing

Bigram estimates of sentence

probabilities

P(<s> I want english food </s>) =

P(I|<s>)

× P(want|I)

× P(english|want)

× P(food|english)

× P(</s>|food)

= .000031

Page 43: TOPIC 4: LANGUAGE MODELINGnlpcs724.weebly.com/.../2/66126761/cs724_nlp_topic_4-language_m… · • Introduction to Language Modeling, • Application, • N-gram model, • Smoothing

End of Topic 4