Statistical Natural Language Processing and Applications

39
Statistical Natural Language Processing and Applications

Transcript of Statistical Natural Language Processing and Applications

Statistical Natural Language Processing and Applications

Textbooks you can refer

Jacob Eisenstein. Natural Language Processing(2018, draft)

Jurafsky, D. and J. H. Martin: Speech and Language Processing. Prentice-Hall. 2009. 2nd

edition (3rd edition, 2019 draft: http://web.stanford.edu/~jurafsky/slp3/)

Yoav Goldberg. A Primer on Neural Network Models for Natural Language Processing (pdf)

Manning, C. D., Schütze, H.: Foundations of Statistical Natural Language Processing. The

MIT Press. 1999. ISBN 0-262-13360-1.

2

Goals of the HLT

Computers would be a lot more useful if they could handle our email, do our library research, talk to us …

But they are fazed by natural human language.

How can we make computers have abilities to handle human language? (Or help them learn it as kids do?)

3

A few applications of HLT

Spelling correction, grammar checking …(language learning and evaluation e.g. TOEFL essay score)

Better search engines Information extraction, gisting Psychotherapy; Harlequin romances; etc. New interfaces:

Speech recognition (and text-to-speech) Dialogue systems (USS Enterprise onboard computer) Machine translation; speech translation (the Babel

tower??)

Trans-lingual summarization, detection, extraction …

4

Question Answering: IBM’s Watson

Won Jeopardy on February 16, 2011!

5

WILLIAM WILKINSON’S “AN ACCOUNT OF THE PRINCIPALITIES OF

WALLACHIA AND MOLDOVIA”INSPIRED THIS AUTHOR’S

MOST FAMOUS NOVEL

Bram Stoker

Information ExtractionSubject: curriculum meeting

Date: January 15, 2012

To: Dan Jurafsky

Hi Dan, we’ve now scheduled the curriculum meeting.

It will be in Gates 159 tomorrow from 10:00-11:30.

-Chris

6

Create new Calendar entry

Event: Curriculum mtgDate: Jan-16-2012Start: 10:00amEnd: 11:30amWhere: Gates 159

Information Extraction & Sentiment Analysis

nice and compact to carry!

since the camera is small and light, I won't need to carry around those heavy, bulky professional cameras either!

the camera feels flimsy, is plastic and very light in weight you have to be very delicate in the handling of this camera

7

Size and weight

Attributes:zoomaffordabilitysize and weightflash ease of use

Machine Translation

Fully automatic

8

• Helping human translators

Enter Source Text:

Translation from Stanford’s Phrasal:

这 不过 是 一 个 时间 的 问题 .

This is only a matter of time.

Language Technology

Coreference resolution

Question answering (QA)

Part-of-speech (POS) tagging

Word sense disambiguation (WSD)

Paraphrase

Named entity recognition (NER)

ParsingSummarization

Information extraction (IE)

Machine translation (MT)Dialog

Sentiment analysis

mostly solved

making good progress

still really hard

Spam detection

Let’s go to Agra!

Buy V1AGRA …

Colorless green ideas sleep furiously.

ADJ ADJ NOUN VERB ADV

Einstein met with UN officials in PrincetonPERSON ORG LOC

You’re invited to our dinner party, Friday May 27 at 8:30

PartyMay 27add

Best roast chicken in San Francisco!

The waiter ignored us for 20 minutes.

Carter told Mubarak he shouldn’t run again.

I need new batteries for my mouse.

The 13th Shanghai International Film Festival…

第13届上海国际电影节开幕…

The Dow Jones is up

Housing prices rose

Economy is good

Q. How effective is ibuprofen in reducing fever in patients with acute febrile illness?

I can see Alcatraz from the window!

XYZ acquired ABC yesterday

ABC has been taken over by XYZ

Where is Citizen Kane playing in SF?

Castro Theatre at 7:30. Do you want a ticket?

The S&P500 jumped

Ambiguity makes NLP hard:“Crash blossoms”

Violinist Linked to JAL Crash BlossomsTeacher Strikes Idle KidsRed Tape Holds Up New BridgesHospitals Are Sued by 7 Foot DoctorsJuvenile Court to Try Shooting DefendantLocal High School Dropouts Cut in Half

non-standard English

Great job @justinbieber! Were SOO PROUD of what youve accomplished! U taught us 2 #neversaynever & you yourself should never give up either♥

segmentation issues idioms

dark horseget cold feet

lose facethrow in the towel

neologisms

unfriendRetweet

bromance

tricky entity names

Where is A Bug’s Life playing …Let It Be was recorded …… a mutation on the for gene …

world knowledge

Mary and Sue are sisters.Mary and Sue are mothers.

But that’s what makes it fun!

the New York-New Haven Railroadthe New York-New Haven Railroad

Why else is natural language understanding difficult?

Levels of Language

Phonetics/phonology/morphology: what words (or subwords) are we dealing with?

Syntax: What phrases are we dealing with? Which words modify one another?

Semantics: What’s the literal meaning?

Pragmatics: What should you conclude from the fact that I said something? How should you react?

12

What’s hard – ambiguities, ambiguities, all different levels of ambiguities

John stopped at the donut store on his way home from work. He thought a coffee was good every few hours. But it turned out to be too expensivethere. [from J. Eisner]

- donut: To get a donut (doughnut; spare tire) for his car?

- Donut store: store where donuts shop? or is run by donuts? or looks like a big donut? or made of donut?

- From work: Well, actually, he stopped there from hunger and exhaustion, not just from work.

- Every few hours: That’s how often he thought it? Or that’s for coffee?

- it: the particular coffee that was good every few hours? the donut store? the situation

- Too expensive: too expensive for what? what are we supposed to conclude about what John did?

13

Statistical NLP

Imagine:

Each sentence W = w1, w2, ..., wn gets a probability P(W|X) in a context X (think of it in the intuitive sense for now)

For every possible context X, sort all the imaginable sentences W according to P(W|X):

Ideal situation:

best sentence (most probable in context X)

NB: same for interpretation

P(W) “ungrammatical” sentences

14

Real World Situation

Unable to specify set of grammatical sentences today using fixed “categorical” rules (maybe never)

Use statistical “model” based on REAL WORLD DATAand care about the best sentence only (disregarding the “grammaticality” issue)

best sentence

P(W)

Wbest

15

Language Modeling (and the Noisy Channel)

16

The Noisy Channel

Prototypical case:

Input Output (noisy)

The channel

0,1,1,1,0,1,0,1,... (adds noise) 0,1,1,0,0,1,1,0,...

Model: probability of error (noise):

Example: p(0|1) = .3 p(1|1) = .7 p(1|0) = .4 p(0|0) = .6

The Task:

known: the noisy output; want to know: the input (decoding)

17

Noisy Channel Applications

OCR

straightforward: text → print (adds noise), scan → image Handwriting recognition

text → neurons, muscles (“noise”), scan/digitize → image Speech recognition (dictation, commands, etc.)

text → conversion to acoustic signal (“noise”) → acoustic waves

Machine Translation

text in target language → translation (“noise”) → source language

Also: Part of Speech Tagging

sequence of tags → selection of word forms → text

18

Noisy Channel: The Golden Rule of ...

OCR, ASR, HR, MT, ...

Recall:

p(A|B) = p(B|A) p(A) / p(B) (Bayes formula)

Abest = argmaxA p(B|A) p(A) (The Golden Rule) p(B|A): the acoustic/image/translation/lexical model

application-specific name

will explore later

p(A): the language model

19

Probabilistic Language Models

• Today’s goal: assign a probability to a sentence• Machine Translation:

• P(high winds tonite) > P(large winds tonite)

• Spell Correction

• The office is about fifteen minuets from my house

P(about fifteen minutes from) > P(about fifteen minuets from)

• Speech Recognition

• P(I saw a van) >> P(eyes awe of an)

• + Summarization, question-answering, etc., etc.!!

Why?

Probabilistic Language Modeling

Goal: compute the probability of a sentence or sequence of words:

P(W) = P(w1,w2,w3,w4,w5…wn)

Related task: probability of an upcoming word:

P(w5|w1,w2,w3,w4)

A model that computes either of these:

P(W) or P(wn|w1,w2…wn-1) is called a language model.

Better: the grammar But language model or LM is standard

n-gram Language Models

(n-1)th order Markov approximation → n-gram LM:

p(W) =df Πi=1..dp(wi|wi-n+1,wi-n+2,...,wi-1) ! In particular (assume vocabulary |V| = 60k):

0-gram LM: uniform model, p(w) = 1/|V|, 1 parameter

1-gram LM: unigram model, p(w), 6ⅹ104 parameters

2-gram LM: bigram model, p(wi|wi-1) 3.6ⅹ109 parameters

3-gram LM: trigram model, p(wi|wi-2,wi-1) 2.16ⅹ1014 parameters

22

Maximum Likelihood Estimate

MLE: Relative Frequency...

...best predicts the data at hand (the “training data”)

Trigrams from Training Data T:

count sequences of three words in T: c3(wi-2,wi-1,wi) [NB: notation: just saying that the three words follow each other]

count sequences of two words in T: c2(wi-1,wi):

either use c2(y,z) = Σw c3(y,z,w)

or count differently at the beginning (& end) of data!

p(wi|wi-2,wi-1) =est. c3(wi-2,wi-1,wi) / c2(wi-2,wi-1) !23

LM: an Example

Training data:

<s> <s> He can buy the can of soda.

Unigram: p1(He) = p1(buy) = p1(the) = p1(of) = p1(soda) = p1(.) = .125

p1(can) = .25

Bigram: p2(He|<s>) = 1, p2(can|He) = 1, p2(buy|can) = .5,

p2(of|can) = .5, p2(the|buy) = 1,...

Trigram: p3(He|<s>,<s>) = 1, p3(can|<s>,He) = 1,

p3(buy|He,can) = 1, p3(of|the,can) = 1, ..., p3(.|of,soda) = 1.

(normalized for all n-grams) Entropy: H(p1) = 2.75, H(p2) = .25, H(p3) = 0 ← Great?!

24

LM: an Example (The Problem)

Cross-entropy:

S = <s> <s> It was the greatest buy of all. (test data)

Even HS(p1) fails (= HS(p2) = HS(p3) = ∞), because:

all unigrams but p1(the), p1(buy), p1(of) and p1(.) are 0.

all bigram probabilities are 0.

all trigram probabilities are 0.

We want: to make all probabilities non-zero. data sparseness handling

25

Why do we need Nonzero Probs?

To avoid infinite Cross Entropy:

happens when an event is found in test data which has not been seen in training data

H(p) = ∞: prevents comparing data with ≥ 0 “errors”

To make the system more robust

low count estimates:

they typically happen for “detailed” but relatively rare appearances

high count estimates: reliable but less “detailed”

26

Eliminating the Zero Probabilities:Smoothing Get new p’(w) (same Ω): almost p(w) but no zeros

Discount w for (some) p(w) > 0: new p’(w) < p(w)

Σw∈discounted (p(w) - p’(w)) = D

Distribute D to all w; p(w) = 0: new p’(w) > p(w)

possibly also to other w with low p(w)

For some w (possibly): p’(w) = p(w)

Make sure Σw∈Ω p’(w) = 1

There are many ways of smoothing

27

Smoothing by Adding 1(Laplace)

Simplest but not really usable:

Predicting words w from a vocabulary V, training data T:

p’(w|h) = (c(h,w) + 1) / (c(h) + |V|)

for non-conditional distributions: p’(w) = (c(w) + 1) / (|T| + |V|)

Problem if |V| > c(h) (as is often the case; even >> c(h)!)

Example: Training data: <s> what is it what is small ? |T| = 8 V = what, is, it, small, ?, <s>, flying, birds, are, a, bird, . , |V| = 12

p(it)=.125, p(what)=.25, p(.)=0 p(what is it?) = .252ⅹ.1252 ≅ .001

p(it is flying.) = .125ⅹ.25ⅹ02 = 0

p’(it) =.1, p’(what) =.15, p’(.)=.05 p’(what is it?) = .152ⅹ.12 ≅ .0002

p’(it is flying.) = .1ⅹ.15ⅹ.052 ≅ .00004

(assume word independence!)

28

Adding less than 1

Equally simple:

Predicting words w from a vocabulary V, training data T:

p’(w|h) = (c(h,w) + λ) / (c(h) + λ|V|), λ < 1

for non-conditional distributions: p’(w) = (c(w) + λ) / (|T| + λ|V|)

Example: Training data: <s> what is it what is small ? |T| = 8 V = what, is, it, small, ?, <s>, flying, birds, are, a, bird, . , |V| = 12

p(it)=.125, p(what)=.25, p(.)=0 p(what is it?) = .252ⅹ.1252 ≅ .001

p(it is flying.) = .125ⅹ.25´02 = 0

Use λ = .1:

p’(it)≅ .12, p’(what)≅ .23, p’(.)≅ .01 p’(what is it?) = .232ⅹ.122 ≅ .0007

p’(it is flying.) = .12ⅹ.23ⅹ.012 ≅ .000003

29

Perplexity

Perplexity is the probability of the test set, normalized by the number of words:

Chain rule:

For bigrams:

Minimizing perplexity is the same as maximizing probability

The best language model is one that best predicts an unseen test set• Gives the highest P(sentence)

The wall street journal

Text Classification and Naïve Bayes

The Task of Text Classification

Text Classification

• Assigning subject categories, topics, or genres• Spam detection• Authorship identification• Age/gender identification• Language Identification• Sentiment analysis• …

Dan Jurafsky

Text Classification: definition

• Input:• a document d• a fixed set of classes C = c1, c2,…, cJ

• Output: a predicted class c ∈ C

Dan Jurafsky

Planning GUIGarbageCollection

Machine Learning NLP

parsertagtrainingtranslationlanguage...

learningtrainingalgorithmshrinkagenetwork...

garbagecollectionmemoryoptimizationregion...

Test document

parserlanguagelabeltranslation…

Bag of words for document classification

...planningtemporalreasoningplanlanguage...

?

Dan Jurafsky

Naïve Bayes Classifier (I)

MAP is “maximum a posteriori” = most likely class

Bayes Rule

Dropping the denominator

Multinomial Naïve Bayes Classifier

Generative Model for Multinomial Naïve Bayes

38

c=China

X1=Shanghai X2=and X3=Shenzhen X4=issue X5=bonds

39