1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing...

149
1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin

Transcript of 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing...

Page 1: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

1

Machine Learning Summer SchoolJan. 15 2015, UT Austin

Application:Natural Language Processing

Raymond J. MooneyUniversity of Texas at Austin

Page 2: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

Natural Language Processing

• NLP is the branch of computer science focused on developing systems that allow computers to communicate with people using everyday language.

• Also called Computational Linguistics– Also concerns how computational methods can

aid the understanding of human language

2

Page 3: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

3

NL Communication(Russell & Norvig, 1995)

Page 4: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

4

Syntax, Semantic, Pragmatics

• Syntax concerns the proper ordering of words and its affect on meaning.– The dog bit the boy.– The boy bit the dog.– * Bit boy dog the the.– Colorless green ideas sleep furiously.

• Semantics concerns the (literal) meaning of words, phrases, and sentences.– “plant” as a photosynthetic organism– “plant” as a manufacturing facility– “plant” as the act of sowing

• Pragmatics concerns the overall communicative and social context and its effect on interpretation.– The ham sandwich wants another beer. (co-reference, anaphora)– John thinks vanilla. (ellipsis)

Page 5: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

5

Modular Comprehension

Acoustic/Phonetic

Syntax Semantics Pragmatics

words parsetrees

literalmeaning

meaning(contextualized

)

sound waves

Page 6: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

6

Ambiguity

• Natural language is highly ambiguous and must be disambiguated.– I saw the man on the hill with a

telescope.– I saw the Grand Canyon flying to LA.– Time flies like an arrow.– Horse flies like a sugar cube.– Time runners like a coach.– Time cars like a Porsche.

Page 7: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

7

Ambiguity is Ubiquitous

• Speech Recognition– “recognize speech” vs. “wreck a nice beach”– “youth in Asia” vs. “euthanasia”

• Syntactic Analysis– “I ate spaghetti with chopsticks” vs. “I ate spaghetti with meatballs.”

• Semantic Analysis– “The dog is in the pen.” vs. “The ink is in the pen.”– “I put the plant in the window” vs. “Ford put the plant in Mexico”

• Pragmatic Analysis– From “The Pink Panther Strikes Again”:– Clouseau: Does your dog bite?

Hotel Clerk: No. Clouseau: [bowing down to pet the dog] Nice doggie. [Dog barks and bites Clouseau in the hand] Clouseau: I thought you said your dog did not bite! Hotel Clerk: That is not my dog.

Page 8: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

8

Ambiguity is Explosive

• Ambiguities compound to generate enormous numbers of possible interpretations.

• In English, a sentence ending in n prepositional phrases has over 2n syntactic interpretations (cf. Catalan numbers).– “I saw the man with the telescope”: 2 parses– “I saw the man on the hill with the telescope.”: 5 parses– “I saw the man on the hill in Texas with the telescope”:

14 parses– “I saw the man on the hill in Texas with the telescope at

noon.”: 42 parses– “I saw the man on the hill in Texas with the telescope at

noon on Monday” 132 parses

Page 9: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

Natural Language Tasks

• Processing natural language text involves many various syntactic, semantic and pragmatic tasks in addition to other problems.

9

Page 10: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

Syntactic Tasks

Page 11: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

Word Segmentation

• Breaking a string of characters (graphemes) into a sequence of words.

• In some written languages (e.g. Chinese) words are not separated by spaces.

• Even in English, characters other than white-space can be used to separate words [e.g. , ; . - : ( ) ]

• Examples from English URLs:– jumptheshark.com jump the shark .com– myspace.com/pluckerswingbar myspace .com pluckers wing bar myspace .com plucker swing bar

Page 12: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

Morphological Analysis

• Morphology is the field of linguistics that studies the internal structure of words. (Wikipedia)

• A morpheme is the smallest linguistic unit that has semantic meaning (Wikipedia)– e.g. “carry”, “pre”, “ed”, “ly”, “s”

• Morphological analysis is the task of segmenting a word into its morphemes:– carried carry + ed (past tense)

– independently in + (depend + ent) + ly

– Googlers (Google + er) + s (plural)

– unlockable un + (lock + able) ?

(un + lock) + able ?

Page 13: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

Part Of Speech (POS) Tagging

• Annotate each word in a sentence with a part-of-speech.

• Useful for subsequent syntactic parsing and word sense disambiguation.

I ate the spaghetti with meatballs. Pro V Det N Prep N

John saw the saw and decided to take it to the table.PN V Det N Con V Part V Pro Prep Det N

Page 14: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

Phrase Chunking

• Find all non-recursive noun phrases (NPs) and verb phrases (VPs) in a sentence.– [NP I] [VP ate] [NP the spaghetti] [PP with]

[NP meatballs].– [NP He ] [VP reckons ] [NP the current account

deficit ] [VP will narrow ] [PP to ] [NP only # 1.8 billion ] [PP in ] [NP September ]

Page 15: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

Syntactic Parsing

• Produce the correct syntactic parse tree for a sentence.

Page 16: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

Semantic Tasks

Page 17: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

17

Word Sense Disambiguation (WSD)

• Words in natural language usually have a fair number of different possible meanings.– Ellen has a strong interest in computational

linguistics.– Ellen pays a large amount of interest on her

credit card.• For many tasks (question answering,

translation), the proper sense of each ambiguous word in a sentence must be determined.

Page 18: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

18

Semantic Role Labeling (SRL)

• For each clause, determine the semantic role played by each noun phrase that is an argument to the verb.agent patient source destination instrument– John drove Mary from Austin to Dallas in his

Toyota Prius.– The hammer broke the window.

• Also referred to a “case role analysis,” “thematic analysis,” and “shallow semantic parsing”

Page 19: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

19

Semantic Parsing

• A semantic parser maps a natural-language sentence to a complete, detailed semantic representation (logical form).

• For many applications, the desired output is immediately executable by another program.

• Example: Mapping an English database query to Prolog: How many cities are there in the US? answer(A, count(B, (city(B), loc(B, C), const(C, countryid(USA))), A))

Page 20: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

Textual Entailment

• Determine whether one natural language sentence entails (implies) another under an ordinary interpretation.

Page 21: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

Textual Entailment Problems from PASCAL Challenge

TEXT HYPOTHESIS ENTAILMENT

Eyeing the huge market potential, currently led by Google, Yahoo took over search company Overture Services Inc last year.

Yahoo bought Overture. TRUE

Microsoft's rival Sun Microsystems Inc. bought Star Office last month and plans to boost its development as a Web-based device running over the Net on personal computers and Internet appliances.

Microsoft bought Star Office. FALSE

The National Institute for Psychobiology in Israel was established in May 1971 as the Israel Center for Psychobiology by Prof. Joel.

Israel was established in May 1971. FALSE

Since its formation in 1948, Israel fought many wars with neighboring Arab countries.

Israel was established in 1948.

TRUE

Page 22: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

Pragmatics/Discourse Tasks

Page 23: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

Anaphora Resolution/Co-Reference

• Determine which phrases in a document refer to the same underlying entity.– John put the carrot on the plate and ate it.

– Bush started the war in Iraq. But the president needed the consent of Congress.

• Some cases require difficult reasoning.• Today was Jack's birthday. Penny and Janet went to the store.

They were going to get presents. Janet decided to get a kite. "Don't do that," said Penny. "Jack has a kite. He will make you take it back."

Page 24: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

Ellipsis Resolution

• Frequently words and phrases are omitted from sentences when they can be inferred from context.

"Wise men talk because they have something to say; fools, because they have to say something.“ (Plato)"Wise men talk because they have something to say; fools talk because they have to say something.“ (Plato)

Page 25: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

Other Applied Tasks

Page 26: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

26

Text Categorization

• Assigning documents to a fixed set of categories.• Applications:

– Web pages • Recommending• Hierarchical classification into an ontology

– News articles • Personalized newspaper

– Email messages • Prioritizing • spam filtering

– Sentiment Analysis• Classifying reviews (products, restaurants, movies) as positive or negative

– Authorship Attribution– Deception Detection– Native Language Identification

Page 27: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

27

Information Extraction (IE)

• Identify phrases in language that refer to specific types of entities and relations in text.

• Named entity recognition is task of identifying names of people, places, organizations, etc. in text.

people organizations places– Michael Dell is the CEO of Dell Computer

Corporation and lives in Austin Texas. • Relation extraction identifies specific relations

between entities.– Michael Dell is the CEO of Dell Computer

Corporation and lives in Austin Texas.

Page 28: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

Question Answering

• Directly answer natural language questions based on information presented in a corpora of textual documents (e.g. the web).– When was Barack Obama born? (factoid)

• August 4, 1961– Who was president when Barack Obama was

born?• John F. Kennedy

– How many presidents have there been since Barack Obama was born?• 9

Page 29: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

Text Summarization

• Produce a short summary of a longer document or article.– Article: With a split decision in the final two primaries and a flurry of

superdelegate endorsements, Sen. Barack Obama sealed the Democratic presidential nomination last night after a grueling and history-making campaign against Sen. Hillary Rodham Clinton that will make him the first African American to head a major-party ticket. Before a chanting and cheering audience in St. Paul, Minn., the first-term senator from Illinois savored what once seemed an unlikely outcome to the Democratic race with a nod to the marathon that was ending and to what will be another hard-fought battle, against Sen. John McCain, the presumptive Republican nominee….

– Summary: Senator Barack Obama was declared the presumptive Democratic presidential nominee.

Page 30: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

Machine Translation (MT)

• Translate a sentence from one natural language to another.– Hasta la vista, bebé Until we see each other again, baby.

Page 31: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

31

Ambiguity Resolution is Required for Translation

• Syntactic and semantic ambiguities must be properly resolved for correct translation:– “John plays the guitar.” → “John toca la guitarra.”– “John plays soccer.” → “John juega el fútbol.”

• An apocryphal story is that an early MT system gave the following results when translating from English to Russian and then back to English:– “The spirit is willing but the flesh is weak.” “The

liquor is good but the meat is spoiled.”– “Out of sight, out of mind.” “Invisible idiot.”

Page 32: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

32

Resolving Ambiguity

• Choosing the correct interpretation of linguistic utterances requires knowledge of:– Syntax

• An agent is typically the subject of the verb– Semantics

• Michael and Ellen are names of people• Austin is the name of a city (and of a person)• Toyota is a car company and Prius is a brand of car

– Pragmatics– World knowledge

• Credit cards require users to pay financial interest• Agents must be animate and a hammer is not animate

Page 33: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

33

Manual Knowledge Acquisition

• Traditional, “rationalist,” approaches to language processing require human specialists to specify and formalize the required knowledge.

• Manual knowledge engineering, is difficult, time-consuming, and error prone.

• “Rules” in language have numerous exceptions and irregularities.– “All grammars leak.”: Edward Sapir (1921)

• Manually developed systems were expensive to develop and their abilities were limited and “brittle” (not robust).

Page 34: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

34

Automatic Learning Approach

• Use machine learning methods to automatically acquire the required knowledge from appropriately annotated text corpora.

• Variously referred to as the “corpus based,” “statistical,” or “empirical” approach.

• Statistical learning methods were first applied to speech recognition in the late 1970’s and became the dominant approach in the 1980’s.

• During the 1990’s, the statistical training approach expanded and came to dominate almost all areas of NLP.

Page 35: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

35

Learning Approach

Manually Annotated Training Corpora

MachineLearning

LinguisticKnowledge

NLP System

Raw Text AutomaticallyAnnotated Text

Page 36: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

36

Advantages of the Learning Approach

• Large amounts of electronic text are now available.

• Annotating corpora is easier and requires less expertise than manual knowledge engineering.

• Learning algorithms have progressed to be able to handle large amounts of data and produce accurate probabilistic knowledge.

• The probabilistic knowledge acquired allows robust processing that handles linguistic regularities as well as exceptions.

Page 37: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

Text Categorization

Page 38: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

Reducing Text Classification to Feature Vector Classification

• Ignore the order of words and treat each word as a feature.

• The value of a word feature is its frequency in the document, called Bag of Words (BOW) representation.

• Usually remove stop words (common function words like “a”, “the”, “to”…).– Avoid this for authorship attribution, deception detection,…

• May include features for short phrases (i.e. n-grams).• May weight features to give higher weight to rare

words (e.g. TF/IDF weighting from info retrieval).

38

Page 39: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

Classifiers for Text

• BOW representation results in very high-dimensional, sparse data.

• Many classifiers tend to over-fit such data.• Need classifier with strong regularization.

– Support Vector Machine (SVM)– Logistic regression with L1 regularization

• BOW classification works well for many applications despite its limitations.

39

Page 40: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

POS Tagging andSequence Labeling

Page 41: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

41

Part Of Speech Tagging

• Annotate each word in a sentence with a part-of-speech marker.

• Lowest level of syntactic analysis.

• Useful for subsequent syntactic parsing and word sense disambiguation.

John saw the saw and decided to take it to the table.NNP VBD DT NN CC VBD TO VB PRP IN DT NN

Page 42: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

42

Beyond Classification Learning

• Standard classification problem assumes individual cases are disconnected and independent (i.i.d.: independently and identically distributed).

• Many NLP problems do not satisfy this assumption and involve making many connected decisions, each resolving a different ambiguity, but which are mutually dependent.

• More sophisticated learning and inference techniques are needed to handle such situations in general.

Page 43: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

43

Sequence Labeling Problem

• Many NLP problems can viewed as sequence labeling.

• Each token in a sequence is assigned a label.• Labels of tokens are dependent on the labels of

other tokens in the sequence, particularly their neighbors (not i.i.d).

foo bar blam zonk zonk bar blam

Page 44: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

44

Information Extraction

• Identify phrases in language that refer to specific types of entities and relations in text.

• Named entity recognition is task of identifying names of people, places, organizations, etc. in text.

people organizations places– Michael Dell is the CEO of Dell Computer Corporation and lives

in Austin Texas. • Extract pieces of information relevant to a specific

application, e.g. used car ads: make model year mileage price

– For sale, 2002 Toyota Prius, 20,000 mi, $15K or best offer. Available starting July 30, 2006.

Page 45: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

45

Semantic Role Labeling

• For each clause, determine the semantic role played by each noun phrase that is an argument to the verb.agent patient source destination instrument– John drove Mary from Austin to Dallas in his

Toyota Prius.– The hammer broke the window.

• Also referred to a “case role analysis,” “thematic analysis,” and “shallow semantic parsing”

Page 46: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

46

Probabilistic Sequence Models

• Probabilistic sequence models allow integrating uncertainty over multiple, interdependent classifications and collectively determine the most likely global assignment.

• Two standard models– Hidden Markov Model (HMM)– Conditional Random Field (CRF)

Page 47: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

47

Sample Markov Model for POS

0.95

0.05

0.9

0.05

stop

0.5

0.1

0.8

0.1

0.1

0.25

0.25

start0.1

0.5

0.4

Det Noun

PropNoun

Verb

Page 48: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

48

Sample Markov Model for POS

0.95

0.05

0.9

0.05

stop

0.5

0.1

0.8

0.1

0.1

0.25

0.25

start0.1

0.5

0.4

Det Noun

PropNoun

Verb

P(PropNoun Verb Det Noun) = 0.4*0.8*0.25*0.95*0.1=0.0076

Page 49: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

49

Hidden Markov Model

• Probabilistic generative model for sequences.• Assume an underlying set of hidden (unobserved)

states in which the model can be (e.g. parts of speech).

• Assume probabilistic transitions between states over time (e.g. transition from POS to another POS as sequence is generated).

• Assume a probabilistic generation of tokens from states (e.g. words generated for each POS).

Page 50: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

50

Sample HMM for POS

PropNoun

John MaryAlice

Jerry

Tom

Noun

catdog

carpen

bedapple

Det

a thethe

the

that

athea

Verb

bit

ate sawplayed

hit

0.95

0.05

0.9

gave0.05

stop

0.5

0.1

0.8

0.1

0.1

0.25

0.25

start0.1

0.5

0.4

Page 51: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

51

Sample HMM Generation

PropNoun

John MaryAlice

Jerry

Tom

Noun

catdog

carpen

bedapple

Det

a thethe

the

that

athea

Verb

bit

ate sawplayed

hit

0.95

0.05

0.9

gave0.05

stop

0.5

0.1

0.8

0.1

0.1

0.25

0.25

start0.1

0.5

0.4

Page 52: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

52

Sample HMM Generation

PropNoun

John MaryAlice

Jerry

Tom

Noun

catdog

carpen

bedapple

Det

a thethe

the

that

athea

Verb

bit

ate sawplayed

hit

0.95

0.05

0.9

gave0.05

stop

0.5

0.1

0.8

0.1

0.10.25

start0.1

0.5

0.4

Page 53: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

53

Sample HMM Generation

PropNoun

John MaryAlice

Jerry

Tom

Noun

catdog

carpen

bedapple

Det

a thethe

the

that

athea

Verb

bit

ate sawplayed

hit

0.95

0.05

0.9

gave0.05

stop

0.5

0.1

0.8

0.1

0.1

0.25

0.25

John start0.1

0.5

0.4

Page 54: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

54

Sample HMM Generation

PropNoun

John MaryAlice

Jerry

Tom

Noun

catdog

carpen

bedapple

Det

a thethe

the

that

athea

Verb

bit

ate sawplayed

hit

0.95

0.05

0.9

gave0.05

stop

0.5

0.1

0.8

0.1

0.1

0.25

0.25

John start0.1

0.5

0.4

Page 55: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

55

Sample HMM Generation

PropNoun

John MaryAlice

Jerry

Tom

Noun

catdog

carpen

bedapple

Det

a thethe

the

that

athea

Verb

bit

ate sawplayed

hit

0.95

0.05

0.9

gave0.05

stop

0.5

0.1

0.8

0.1

0.1

0.25

0.25

John bit start0.1

0.5

0.4

Page 56: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

56

Sample HMM Generation

PropNoun

John MaryAlice

Jerry

Tom

Noun

catdog

carpen

bedapple

Det

a thethe

the

that

athea

Verb

bit

ate sawplayed

hit

0.95

0.05

0.9

gave0.05

stop

0.5

0.1

0.8

0.1

0.1

0.25

0.25

John bit start0.1

0.5

0.4

Page 57: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

57

Sample HMM Generation

PropNoun

John MaryAlice

Jerry

Tom

Noun

catdog

carpen

bedapple

Det

a thethe

the

that

athea

Verb

bit

ate sawplayed

hit

0.95

0.05

0.9

gave0.05

stop

0.5

0.1

0.8

0.1

0.1

0.25

0.25

John bit thestart0.1

0.5

0.4

Page 58: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

58

Sample HMM Generation

PropNoun

John MaryAlice

Jerry

Tom

Noun

catdog

carpen

bedapple

Det

a thethe

the

that

athea

Verb

bit

ate sawplayed

hit

0.95

0.05

0.9

gave0.05

stop

0.5

0.1

0.8

0.1

0.1

0.25

0.25

John bit thestart0.1

0.5

0.4

Page 59: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

59

Sample HMM Generation

PropNoun

John MaryAlice

Jerry

Tom

Noun

catdog

carpen

bedapple

Det

a thethe

the

that

athea

Verb

bit

ate sawplayed

hit

0.95

0.05

0.9

gave0.05

stop

0.5

0.1

0.8

0.1

0.1

0.25

0.25

John bit the applestart0.1

0.5

0.4

Page 60: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

60

Sample HMM Generation

PropNoun

John MaryAlice

Jerry

Tom

Noun

catdog

carpen

bedapple

Det

a thethe

the

that

athea

Verb

bit

ate sawplayed

hit

0.95

0.05

0.9

gave0.05

stop

0.5

0.1

0.8

0.1

0.1

0.25

0.25

John bit the applestart0.1

0.5

0.4

Page 61: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

61

Three Useful HMM Tasks

• Observation Likelihood: To classify and order sequences.

• Most likely state sequence (Decoding): To tag each token in a sequence with a label.

• Maximum likelihood training (Learning): To train models to fit empirical training data.

Page 62: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

62

HMM: Observation Likelihood

• Given a sequence of observations, O, and a model with a set of parameters, λ, what is the probability that this observation was generated by this model: P(O| λ) ?

• Allows HMM to be used as a language model: A formal probabilistic model of a language that assigns a probability to each string saying how likely that string was to have been generated by the language.

• Useful for two tasks:– Sequence Classification– Most Likely Sequence

Page 63: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

63

Sequence Classification

• Assume an HMM is available for each category (i.e. language).

• What is the most likely category for a given observation sequence, i.e. which category’s HMM is most likely to have generated it?

• Used in speech recognition to find most likely word model to have generate a given sound or phoneme sequence.

Austin Boston

? ?

P(O | Austin) > P(O | Boston) ?

ah s t e n

O

Page 64: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

64

Most Likely Sequence

• Of two or more possible sequences, which one was most likely generated by a given model?

• Used to score alternative word sequence interpretations in speech recognition.

Ordinary English

dice precedent core

vice president Gore

O1

O2

?

?

P(O2 | OrdEnglish) > P(O1 | OrdEnglish) ?

Page 65: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

65

HMM: Observation LikelihoodNaïve Solution

• Consider all possible state sequences, Q, of length T that the model could have traversed in generating the given observation sequence.

• Compute the probability of a given state sequence from A, and multiply it by the probabilities of generating each of given observations in each of the corresponding states in this sequence to get P(O,Q| λ) = P(O| Q, λ) P(Q| λ) .

• Sum this over all possible state sequences to get P(O| λ).

• Computationally complex: O(TNT).

Page 66: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

66

HMM: Observation LikelihoodEfficient Solution

• Due to the Markov assumption, the probability of being in any state at any given time t only relies on the probability of being in each of the possible states at time t−1.

• Forward Algorithm: Uses dynamic programming to exploit this fact to efficiently compute observation likelihood in O(TN2) time.– Compute a forward trellis that compactly and

implicitly encodes information about all possible state paths.

Page 67: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

67

Most Likely State Sequence (Decoding)

• Given an observation sequence, O, and a model, λ, what is the most likely state sequence,Q=q1,q2,…qT, that generated this sequence from this model?

• Used for sequence labeling, assuming each state corresponds to a tag, it determines the globally best assignment of tags to all tokens in a sequence using a principled approach grounded in probability theory.

John gave the dog an apple.

Page 68: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

68

Most Likely State Sequence

• Given an observation sequence, O, and a model, λ, what is the most likely state sequence,Q=q1,q2,…qT, that generated this sequence from this model?

• Used for sequence labeling, assuming each state corresponds to a tag, it determines the globally best assignment of tags to all tokens in a sequence using a principled approach grounded in probability theory.

John gave the dog an apple.

Det Noun PropNoun Verb

Page 69: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

69

Most Likely State Sequence

• Given an observation sequence, O, and a model, λ, what is the most likely state sequence,Q=q1,q2,…qT, that generated this sequence from this model?

• Used for sequence labeling, assuming each state corresponds to a tag, it determines the globally best assignment of tags to all tokens in a sequence using a principled approach grounded in probability theory.

John gave the dog an apple.

Det Noun PropNoun Verb

Page 70: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

70

Most Likely State Sequence

• Given an observation sequence, O, and a model, λ, what is the most likely state sequence,Q=q1,q2,…qT, that generated this sequence from this model?

• Used for sequence labeling, assuming each state corresponds to a tag, it determines the globally best assignment of tags to all tokens in a sequence using a principled approach grounded in probability theory.

John gave the dog an apple.

Det Noun PropNoun Verb

Page 71: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

71

Most Likely State Sequence

• Given an observation sequence, O, and a model, λ, what is the most likely state sequence,Q=q1,q2,…qT, that generated this sequence from this model?

• Used for sequence labeling, assuming each state corresponds to a tag, it determines the globally best assignment of tags to all tokens in a sequence using a principled approach grounded in probability theory.

John gave the dog an apple.

Det Noun PropNoun Verb

Page 72: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

72

Most Likely State Sequence

• Given an observation sequence, O, and a model, λ, what is the most likely state sequence,Q=q1,q2,…qT, that generated this sequence from this model?

• Used for sequence labeling, assuming each state corresponds to a tag, it determines the globally best assignment of tags to all tokens in a sequence using a principled approach grounded in probability theory.

John gave the dog an apple.

Det Noun PropNoun Verb

Page 73: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

73

Most Likely State Sequence

• Given an observation sequence, O, and a model, λ, what is the most likely state sequence,Q=q1,q2,…qT, that generated this sequence from this model?

• Used for sequence labeling, assuming each state corresponds to a tag, it determines the globally best assignment of tags to all tokens in a sequence using a principled approach grounded in probability theory.

John gave the dog an apple.

Det Noun PropNoun Verb

Page 74: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

74

HMM: Most Likely State SequenceEfficient Solution

• Obviously, could use naïve algorithm based on examining every possible state sequence of length T.

• Dynamic Programming can also be used to exploit the Markov assumption and efficiently determine the most likely state sequence for a given observation and model.

• Standard procedure is called the Viterbi algorithm (Viterbi, 1967) and also has O(N2T) time complexity.

Page 75: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

HMM Learning

• Supervised Learning: All training sequences are completely labeled (tagged).

• Unsupervised Learning: All training sequences are unlabelled (but generally know the number of tags, i.e. states).

• Semisupervised Learning: Some training sequences are labeled, most are unlabeled.

75

Page 76: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

76

Supervised HMM Training

• If training sequences are labeled (tagged) with the underlying state sequences that generated them, then the parameters, λ={A,B} can all be estimated directly.

SupervisedHMM

Training

John ate the appleA dog bit MaryMary hit the dogJohn gave Mary the cat.

.

.

.

Training Sequences

Det Noun PropNoun Verb

Page 77: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

Supervised Parameter Estimation

• Estimate state transition probabilities based on tag bigram and unigram statistics in the labeled data.

• Estimate the observation probabilities based on tag/word co-occurrence statistics in the labeled data.

• Use appropriate smoothing if training data is sparse.

77

)(

)q ,( 1t

it

jitij sqC

ssqCa

)(

),()(

ji

kijij sqC

vosqCkb

Page 78: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

Learning and Using HMM Taggers

• Use a corpus of labeled sequence data to easily construct an HMM using supervised training.

• Given a novel unlabeled test sequence to tag, use the Viterbi algorithm to predict the most likely (globally optimal) tag sequence.

78

Page 79: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

Evaluating Taggers

• Train on training set of labeled sequences.• Measure accuracy on a disjoint test set.• Generally measure tagging accuracy, i.e. the

percentage of tokens tagged correctly.• Accuracy of most modern POS taggers, including

HMMs is 96−97% (for Penn tagset trained on about 800K words) .– Generally matching human agreement level.

79

Page 80: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

80

Unsupervised Maximum Likelihood Training

ah s t e na s t i noh s t u neh z t en

.

.

.

Training Sequences

HMMTraining

Austin

Page 81: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

81

Maximum Likelihood Training

• Given an observation sequence, O, what set of parameters, λ, for a given model maximizes the probability that this data was generated from this model (P(O| λ))?

• Used to train an HMM model and properly induce its parameters from a set of training data.

• Only need to have an unannotated observation sequence (or set of sequences) generated from the model. Does not need to know the correct state sequence(s) for the observation sequence(s). In this sense, it is unsupervised.

Page 82: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

82

HMM: Maximum Likelihood TrainingEfficient Solution

• There is no known efficient algorithm for finding the parameters, λ, that truly maximizes P(O| λ).

• However, using iterative re-estimation, the Baum-Welch algorithm (a.k.a. forward-backward) , a version of a standard statistical procedure called Expectation Maximization (EM), is able to locally maximize P(O| λ).

• In practice, EM is able to find a good set of parameters that provide a good fit to the training data in many cases.

Page 83: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

83

EM Algorithm

• Iterative method for learning probabilistic categorization model from unsupervised data.

• Initially assume random assignment of examples to categories.

• Learn an initial probabilistic model by estimating model parameters from this randomly labeled data.

• Iterate following two steps until convergence:– Expectation (E-step): Compute P(ci | E) for each example

given the current model, and probabilistically re-label the examples based on these posterior probability estimates.

– Maximization (M-step): Re-estimate the model parameters, , from the probabilistically re-labeled data.

Page 84: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

Syntactic Parsing

• Produce the correct syntactic parse tree for a sentence.

Page 85: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

Context Free Grammars (CFG)

• N a set of non-terminal symbols (or variables)• a set of terminal symbols (disjoint from N)• R a set of productions or rules of the form

A→, where A is a non-terminal and is a string of symbols from ( N)*

• S, a designated non-terminal called the start symbol

Page 86: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

Simple CFG for ATIS English

S → NP VPS → Aux NP VPS → VPNP → PronounNP → Proper-NounNP → Det NominalNominal → NounNominal → Nominal NounNominal → Nominal PPVP → VerbVP → Verb NPVP → VP PPPP → Prep NP

Det → the | a | that | thisNoun → book | flight | meal | moneyVerb → book | include | preferPronoun → I | he | she | meProper-Noun → Houston | NWAAux → doesPrep → from | to | on | near | through

Grammar Lexicon

Page 87: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

Sentence Generation

• Sentences are generated by recursively rewriting the start symbol using the productions until only terminals symbols remain.

S

VP

Verb NP

Det Nominal

Nominal PP

book

Prep NP

through

Houston

Proper-Noun

the

flight

Noun

Derivation or

Parse Tree

Page 88: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

Parsing

• Given a string of terminals and a CFG, determine if the string can be generated by the CFG.– Also return a parse tree for the string– Also return all possible parse trees for the string

• Must search space of derivations for one that derives the given string.– Top-Down Parsing: Start searching space of

derivations for the start symbol.– Bottom-up Parsing: Start search space of reverse

deivations from the terminal symbols in the string.

Page 89: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

Parsing Example

S

VP

Verb NP

book Det Nominal

that Noun

flight

book that flight

Page 90: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

Top Down Parsing

S

NP VP

Pronoun

Page 91: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

Top Down Parsing

S

NP VP

Pronoun

book

X

Page 92: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

Top Down Parsing

S

NP VP

ProperNoun

Page 93: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

Top Down Parsing

S

NP VP

ProperNoun

book

X

Page 94: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

Top Down Parsing

S

NP VP

Det Nominal

Page 95: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

Top Down Parsing

S

NP VP

Det Nominal

book

X

Page 96: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

Top Down Parsing

S

Aux NP VP

Page 97: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

Top Down Parsing

S

Aux NP VP

book

X

Page 98: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

Top Down Parsing

S

VP

Page 99: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

Top Down Parsing

S

VP

Verb

Page 100: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

Top Down Parsing

S

VP

Verb

book

Page 101: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

Top Down Parsing

S

VP

Verb

bookXthat

Page 102: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

Top Down Parsing

S

VP

Verb NP

Page 103: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

Top Down Parsing

S

VP

Verb NP

book

Page 104: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

Top Down Parsing

S

VP

Verb NP

book Pronoun

Page 105: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

Top Down Parsing

S

VP

Verb NP

book Pronoun

Xthat

Page 106: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

Top Down Parsing

S

VP

Verb NP

book ProperNoun

Page 107: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

Top Down Parsing

S

VP

Verb NP

book ProperNoun

Xthat

Page 108: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

Top Down Parsing

S

VP

Verb NP

book Det Nominal

Page 109: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

Top Down Parsing

S

VP

Verb NP

book Det Nominal

that

Page 110: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

Top Down Parsing

S

VP

Verb NP

book Det Nominal

that Noun

Page 111: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

Top Down Parsing

S

VP

Verb NP

book Det Nominal

that Noun

flight

Page 112: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

Dynamic Programming Parsing

• To avoid extensive repeated work, must cache intermediate results, i.e. completed phrases.

• Caching (memoizing) critical to obtaining a polynomial time parsing (recognition) algorithm for CFGs.

• Dynamic programming algorithms based on both top-down and bottom-up search can achieve O(n3) recognition time where n is the length of the input string.

112

Page 113: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

Dynamic Programming Parsing Methods

• CKY (Cocke-Kasami-Younger) algorithm based on bottom-up parsing and requires first normalizing the grammar.

• Earley parser is based on top-down parsing and does not require normalizing grammar but is more complex.

• More generally, chart parsers retain completed phrases in a chart and can combine top-down and bottom-up search.

113

Page 114: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

Parsing Conclusions

• Syntax parse trees specify the syntactic structure of a sentence that helps determine its meaning.– John ate the spaghetti with meatballs with chopsticks.– How did John eat the spaghetti?

What did John eat?

• CFGs can be used to define the grammar of a natural language.

• Dynamic programming algorithms allow computing a single parse tree in cubic time or all parse trees in exponential time.

114

Page 115: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

Syntactic Ambiguity

• Just produces all possible parse trees.• Does not address the important issue of

ambiguity resolution.

115

Page 116: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

Statistical Parsing

• Statistical parsing uses a probabilistic model of syntax in order to assign probabilities to each parse tree.

• Provides principled approach to resolving syntactic ambiguity.

• Allows supervised learning of parsers from tree-banks of parse trees provided by human linguists.

• Also allows unsupervised learning of parsers from unannotated text, but the accuracy of such parsers has been limited.

116

Page 117: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

117

Probabilistic Context Free Grammar(PCFG)

• A PCFG is a probabilistic version of a CFG where each production has a probability.

• Probabilities of all productions rewriting a given non-terminal must add to 1, defining a distribution for each non-terminal.

• String generation is now probabilistic where production probabilities are used to non-deterministically select a production for rewriting a given non-terminal.

Page 118: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

Simple PCFG for ATIS English

S → NP VP S → Aux NP VP S → VP NP → PronounNP → Proper-NounNP → Det NominalNominal → NounNominal → Nominal NounNominal → Nominal PPVP → VerbVP → Verb NPVP → VP PPPP → Prep NP

Grammar0.80.10.10.20.20.60.30.20.50.20.50.31.0

Prob

+

+

+

+

1.0

1.0

1.0

1.0

Det → the | a | that | this 0.6 0.2 0.1 0.1Noun → book | flight | meal | money 0.1 0.5 0.2 0.2Verb → book | include | prefer 0.5 0.2 0.3Pronoun → I | he | she | me 0.5 0.1 0.1 0.3Proper-Noun → Houston | NWA 0.8 0.2Aux → does 1.0Prep → from | to | on | near | through 0.25 0.25 0.1 0.2 0.2

Lexicon

Page 119: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

119

Sentence Probability

• Assume productions for each node are chosen independently.

• Probability of derivation is the product of the probabilities of its productions.

P(D1) = 0.1 x 0.5 x 0.5 x 0.6 x 0.6 x 0.5 x 0.3 x 1.0 x 0.2 x 0.2 x 0.5 x 0.8 = 0.0000216

D1S

VP

Verb NP

Det Nominal

Nominal PP

book

Prep NP

through

Houston

Proper-Noun

the

flight

Noun

0.5

0.50.6

0.6 0.5

1.0

0.20.3

0.5 0.2

0.8

0.1

Page 120: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

Syntactic Disambiguation

• Resolve ambiguity by picking most probable parse tree.

120120

D2

VP

Verb NP

Det Nominalbook

Prep NP

through

Houston

Proper-Nounthe

flight

Noun

0.5

0.50.6

0.61.0

0.20.3

0.5 0.2

0.8

S

VP0.1

PP

0.3

P(D2) = 0.1 x 0.3 x 0.5 x 0.6 x 0.5 x 0.6 x 0.3 x 1.0 x 0.5 x 0.2 x 0.2 x 0.8 = 0.00001296

Page 121: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

Sentence Probability

• Probability of a sentence is the sum of the probabilities of all of its derivations.

121

P(“book the flight through Houston”) = P(D1) + P(D2) = 0.0000216 + 0.00001296 = 0.00003456

Page 122: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

122

Three Useful PCFG Tasks

• Observation likelihood: To classify and order sentences.

• Most likely derivation: To determine the most likely parse tree for a sentence.

• Maximum likelihood training: To train a PCFG to fit empirical training data.

Page 123: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

PCFG: Most Likely Derivation

• There is an analog to the Viterbi algorithm to efficiently determine the most probable derivation (parse tree) for a sentence.

S → NP VPS → VPNP → Det A NNP → NP PPNP → PropNA → εA → Adj APP → Prep NPVP → V NPVP → VP PP

0.90.10.50.30.20.60.41.00.70.3

English

PCFG Parser

John liked the dog in the pen.S

NP VP

John V NP PP

liked the dog in the penX

Page 124: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

124

PCFG: Most Likely Derivation

• There is an analog to the Viterbi algorithm to efficiently determine the most probable derivation (parse tree) for a sentence.

S → NP VPS → VPNP → Det A NNP → NP PPNP → PropNA → εA → Adj APP → Prep NPVP → V NPVP → VP PP

0.90.10.50.30.20.60.41.00.70.3

English

PCFG Parser

John liked the dog in the pen.

S

NP VP

John V NP

liked the dog in the pen

Page 125: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

Probabilistic CKY

• CKY can be modified for PCFG parsing by including in each cell a probability for each non-terminal.

Page 126: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

126

PCFG: Supervised Training

• If parse trees are provided for training sentences, a grammar and its parameters can be can all be estimated directly from counts accumulated from the tree-bank (with appropriate smoothing).

.

.

.

Tree Bank

SupervisedPCFGTraining

S → NP VPS → VPNP → Det A NNP → NP PPNP → PropNA → εA → Adj APP → Prep NPVP → V NPVP → VP PP

0.90.10.50.30.20.60.41.00.70.3

English

S

NP VP

John V NP PP

put the dog in the pen

S

NP VP

John V NP PP

put the dog in the pen

Page 127: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

Estimating Production Probabilities

• Set of production rules can be taken directly from the set of rewrites in the treebank.

• Parameters can be directly estimated from frequency counts in the treebank.

127

)count(

)count(

)count(

)count()|(

P

Page 128: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

128

PCFG: Maximum Likelihood Training

• Given a set of sentences, induce a grammar that maximizes the probability that this data was generated from this grammar.

• Assume the number of non-terminals in the grammar is specified.

• Only need to have an unannotated set of sequences generated from the model. Does not need correct parse trees for these sentences. In this sense, it is unsupervised.

Page 129: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

129

PCFG: Maximum Likelihood Training

John ate the appleA dog bit MaryMary hit the dogJohn gave Mary the cat.

.

.

.

Training Sentences

PCFGTraining

S → NP VPS → VPNP → Det A NNP → NP PPNP → PropNA → εA → Adj APP → Prep NPVP → V NPVP → VP PP

0.90.10.50.30.20.60.41.00.70.3

English

Page 130: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

Inside-Outside

• The Inside-Outside algorithm is a version of EM for unsupervised learning of a PCFG.– Analogous to Baum-Welch (forward-backward) for HMMs

• Given the number of non-terminals, construct all possible CNF productions with these non-terminals and observed terminal symbols.

• Use EM to iteratively train the probabilities of these productions to locally maximize the likelihood of the data.– See Manning and Schütze text for details

• Experimental results are not impressive, but recent work imposes additional constraints to improve unsupervised grammar learning.

Page 131: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

131

Treebanks

• English Penn Treebank: Standard corpus for testing syntactic parsing consists of 1.2 M words of text from the Wall Street Journal (WSJ).

• Typical to train on about 40,000 parsed sentences and test on an additional standard disjoint test set of 2,416 sentences.

• Chinese Penn Treebank: 100K words from the Xinhua news service.

• Other corpora existing in many languages, see the Wikipedia article “Treebank”

Page 132: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

First WSJ Sentence

132

( (S (NP-SBJ (NP (NNP Pierre) (NNP Vinken) ) (, ,) (ADJP (NP (CD 61) (NNS years) ) (JJ old) ) (, ,) ) (VP (MD will) (VP (VB join) (NP (DT the) (NN board) ) (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director) )) (NP-TMP (NNP Nov.) (CD 29) ))) (. .) ))

Page 133: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

133

Parsing Evaluation Metrics

• PARSEVAL metrics measure the fraction of the constituents that match between the computed and human parse trees. If P is the system’s parse tree and T is the human parse tree (the “gold standard”):– Recall = (# correct constituents in P) / (# constituents in T)– Precision = (# correct constituents in P) / (# constituents in P)

• Labeled Precision and labeled recall require getting the non-terminal label on the constituent node correct to count as correct.

• F1 is the harmonic mean of precision and recall.

Page 134: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

Computing Evaluation Metrics

Correct Tree TS

VP

Verb NP

Det Nominal

Nominal PP

book

Prep NP

through

Houston

Proper-Noun

the

flight

Noun

Computed Tree P

VP

Verb NP

Det Nominalbook

Prep NP

through

Houston

Proper-Nounthe

flight

Noun

S

VP

PP

# Constituents: 12 # Constituents: 12# Correct Constituents: 10

Recall = 10/12= 83.3% Precision = 10/12=83.3% F1 = 83.3%

Page 135: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

135

Treebank Results

• Results of current state-of-the-art systems on the English Penn WSJ treebank are slightly greater than 90% labeled precision and recall.

Page 136: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

Statistical Parsing Conclusions

• Statistical models such as PCFGs allow for probabilistic resolution of ambiguities.

• PCFGs can be easily learned from treebanks.

• Current statistical parsers are quite accurate but not yet at the level of human-expert agreement.

136

Page 137: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

Machine Translation

• Automatically translate one natural language into another.

137

Mary didn’t slap the green witch.

Maria no dió una bofetada a la bruja verde.

Page 138: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

Statistical MT

• Manually encoding comprehensive bilingual lexicons and transfer rules is difficult.

• SMT acquires knowledge needed for translation from a parallel corpus or bitext that contains the same set of documents in two languages.

• The Canadian Hansards (parliamentary proceedings in French and English) is a well-known parallel corpus.

• First align the sentences in the corpus based on simple methods that use coarse cues like sentence length to give bilingual sentence pairs.

138

Page 139: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

Picking a Good Translation

• A good translation should be faithful and correctly convey the information and tone of the original source sentence.

• A good translation should also be fluent, grammatically well structured and readable in the target language.

• Final objective:

139

)(fluency ),(ssfaithfulne argmaxTargetT

TSTTbest

Page 140: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

Noisy Channel Model

• Based on analogy to information-theoretic model used to decode messages transmitted via a communication channel that adds errors.

• Assume that source sentence was generated by a “noisy” transformation of some target language sentence and then use Bayesian analysis to recover the most likely target sentence that generated it.

140

Translate foreign language sentence F=f1, f2, …fm to anEnglish sentence Ȇ = e1, e2, …eI that maximizes P(E | F)

Page 141: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

Bayesian Analysis of Noisy Channel

141

)|(argmaxˆ FEPEEnglishE

)()|(argmax)(

)()|(argmax

EPEFPFP

EPEFP

EnglishE

EnglishE

Translation Model Language Model

A decoder determines the most probable translation Ȇ given F

Page 142: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

Language Model

• Use a standard n-gram language model for P(E).

• Can be trained on a large, unsupervised mono-lingual corpus for the target language E.

• Could use a more sophisticated PCFG language model to capture long-distance dependencies.

• Terabytes of web data have been used to build a large 5-gram model of English.

142

Page 143: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

Word Alignment

• Shows mapping between words in one language and the other.

143

Mary didn’t slap the green witch.

Maria no dió una bofetada a la bruja verde.

Page 144: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

Translation Model using Word Alignment

• Can learn to align from supervised word alignments, but human-aligned bitexts are rare and expensive to construct.

• Typically use an unsupervised EM-based approach to compute a word alignment from unannotated parallel corpus.

144

Page 145: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

One to Many Alignment

• To simplify the problem, typically assume each word in F aligns to 1 word in E (but assume each word in E may generate more than one word in F).

• Some words in F may be generated by the NULL element of E.

• Therefore, alignment can be specified by a vector A giving, for each word in F, the index of the word in E which generated it.

145

NULL Mary didn’t slap the green witch.

Maria no dió una bofetada a la bruja verde.

0 1 2 3 4 5 6

1 2 3 3 3 0 4 6 5

Page 146: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

IBM Model 1

• First model proposed in seminal paper by Brown et al. in 1993 as part of CANDIDE, the first complete SMT system.

• Assumes following simple generative model of producing F from E=e1, e2, …eI – Choose length, J, of F sentence: F=f1, f2, …fJ

– Choose a 1 to many alignment A=a1, a2, …aJ

– For each position in F, generate a word fj from the aligned word in E: eaj

146

Page 147: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

verde.1 2 3 3 3 0 4 6 5

Sample IBM Model 1 Generation

147

NULL Mary didn’t slap the green witch.0 1 2 3 4 5 6

Maria no dió una bofetada a la bruja

Page 148: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

Machine Translation Conclusions

• Statistical MT methods can automatically learn a translation system from a parallel corpus.

• Typically use a noisy-channel model to exploit both a bilingual translation model and a monolingual language model.

• Automatic word alignment methods can learn a translation model from a parallel corpus.

148

Page 149: 1 Machine Learning Summer School Jan. 15 2015, UT Austin Application: Natural Language Processing Raymond J. Mooney University of Texas at Austin.

NLP Software

• Stanford CoreNLP– http://nlp.stanford.edu/software/corenlp.shtml

• OpenNLP– http://opennlp.apache.org/

• Berkeley NLP– http://nlp.cs.berkeley.edu/software.shtml

• Mallet (text categorization, HMMs, CRFs)– http://mallet.cs.umass.edu/

• MOSES Machine translation– http://www.statmt.org/moses/

149