Simple unsupervised grammar induction from raw text with cascaded finite state models: ACL 2011 talk

49
Simple Unsupervised Grammar Induction from Raw Text with Cascaded Finite State Models Elias Ponvert, Jason Baldridge, Katrin Erk Department of Linguistics The University of Texas at Austin Association for Computational Linguistics 19–24 June, 2011 Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 1 / 34

description

Slides from my 2011 Association for Computational Linguistics paper & talk (joint work with Jason Baldridge and Katrin Erk). Presents Unsupervised Partial Parsing, a simple but very effective method for discovering grammatical phrases (like noun phrases and what not)

Transcript of Simple unsupervised grammar induction from raw text with cascaded finite state models: ACL 2011 talk

Page 1: Simple unsupervised grammar induction from raw text with cascaded finite state models: ACL 2011 talk

Simple Unsupervised Grammar Induction fromRaw Text with Cascaded Finite State Models

Elias Ponvert, Jason Baldridge, Katrin Erk

Department of LinguisticsThe University of Texas at Austin

Association for Computational Linguistics19–24 June, 2011

Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 1 / 34

Page 2: Simple unsupervised grammar induction from raw text with cascaded finite state models: ACL 2011 talk

Why unsupervised parsing?1 Less reliance on annotated training

Hello!

2 Apply to new languages and domains

Særær manannær man

mæþæn

Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 2 / 34

Page 3: Simple unsupervised grammar induction from raw text with cascaded finite state models: ACL 2011 talk

Assumptions made in parser learning

S

NP VPPP

P

on

NP

N

Sunday

Det

the

A

brown

N

bear

V

sleeps

,

,

Getting these labels right AS WELL AS the structureof the tree is hard

Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 3 / 34

Page 4: Simple unsupervised grammar induction from raw text with cascaded finite state models: ACL 2011 talk

Assumptions made in parser learning

P

on

N

Sunday

Det

the

A

brown

N

bear

V

sleeps

,

,

So the task is to identify the structure alone

Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 3 / 34

Page 5: Simple unsupervised grammar induction from raw text with cascaded finite state models: ACL 2011 talk

Assumptions made in parser learning

on Sunday the brown bear

sleeps,

Learning operates from gold-standard parts-of-speech(POS) rather than raw text

P N Det A N

V,

on Sunday , the brown bear sleepsP N , Det A N V

Klein & Manning 2003 CCMBod 2006a, 2006bKlein & Manning 2005 DMVSuccessors to DMV: - Smith 2006, Smith & Cohen 2009, Headden et al 2009, Spitkovsky et al 2010ab, &c

J. Gao et al 2003, 2004Seginer 2007

this work

Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 3 / 34

Page 6: Simple unsupervised grammar induction from raw text with cascaded finite state models: ACL 2011 talk

Unsupervised parsing: desiderata

Raw text

Standard NLP / extensible

Scalable and fast

Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 4 / 34

Page 7: Simple unsupervised grammar induction from raw text with cascaded finite state models: ACL 2011 talk

A new approach: start from the bottom

Unsupervised Partial Parsing =segmentation of (non-overlapping) multiword constituents

Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 5 / 34

Page 8: Simple unsupervised grammar induction from raw text with cascaded finite state models: ACL 2011 talk

Unsupervised segmentation of constituentsleaves some room for interpretation

Possible segmentations( the cat ) in ( the hat ) knows ( a lot ) about that

( the cat ) ( in the hat ) knows ( a lot ) ( about that )

( the cat in the hat ) knows ( a lot about that )

( the cat in the hat ) ( knows a lot about that )

( the cat in the hat ) ( knows a lot ) ( about that )

Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 6 / 34

Page 9: Simple unsupervised grammar induction from raw text with cascaded finite state models: ACL 2011 talk

Defining UPP by evaluation1. Constituent chunks:

non-hierarchical multiword constituentsS

NP

D

The

N

Cat

PP

P

in

NP

D

the

N

hat

VP

V

knows

NP

D

a

N

lot

PP

P

about

NP

N

that

Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 7 / 34

Page 10: Simple unsupervised grammar induction from raw text with cascaded finite state models: ACL 2011 talk

Defining UPP by evaluation2. Base NPs:

non-recursive noun phrases

S

NP

D

The

N

Cat

PP

P

in

NP

D

the

N

hat

VP

V

knows

NP

D

a

N

lot

PP

P

about

NP

N

that

Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 7 / 34

Page 11: Simple unsupervised grammar induction from raw text with cascaded finite state models: ACL 2011 talk

Multilingual data for direct evaluation

English WSJGerman NegraChinese CTB

Sentences Types TokensWSJ Penn Treebank 49K 44K 1M

Negra Negra German Corpus 21K 49K 300KCTB Penn Chinese Treebank 19K 37K 430K

Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 8 / 34

Page 12: Simple unsupervised grammar induction from raw text with cascaded finite state models: ACL 2011 talk

Constituent chunks and NPs in the data

WSJChunks 203KNPs 172KChunks ∩ NPs 161K

NegraChunks 59KNPs 33KChunks ∩ NPs 23K

CTBChunks 92KNPs 56KChunks ∩ NPs 43K

Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 9 / 34

Page 13: Simple unsupervised grammar induction from raw text with cascaded finite state models: ACL 2011 talk

The benchmark: CCL parser

the cat

saw

the red dog

run

the0 ��

cat0

��

1 ��saw

0 ���� ��0

��the

0 ��red

0 ��

0�� dog

0�� run

0��

Common Cover Links representation

Constituency tree

Seginer (2007 ACL; 2007 PhD UvA)

Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 10 / 34

Page 14: Simple unsupervised grammar induction from raw text with cascaded finite state models: ACL 2011 talk

Hypothesis

Segmentation can be learned bygeneralizing on phrasal boundaries

Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 11 / 34

Page 15: Simple unsupervised grammar induction from raw text with cascaded finite state models: ACL 2011 talk

UPP as a tagging problem

Bthe

Icat

Oin

Bthe

Ihat

the cat in the hat

B Beginning of a constituentI Inside a constituent

O Not inside a constituent

Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 12 / 34

Page 16: Simple unsupervised grammar induction from raw text with cascaded finite state models: ACL 2011 talk

Learning from boundaries

Bthe

Icat

Oin

Bthe

Ihat

the cat in the hat

STOP

#STOP

#

Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 13 / 34

Page 17: Simple unsupervised grammar induction from raw text with cascaded finite state models: ACL 2011 talk

Learning from punctuation

Bon

Isunday

Bthe

Ibrown

Ibear

STOP

#STOP

#

on sunday , the brown bear sleeps

STOP

,O

sleeps

Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 14 / 34

Page 18: Simple unsupervised grammar induction from raw text with cascaded finite state models: ACL 2011 talk

UPP: Models

P( ) ≈ P( ) P( )B

the

I

cat

O

in

B

the

I

hat

Hidden Markov Model

B I

the

B

the

B I

Probabilistic right linear grammar

P( ) = P( ) P( | )theB I B I

BI

OB

I

thecat

inthe

hat

B

Ithe

Learning: expectation maximization (EM) viaforward-backward (run to convergence)

Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 15 / 34

Page 19: Simple unsupervised grammar induction from raw text with cascaded finite state models: ACL 2011 talk

UPP: Models

P( ) ≈ P( ) P( )B

the

I

cat

O

in

B

the

I

hat

Hidden Markov Model

B I

the

B

the

B I

Probabilistic right linear grammar

P( ) = P( ) P( | )theB I B I

BI

OB

I

thecat

inthe

hat

B

Ithe

Decoding: ViterbiSmoothing: additive smoothing on emissions

Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 15 / 34

Page 20: Simple unsupervised grammar induction from raw text with cascaded finite state models: ACL 2011 talk

UPP: Constraints on sequences

Bthe

Icat

Oin

Bthe

Ihat

the cat in the hat

STOP

#STOP

#

STOP B

O I

1

Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 16 / 34

Page 21: Simple unsupervised grammar induction from raw text with cascaded finite state models: ACL 2011 talk

UPP evaluation: Setup

Evaluation by comparison to treebank dataStandard train / development / test splitsPrecision and recall on matched constituentsBenchmark: CCLBoth get tokenization, punctuation,sentence boundaries

Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 17 / 34

Page 22: Simple unsupervised grammar induction from raw text with cascaded finite state models: ACL 2011 talk

UPP evaluation: Chunking (F-score)

0 10 20 30 40 50 60 70 80

CTB

Negra

WSJ

CCL∗ HMM Chunker PRLG Chunker

CCL non-hierarchical constituentsFirst-level parsing output

Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 18 / 34

Page 23: Simple unsupervised grammar induction from raw text with cascaded finite state models: ACL 2011 talk

UPP evaluation: Base NPs (F-score)

0 10 20 30 40 50 60 70 80

CTB

Negra

WSJ

CCL∗ HMM Chunker PRLG Chunker

CCL non-hierarchical constituentsFirst-level parsing output

Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 19 / 34

Page 24: Simple unsupervised grammar induction from raw text with cascaded finite state models: ACL 2011 talk

UPP: Review

Sequence models can generalize on indicatorsfor phrasal boundariesLeads to improved unsupervised segmentation

Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 20 / 34

Page 25: Simple unsupervised grammar induction from raw text with cascaded finite state models: ACL 2011 talk

Question

Are we limited to segmentation?

Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 21 / 34

Page 26: Simple unsupervised grammar induction from raw text with cascaded finite state models: ACL 2011 talk

Hypothesis

Identification of higher level constituentscan also be learned by generalizing onphrasal boundaries

Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 22 / 34

Page 27: Simple unsupervised grammar induction from raw text with cascaded finite state models: ACL 2011 talk

Cascaded UPP: 1 Segment raw text

there is no asbestos in our products now

there is no asbestos in our products now

Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 23 / 34

Page 28: Simple unsupervised grammar induction from raw text with cascaded finite state models: ACL 2011 talk

Cascaded UPP: 2 Choose stand-ins for phrases

our productsis no asbestos

there is no asbestos in our products now

there in nowis our

Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 23 / 34

Page 29: Simple unsupervised grammar induction from raw text with cascaded finite state models: ACL 2011 talk

Cascaded UPP: 3 Segment text + phrasal stand-ins

there in nowis our

there in nowis our

Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 23 / 34

Page 30: Simple unsupervised grammar induction from raw text with cascaded finite state models: ACL 2011 talk

Cascaded UPP: 4 Choose stand-ins and repeat steps 3–4

our products

in

is no asbestos

there

there in nowis our

is in now

Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 23 / 34

Page 31: Simple unsupervised grammar induction from raw text with cascaded finite state models: ACL 2011 talk

Cascaded UPP: 5 Unwind to output tree

our products

in

is no asbestos

there

is in now

thereis no asbestos in our products

now

Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 23 / 34

Page 32: Simple unsupervised grammar induction from raw text with cascaded finite state models: ACL 2011 talk

Cascaded UPP: Review

Separate models learned at each cascade levelModels share hyper-parameters (smoothing etc)Choice of pseudowords as phrasal stand-insPseudoword-identification: corpus frequency

Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 24 / 34

Page 33: Simple unsupervised grammar induction from raw text with cascaded finite state models: ACL 2011 talk

Cascaded UPP: Evaluation

0 10 20 30 40 50 60

CTB

Negra

WSJ

CCL Cascaded HMM Cascaded PRLG

All constituent F-scoreCascade run to convergence

Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 25 / 34

Page 34: Simple unsupervised grammar induction from raw text with cascaded finite state models: ACL 2011 talk

More example parses

diethe

csuCSU

tutdoes

dasthis in

in

bayernBavaria

dochnevertheless

auchalso

sehrvery

erfolgreichsuccessfully

Nevertheless, the CSU does this in Bavaria very successfully as well

Gold standard

die csutut das

in bayerndoch auch

sehr erfolgreich

Cascaded PRLG – Negra correctincorrect

Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 26 / 34

Page 35: Simple unsupervised grammar induction from raw text with cascaded finite state models: ACL 2011 talk

More example parses

beiwith

denthe

windsorsWindsors

bleibtstays

alleseverything

inin der

the

familiefamily

With the Windsors everything stays in the family.

Gold standard

bei den windsorsbleibt alles

in der familie

Cascaded PRLG – Negra correctincorrect

Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 26 / 34

Page 36: Simple unsupervised grammar induction from raw text with cascaded finite state models: ACL 2011 talk

More example parses

immerever

mehrmore

anlagenteilemachine parts

uberalternover-age

(with) more and more machine parts over-age

Cascaded PRLG – Negra correctincorrect

Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 26 / 34

Page 37: Simple unsupervised grammar induction from raw text with cascaded finite state models: ACL 2011 talk

What we’ve learned

Unsupervised identification of base NPs andlocal constituents is possibleA cascade of chunking models for raw textparsing has state-of-the-art results

Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 27 / 34

Page 38: Simple unsupervised grammar induction from raw text with cascaded finite state models: ACL 2011 talk

Future directions

Improvements to the sequence modelsBetter phrasal stand-in (pseudoword)constructionLearning joint models rather than a cascade

Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 28 / 34

Page 39: Simple unsupervised grammar induction from raw text with cascaded finite state models: ACL 2011 talk

What’s in the paper

Comparison to Klein & Manning’s CCMDiscussion of phrasal punctuation

I the chunkers still do well w/out punctuation

Analysis of chunking and parsing ChineseError analysis

Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 29 / 34

Page 40: Simple unsupervised grammar induction from raw text with cascaded finite state models: ACL 2011 talk

Thanks!

Contact: [email protected]: elias.ponvert.net/upparse

This work is supported in part by the U. S. Army Research Laboratory andthe U.S. Army Research Office under grant number W911NF-10-1-0533. Sup-port for Elias was also provided by Mike Hogg Endowment Fellowship, theOffice of Graduate Studies at The University of Texas at Austin.

Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 30 / 34

Page 41: Simple unsupervised grammar induction from raw text with cascaded finite state models: ACL 2011 talk

Appendices

Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 31 / 34

Page 42: Simple unsupervised grammar induction from raw text with cascaded finite state models: ACL 2011 talk

More example parses

two share

a house almost devoid of furniture

Gold standardtwo

share

a housealmost devoid

offurniture

Cascaded PRLG – WSJ correctincorrect

Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 32 / 34

Page 43: Simple unsupervised grammar induction from raw text with cascaded finite state models: ACL 2011 talk

More example parses

what

is one to think of all this

Gold standardwhat

is

one

to

think

of

all this

Cascaded PRLG – WSJ correctincorrect

Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 32 / 34

Page 44: Simple unsupervised grammar induction from raw text with cascaded finite state models: ACL 2011 talk

Learning curves: Base NPs

10 20 30 40K

20

40

60

80

sentences10 20 30 40K

2060

100

20

40

60

80

F-s

core

EM iter sentences

1

0 20 40 60 80 100

20

40

60

80

EM iter

PRLG chunking model: WSJ

Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 33 / 34

Page 45: Simple unsupervised grammar induction from raw text with cascaded finite state models: ACL 2011 talk

Learning curves: Base NPs

5 10 15K1020304050

sentences 5 10 15K20

80140

20

40

F-s

core

EM iter sentences

1

0 50 100 1501020304050

EM iter

PRLG chunking model: Negra

Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 33 / 34

Page 46: Simple unsupervised grammar induction from raw text with cascaded finite state models: ACL 2011 talk

Learning curves: Base NPs

5 10 15K0

10

20

30

sentences 510 15K

2060

100

10

20

30

F-s

core

EM iter sentences

1

0 20 40 60 80 1000

10

20

30

EM iter

PRLG chunking model: CTB

Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 33 / 34

Page 47: Simple unsupervised grammar induction from raw text with cascaded finite state models: ACL 2011 talk

What are the models learning?

B P(w|B)the 21.0a 8.7to 6.5’s 2.8in 1.9mr. 1.8its 1.6of 1.4an 1.4and 1.4

I P(w|I)% 1.8million 1.6be 1.3company 0.9year 0.8market 0.7billion 0.6share 0.5new 0.5than 0.5

O P(w|O)

of 5.8and 4.0in 3.7that 2.2to 2.1for 2.0is 2.0it 1.7said 1.7on 1.5

HMM Emissions: WSJ

Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 34 / 34

Page 48: Simple unsupervised grammar induction from raw text with cascaded finite state models: ACL 2011 talk

What are the models learning?

B P(w|B)der the 13.0die the 12.2den the 4.4und and 3.3im in 3.2das the 2.9des the 2.7dem the 2.4eine a 2.1ein a 2.0

I P(w|I)uhr o’clock 0.8juni June 0.6jahren years 0.4prozent percent 0.4mark currency 0.3stadt city 0.3000 0.3millionen millions 0.3jahre year 0.3frankfurter Frankfurt 0.3

O P(w|O)

in in 3.4und and 2.7mit with 1.7fur for 1.6auf on 1.5zu to 1.4von of 1.3sich such 1.3ist is 1.3nicht not 1.2

HMM Emissions: Negra

Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 34 / 34

Page 49: Simple unsupervised grammar induction from raw text with cascaded finite state models: ACL 2011 talk

What are the models learning?

B P(w|B)的 de, of 14.3一 one 3.1和 and 1.1两 two 0.9这 this 0.8有 have 0.8经济 economy 0.7各 each 0.7全 all 0.7不 no 0.6

I P(w|I)的 de 3.9了 (perf. asp.) 2.2个 ge (measure) 1.5年 year 1.3说 say 1.0中 middle 0.9上 on, above 0.9人 person 0.7大 big 0.7国 country 0.6

O P(w|O)

在 at, in 3.4是 is 2.4中国 China 1.4也 also 1.2不 no 1.2对 pair 1.1和 and 1.0的 de 1.0将 fut. tns. 1.0有 have 1.0

HMM Emissions: CTB

Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 34 / 34