Simple unsupervised grammar induction from raw text with cascaded finite state models: ACL 2011 talk

Simple Unsupervised Grammar Induction fromRaw Text with Cascaded Finite State Models

Elias Ponvert, Jason Baldridge, Katrin Erk

Department of LinguisticsThe University of Texas at Austin

Association for Computational Linguistics19–24 June, 2011

Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 1 / 34

Why unsupervised parsing?1 Less reliance on annotated training

Hello!

2 Apply to new languages and domains

Særær manannær man

mæþæn


Assumptions made in parser learning

S

NP VPPP

P

on

NP

N

Sunday

Det

the

A

brown

N

bear

V

sleeps

,

,

Getting these labels right AS WELL AS the structureof the tree is hard



P

on

N

Sunday

Det

the

A

brown

N

bear

V

sleeps

,

,

So the task is to identify the structure alone



on Sunday the brown bear

sleeps,

Learning operates from gold-standard parts-of-speech(POS) rather than raw text

P N Det A N

V,

on Sunday , the brown bear sleepsP N , Det A N V

Klein & Manning 2003 CCMBod 2006a, 2006bKlein & Manning 2005 DMVSuccessors to DMV: - Smith 2006, Smith & Cohen 2009, Headden et al 2009, Spitkovsky et al 2010ab, &c

J. Gao et al 2003, 2004Seginer 2007

this work


Unsupervised parsing: desiderata

Raw text

Standard NLP / extensible

Scalable and fast


A new approach: start from the bottom

Unsupervised Partial Parsing =segmentation of (non-overlapping) multiword constituents


Unsupervised segmentation of constituentsleaves some room for interpretation

Possible segmentations( the cat ) in ( the hat ) knows ( a lot ) about that

( the cat ) ( in the hat ) knows ( a lot ) ( about that )

( the cat in the hat ) knows ( a lot about that )

( the cat in the hat ) ( knows a lot about that )

( the cat in the hat ) ( knows a lot ) ( about that )


Defining UPP by evaluation1. Constituent chunks:

non-hierarchical multiword constituentsS

NP

D

The

N

Cat

PP

P

in

NP

D

the

N

hat

VP

V

knows

NP

D

a

N

lot

PP

P

about

NP

N

that


Defining UPP by evaluation2. Base NPs:

non-recursive noun phrases

S

NP

D

The

N

Cat

PP

P

in

NP

D

the

N

hat

VP

V

knows

NP

D

a

N

lot

PP

P

about

NP

N

that


Multilingual data for direct evaluation

English WSJGerman NegraChinese CTB

Sentences Types TokensWSJ Penn Treebank 49K 44K 1M

Negra Negra German Corpus 21K 49K 300KCTB Penn Chinese Treebank 19K 37K 430K


Constituent chunks and NPs in the data

WSJChunks 203KNPs 172KChunks ∩ NPs 161K

NegraChunks 59KNPs 33KChunks ∩ NPs 23K

CTBChunks 92KNPs 56KChunks ∩ NPs 43K


The benchmark: CCL parser

the cat

saw

the red dog

run

the0 ��

cat0

��

1 ��saw

0 �� 0

��the

0 ��red

0 ��

0�� dog

0�� run

0��

Common Cover Links representation

Constituency tree

Seginer (2007 ACL; 2007 PhD UvA)


Hypothesis

Segmentation can be learned bygeneralizing on phrasal boundaries


UPP as a tagging problem

Bthe

Icat

Oin

Bthe

Ihat

the cat in the hat

B Beginning of a constituentI Inside a constituent

O Not inside a constituent


Learning from boundaries

Bthe

Icat

Oin

Bthe

Ihat

the cat in the hat

STOP

#STOP

#


Learning from punctuation

Bon

Isunday

Bthe

Ibrown

Ibear

STOP

#STOP

#

on sunday , the brown bear sleeps

STOP

,O

sleeps


UPP: Models

P( ) ≈ P( ) P( )B

the

I

cat

O

in

B

the

I

hat

Hidden Markov Model

B I

the

B

the

B I

Probabilistic right linear grammar

P( ) = P( ) P( | )theB I B I

BI

OB

I

thecat

inthe

hat

B

Ithe

Learning: expectation maximization (EM) viaforward-backward (run to convergence)


UPP: Models

P( ) ≈ P( ) P( )B

the

I

cat

O

in

B

the

I

hat

Hidden Markov Model

B I

the

B

the

B I

Probabilistic right linear grammar

P( ) = P( ) P( | )theB I B I

BI

OB

I

thecat

inthe

hat

B

Ithe

Decoding: ViterbiSmoothing: additive smoothing on emissions


UPP: Constraints on sequences

Bthe

Icat

Oin

Bthe

Ihat

the cat in the hat

STOP

#STOP

#

STOP B

O I

1


UPP evaluation: Setup

Evaluation by comparison to treebank dataStandard train / development / test splitsPrecision and recall on matched constituentsBenchmark: CCLBoth get tokenization, punctuation,sentence boundaries


UPP evaluation: Chunking (F-score)

0 10 20 30 40 50 60 70 80

CTB

Negra

WSJ

CCL∗ HMM Chunker PRLG Chunker

CCL non-hierarchical constituentsFirst-level parsing output


UPP evaluation: Base NPs (F-score)

0 10 20 30 40 50 60 70 80

CTB

Negra

WSJ

CCL∗ HMM Chunker PRLG Chunker

CCL non-hierarchical constituentsFirst-level parsing output


UPP: Review

Sequence models can generalize on indicatorsfor phrasal boundariesLeads to improved unsupervised segmentation


Question

Are we limited to segmentation?


Hypothesis

Identification of higher level constituentscan also be learned by generalizing onphrasal boundaries


Cascaded UPP: 1 Segment raw text

there is no asbestos in our products now



Cascaded UPP: 2 Choose stand-ins for phrases

our productsis no asbestos


there in nowis our


Cascaded UPP: 3 Segment text + phrasal stand-ins

there in nowis our

there in nowis our


Cascaded UPP: 4 Choose stand-ins and repeat steps 3–4

our products

in

is no asbestos

there

there in nowis our

is in now


Cascaded UPP: 5 Unwind to output tree

our products

in

is no asbestos

there

is in now

thereis no asbestos in our products

now


Cascaded UPP: Review

Separate models learned at each cascade levelModels share hyper-parameters (smoothing etc)Choice of pseudowords as phrasal stand-insPseudoword-identification: corpus frequency


Cascaded UPP: Evaluation

0 10 20 30 40 50 60

CTB

Negra

WSJ

CCL Cascaded HMM Cascaded PRLG

All constituent F-scoreCascade run to convergence


More example parses

diethe

csuCSU

tutdoes

dasthis in

in

bayernBavaria

dochnevertheless

auchalso

sehrvery

erfolgreichsuccessfully

Nevertheless, the CSU does this in Bavaria very successfully as well

Gold standard

die csutut das

in bayerndoch auch

sehr erfolgreich

Cascaded PRLG – Negra correctincorrect


More example parses

beiwith

denthe

windsorsWindsors

bleibtstays

alleseverything

inin der

the

familiefamily

With the Windsors everything stays in the family.

Gold standard

bei den windsorsbleibt alles

in der familie



More example parses

immerever

mehrmore

anlagenteilemachine parts

uberalternover-age

(with) more and more machine parts over-age



What we’ve learned

Unsupervised identification of base NPs andlocal constituents is possibleA cascade of chunking models for raw textparsing has state-of-the-art results


Future directions

Improvements to the sequence modelsBetter phrasal stand-in (pseudoword)constructionLearning joint models rather than a cascade


What’s in the paper

Comparison to Klein & Manning’s CCMDiscussion of phrasal punctuation

I the chunkers still do well w/out punctuation

Analysis of chunking and parsing ChineseError analysis


Thanks!

Contact: [email protected]: elias.ponvert.net/upparse

This work is supported in part by the U. S. Army Research Laboratory andthe U.S. Army Research Office under grant number W911NF-10-1-0533. Sup-port for Elias was also provided by Mike Hogg Endowment Fellowship, theOffice of Graduate Studies at The University of Texas at Austin.


[email protected]

elias.ponvert.net/upparse

Appendices


More example parses

two share

a house almost devoid of furniture

Gold standardtwo

share

a housealmost devoid

offurniture

Cascaded PRLG – WSJ correctincorrect


More example parses

what

is one to think of all this

Gold standardwhat

is

one

to

think

of

all this

Cascaded PRLG – WSJ correctincorrect


Learning curves: Base NPs

10 20 30 40K

20

40

60

80

sentences10 20 30 40K

2060

100

20

40

60

80

F-s

core

EM iter sentences

1

0 20 40 60 80 100

20

40

60

80

EM iter

PRLG chunking model: WSJ



5 10 15K1020304050

sentences 5 10 15K20

80140

20

40

F-s

core

EM iter sentences

1

0 50 100 1501020304050

EM iter

PRLG chunking model: Negra



5 10 15K0

10

20

30

sentences 510 15K

2060

100

10

20

30

F-s

core

EM iter sentences

1

0 20 40 60 80 1000

10

20

30

EM iter

PRLG chunking model: CTB


What are the models learning?

B P(w|B)the 21.0a 8.7to 6.5’s 2.8in 1.9mr. 1.8its 1.6of 1.4an 1.4and 1.4

I P(w|I)% 1.8million 1.6be 1.3company 0.9year 0.8market 0.7billion 0.6share 0.5new 0.5than 0.5

O P(w|O)

of 5.8and 4.0in 3.7that 2.2to 2.1for 2.0is 2.0it 1.7said 1.7on 1.5

HMM Emissions: WSJ



B P(w|B)der the 13.0die the 12.2den the 4.4und and 3.3im in 3.2das the 2.9des the 2.7dem the 2.4eine a 2.1ein a 2.0

I P(w|I)uhr o’clock 0.8juni June 0.6jahren years 0.4prozent percent 0.4mark currency 0.3stadt city 0.3000 0.3millionen millions 0.3jahre year 0.3frankfurter Frankfurt 0.3

O P(w|O)

in in 3.4und and 2.7mit with 1.7fur for 1.6auf on 1.5zu to 1.4von of 1.3sich such 1.3ist is 1.3nicht not 1.2

HMM Emissions: Negra



B P(w|B)的 de, of 14.3一 one 3.1和 and 1.1两 two 0.9这 this 0.8有 have 0.8经济 economy 0.7各 each 0.7全 all 0.7不 no 0.6

I P(w|I)的 de 3.9了 (perf. asp.) 2.2个 ge (measure) 1.5年 year 1.3说 say 1.0中 middle 0.9上 on, above 0.9人 person 0.7大 big 0.7国 country 0.6

O P(w|O)

在 at, in 3.4是 is 2.4中国 China 1.4也 also 1.2不 no 1.2对 pair 1.1和 and 1.0的 de 1.0将 fut. tns. 1.0有 have 1.0

HMM Emissions: CTB


Simple unsupervised grammar induction from raw text with cascaded finite state models: ACL 2011 talk

Technology

Transcript of Simple unsupervised grammar induction from raw text with cascaded finite state models: ACL 2011 talk