Treebanks and Parsing - Computational...

30
Treebanks and Parsing Treebanks Extracting a Grammar Ambiguity Parsing and Treebanks Treebanks and Parsing L645 Dept. of Linguistics, Indiana University Fall 2009 1 / 23

Transcript of Treebanks and Parsing - Computational...

Page 1: Treebanks and Parsing - Computational Linguisticscl.indiana.edu/~md7/09/645/slides/10-parsing/10d-treebank.pdf · Treebanks and Parsing Treebanks ... the most probable parse: R-SIMPX-ON

Treebanks andParsing

Treebanks

Extracting aGrammar

Ambiguity

Parsing andTreebanksTreebanks and Parsing

L645

Dept. of Linguistics, Indiana UniversityFall 2009

1 / 23

Page 2: Treebanks and Parsing - Computational Linguisticscl.indiana.edu/~md7/09/645/slides/10-parsing/10d-treebank.pdf · Treebanks and Parsing Treebanks ... the most probable parse: R-SIMPX-ON

Treebanks andParsing

Treebanks

Extracting aGrammar

Ambiguity

Parsing andTreebanks

Def. Treebank

A treebank is a syntactically annotated corpus.

Issues:

◮ complete analysis vs. partial analysis◮ theory-neutral vs. theory-dependent◮ spoken vs. written language◮ constituents vs. dependency◮ annotate grammatical functions?◮ manual vs. automatic annotation

2 / 23

Page 3: Treebanks and Parsing - Computational Linguisticscl.indiana.edu/~md7/09/645/slides/10-parsing/10d-treebank.pdf · Treebanks and Parsing Treebanks ... the most probable parse: R-SIMPX-ON

Treebanks andParsing

Treebanks

Extracting aGrammar

Ambiguity

Parsing andTreebanks

Some Remarks

◮ treebanking is extremely labor-intensive (i.e. costly)

◮ good planning is therefore necessary

◮ good tools are crucial◮ they speed up the process◮ they help with consistency

◮ a detailed stylebook is essential

3 / 23

Page 4: Treebanks and Parsing - Computational Linguisticscl.indiana.edu/~md7/09/645/slides/10-parsing/10d-treebank.pdf · Treebanks and Parsing Treebanks ... the most probable parse: R-SIMPX-ON

Treebanks andParsing

Treebanks

Extracting aGrammar

Ambiguity

Parsing andTreebanks

Penn WSJ Treebank – Example

( (S (NP-SBJ (NP Pierre Vinken),(ADJP (NP 61 years)

old),)

(VP will(VP join

(NP the board)(PP-CLR as

(NP a nonexecutive director))(NP-TMP Nov. 29)))

.))

5 / 23

Page 5: Treebanks and Parsing - Computational Linguisticscl.indiana.edu/~md7/09/645/slides/10-parsing/10d-treebank.pdf · Treebanks and Parsing Treebanks ... the most probable parse: R-SIMPX-ON

Treebanks andParsing

Treebanks

Extracting aGrammar

Ambiguity

Parsing andTreebanks

Treebanks for English

◮ Penn Treebank◮ BLLIP Treebank◮ The Penn-Helsinki Parsed Corpus of Middle English◮ Susanne Corpus and Christine Project◮ International Corpus of English ICE◮ Lancaster Treebank◮ The Redwoods HPSG Treebank

6 / 23

Page 6: Treebanks and Parsing - Computational Linguisticscl.indiana.edu/~md7/09/645/slides/10-parsing/10d-treebank.pdf · Treebanks and Parsing Treebanks ... the most probable parse: R-SIMPX-ON

Treebanks andParsing

Treebanks

Extracting aGrammar

Ambiguity

Parsing andTreebanks

Treebanks Projects

◮ Basque◮ Eus3LB project

◮ Bulgarian◮ HPSG-based Syntactic Treebank of Bulgarian

(BulTreeBank)◮ Catalan

◮ CAT3LB project◮ Chinese

◮ The Chinese Treebank Project◮ Czech

◮ Prague Dependency Treebank

7 / 23

Page 7: Treebanks and Parsing - Computational Linguisticscl.indiana.edu/~md7/09/645/slides/10-parsing/10d-treebank.pdf · Treebanks and Parsing Treebanks ... the most probable parse: R-SIMPX-ON

Treebanks andParsing

Treebanks

Extracting aGrammar

Ambiguity

Parsing andTreebanks

Treebanks Projects (2)

◮ Danish◮ Danish Dependency Treebank

◮ Dutch◮ The Alpino Treebank

◮ French◮ Project TALANA

◮ German◮ NeGra Project - NeGra Corpus◮ Project TIGER◮ Verbmobil Treebank of Spoken German (TuBa-D/S)◮ The Tubingen Treebank of Written German

(TuBa-D/Z)

8 / 23

Page 8: Treebanks and Parsing - Computational Linguisticscl.indiana.edu/~md7/09/645/slides/10-parsing/10d-treebank.pdf · Treebanks and Parsing Treebanks ... the most probable parse: R-SIMPX-ON

Treebanks andParsing

Treebanks

Extracting aGrammar

Ambiguity

Parsing andTreebanks

Treebanks Projects (3)

◮ Italian◮ Turin University Treebank TUT◮ Italian Syntactic-Semantic Treebank

◮ Portuguese◮ The Floresta Sinta(c)tica project

◮ Slovene◮ Slovene Dependency Treebank

◮ Swedish◮ Swedish Treebank

◮ Turkish◮ METU treebank

9 / 23

Page 9: Treebanks and Parsing - Computational Linguisticscl.indiana.edu/~md7/09/645/slides/10-parsing/10d-treebank.pdf · Treebanks and Parsing Treebanks ... the most probable parse: R-SIMPX-ON

Treebanks andParsing

Treebanks

Extracting aGrammar

Ambiguity

Parsing andTreebanks

The Annotation Scheme

◮ Should the annotation scheme be dependent on aparticular theory?

◮ Theory-neutrality doesn’t really exist. Everyannotation scheme is at least implicitlytheory-dependent.

◮ Grounding an annotation scheme in a linguistictheory tends to improve consistency of annotations.

10 / 23

Page 10: Treebanks and Parsing - Computational Linguisticscl.indiana.edu/~md7/09/645/slides/10-parsing/10d-treebank.pdf · Treebanks and Parsing Treebanks ... the most probable parse: R-SIMPX-ON

Treebanks andParsing

Treebanks

Extracting aGrammar

Ambiguity

Parsing andTreebanks

Theory-dependent Treebanks

◮ Prague Dependeny Treebank◮ based on Dependency Grammar

◮ The Redwoods HPSG Treebank◮ based on Head-Driven Phrase Structure Grammar

◮ CCGbank◮ translation of the Penn Treebank into a corpus of

Combinatory Categorial Grammar derivations

11 / 23

Page 11: Treebanks and Parsing - Computational Linguisticscl.indiana.edu/~md7/09/645/slides/10-parsing/10d-treebank.pdf · Treebanks and Parsing Treebanks ... the most probable parse: R-SIMPX-ON

Treebanks andParsing

Treebanks

Extracting aGrammar

Ambiguity

Parsing andTreebanks

Theory-neutral Treebanks

Theory-neutral treebanks do not adhere to any particularlinguistic theory

◮ encode those grammatical properties that aredistinguished by many, if not all grammaticalframeworks

Advantage:◮ more widely usable◮ less dependent on whatever version of a particular

grammatical theory may have existed at the timewhen the treebank annotation scheme wasdetermined

◮ examples: Penn Treebank, Negra treebank,Tubingen treebanks

12 / 23

Page 12: Treebanks and Parsing - Computational Linguisticscl.indiana.edu/~md7/09/645/slides/10-parsing/10d-treebank.pdf · Treebanks and Parsing Treebanks ... the most probable parse: R-SIMPX-ON

Treebanks andParsing

Treebanks

Extracting aGrammar

Ambiguity

Parsing andTreebanks

Extracting the CFG Grammar

0 1 2 3 4 5 6 7

500 501 502 503 504 505

506 507 508 509

510

511

wie

PWAV

sieht

VVFIN

das

PDS

bei

APPR

Ihnen

PPER

am

APPRART

Donnerstag

NN

aus

PTKVZ

HD HD HD HD HD VPT

ADJX

PRED

VXFIN

HD −

NX

HD −

NX

HD

NX

ON

PX

FOPP

PX

V−MOD

VF

LK

MF

VC

SIMPX

How does Thursday look for you?

13 / 23

Page 13: Treebanks and Parsing - Computational Linguisticscl.indiana.edu/~md7/09/645/slides/10-parsing/10d-treebank.pdf · Treebanks and Parsing Treebanks ... the most probable parse: R-SIMPX-ON

Treebanks andParsing

Treebanks

Extracting aGrammar

Ambiguity

Parsing andTreebanks

Extracting the CFG Grammar

0 1 2 3 4 5 6 7

500 501 502 503 504 505

506 507 508 509

510

511

wie

PWAV

sieht

VVFIN

das

PDS

bei

APPR

Ihnen

PPER

am

APPRART

Donnerstag

NN

aus

PTKVZ

HD HD HD HD HD VPT

ADJX

PRED

VXFIN

HD −

NX

HD −

NX

HD

NX

ON

PX

FOPP

PX

V−MOD

VF

LK

MF

VC

SIMPX

rules:

SIMPX → VF LK MF VCVF → ADJXLK → VXFINMF → NX PX PXVC → PTKVZ

13 / 23

Page 14: Treebanks and Parsing - Computational Linguisticscl.indiana.edu/~md7/09/645/slides/10-parsing/10d-treebank.pdf · Treebanks and Parsing Treebanks ... the most probable parse: R-SIMPX-ON

Treebanks andParsing

Treebanks

Extracting aGrammar

Ambiguity

Parsing andTreebanks

Extracting the CFG Grammar

0 1 2 3 4 5 6 7

500 501 502 503 504 505

506 507 508 509

510

511

wie

PWAV

sieht

VVFIN

das

PDS

bei

APPR

Ihnen

PPER

am

APPRART

Donnerstag

NN

aus

PTKVZ

HD HD HD HD HD VPT

ADJX

PRED

VXFIN

HD −

NX

HD −

NX

HD

NX

ON

PX

FOPP

PX

V−MOD

VF

LK

MF

VC

SIMPX

or:

SIMPX → VF-NH LK-NH MF-NH VC-NHVF-NH → ADJX-PREDLK-NH → VXFIN-HDMF-NH → NX-ON PX-FOPP PX-VMODVC-NH → PTKVZ-VPT

13 / 23

Page 15: Treebanks and Parsing - Computational Linguisticscl.indiana.edu/~md7/09/645/slides/10-parsing/10d-treebank.pdf · Treebanks and Parsing Treebanks ... the most probable parse: R-SIMPX-ON

Treebanks andParsing

Treebanks

Extracting aGrammar

Ambiguity

Parsing andTreebanks

Ambiguity – A Serious Problem

example sentence:

Volker Tegeler, stellvertretenderGeschaftsf uhrer des Landesverbandes, sagt:

(Volker Tegeler, vice CEO of the national association,says ...)

How many different analyses?◮ training corpus: 15 000 sentences, grammar: ca.

6 000 rules

14 / 23

Page 16: Treebanks and Parsing - Computational Linguisticscl.indiana.edu/~md7/09/645/slides/10-parsing/10d-treebank.pdf · Treebanks and Parsing Treebanks ... the most probable parse: R-SIMPX-ON

Treebanks andParsing

Treebanks

Extracting aGrammar

Ambiguity

Parsing andTreebanks

Ambiguity – A Serious Problem

example sentence:

Volker Tegeler, stellvertretenderGeschaftsf uhrer des Landesverbandes, sagt:

(Volker Tegeler, vice CEO of the national association,says ...)

How many different analyses?◮ training corpus: 15 000 sentences, grammar: ca.

6 000 rules

Answer: don’t know◮ but 58(!) root categories (if complex information is

used in the nodes)

14 / 23

Page 17: Treebanks and Parsing - Computational Linguisticscl.indiana.edu/~md7/09/645/slides/10-parsing/10d-treebank.pdf · Treebanks and Parsing Treebanks ... the most probable parse: R-SIMPX-ON

Treebanks andParsing

Treebanks

Extracting aGrammar

Ambiguity

Parsing andTreebanks

Ambiguity – Sample Analyses

the most probable parse:

R-SIMPX-ON

NX-ON

ENX-ON

NE-n

Volker

NE-n

Tegeler

, NX-ON

NX-ON

ADJX-n

ADJA-n

stellv.

NN

Geschaftsf.

NX-g

ART-g

des

NN-g

Landesv.

, R-SIMPX

VC

VXFIN

VVFIN

sagt

15 / 23

Page 18: Treebanks and Parsing - Computational Linguisticscl.indiana.edu/~md7/09/645/slides/10-parsing/10d-treebank.pdf · Treebanks and Parsing Treebanks ... the most probable parse: R-SIMPX-ON

Treebanks andParsing

Treebanks

Extracting aGrammar

Ambiguity

Parsing andTreebanks

Ambiguity – Sample Analyses

the almost correct parse:

SIMPX-ON

NX-ON

ENX-ON

NE-n

Volker

NE-n

Tegeler

, NX-ON

NX-ON

ADJX-n

ADJA-n

stellv.

NN

Geschaftsf.

NX-g

ART-g

des

NN-g

Landesv.

, SIMPX

VC-FIN

VVFIN

sagt

15 / 23

Page 19: Treebanks and Parsing - Computational Linguisticscl.indiana.edu/~md7/09/645/slides/10-parsing/10d-treebank.pdf · Treebanks and Parsing Treebanks ... the most probable parse: R-SIMPX-ON

Treebanks andParsing

Treebanks

Extracting aGrammar

Ambiguity

Parsing andTreebanks

Ambiguity – Sample Analyses

ADVX

ADVX

ADVX

NX-a

NE

Volker

ADV

Tegeler

, ADVX

ADJD

stellv.

SIMPX-C-OA

NX-OA

NX-OA

NN-a

Geschaftsf.

NX-d

NE-d

des

SIMPX-C

SIMPX

NX-d

NN-d

Landesv.

, SIMPX-C

R-SIMPX

VC

VXFIN

VVFIN

sagt

an “interesting” parse15 / 23

Page 20: Treebanks and Parsing - Computational Linguisticscl.indiana.edu/~md7/09/645/slides/10-parsing/10d-treebank.pdf · Treebanks and Parsing Treebanks ... the most probable parse: R-SIMPX-ON

Treebanks andParsing

Treebanks

Extracting aGrammar

Ambiguity

Parsing andTreebanks

Parsing and Treebanks – How do theyRelate?

◮ Progress in parsing has been guided byimprovements on the Penn Treebank

◮ How much of this is dependent on the annotationscheme?

◮ German: has two treebanks→ ideal for comparinginfluence of annotation scheme on parsing

16 / 23

Page 21: Treebanks and Parsing - Computational Linguisticscl.indiana.edu/~md7/09/645/slides/10-parsing/10d-treebank.pdf · Treebanks and Parsing Treebanks ... the most probable parse: R-SIMPX-ON

Treebanks andParsing

Treebanks

Extracting aGrammar

Ambiguity

Parsing andTreebanks

Negra and TuBa-D/Z

◮ both are based on newspaper texts: FrankfurterRundschau, taz

◮ both use same POS tagset: STTS◮ both annotate constituent structure and

function-argument structure◮ Negra: 20 000 sentences; TuBa-D/Z: 15 000

sentences (version 1), 27 000 sentences (version 3)

◮ different annotation decisions

17 / 23

Page 22: Treebanks and Parsing - Computational Linguisticscl.indiana.edu/~md7/09/645/slides/10-parsing/10d-treebank.pdf · Treebanks and Parsing Treebanks ... the most probable parse: R-SIMPX-ON

Treebanks andParsing

Treebanks

Extracting aGrammar

Ambiguity

Parsing andTreebanks

A Tree from Negra

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

500 501 502

503

504

505

506

507

Das

PDS

kann

VMFIN

die

ART

Maske

NN

sein

VAINF

,

$,

die

PRELS

ihm

PPER

sein

PPOSAT

Charakter

NN

wie

KOKOM

ein

ART

Stempel

NN

ins

APPRART

Gesicht

NN

gedrückt

VVPP

hat

VAFIN

;

$.

NK NK CM NK NK AC NK

OA DA

NP

MO

PP

MO HD

NP

SB

VP

OC HD

NK NK

S

RC

HD

NP

PD

SB HD

VP

OC

S

That could be the mask that his character

pressed like a stamp on his face.

18 / 23

Page 23: Treebanks and Parsing - Computational Linguisticscl.indiana.edu/~md7/09/645/slides/10-parsing/10d-treebank.pdf · Treebanks and Parsing Treebanks ... the most probable parse: R-SIMPX-ON

Treebanks andParsing

Treebanks

Extracting aGrammar

Ambiguity

Parsing andTreebanks

A Tree from TuBa-D/Z

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

500 501 502 503 504 505 506 507 508

509 510 511 512 513 514 515

516 517

518 519

520

521

Der

ART

Autokonvoi

NN

mit

APPR

den

ART

Probenbesuchern

NN

fährt

VVFIN

eine

ART

Straße

NN

entlang

PTKVZ

,

$,

die

PRELS

noch

ADV

heute

ADV

Lagerstraße

NN

heißt

VVFIN

.

$.

− HD − HD HD − HD VPT HD HD HD HD

NCX

HD

VXFIN

HD

NCX

OA

NCX

ON

ADVX

− HD

NCX

VXFIN

HD

NCX

HD

PX

ADVX

V−MOD

EN−ADD

PRED

NX

ON

C

MF

VC

R−SIMPX

OA−MOD

VF

LK

MF

VC

NF

SIMPX

The convoy of the rehearsal visitors’ cars goes

along a street that is still called Lagerstraße.

19 / 23

Page 24: Treebanks and Parsing - Computational Linguisticscl.indiana.edu/~md7/09/645/slides/10-parsing/10d-treebank.pdf · Treebanks and Parsing Treebanks ... the most probable parse: R-SIMPX-ON

Treebanks andParsing

Treebanks

Extracting aGrammar

Ambiguity

Parsing andTreebanks

Main Differences in Annotation

◮ phrase structure◮ Negra: extremely flat◮ TuBa-D/Z: premodification flat, postmodification high

◮ clause structure◮ Negra: VP⇒ crossing branches

no unary nodes◮ TuBa-D/Z: topological fields

◮ long-distance relationships◮ Negra: crossing branches◮ TuBa-D/Z: pure tree structure + special labels

20 / 23

Page 25: Treebanks and Parsing - Computational Linguisticscl.indiana.edu/~md7/09/645/slides/10-parsing/10d-treebank.pdf · Treebanks and Parsing Treebanks ... the most probable parse: R-SIMPX-ON

Treebanks andParsing

Treebanks

Extracting aGrammar

Ambiguity

Parsing andTreebanks

Crossing Branches

0 1 2 3 4 5 6 7 8 9 10 11

500 501 502

503

504

Diese

PDAT

Metapher

NN

kann

VMFIN

die

ART

Freizeitmalerin

NN

durchaus

ADV

auch

ADV

auf

APPR

ihr

PPOSAT

Leben

NN

anwenden

VVINF

.

$.

NK NK NK NK MO AC NK NK

NP

OA

PP

MO HD

HD

NP

SB MO

VP

OC

S

The amateur painter can by all means apply

this metaphor also to her life.

21 / 23

Page 26: Treebanks and Parsing - Computational Linguisticscl.indiana.edu/~md7/09/645/slides/10-parsing/10d-treebank.pdf · Treebanks and Parsing Treebanks ... the most probable parse: R-SIMPX-ON

Treebanks andParsing

Treebanks

Extracting aGrammar

Ambiguity

Parsing andTreebanks

Resolved Structure

0 1 2 3 4 5 6 7 8 9 10 11

500 501 502

503

504

Diese

PDAT

Metapher

NN

kann

VMFIN

die

ART

Freizeitmalerin

NN

durchaus

ADV

auch

ADV

auf

APPR

ihr

PPOSAT

Leben

NN

anwenden

VVINF

.

$.

NK NK NK NK MO AC NK NK

PP

MO HD

NP

OA HD

NP

SB MO

VP

OC

S

22 / 23

Page 27: Treebanks and Parsing - Computational Linguisticscl.indiana.edu/~md7/09/645/slides/10-parsing/10d-treebank.pdf · Treebanks and Parsing Treebanks ... the most probable parse: R-SIMPX-ON

Treebanks andParsing

Treebanks

Extracting aGrammar

Ambiguity

Parsing andTreebanks

Resolved Structure

0 1 2 3 4 5 6 7 8 9 10 11

500 501 502

503

504

Diese

PDAT

Metapher

NN

kann

VMFIN

die

ART

Freizeitmalerin

NN

durchaus

ADV

auch

ADV

auf

APPR

ihr

PPOSAT

Leben

NN

anwenden

VVINF

.

$.

NK NK NK NK MO AC NK NK

PP

MO HD

NP

OA HD

NP

SB MO

VP

OC

S

S→ NP VMFIN NP ADV VP

22 / 23

Page 28: Treebanks and Parsing - Computational Linguisticscl.indiana.edu/~md7/09/645/slides/10-parsing/10d-treebank.pdf · Treebanks and Parsing Treebanks ... the most probable parse: R-SIMPX-ON

Treebanks andParsing

Treebanks

Extracting aGrammar

Ambiguity

Parsing andTreebanks

Results

lab. prec. lab. rec. lab. F-scoreNegra no GF 69.96 69.95 69.95

all GF 47.20 56.43 51.41subject 52.50 58.02 55.12acc. object 35.14 36.30 35.71dat. object 8.38 3.58 5.00

TuBa no GF 89.86 88.51 89.18all GF 75.73 74.93 75.33subject 66.82 75.93 71.08acc. object 43.84 47.31 45.50dat. object 24.46 9.96 14.07

23 / 23

Page 29: Treebanks and Parsing - Computational Linguisticscl.indiana.edu/~md7/09/645/slides/10-parsing/10d-treebank.pdf · Treebanks and Parsing Treebanks ... the most probable parse: R-SIMPX-ON

Treebanks andParsing

Treebanks

Extracting aGrammar

Ambiguity

Parsing andTreebanks

Results

lab. prec. lab. rec. lab. F-scoreNegra no GF 69.96 69.95 69.95

all GF 47.20 56.43 51.41subject 52.50 58.02 55.12acc. object 35.14 36.30 35.71dat. object 8.38 3.58 5.00

TuBa no GF 89.86 88.51 89.18all GF 75.73 74.93 75.33subject 66.82 75.93 71.08acc. object 43.84 47.31 45.50dat. object 24.46 9.96 14.07

23 / 23

Page 30: Treebanks and Parsing - Computational Linguisticscl.indiana.edu/~md7/09/645/slides/10-parsing/10d-treebank.pdf · Treebanks and Parsing Treebanks ... the most probable parse: R-SIMPX-ON

Treebanks andParsing

Treebanks

Extracting aGrammar

Ambiguity

Parsing andTreebanks

Results

lab. prec. lab. rec. lab. F-scoreNegra no GF 69.96 69.95 69.95

all GF 47.20 56.43 51.41subject 52.50 58.02 55.12acc. object 35.14 36.30 35.71dat. object 8.38 3.58 5.00

TuBa no GF 89.86 88.51 89.18all GF 75.73 74.93 75.33subject 66.82 75.93 71.08acc. object 43.84 47.31 45.50dat. object 24.46 9.96 14.07

23 / 23