Introduction to NLP Data-Driven Dependency Parsing

Introduction to NLPData-Driven Dependency Parsing

Prof. Reut TsarfatyBar Ilan University

November 24, 2020

Statistical Parsing

The Big Picture

Statistical Parsing

The Big Picture

Statistical Parsing

The Big Picture

Statistical Parsing: The Big Picture

The QuestionsI What kind of Trees?I What kind of Models?

I GenerativeI Discriminative

I Which Search Algorithm (Decoding) ?I Which Learning Algorithm (Training) ?I What kind of Evaluation?

Statistical Parsing: The Big Picture

Previously on NLP@BIU

Representation Phrase-Structure TreesModel Generative

Objective ProbabilisticSearch CKY

Train Maximum LikelihoodEvaluation Precision/Recall/F1

Today: Introduction to Dependency Parsing

Today: More Modeling Choices:

Representation Constituency Trees Dependency TreesModel Generative ?

Objective Probailistic ?Search Exhaustive ?

Train MLE ?Evaluation F1-Scores Attachement Scores

Introduction to Dependency Parsing

I The purpose of Syntactic Structures:I Encode Predicate Argument StructuresI Who Does What to Whom? (When, Where, Why...)

I Properties of Dependency Structures:I Defined as (labeled) binary relations between wordsI Reflect a long linguistic (European) traditionI Explicitly represent Argument Structure

Representation: Labeled vs. Unlabeled

Unlabeled Dependency Tree: –ROOT–

dumped

workers sacks into

binsLabeled Dependency Tree: –ROOT–

dumped

subjworkers

dobjsacks

prepinto

pobjbins

Representation: Functional vs. Lexical

Functional Dependencies: –ROOT–

dumped

subjworkers

dobjsacks

prepinto

pobjbins

Lexical Dependencies: –ROOT–

dumped

subjworkers

dobjsacks

nmodbins

caseinto

Discussions: Options and Schemes

Vertical vs. Horizontal Representationhttp://nlp.stanford.edu:8080/corenlp/

The Universal Dependencies Initiativehttps://universaldependencies.org/

Let’s Analyse!

The cat sat on the mat .

Let’s Analyse!

The cat is on the mat .

Let’s Analyse!

The cat , which I met , is sitting on the mat .

Let’s Analyse!

The dog and the cat sat on the big and fluffy mat

You should know how to read/analyse these!

Let’s Analyse!

The dog and the cat sat on the big and fluffy mat

You should know how to read/analyse these!

Dependency Trees: Formal Definition

I A labeled dependency tree is a labeled directed tree T :I a set V of nodes, labeled with words (including ROOT )I a set A of arcs, labeled with dependency typesI a linear precedence order < on V

I Notation:I Arc 〈v1, v2〉 connects head v1 with dep v2I Arc 〈v1, l , v2〉 connects head v1 with dep v2 with label l ∈ LI A node v0 (ROOT) serves as a unique root of the tree

Properties of Dependency Trees

A dependency T tree is:I connected:

For every node i there is a node j such that i → j or j → iI acyclic:

If i → j then not j →∗ iI single head:

If i → j then not k → j for any k 6= iI projective:

If i → j then i →∗ k for any k such that i < k < j

Non-Projective Dependency Trees

Non-Projective Dependency TreesMany parsing parsing algorithms are restricted to projectivedependency trees.

Is this a problem?Statistics from CoNLL-X Shared Task 2006I NPD = Non-projective dependenciesI NPS = Non-projective sentences

Language %NPD % NPSDutch 5.4 36.4

German 2.3 27.8Czech 1.9 23.2

Slovene 1.9 22.2Portuguese 1.3 18.9

Danish 1.0 15.6

We will (mostly) focus on projective dependencies.

Evaluation Metrics

I Unlabeled Attachement Scores (UAS)The percentage of identical arcsfrom the total number or arcs in the treeUAS =

Aintersect(i,j)n

I Labeled Attachement Scores (LAS)The percentage of identical arcs with identical labelsfrom the total number or arcs in the treeLAS =

Aintersect(i,l,j)n

I Root AccuracyThe percentage of sentences with correct root dependency

I Exact MatchThe percentage of sentences with parses identical to gold

Models for Dependency Parsing

The Parsing Objective:

y∗ = arg max{y |y∈GEN(x)}

Score(y)

The Modeling Choices:

Representation Dependency TreesModel ?

Decoder ?Trainer ?

Evaluation Attachment Scores

Modeling Methods

Our Modeling Tasks:

y∗ = arg max{y |y∈GEN(x)}

Score(y)

I GEN: How do we generate all t?I Score: How do we score any t?I argmax: How do we find the best t?

Modeling Methods

I Conversion-Based- Convert Phrase-Structure to Dependency Trees

I Grammar-Based- Generative methods based on PCFGs

I Graph-Based- Globally Optimised, Restricted features

I Transition-Based- Locally Optimal, Unrestricted features

I Neural-Based

Modeling Methods (1)

I Conversion-Based: Convert PS trees Using a Head TableI Grammar-BasedI Graph-BasedI Transition-Based

workers

dumped

I Conversion-Based: Convert PS trees Using a Head Table

VP → VBD VBN MD VBZ VB VBG VBP VPNP ← NN NX JJR CD JJ JJS RBADJP ← NNS QP NN ADVP JJ VBN VBGADVP → RB RBR RBS FW ADVP TO CD JJRS ← VP S SBAR ADJP UCP NPSQ ← VBZ VBD VBP VB MD PRD VP SQSBAR ← S SQ SINV SBAR FRAG IN DT

Where Do Dependency Trees Come From?

workers

dumped

Where Do Dependency Trees Come From?

S/dumped

NP/workers

workers

VP/dumped

V/dumped

dumped

NP/sacks

PP/into

P/into

NP/bins

Where do Dependency Trees Come From?

dumped

workers

dumped

workers

dumped

–ROOT–

dumped

workers sacks into

X Conversion-BasedI Grammar-BasedI Graph-BasedI Transition-Based

dumped

workers

dumped

Grammar-Based Dependency Parsing

The Basic IdeaI Treat bi-lexical dependencies as constituentsI Decode using chart based algorithm (e.g., CKY)I Learn using standard MLE methodsI Evaluate over the set of resulting dependencies as usual

Relevant StudiesI Original version: [Hays 1964]I Link Grammar: [Sleator and Temperley 1991]I Earley-style left-corner: [Lombardo and Lesmo 1996]I Bilexical grammars: [Eisner 1996a, 1996b, Eisner 2000]

http://cs.jhu.edu/~jason/papers/eisner.coling96.pdf

The Basic IdeaI Treat bi-lexical dependencies as constituentsI Decode using chart based algorithm (e.g., CKY)I Learn using standard MLE methodsI Evaluate over the set of resulting dependencies as usual

Relevant StudiesI Original version: [Hays 1964]I Link Grammar: [Sleator and Temperley 1991]I Earley-style left-corner: [Lombardo and Lesmo 1996]I Bilexical grammars: [Eisner 1996a, 1996b, Eisner 2000]

http://cs.jhu.edu/~jason/papers/eisner.coling96.pdf

The Objective Function:

t∗ = argmax{t |∈GEN(x)}P(t)

Representation Dependency TreesModel PCFG

Decoder Adapted CKYTrainer Smoothed MLE

Evaluation Attachment Scores

X Conversion-BasedX Grammar-BasedI Graph-BasedI Transition-Based

Graph-Based Dependency Parsing

The Basic IdeaI Define a global Arc-Factored modelI Treat the search as an MST problemI Treat the learning as a classification problemI Evaluate over the set of gold dependencies as usual

Step 1: Defining the Arc Factored Model

t∗ = arg maxt∈GEN(V )

wΦ(t)

= arg maxt∈GEN(V )

∑(i→j)∈t

wφarc(i → j)

Step 2: Defining Feature Templates

Name φi (had,OBJ, effect) wiUnigram head “had” wuniheadUnigram dep “effect” wunidepUnigram head pos VB wuniheadposUnigram dep pos NN wunidepposBigram head-dep “had-effect” wbigramBigram headpos-deppos VB-NN wbigramposLabeled Bigram head-dep “had-OBJ-effect” wbigramlabelLabeled Bigram headpos-deppos VB-obj-NN wbigramposlabelIn-Between pos VB-IN-NN winbetween

Step 2: Defining Feature Templates

Name φi (had,OBJ, effect) wiUnigram head “had” wuniheadUnigram dep “effect” wunidepUnigram head pos VB wuniheadposUnigram dep pos NN wunidepposBigram head-dep “had-effect” wbigramBigram headpos-deppos VB-NN wbigramposLabeled Bigram head-dep “had-OBJ-effect” wbigramlabelLabeled Bigram headpos-deppos VB-obj-NN wbigramposlabelIn-Between pos VB-IN-NN winbetween

Step 3: LearningE.g., PerceptronI Theory:

- Find w that assigns higher scores to yi than any y ∈ Y- If seperation exists, will learn to separate the correctstructure from the incorrect structures

I Practice:- Training requires repeated inference-update- Computing feature values is time consuming- The Averaged-Perceptron variant preferred

Step 4: Finding the Max-Spanning TreeThe Chu-Liu-Edmonds Algorithm

Runtime complexity: O(n2)

Step 3: Online LearningPerceptron/MIRA (Margin Infused Relaxed Algorithm)

Step 4: Max-Spanning Tree DecodingThe Chu-Liu-Edmonds Algorithm (CLE)

http://repository.upenn.edu/cgi/viewcontent.cgi?

article=1056&context=cis_reports

The Objective Function:

t∗ = argmax{t |∈GEN(x)}∑

a∈arcs(t)

wT Φ(a)

Representation Dependency TreesModel Graph-Based

Arc-FactoredDecoder MST/CLE O(n2)

Trainer Perceptron/MIRAEvaluation Attachment Scores

X Conversion-BasedX Grammar-BasedX Graph-BasedI Transition-Based

Transition-Based Dependency Parsing

The Basic IdeaI Define a transition systemI Define an Oracle Algorithm for DecodingI Approximate the Oracle Algorithm via LearningI Evaluate over Dependency Arcs as Usual

http://stp.lingfil.uu.se/~nivre/docs/BeyondMaltParser.pdf

Defining ConfigurationsA parser Configuration is a triplet c = (S,Q,A), whereI S = a stack [...,wi ]S of partially processed nodesI Q = a queue [wj , ...]Q of remaining input nodesI A = a set of labeled arcs (wi , l ,wj)

Initialization:I c0 = ([w0]S, [w1, ...,wn]Q, {})

Note: w0 = ROOTTermination:I ct = ([w0]S, []Q,A)

Defining TransitionsI Shift:

([...]S, [wi , ...]Q,A) −→ ([...,wi ]S, [...]Q,A)

I Arc-Left(l):([...,wi ,wj ]S,Q,A) −→ ([...,wj ]S,Q,A ∪ (wj , l ,wi))

I Arc-Right(l):([...,wi ,wj ]S,Q,A) −→ ([...,wi ]S,Q,A ∪ (wi , l ,wj))

Demo Deck

Deterministic ParsingGiven an oracle O that correctly predicts the next transitionO(c), parsing is deterministic:

PARSE(w1, ...,wn)1. c ← ([w0]S, [w1, ...,wn]Q, )2. while Qc 6= [] or |Sc | = 13. t ← O(c)4. c ← t(c)5. return T = (w0,w1, ...,wn,Ac)

Data-Driven ParsingWe approximate the Oracle O using a Classifier Predict(c) thatpredicts the next transition using Features of c, feats(c).

PARSE(w1, ...,wn)1. c ← ([w0]S, [w1, ...,wn]Q, )2. while Qc 6= [] or |Sc | = 13. t ← Predict(w, feats(c))4. c ← t(c)5. return T = (w0,w1, ...,wn,Ac)

Feature Engineering

name feature weightS[0] word effect w1S[0] pos NN w2S[1] word little w3S[1] pos JJ w4Q[0] word on w5Q[0] pos P w6Q[1] word financial w7Q[1] pos JJ w8Root(A) word had w9Root(A) POS VB w10s[0]-S[1] effect→little w11s[1]-S[0] little→effect w12

Feature Engineering

name feature weightS[0] word effect w1S[0] pos NN w2S[1] word little w3S[1] pos JJ w4Q[0] word on w5Q[0] pos P w6Q[1] word financial w7Q[1] pos JJ w8Root(A) word had w9Root(A) POS VB w10s[0]-S[1] effect→little w11s[1]-S[0] little→effect w12

I An Oracle O can be approximated by a (linear) classifier:

Predict(t) = arg maxt

wΦ(c, t)

I History-Based Features Φ(c, t)- Features over input words relative to S and Q- Features over the (partial) dependency tree defined by A- Features over the (partial) transition sequence so far

I Learning w from Treebank Data- Reconstruct Oracle sequence for each sentence- Construct training data set D = {(c, t)|O(c) = t}- Maximize accuracy of local predictions O(c) = t

Online LearningOnline learning algorithms

Step 4: Greedy DecodingGreedy: At each step, select the maximum scoring transition.

Recent Advances: Neural-Network Models

The Basic Claim: Both graph based and transition-basedmodels benefit from the move to Neural Networks.I Same overall approach and algorithm as before, but:→ Replace linear classifier with non-linear to MLP.→ Use pre-trained word embeddings.→ Replace feature-extractor with Bi-LSTM.

I Further explorations:→ Semi-supervised learning.→ Multi-task learning

I Remaining Challenges:→ Out-of-domain parsing (e.g. twitter)→ Parsing Morphologically-Rich Languages (e.g. Hebrew)

Summarising Dependency Parsing

I Dependency trees as labeled bi-lexical dependenciesI Data-Driven parsing trained over Dependency Treebanks

I Varied Methods:I Conversion-Based (Rules)I Grammar-Based (Probabilistic)I Graph-Based (Linear, Globally Optimized)I Transition-Based (Linear, Locally Optimized)

I Neural Network models work the same but:I Non-linear objective eg. MLPI Better word-representations eg. Word EmbeddingsI Better (automatic) feature-extraction eg. BiLSTM

I English is “solved” — What about other languages?I Stanford CoreNLP https://corenlp.runI The UD Initiative: https://universaldependencies.org/I UDPipe: http://lindat.mff.cuni.cz/services/udpipe/I ONLP: nlp.biu.ac.il/~rtsarfaty/onlp/hebrew/

Summarising Dependency Parsing

I Dependency trees as labeled bi-lexical dependenciesI Data-Driven parsing trained over Dependency Treebanks

I Varied Methods:I Conversion-Based (Rules)I Grammar-Based (Probabilistic)I Graph-Based (Linear, Globally Optimized)I Transition-Based (Linear, Locally Optimized)

I Neural Network models work the same but:I Non-linear objective eg. MLPI Better word-representations eg. Word EmbeddingsI Better (automatic) feature-extraction eg. BiLSTM

I English is “solved” — What about other languages?I Stanford CoreNLP https://corenlp.runI The UD Initiative: https://universaldependencies.org/I UDPipe: http://lindat.mff.cuni.cz/services/udpipe/I ONLP: nlp.biu.ac.il/~rtsarfaty/onlp/hebrew/

NLP@BIU: Where We’re At

So FarX Part 1: Introduction (classes 1-2)X Part 2: Words/Sequences (classes 3-4)X Part 3: Sentences/Trees (classes 5-6)→ Part 4: Meanings (Prof. Ido Dagan, starting class 7)

To Be Continued...

Introduction to NLP Data-Driven Dependency Parsing

Documents

Transcript of Introduction to NLP Data-Driven Dependency Parsing

Dependency Parsing by Belief Propagation

Posterior Sparsity in Unsupervised Dependency Parsing

NLP Programming Tutorial 8 - Phrase Structure Parsing · 3 NLP Programming Tutorial 8 – Phrase Structure Parsing Two Types of Parsing Dependency: focuses on relations between words

Dependency Parsing - UMD

Approximation-Aware Dependency Parsing by Belief …

Some Observations on Hindi Dependency Parsing

Lecture 19: Dependency Grammars and Dependency Parsing

NLP Programming Tutorial 12 - Dependency Parsing · 3 NLP Programming Tutorial 12 – Dependency Parsing Two Types of Parsing Dependency: focuses on relations between words …

Punctuation: Making a Point - Stanford NLP Groupnlp.stanford.edu/pubs/punctuation-slides.pdf · Punctuation: Making a Point in Unsupervised Dependency Parsing Valentin I. Spitkovsky

Dependency Grammar and Dependency Parsing - …stp.lingfil.uu.se/~nivre/docs/05133.pdfDependency Grammar and Dependency Parsing Joakim Nivre 1 Introduction ... Thus, in the sentence

Concavity and Initialization for Unsupervised Dependency ...kgimpel/talks/gimpel... · Unsupervised Dependency Parsing Kevin Gimpel Noah A. Smith 1. lti Unsupervised learning in NLP

Proceedings of ICON10 NLP TOOLS CONTEST: …researchweb.iiit.ac.in/~prashanth/papers/husain-mannem-ambati... · ICON10 NLP TOOLS CONTEST: INDIAN LANGUAGE DEPENDENCY PARSING ... Telugu:

Speech & NLP: Syntax & Parsing

Dependency Parsing

NLP Programming Tutorial 12 - Dependency Parsing · 2 NLP Programming Tutorial 12 – Dependency Parsing Interpreting Language is Hard! I saw a girl with a telescope “Parsing”

Scene Graph Parsing as Dependency Parsing · 2018-05-15 · Scene Graph Parsing as Dependency Parsing Yu-Siang Wang National Taiwan University b03202047@ntu.edu.tw Chenxi Liu( ) Johns

Improving dependency parsing using word clusters · dependency parsing. Dependency-based approaches to syntactic parsing are applied to a wide range of languages and new methods are

Punctuation: Making a Point - Stanford NLP GroupPunctuation: Making a Point in Unsupervised Dependency Parsing Valentin I. Spitkovsky withDaniel Jurafsky (StanfordUniversity) andHiyan

11-711 Advanced NLP Dependency Parsing: Algorithms, Models ...

Partial Dependency Parsing for Irish