Asma Naseer. Shallow Parsing or Partial Parsing At first proposed by Steven Abney (1991) Breaking...

Post on 19-Dec-2015

222 views 1 download

Transcript of Asma Naseer. Shallow Parsing or Partial Parsing At first proposed by Steven Abney (1991) Breaking...

Asma Naseer

CHUNKINGSHALLOW PARSING

INTRODUCTION

Shallow Parsing or Partial Parsing At first proposed by Steven Abney

(1991) Breaking text up into small pieces Each piece is parsed separately [1]

INTRODUCTION (CONTINUE . . . ) Words are not arranged flatly in a

sentence but are grouped in smaller parts called phrases

The girl was playing in the street

دی کتاب کو احمد نے اس

INTRODUCTION (CONTINUE . . . ) Chunks are non-recursive (does not contain

a phrase of the same category as it self)

NP D? AdjP? AdjP? N

The big red balloon

[NP[D The] [AdjP [Adj big]] [AdjP [Adj red]] [N balloon]]

[1]

INTRODUCTION (CONTINUE . . . ) Each phrase is dominated by a head h

A man proud of his son.

A proud man

The root of the chunk has h as s-head (semantic head)

Head of a Noun phrase is usually a Noun or pronoun [1]

[1]

CHUNK TAGGING (CONTINUE . . .)

IOBE IOB IO

CHUNK TAGGING

IOB (Inside Outside Begin)I-NP O-NP B-NPI-VP O-VP B-BP

کیا خطاب سے قوم نے جناح علی محمد اعظم قائد

] [I-NP محمد ] [I-NPعلی] [I-NP جناح] اعظم [B-NP قائد

[O-NP نے] [B-NP قوم] [ O-NP سے] [B-NP خطاب]

[O-NP کیا]

RESEARCH WORK

Rule Based Vs Statistical Based Chunking [2] Use of Support Vector Learning for Chunk

Identification [5] A Context Based Maximum Likelihood

Approach to Chunking [6] Chunking with Maximum Entropy Models [7] Single-Classifier Memory-Based Phrase

Chunking [8] Hybrid Text Chunking [9] Shallow Parsing as POS Tagging [3]

RULE BASED VS STATISTICAL BASED CHUNKING Two techniques are used

Regular expressions rules○ Shallow Parse based on regular expressions

N-gram statistical tagger (machine based chunking)○ NLTK (Natural Language Toolkit) based on

TnT Tagger (Trigramsb’n’Tags).○ Basic Idea: Reuse POS tagger for chunking.

RULE BASED VS STATISTICAL BASED CHUNKING ( CONTINUE… )

Regular expressions rules

Necessary to develop regular expressions manually

N-gram statistical tagger

Can be trained on gold standard chunked data

RULE BASED VS STATISTICAL BASED CHUNKING ( CONTINUE… ) Focus is on Verb and Noun phrase chunking Noun Phrases

Noun or pronoun is the headAlso contains

○ Determiners i.e. Articles, Demonstratives, Numerals, Possessives and Quantifiers

○ Adjectives○ Complements ( ad-positional, relative clauses )

Verb PhrasesVerb is the headOften one or two complementsAny number of Adjuncts

RULE BASED VS STATISTICAL BASED CHUNKING ( CONTINUE… ) Training NLTK on Chunk Data

Starts with empty rule set○ 1. Define or refine a rule○ 2. Execute chunker on training data○ 3. Compare results with previous run

Repeat (1,2 & 3) until performance does not improve significantly

Issues: Total 211,727 phrases. Taken subset 1,000 phrases.

RULE BASED VS STATISTICAL BASED CHUNKING ( CONTINUE… ) Training TnT on Chunk Data

Chunking is treated as statistical taggingTwo steps

○ Parameter generation : create model parameters from training corpus

○ Tagging : tag each word with chunk label

RULE BASED VS STATISTICAL BASED CHUNKING ( CONTINUE… ) Data Set

WSJ: Wall Street Journal Newspaper NY○ US○ International Business○ Financial News

Training: section 15-18Testing: section 20Both tagged with POS and IOBSpecial characters are treated as other

POS, punctuation are tagged as O

RULE BASED VS STATISTICAL BASED CHUNKING ( CONTINUE… ) Results

Precision P = |reference ∩ test| / testRecall R = |reference ∩ test| / referenceF- Measure Fα = 0.5 = 1 / (α/P + (1-α)/PR)F- Rate F = (2 * P* R) / (R+P)

RULE BASED VS STATISTICAL BASED CHUNKING ( CONTINUE… ) Results

NLTK

TnT

P R F-Measure

VP 79.3 % 80.1 % 79.7 %

NP 76.5 % 84.4 % 80.3 %

P R F-Measure

VP 79.59 % 82.35 % 80.95 %

NP 78.36 % 76.76 % 77.55 %

USE OF SUPPORT VECTOR LEARNING FOR CHUNK IDENTIFICATION SVM (Large Margin Classifiers) Introduced by Vapnik 1995 Two class pattern recognition problem Good generalization performance High accuracy in text categorization

without over fitting (Joachims, 1998; Taira and Haruono, 1999)

USE OF SUPPORT VECTOR LEARNING FOR CHUNK IDENTIFICATION ( CONTINUE… ) Training data (xi, yi)…. (xl, yl) xi Є Rn, yi Є {+1, -1}

xi is the i-th sample represented by n dimensional vector

yi is (+ve or –ve class) label of i-th sample In SVM

+ve and –ve examples are separated by a hyperplane

SVM finds optimal hyperplane

USE OF SUPPORT VECTOR LEARNING FOR CHUNK IDENTIFICATION ( CONTINUE… )

Two possible hyperplanes

USE OF SUPPORT VECTOR LEARNING FOR CHUNK IDENTIFICATION ( CONTINUE… ) Chunks in CoNLL-2000 shared task, are IOB

Tagged Each chunk type belongs to either I or B

I-NP or B-NP 22 types of chunks are found in CoNLL-2000 Chunking problem is classification of these 22

types SVM is binary classifier, so its extended to k-

classes One class vs. all others Pairwise classification

○ k * (k-1) / 2 classifiers 22 * 21 / 2 = 231 classifiers○ Majority decides final class

USE OF SUPPORT VECTOR LEARNING FOR CHUNK IDENTIFICATION ( CONTINUE… ) Feature vector consists of

Words: wPOS tags: tChunk tags: c

To identify chunk ci at i-th wordwj, tj (j = i-2, i-1, i, i+1, i+2)cj (j = i-2, i-1)

All features are expanded to binary values; either 0 or 1

The total dimensions of feature vector becomes 92837

USE OF SUPPORT VECTOR LEARNING FOR CHUNK IDENTIFICATION ( CONTINUE… )

Results It took about 1 day to train 231 classifiers PC-Linux

Celeron 500 MHz, 512 MB ADJP, ADVP, CONJP, INTJ, LST, NP, PP,

PRT, SBAR, VPPrecision = 93.45 %Recall = 93.51 %Fβ=1 = 93.48 %

A CONTEXT BASED MAXIMUM LIKELIHOOD APPROACH TO CHUNKING

Training POS Tags based Construct symmetric n-context from

training corpus1-context: most common chunk label for each

tag3-context: tag followed by the tag before and

after it [t-1, t0, t+1]

5-context [t-2 ,t-1, t0, t+1, t+2]

7-context [t-3 , t-2 ,t-1, t0, t+1, t+2, t+3]

A CONTEXT BASED MAXIMUM LIKELIHOOD APPROACH TO CHUNKING (CONTINUE . . .)

Training For each context find the most frequent

labelCC [O CC]PRP CC RP [B-NP CC]

To save storage space n-context is added if its different from its nearest lower order context

A CONTEXT BASED MAXIMUM LIKELIHOOD APPROACH TO CHUNKING (CONTINUE . . .)

Testing Construct maximum context for each tag Look up in the database of most likely

patterns If the largest context is not found context

is diminished step by step The only rule for chunk-labeling is to

look up [t-3 , t-2 ,t-1, t0, t+1, t+2, t+3] .… [t0] until the context is found

A CONTEXT BASED MAXIMUM LIKELIHOOD APPROACH TO CHUNKING (CONTINUE . . .)

Results The best results are achieved for 5-

contextADJP, ADVP, CONJP, INTJ, LST, NP, PP,

PRT, SBAR, VP○ Precision = 86.24%○ Recall = 88.25%○ Fβ=1 = 87.23%

CHUNKING WITH MAXIMUM ENTROPY MODELS Maximum Entropy models are exponential

models Collect as much information as possible

Frequencies of events relevant to the process MaxEnt model has the form

P(w|h) = 1 / Z(h) . eΣi λi fi(h,w)

fi(h,w) is a binary valued featured vector describing an event

λi describes how important is fiZ(h) is a normalization factor

CHUNKING WITH MAXIMUM ENTROPY MODELS (CONTUNE . . .)

Attributes Used Information in WSJ Corpus

Current WordPOS Tag of Current WordSurrounding WordsPOS Tags of Surrounding Words

Context Left Context: 3 wordsRight Context: 2 words

Additional Information Chunk tags of previous 2 words

CHUNKING WITH MAXIMUM ENTROPY MODELS (CONTUNE . . .)

Results Tagging Accuracy = 95.5%

# of correct tagged words

Total # of words Recall = 91.86%

# of correct proposed base NPs

Number of correct base NPs Precision = 92.08%

# of correct proposed base NPs

Number of proposed base NPs

Fβ=1 = 91.97%

(β 2 +1). Recall .Precision

β2 . (Recall + Precision)

HYBRID TEXT CHUNKING

Context based Lexicon and HMM based chunker

Statistics were used for chunking by Church(1998)Corpus frequencies were usedNon-recursive noun phrases were identified

Skut & Brants (1998) modifeid Church approach and used Viterbi Tagger

HYBRID TEXT CHUNKING (CONTINUE . . .)

Error-driven HMM based text chunker Memory is decreased by keeping only +ve

lexical entries HMM based text chunker with context-

dependent lexiconGiven Gn

1 = g1, g2,. . ., gn

Find optimal sequence Tn1 = t1, t2, . . ., tn

Maximize log P( Tn1 | Gn

1 )

log P( Tn1 | Gn

1 ) = log P(Tn1) + log P( Tn

1 , Gn1 )

P( Tn1 ) P ( Gn

1 )

SHALLOW PARSING AS POS TAGGING CoNLL 2000 : for testing and training Ratnaparkhi’s maximum entropy based

POS taggerNo change in internal operationInformation for training is increased

SHALLOW PARSING AS POS TAGGING (CONTINUE . . .)

Shallow Parsing VS POS Tagging Shallow Parsing requires more

surrounding POS/lexical syntactic environment

Training ConfigurationsWords w1 w2 w3

POS Tags t1 t2 t3

Chunk Types c1 c2 c3

Suffixes or Prefixes

SHALLOW PARSING AS POS TAGGING (CONTINUE . . .)

Amount of information is gradually increasedWord w1

Tag t1

Word, Tag, Chunk Label (w1 t1 c1)○ Current chunk label is accessed through another

model with configurations of words and tags (w1 t1)

To deal with sparseness○ t1, t2

○ c1

○ c2 (last two letters)

○ w1 (first two letters)

SHALLOW PARSING AS POS TAGGING (CONTINUE . . .)

Word w1

SHALLOW PARSING AS POS TAGGING (CONTINUE . . .)

Tag t1

SHALLOW PARSING AS POS TAGGING (CONTINUE . . .)

(w1 t1 c1)

SHALLOW PARSING AS POS TAGGING (CONTINUE . . .)

Sparseness Handling

SHALLOW PARSING AS POS TAGGING (CONTINUE . . .)

Precision Recall F β=1

Word w1 88.06% 88.71% 80.38%

Tag t1 88.15% 88.07% 88.11%

(w1 t1 c1) 89.79% 90.70% 90.24%

Sparseness Handling

91.65% 92.23% 91.94%

Over all Results

SHALLOW PARSING AS POS TAGGING (CONTINUE . . .)

Error Analysis Three groups of errors

Difficult syntactic constructs○ Punctuations○ Treating di-transitive VPs and transitive VPs○ Adjective vs. Adverbial Phrases

Mistakes made in training or testing by annotator○ Noise○ POS Errors○ Odd annotation decisions

Errors peculiar to approach○ Exponential Distribution assigns non zero probability to all

events○ Tagger may assign illegal chunk-labels (I-NP while w is not NP)

SHALLOW PARSING AS POS TAGGING (CONTINUE . . .)

Comments PPs are easy to identify ADJP and ADVP are hard to identify

correctly (more syntactic information is required)

Performance at NPs can be further improved

Performance using w1 or t1 is almost same. Using both the features enhances performance

REFERENCES

[1] Philip Brooks, “A Simple Chunk Parser”, May 8, 2003. [2] Igor Boehm, “Rule Based vs. Statistical Chunking of CoNLL data

Set”. [3] Miles Osborne, “Shallow Parsing as POS Tagging” [4] Hans van Halteren, “Chunking with WPDV Models” [5] Taku Kudoh and Yuji Matsumoto, “Use of Support Vector

Learning for Chunk Identification”, In proceeding of CoNLL-2000 and LLL-2000, page 142-144, Portugal 2000.

[6] Christer Johanson, “A Context Sensitive Maximum Likelihood Approach to Chunking”

[7] Rob Koeling, “Chunking with Maximum Entropy Models” [8] Jorn Veenstra and Antal van den Bosch, “Single Cassifier

Memory Based Phrase Chunking” [9] Guo dong Zhou and Jian Su and TongGuan Tey, “Hybrid Text

Chunking”