Asma Naseer. Shallow Parsing or Partial Parsing At first proposed by Steven Abney (1991) Breaking...
-
date post
19-Dec-2015 -
Category
Documents
-
view
221 -
download
1
Transcript of Asma Naseer. Shallow Parsing or Partial Parsing At first proposed by Steven Abney (1991) Breaking...
Asma Naseer
CHUNKINGSHALLOW PARSING
INTRODUCTION
Shallow Parsing or Partial Parsing At first proposed by Steven Abney
(1991) Breaking text up into small pieces Each piece is parsed separately [1]
INTRODUCTION (CONTINUE . . . ) Words are not arranged flatly in a
sentence but are grouped in smaller parts called phrases
The girl was playing in the street
دی کتاب کو احمد نے اس
INTRODUCTION (CONTINUE . . . ) Chunks are non-recursive (does not contain
a phrase of the same category as it self)
NP D? AdjP? AdjP? N
The big red balloon
[NP[D The] [AdjP [Adj big]] [AdjP [Adj red]] [N balloon]]
[1]
INTRODUCTION (CONTINUE . . . ) Each phrase is dominated by a head h
A man proud of his son.
A proud man
The root of the chunk has h as s-head (semantic head)
Head of a Noun phrase is usually a Noun or pronoun [1]
[1]
CHUNK TAGGING (CONTINUE . . .)
IOBE IOB IO
CHUNK TAGGING
IOB (Inside Outside Begin)I-NP O-NP B-NPI-VP O-VP B-BP
کیا خطاب سے قوم نے جناح علی محمد اعظم قائد
] [I-NP محمد ] [I-NPعلی] [I-NP جناح] اعظم [B-NP قائد
[O-NP نے] [B-NP قوم] [ O-NP سے] [B-NP خطاب]
[O-NP کیا]
RESEARCH WORK
Rule Based Vs Statistical Based Chunking [2] Use of Support Vector Learning for Chunk
Identification [5] A Context Based Maximum Likelihood
Approach to Chunking [6] Chunking with Maximum Entropy Models [7] Single-Classifier Memory-Based Phrase
Chunking [8] Hybrid Text Chunking [9] Shallow Parsing as POS Tagging [3]
RULE BASED VS STATISTICAL BASED CHUNKING Two techniques are used
Regular expressions rules○ Shallow Parse based on regular expressions
N-gram statistical tagger (machine based chunking)○ NLTK (Natural Language Toolkit) based on
TnT Tagger (Trigramsb’n’Tags).○ Basic Idea: Reuse POS tagger for chunking.
RULE BASED VS STATISTICAL BASED CHUNKING ( CONTINUE… )
Regular expressions rules
Necessary to develop regular expressions manually
N-gram statistical tagger
Can be trained on gold standard chunked data
RULE BASED VS STATISTICAL BASED CHUNKING ( CONTINUE… ) Focus is on Verb and Noun phrase chunking Noun Phrases
Noun or pronoun is the headAlso contains
○ Determiners i.e. Articles, Demonstratives, Numerals, Possessives and Quantifiers
○ Adjectives○ Complements ( ad-positional, relative clauses )
Verb PhrasesVerb is the headOften one or two complementsAny number of Adjuncts
RULE BASED VS STATISTICAL BASED CHUNKING ( CONTINUE… ) Training NLTK on Chunk Data
Starts with empty rule set○ 1. Define or refine a rule○ 2. Execute chunker on training data○ 3. Compare results with previous run
Repeat (1,2 & 3) until performance does not improve significantly
Issues: Total 211,727 phrases. Taken subset 1,000 phrases.
RULE BASED VS STATISTICAL BASED CHUNKING ( CONTINUE… ) Training TnT on Chunk Data
Chunking is treated as statistical taggingTwo steps
○ Parameter generation : create model parameters from training corpus
○ Tagging : tag each word with chunk label
RULE BASED VS STATISTICAL BASED CHUNKING ( CONTINUE… ) Data Set
WSJ: Wall Street Journal Newspaper NY○ US○ International Business○ Financial News
Training: section 15-18Testing: section 20Both tagged with POS and IOBSpecial characters are treated as other
POS, punctuation are tagged as O
RULE BASED VS STATISTICAL BASED CHUNKING ( CONTINUE… ) Results
Precision P = |reference ∩ test| / testRecall R = |reference ∩ test| / referenceF- Measure Fα = 0.5 = 1 / (α/P + (1-α)/PR)F- Rate F = (2 * P* R) / (R+P)
RULE BASED VS STATISTICAL BASED CHUNKING ( CONTINUE… ) Results
NLTK
TnT
P R F-Measure
VP 79.3 % 80.1 % 79.7 %
NP 76.5 % 84.4 % 80.3 %
P R F-Measure
VP 79.59 % 82.35 % 80.95 %
NP 78.36 % 76.76 % 77.55 %
USE OF SUPPORT VECTOR LEARNING FOR CHUNK IDENTIFICATION SVM (Large Margin Classifiers) Introduced by Vapnik 1995 Two class pattern recognition problem Good generalization performance High accuracy in text categorization
without over fitting (Joachims, 1998; Taira and Haruono, 1999)
USE OF SUPPORT VECTOR LEARNING FOR CHUNK IDENTIFICATION ( CONTINUE… ) Training data (xi, yi)…. (xl, yl) xi Є Rn, yi Є {+1, -1}
xi is the i-th sample represented by n dimensional vector
yi is (+ve or –ve class) label of i-th sample In SVM
+ve and –ve examples are separated by a hyperplane
SVM finds optimal hyperplane
USE OF SUPPORT VECTOR LEARNING FOR CHUNK IDENTIFICATION ( CONTINUE… )
Two possible hyperplanes
USE OF SUPPORT VECTOR LEARNING FOR CHUNK IDENTIFICATION ( CONTINUE… ) Chunks in CoNLL-2000 shared task, are IOB
Tagged Each chunk type belongs to either I or B
I-NP or B-NP 22 types of chunks are found in CoNLL-2000 Chunking problem is classification of these 22
types SVM is binary classifier, so its extended to k-
classes One class vs. all others Pairwise classification
○ k * (k-1) / 2 classifiers 22 * 21 / 2 = 231 classifiers○ Majority decides final class
USE OF SUPPORT VECTOR LEARNING FOR CHUNK IDENTIFICATION ( CONTINUE… ) Feature vector consists of
Words: wPOS tags: tChunk tags: c
To identify chunk ci at i-th wordwj, tj (j = i-2, i-1, i, i+1, i+2)cj (j = i-2, i-1)
All features are expanded to binary values; either 0 or 1
The total dimensions of feature vector becomes 92837
USE OF SUPPORT VECTOR LEARNING FOR CHUNK IDENTIFICATION ( CONTINUE… )
Results It took about 1 day to train 231 classifiers PC-Linux
Celeron 500 MHz, 512 MB ADJP, ADVP, CONJP, INTJ, LST, NP, PP,
PRT, SBAR, VPPrecision = 93.45 %Recall = 93.51 %Fβ=1 = 93.48 %
A CONTEXT BASED MAXIMUM LIKELIHOOD APPROACH TO CHUNKING
Training POS Tags based Construct symmetric n-context from
training corpus1-context: most common chunk label for each
tag3-context: tag followed by the tag before and
after it [t-1, t0, t+1]
5-context [t-2 ,t-1, t0, t+1, t+2]
7-context [t-3 , t-2 ,t-1, t0, t+1, t+2, t+3]
A CONTEXT BASED MAXIMUM LIKELIHOOD APPROACH TO CHUNKING (CONTINUE . . .)
Training For each context find the most frequent
labelCC [O CC]PRP CC RP [B-NP CC]
To save storage space n-context is added if its different from its nearest lower order context
A CONTEXT BASED MAXIMUM LIKELIHOOD APPROACH TO CHUNKING (CONTINUE . . .)
Testing Construct maximum context for each tag Look up in the database of most likely
patterns If the largest context is not found context
is diminished step by step The only rule for chunk-labeling is to
look up [t-3 , t-2 ,t-1, t0, t+1, t+2, t+3] .… [t0] until the context is found
A CONTEXT BASED MAXIMUM LIKELIHOOD APPROACH TO CHUNKING (CONTINUE . . .)
Results The best results are achieved for 5-
contextADJP, ADVP, CONJP, INTJ, LST, NP, PP,
PRT, SBAR, VP○ Precision = 86.24%○ Recall = 88.25%○ Fβ=1 = 87.23%
CHUNKING WITH MAXIMUM ENTROPY MODELS Maximum Entropy models are exponential
models Collect as much information as possible
Frequencies of events relevant to the process MaxEnt model has the form
P(w|h) = 1 / Z(h) . eΣi λi fi(h,w)
fi(h,w) is a binary valued featured vector describing an event
λi describes how important is fiZ(h) is a normalization factor
CHUNKING WITH MAXIMUM ENTROPY MODELS (CONTUNE . . .)
Attributes Used Information in WSJ Corpus
Current WordPOS Tag of Current WordSurrounding WordsPOS Tags of Surrounding Words
Context Left Context: 3 wordsRight Context: 2 words
Additional Information Chunk tags of previous 2 words
CHUNKING WITH MAXIMUM ENTROPY MODELS (CONTUNE . . .)
Results Tagging Accuracy = 95.5%
# of correct tagged words
Total # of words Recall = 91.86%
# of correct proposed base NPs
Number of correct base NPs Precision = 92.08%
# of correct proposed base NPs
Number of proposed base NPs
Fβ=1 = 91.97%
(β 2 +1). Recall .Precision
β2 . (Recall + Precision)
HYBRID TEXT CHUNKING
Context based Lexicon and HMM based chunker
Statistics were used for chunking by Church(1998)Corpus frequencies were usedNon-recursive noun phrases were identified
Skut & Brants (1998) modifeid Church approach and used Viterbi Tagger
HYBRID TEXT CHUNKING (CONTINUE . . .)
Error-driven HMM based text chunker Memory is decreased by keeping only +ve
lexical entries HMM based text chunker with context-
dependent lexiconGiven Gn
1 = g1, g2,. . ., gn
Find optimal sequence Tn1 = t1, t2, . . ., tn
Maximize log P( Tn1 | Gn
1 )
log P( Tn1 | Gn
1 ) = log P(Tn1) + log P( Tn
1 , Gn1 )
P( Tn1 ) P ( Gn
1 )
SHALLOW PARSING AS POS TAGGING CoNLL 2000 : for testing and training Ratnaparkhi’s maximum entropy based
POS taggerNo change in internal operationInformation for training is increased
SHALLOW PARSING AS POS TAGGING (CONTINUE . . .)
Shallow Parsing VS POS Tagging Shallow Parsing requires more
surrounding POS/lexical syntactic environment
Training ConfigurationsWords w1 w2 w3
POS Tags t1 t2 t3
Chunk Types c1 c2 c3
Suffixes or Prefixes
SHALLOW PARSING AS POS TAGGING (CONTINUE . . .)
Amount of information is gradually increasedWord w1
Tag t1
Word, Tag, Chunk Label (w1 t1 c1)○ Current chunk label is accessed through another
model with configurations of words and tags (w1 t1)
To deal with sparseness○ t1, t2
○ c1
○ c2 (last two letters)
○ w1 (first two letters)
SHALLOW PARSING AS POS TAGGING (CONTINUE . . .)
Word w1
SHALLOW PARSING AS POS TAGGING (CONTINUE . . .)
Tag t1
SHALLOW PARSING AS POS TAGGING (CONTINUE . . .)
(w1 t1 c1)
SHALLOW PARSING AS POS TAGGING (CONTINUE . . .)
Sparseness Handling
SHALLOW PARSING AS POS TAGGING (CONTINUE . . .)
Precision Recall F β=1
Word w1 88.06% 88.71% 80.38%
Tag t1 88.15% 88.07% 88.11%
(w1 t1 c1) 89.79% 90.70% 90.24%
Sparseness Handling
91.65% 92.23% 91.94%
Over all Results
SHALLOW PARSING AS POS TAGGING (CONTINUE . . .)
Error Analysis Three groups of errors
Difficult syntactic constructs○ Punctuations○ Treating di-transitive VPs and transitive VPs○ Adjective vs. Adverbial Phrases
Mistakes made in training or testing by annotator○ Noise○ POS Errors○ Odd annotation decisions
Errors peculiar to approach○ Exponential Distribution assigns non zero probability to all
events○ Tagger may assign illegal chunk-labels (I-NP while w is not NP)
SHALLOW PARSING AS POS TAGGING (CONTINUE . . .)
Comments PPs are easy to identify ADJP and ADVP are hard to identify
correctly (more syntactic information is required)
Performance at NPs can be further improved
Performance using w1 or t1 is almost same. Using both the features enhances performance
REFERENCES
[1] Philip Brooks, “A Simple Chunk Parser”, May 8, 2003. [2] Igor Boehm, “Rule Based vs. Statistical Chunking of CoNLL data
Set”. [3] Miles Osborne, “Shallow Parsing as POS Tagging” [4] Hans van Halteren, “Chunking with WPDV Models” [5] Taku Kudoh and Yuji Matsumoto, “Use of Support Vector
Learning for Chunk Identification”, In proceeding of CoNLL-2000 and LLL-2000, page 142-144, Portugal 2000.
[6] Christer Johanson, “A Context Sensitive Maximum Likelihood Approach to Chunking”
[7] Rob Koeling, “Chunking with Maximum Entropy Models” [8] Jorn Veenstra and Antal van den Bosch, “Single Cassifier
Memory Based Phrase Chunking” [9] Guo dong Zhou and Jian Su and TongGuan Tey, “Hybrid Text
Chunking”