Fast and Robust Part-of-Speech Tagging Using Dynamic Model Selection

1
Fast and Robust Part-of-Speech Tagging Using Dynamic Model Selection Jinho D. Choi and Martha Palmer Institute of Cognitive Science, University of Colorado Boulder Supervised Learning Domain Adaptation Dynamic Model Selection Part-of-speech Tagging Training Decoding Dynamic Model Selection Experimental setup • Training corpus : The Wall Street Journal Sections 2-21 from OntoNotes v4.0. : 731,677 tokens, 30,060 sentences. • Tagging algorithm : A one-pass, left-to-right POS tagging algorithm. • Machine learning algorithm : Liblinear L2-regularization, L1-loss support vector classification. • Evaluation corpora Comparisons Experiments Conclusion • Our dynamic model selection approach improves the robustness of POS tagging on heterogeneous data, and shows noticeably faster tagging speed against two other systems. • We believe that this approach can be applied to more sophisticated tagging algorithms and improve their robustness even further. ClearNLP Open source projects: clearnlp.googlecode.com , clearparser.googlecode.com Contact: Jinho D. Choi ([email protected] ) Conclusion Simplified word form • In a simplified word form, all numerical expressions are replaced with 0. • A lowercase simplified word form (LSW) is a decapitalized simplified word form. • Simplified word forms give more generalization to lexical features than their original forms. Regular expressions • A simplified word form is derived by applying the following regular expressions sequentially to the original word-form, w. • ‘replaceAll’ is a function that replaces all matches of the regular expression in w (the 1st parameter) with the specific string (the 2nd parameter). 1. w.replaceAll(\d%, 0) e.g., 1% → 0 2. w.replaceAll(\$\d, 0) e.g., $1 → 0 3. w.replaceAll(∧\.\d, 0) e.g., .1 → 0 4. w.replaceAll(\d(,|:|-|\/|\.)\d, 0) e.g., 1,2|1:2|1-2|1/2|1.2 → 0 Pre-processing Target data Traini ng data Mode l Targ et data GOOD POOR Train ing data’ Targ et data Tar get dat a Train ing data’ Mode l’ Mode l’’ FAIR FAIR How many models do we need to build? Do we always know about the target data? Targ et data Tar get dat a Mode l D Mode l G GOOD FAIR? Do not assume the target data. Traini ng data Target data FAIR? Select one of two models dynamically. BC BN CN MD MZ NW WB Total Model D 91.81 95.27 87.36 90.74 93.91 97.45 93.93 92.97 Model G 92.65 94.82 88.24 91.46 93.24 97.11 93.51 93.05 G over D 50.63 36.67 68.80 40.22 21.43 9.51 36.02 41.74 Model S 92.26 95.13 88.18 91.34 93.88 97.46 93.90 93.21 Stanford 87.71 95.50 88.49 90.86 92.80 97.42 94.01 92.50 SVMTool 87.82 95.13 87.86 90.54 92.94 97.31 93.99 92.32 Genre All Tokens Unknown Tok’s Sentences BN Broadcasting news 31,704 3,077 2,076 BC Broadcasting conversation 31,328 1,284 1,969 CN Clinical notes 35,721 6,077 3,170 MD Medpedia articles 34,022 4,755 1,850 MZ Magazine 32,120 2,663 1,409 NW Newswire 39,590 983 1,640 WB Web-text 34,707 2,609 1,738 Tagging accuracies of all tokens (in %) BC BN CN MD MZ NW WB Total Model S 60.97 77.73 68.69 67.30 75.97 88.40 76.27 70.54 Stanford 19.24 87.31 71.20 64.82 66.28 88.40 78.15 64.32 SVMTool 19.08 78.35 66.51 62.94 65.23 86.88 76.47 47.65 Tagging accuracies of unknown tokens (in %) Stanford SVMTool Model S 421 1,163 31,914 Tagging speeds (tokens / sec.) • This work was supported by the SHARP program funded by ONC: 90TR0002/01. The content is solely the responsibility of the authors and does not necessarily represent the official views of the ONC. Acknowledgments Trainin g Data Documen t N Documen t 1 . . . DF(LSW) > th D DF(LSW) > th G Mode l D Mode l G Domain-specific model : using lexical features whose DF(LFW) > 1 Generalized model : using lexical features whose DF(LFW) > 2 Separate documents Extract two sets of features Build two models Input Sentenc es Is Model D? Mode l D Mode l G YES NO Output Sentenc es Output Sentenc es Is the cosine similarity between LSWs of the input sentence and Model D is greater than a threshold?

description

This paper presents a novel way of improving POS tagging on heterogeneous data. First, two separate models are trained (generalized and domain-specific) from the same data set by controlling lexical items with different document frequencies. During decoding, one of the models is selected dynamically given the cosine similarity between each sentence and the training data. This dynamic model selection approach, coupled with a one-pass, left-to-right POS tagging algorithm, is evaluated on corpora from seven different genres. Even with this simple tagging algorithm, our system shows comparable results against other state-of-the-art systems, and gives higher accuracies when evaluated on a mixture of the data. Furthermore, our system is able to tag about 32K tokens per second. We believe that this model selection approach can be applied to more sophisticated tagging algorithms and improve their robustness even further.

Transcript of Fast and Robust Part-of-Speech Tagging Using Dynamic Model Selection

Page 1: Fast and Robust Part-of-Speech Tagging Using Dynamic Model Selection

Fast and Robust Part-of-Speech TaggingUsing Dynamic Model Selection

Jinho D. Choi and Martha PalmerInstitute of Cognitive Science, University of Colorado Boulder

Supervised Learning

Domain Adaptation

Dynamic Model Selection

Part-of-speech Tagging

Training

Decoding

Dynamic Model Selection

Experimental setup

• Training corpus : The Wall Street Journal Sections 2-21 from OntoNotes v4.0. : 731,677 tokens, 30,060 sentences.• Tagging algorithm : A one-pass, left-to-right POS tagging algorithm. • Machine learning algorithm : Liblinear L2-regularization, L1-loss support vector classification.• Evaluation corpora

Comparisons

Experiments

Conclusion

• Our dynamic model selection approach improves the robustness of POS tagging on heterogeneous data, and shows noticeably faster tagging speed against two other systems.• We believe that this approach can be applied to more sophisticated tagging algorithms and improve their robustness even further.

ClearNLP

• Open source projects: clearnlp.googlecode.com, clearparser.googlecode.com• Contact: Jinho D. Choi ([email protected])

Conclusion

Simplified word form

• In a simplified word form, all numerical expressions are replaced with 0.• A lowercase simplified word form (LSW) is a decapitalized simplified word form.• Simplified word forms give more generalization to lexical features than their original forms.

Regular expressions

• A simplified word form is derived by applying the following regular expressions sequentially to the original word-form, w.• ‘replaceAll’ is a function that replaces all matches of the regular expression in w (the 1st parameter) with the specific string (the 2nd parameter).

1. w.replaceAll(\d%, 0) e.g., 1% → 0

2. w.replaceAll(\$\d, 0) e.g., $1 → 0

3. w.replaceAll( \.\d, 0)∧ e.g., .1 → 0

4. w.replaceAll(\d(,|:|-|\/|\.)\d, 0) e.g., 1,2|1:2|1-2|1/2|1.2 → 0

5. w.replaceAll(\d+, 0) e.g., 1234 → 0

Pre-processing

Targetdata

Trainingdata Model

Targetdata

GOOD

POOR

Trainingdata’

Targetdata

Targetdata

Trainingdata’’

Model’

Model’’

FAIR

FAIR

How many models do we need to build?

Do we always know about the target data?

Targetdata

Targetdata

ModelD

ModelG

GOOD

FAIR?

Do not assume the target data.

Trainingdata

Targetdata

FAIR?Select one of two models dynamically.

BC BN CN MD MZ NW WB Total

Model D 91.81 95.27 87.36 90.74 93.91 97.45 93.93 92.97

Model G 92.65 94.82 88.24 91.46 93.24 97.11 93.51 93.05

G over D 50.63 36.67 68.80 40.22 21.43 9.51 36.02 41.74

Model S 92.26 95.13 88.18 91.34 93.88 97.46 93.90 93.21

Stanford 87.71 95.50 88.49 90.86 92.80 97.42 94.01 92.50

SVMTool 87.82 95.13 87.86 90.54 92.94 97.31 93.99 92.32

Genre All Tokens Unknown Tok’s Sentences

BN Broadcasting news 31,704 3,077 2,076

BC Broadcasting conversation 31,328 1,284 1,969

CN Clinical notes 35,721 6,077 3,170

MD Medpedia articles 34,022 4,755 1,850

MZ Magazine 32,120 2,663 1,409

NW Newswire 39,590 983 1,640

WB Web-text 34,707 2,609 1,738

Tagging accuracies of all tokens (in %)

BC BN CN MD MZ NW WB Total

Model S 60.97 77.73 68.69 67.30 75.97 88.40 76.27 70.54

Stanford 19.24 87.31 71.20 64.82 66.28 88.40 78.15 64.32

SVMTool 19.08 78.35 66.51 62.94 65.23 86.88 76.47 47.65

Tagging accuracies of unknown tokens (in %)

Stanford SVMTool Model S

421 1,163 31,914

Tagging speeds (tokens / sec.)

• This work was supported by the SHARP program funded by ONC: 90TR0002/01. The content is solely the responsibility of the authors and does not necessarily represent the official views of the ONC.

Acknowledgments

TrainingData

DocumentN

Document1

. . .

DF(LSW)> thD

DF(LSW)> thG

ModelD

ModelG

Domain-specific model: using lexical features whose DF(LFW) > 1

Generalized model: using lexical features whose DF(LFW) > 2

Separate documents

Extract two sets of features

Build two models

InputSentences

IsModel D?

ModelD

ModelG

YES NO

OutputSentences

OutputSentences

Is the cosine similarity between LSWs of the input sentence and Model D is

greater than a threshold?