Adaptive Parser-Centric Text Normalization
-
Upload
yunyao-li -
Category
Technology
-
view
505 -
download
2
description
Transcript of Adaptive Parser-Centric Text Normalization
![Page 1: Adaptive Parser-Centric Text Normalization](https://reader035.fdocuments.net/reader035/viewer/2022070318/55783d44d8b42a1f5b8b4c9b/html5/thumbnails/1.jpg)
1
Adaptive Parser-Centric
Text Normalization
Congle Zhang* Tyler Baldwin** Howard Ho** Benny Kimelfeld** Yunyao Li**
* University of Washington **IBM Research - Almaden
![Page 2: Adaptive Parser-Centric Text Normalization](https://reader035.fdocuments.net/reader035/viewer/2022070318/55783d44d8b42a1f5b8b4c9b/html5/thumbnails/2.jpg)
Public Text
Web Text
Private Text
TextAnalytics
MarketingFinancial investmentDrug discoveryLaw enforcement…
Applications
Social media
News
SEC
InternalData
SubscriptionData
USPTO
Text analytics is the key for discovering hidden value from text
![Page 3: Adaptive Parser-Centric Text Normalization](https://reader035.fdocuments.net/reader035/viewer/2022070318/55783d44d8b42a1f5b8b4c9b/html5/thumbnails/3.jpg)
DREAM
![Page 4: Adaptive Parser-Centric Text Normalization](https://reader035.fdocuments.net/reader035/viewer/2022070318/55783d44d8b42a1f5b8b4c9b/html5/thumbnails/4.jpg)
REALITY
![Page 5: Adaptive Parser-Centric Text Normalization](https://reader035.fdocuments.net/reader035/viewer/2022070318/55783d44d8b42a1f5b8b4c9b/html5/thumbnails/5.jpg)
Image from http://samasource.org
![Page 6: Adaptive Parser-Centric Text Normalization](https://reader035.fdocuments.net/reader035/viewer/2022070318/55783d44d8b42a1f5b8b4c9b/html5/thumbnails/6.jpg)
CAN YOU READ THIS IN FIRST ATEMPT?
![Page 7: Adaptive Parser-Centric Text Normalization](https://reader035.fdocuments.net/reader035/viewer/2022070318/55783d44d8b42a1f5b8b4c9b/html5/thumbnails/7.jpg)
ay woundent of see ’ em
CAN YOU READ THIS IN FIRST ATEMPT?
00:0000:0100:02
I would not have seen them.
![Page 8: Adaptive Parser-Centric Text Normalization](https://reader035.fdocuments.net/reader035/viewer/2022070318/55783d44d8b42a1f5b8b4c9b/html5/thumbnails/8.jpg)
When a machine reads it
Results from Google translation
Chinese 唉看见他们woundent
Spanish ay woundent de verlas
Japanese ローマ法王進呈の AY woundent
Portuguese
ay woundent de vê-los
German ay woundent de voir 'em
![Page 9: Adaptive Parser-Centric Text Normalization](https://reader035.fdocuments.net/reader035/viewer/2022070318/55783d44d8b42a1f5b8b4c9b/html5/thumbnails/9.jpg)
Text Normalization• Informal writing standard written
form
9
I would not have seen them .
normalize
ay woundent of see ’ em
![Page 10: Adaptive Parser-Centric Text Normalization](https://reader035.fdocuments.net/reader035/viewer/2022070318/55783d44d8b42a1f5b8b4c9b/html5/thumbnails/10.jpg)
Challenge: Grammar
10
text normalization
would not of see them
ay woundent of see ’ em
I would not have seen them. Vs.
mapping out-of-vocabulary non-standard tokens to their in-vocabulary standard form
≠
![Page 11: Adaptive Parser-Centric Text Normalization](https://reader035.fdocuments.net/reader035/viewer/2022070318/55783d44d8b42a1f5b8b4c9b/html5/thumbnails/11.jpg)
Challenge: Domain Adaptation
Tailor the same text normalization solution towards different writing style of different data sources
11
![Page 12: Adaptive Parser-Centric Text Normalization](https://reader035.fdocuments.net/reader035/viewer/2022070318/55783d44d8b42a1f5b8b4c9b/html5/thumbnails/12.jpg)
Challenge: Evaluation• Previous: word error rate & BLEU score
• However,– Words are not equally important – non-word information (punctuations,
capitalization) can be important– Word reordering is important
• How does the normalization actually impact the downstream applications?
12
![Page 13: Adaptive Parser-Centric Text Normalization](https://reader035.fdocuments.net/reader035/viewer/2022070318/55783d44d8b42a1f5b8b4c9b/html5/thumbnails/13.jpg)
Adaptive Parser-Centric Text Normalization
GrammaticalSentence
Domain Transferrable
Parsing performance
![Page 14: Adaptive Parser-Centric Text Normalization](https://reader035.fdocuments.net/reader035/viewer/2022070318/55783d44d8b42a1f5b8b4c9b/html5/thumbnails/14.jpg)
Outlines• Model• Inference• Learning• Instantiation• Evaluation• Conclusion
14
![Page 15: Adaptive Parser-Centric Text Normalization](https://reader035.fdocuments.net/reader035/viewer/2022070318/55783d44d8b42a1f5b8b4c9b/html5/thumbnails/15.jpg)
Model: Replacement Generator
15
• Replacement <i,j,s>: replace tokens xi … xj-1 with s
• Domain customization– Generic (cross-domain) replacements– Domain-specific replacements
Ay1 woudent2 of3 see4 ‘em5
<2,3,”would not”><1,2,”Ay”><1,2,”I”><1,2,ε>
<6,6,”.”>…
EditSameEditDeleteInsert
…
![Page 16: Adaptive Parser-Centric Text Normalization](https://reader035.fdocuments.net/reader035/viewer/2022070318/55783d44d8b42a1f5b8b4c9b/html5/thumbnails/16.jpg)
Model: Boolean Variables• Associate a unique Boolean
variable Xr with each replacement r
– Xr =true: replacement r is used to produce the output sentence
16
<2,3,”would not”> = true
… would not …
![Page 17: Adaptive Parser-Centric Text Normalization](https://reader035.fdocuments.net/reader035/viewer/2022070318/55783d44d8b42a1f5b8b4c9b/html5/thumbnails/17.jpg)
Model: Normalization Graph
17
• A graphical model Ay woudent of see ‘em
<4,6,”see him”>
<1,2,”Ay”> <1,2,”I”>
<2,4,”would not have”> <2,3,”would”>
<4,5,”seen”>
<5,6,”them”>
*START*
*END*
<6,6,”.”>
<3,4,”of”>
![Page 18: Adaptive Parser-Centric Text Normalization](https://reader035.fdocuments.net/reader035/viewer/2022070318/55783d44d8b42a1f5b8b4c9b/html5/thumbnails/18.jpg)
Model: Legal Assignment• Sound
– Any two true replacements do not overlap
– <1,2,”Ay”> and <1,2,”I”> cannot be both true
• Completeness– Every input token is captured by at least
one true replacement18
![Page 19: Adaptive Parser-Centric Text Normalization](https://reader035.fdocuments.net/reader035/viewer/2022070318/55783d44d8b42a1f5b8b4c9b/html5/thumbnails/19.jpg)
Model: Legal = Path• A legal assignment: a path from start
to end
19
<4,6,”see him”>
<1,2,”Ay”> <1,2,”I”>
<2,4,”would not have”> <2,3,”would”>
<4,5,”seen”>
<5,6,”them”>
*START*
*END*
<6,6,”.”>
<3,4,”of”>
I would not have see him.
Output
![Page 20: Adaptive Parser-Centric Text Normalization](https://reader035.fdocuments.net/reader035/viewer/2022070318/55783d44d8b42a1f5b8b4c9b/html5/thumbnails/20.jpg)
Model: Assignment Probability
20
• Log-linear model; feature functions on edges
20
<4,6,”see him”>
<1,2,”Ay”> <1,2,”I”>
<2,4,”would not have”> <2,3,”would”>
<4,5,”seen”>
<5,6,”them”>
*START*
*END*
<6,6,”.”>
<3,4,”of”>
![Page 21: Adaptive Parser-Centric Text Normalization](https://reader035.fdocuments.net/reader035/viewer/2022070318/55783d44d8b42a1f5b8b4c9b/html5/thumbnails/21.jpg)
Outlines• Model• Inference• Learning• Instantiation• Evaluation• Conclusion
21
![Page 22: Adaptive Parser-Centric Text Normalization](https://reader035.fdocuments.net/reader035/viewer/2022070318/55783d44d8b42a1f5b8b4c9b/html5/thumbnails/22.jpg)
Inference• Select the assignment with the highest
probability
• Computationally hard on general graph models …
• But, in our model it boils down to finding the longest path in a weighted and directed acyclic graph
22
![Page 23: Adaptive Parser-Centric Text Normalization](https://reader035.fdocuments.net/reader035/viewer/2022070318/55783d44d8b42a1f5b8b4c9b/html5/thumbnails/23.jpg)
Inference
23
• weighted longest path
<4,6,”see him”>
<1,2,”Ay”> <1,2,”I”>
<2,4,”would not have”> <2,3,”would”>
<4,5,”seen”>
<5,6,”them”>
*START*
*END*
<6,6,”.”>
<3,4,”of”>
I would not have see him.
![Page 24: Adaptive Parser-Centric Text Normalization](https://reader035.fdocuments.net/reader035/viewer/2022070318/55783d44d8b42a1f5b8b4c9b/html5/thumbnails/24.jpg)
Outlines• Model• Inference• Learning• Instantiation• Evaluation• Conclusion
24
![Page 25: Adaptive Parser-Centric Text Normalization](https://reader035.fdocuments.net/reader035/viewer/2022070318/55783d44d8b42a1f5b8b4c9b/html5/thumbnails/25.jpg)
Learning
• Perceptron-style algorithm– Update weights by– Comparing (1) most probable output with
the current weights (2) gold sequence25
(1) Informal: Ay woudent of see ‘em(2) Gold: I would not have seen them.(3) Graph
Input
Output (1) weights of features
![Page 26: Adaptive Parser-Centric Text Normalization](https://reader035.fdocuments.net/reader035/viewer/2022070318/55783d44d8b42a1f5b8b4c9b/html5/thumbnails/26.jpg)
Learning: Gold vs. Inferred
26
<4,6,”see him”>
<1,2,”Ay”> <1,2,”I”>
<2,4,”would not have”> <2,3,”would”>
<4,5,”seen”>
<5,6,”them”>
*START*
*END*
<6,6,”.”>
<3,4,”of”>
Gold sequence
Most probable sequence with current θ
![Page 27: Adaptive Parser-Centric Text Normalization](https://reader035.fdocuments.net/reader035/viewer/2022070318/55783d44d8b42a1f5b8b4c9b/html5/thumbnails/27.jpg)
Learning: Update Weights on the Differential Edges
27
<4,6,”see him”>
<1,2,”Ay”> <1,2,”I”>
<2,4,”would not have”> <2,3,”would”>
<4,5,”seen”>
<5,6,”them”>
*START*
*END*
<6,6,”.”>
<3,4,”of”>
the gold sequence becomes “longer”
Increase wi
![Page 28: Adaptive Parser-Centric Text Normalization](https://reader035.fdocuments.net/reader035/viewer/2022070318/55783d44d8b42a1f5b8b4c9b/html5/thumbnails/28.jpg)
Outlines• Model• Inference• Learning• Instantiation• Evaluation• Conclusion
28
![Page 29: Adaptive Parser-Centric Text Normalization](https://reader035.fdocuments.net/reader035/viewer/2022070318/55783d44d8b42a1f5b8b4c9b/html5/thumbnails/29.jpg)
Instantiation: Replacement Generators
29
Generator From To
leave intact good good
edit distance bac back
lowercase NEED need
capitalize it It
Google spell dispaear disappear
contraction wouldn’t would not
slang language ima I am going to
insert punctuation ε .
duplicated punctuation
!? !
delete filler lmao ε
![Page 30: Adaptive Parser-Centric Text Normalization](https://reader035.fdocuments.net/reader035/viewer/2022070318/55783d44d8b42a1f5b8b4c9b/html5/thumbnails/30.jpg)
Instantiation: Features• N-gram
– Frequency of the phrases induced by an edge
• Part-of-speech– Encourage certain behavior, such as
avoiding the deletion of noun phrases.• Positional
– Capitalize words after stop punctuations• Lineage
– Which generator spawned the replacement30
![Page 31: Adaptive Parser-Centric Text Normalization](https://reader035.fdocuments.net/reader035/viewer/2022070318/55783d44d8b42a1f5b8b4c9b/html5/thumbnails/31.jpg)
Outlines• Model• Inference• Learning• Instantiation• Evaluation• Conclusion
31
![Page 32: Adaptive Parser-Centric Text Normalization](https://reader035.fdocuments.net/reader035/viewer/2022070318/55783d44d8b42a1f5b8b4c9b/html5/thumbnails/32.jpg)
Evaluation Metrics: Compare Parses
Input sentence
32
Human Expert
Gold sentence
Normalized sentence
Normalizer
Parser
Parser
Compare
Gold Parse
Normalized Parse
Focus on subjects, verbs, and objects (SVO)
![Page 33: Adaptive Parser-Centric Text Normalization](https://reader035.fdocuments.net/reader035/viewer/2022070318/55783d44d8b42a1f5b8b4c9b/html5/thumbnails/33.jpg)
Evaluation Metrics: ExampleTest Gold SVO
I kinda wanna get ipad NEW
I kind of want to get a
new iPad.
verb(get) verb(want)verb(get)
precisionv = 1/1
recallv = 1/2
subj(get,I)subj(get,wanna
)obj(get,NEW)
subj(want, I)subj(get,I)obj(get,iPad)
precisionso = 1/3
recallso= 1/333
![Page 34: Adaptive Parser-Centric Text Normalization](https://reader035.fdocuments.net/reader035/viewer/2022070318/55783d44d8b42a1f5b8b4c9b/html5/thumbnails/34.jpg)
Evaluation: Baselines• w/oN: without normalization
• Google: Google spell checker
• w2wN: word-to-word normalization [Han and Baldwin 2011]
• Gw2wN: gold standard for word-to-word normalizations of previous work (whenever available).
34
![Page 35: Adaptive Parser-Centric Text Normalization](https://reader035.fdocuments.net/reader035/viewer/2022070318/55783d44d8b42a1f5b8b4c9b/html5/thumbnails/35.jpg)
Evaluation: Domains
• Twitter [Han and Baldwin 2011]
– Gold: Grammatical sentences
• SMS [Choudhury et al 2007]
– Gold: Grammatical sentences
• Call-Center Log: proprietary– Text-based responses about users’
experience with a call-center for a major company
– Gold: Grammatical sentences35
![Page 36: Adaptive Parser-Centric Text Normalization](https://reader035.fdocuments.net/reader035/viewer/2022070318/55783d44d8b42a1f5b8b4c9b/html5/thumbnails/36.jpg)
Evaluation: Twitter
36
• Twitter-specific replacement generators– Hashtags (#), ats (@), and retweets (RT)– Generators that allowed for either the initial
symbol or the entire token to be deleted
![Page 37: Adaptive Parser-Centric Text Normalization](https://reader035.fdocuments.net/reader035/viewer/2022070318/55783d44d8b42a1f5b8b4c9b/html5/thumbnails/37.jpg)
Evaluation: TwitterSystem
Verb Subject-Object
Pre Rec F1 Pre Rec F1
w/oN 83.7 68.1 75.1 31.7 38.6 34.8
Google 88.9 78.8 83.5 36.1 46.3 40.6
w2wN 87.5 81.5 84.4 44.5 58.9 50.7
Gw2wN 89.8 83.8 86.7 46.9 61.0 53.0
generic 91.7 88.9 90.3 53.6 70.2 60.8
domain specific
95.3 88.7 91.9 72.5 76.3 74.4
37
Domain-specific generators yielded the best overall performance
![Page 38: Adaptive Parser-Centric Text Normalization](https://reader035.fdocuments.net/reader035/viewer/2022070318/55783d44d8b42a1f5b8b4c9b/html5/thumbnails/38.jpg)
Evaluation: TwitterSystem
Verb Subject-Object
Pre Rec F1 Pre Rec F1
w/oN 83.7 68.1 75.1 31.7 38.6 34.8
Google 88.9 78.8 83.5 36.1 46.3 40.6
w2wN 87.5 81.5 84.4 44.5 58.9 50.7
Gw2wN 89.8 83.8 86.7 46.9 61.0 53.0
generic 91.7 88.9 90.3 53.6 70.2 60.8
domain specific
95.3 88.7 91.9 72.5 76.3 74.4
38
w/o domain-specific generators, our system outperformed the word-to-word normalization approaches
![Page 39: Adaptive Parser-Centric Text Normalization](https://reader035.fdocuments.net/reader035/viewer/2022070318/55783d44d8b42a1f5b8b4c9b/html5/thumbnails/39.jpg)
Evaluation: TwitterSystem
Verb Subject-Object
Pre Rec F1 Pre Rec F1
w/oN 83.7 68.1 75.1 31.7 38.6 34.8
Google 88.9 78.8 83.5 36.1 46.3 40.6
w2wN 87.5 81.5 84.4 44.5 58.9 50.7
Gw2wN 89.8 83.8 86.7 46.9 61.0 53.0
generic 91.7 88.9 90.3 53.6 70.2 60.8
domain specific
95.3 88.7 91.9 72.5 76.3 74.4
39
Even perfect word-to-word normalization is not good enough!
![Page 40: Adaptive Parser-Centric Text Normalization](https://reader035.fdocuments.net/reader035/viewer/2022070318/55783d44d8b42a1f5b8b4c9b/html5/thumbnails/40.jpg)
Evaluation: SMS
40
SMS-specific replacement generator:- Mapping
dictionary of SMS abbreviations
![Page 41: Adaptive Parser-Centric Text Normalization](https://reader035.fdocuments.net/reader035/viewer/2022070318/55783d44d8b42a1f5b8b4c9b/html5/thumbnails/41.jpg)
Evaluation: SMS
41
SystemVerb Subject-Object
Pre Rec F1 Pre Rec F1
w/oN 76.4 48.1 59.0 19.5 21.5 20.4
Google 85.1 61.6 71.5 22.4 26.2 24.1
w2wN 78.5 61.5 68.9 29.9 36.0 32.6
Gw2wN 87.6 76.6 81.8 38.0 50.6 43.4
generic 86.5 77.4 81.7 35.5 47.7 40.7
domain specific
88.1 75.0 81.0 41.0 49.5 44.8
![Page 42: Adaptive Parser-Centric Text Normalization](https://reader035.fdocuments.net/reader035/viewer/2022070318/55783d44d8b42a1f5b8b4c9b/html5/thumbnails/42.jpg)
Evaluation: Call-Center
42
Call Center-specific generator:- Mapping dictionary
of call center abbreviations (e.g. “rep.”
“representative”)
![Page 43: Adaptive Parser-Centric Text Normalization](https://reader035.fdocuments.net/reader035/viewer/2022070318/55783d44d8b42a1f5b8b4c9b/html5/thumbnails/43.jpg)
Evaluation: Call-Center
43
SystemVerb Subject-Object
Pre Rec F1 Pre Rec F1
w/oN 98.5 97.1 97.8 69.2 66.1 67.6
Google 99.2 97.9 98.5 70.5 67.3 68.8
generic 98.9 97.4 98.1 71.3 67.9 69.6
domain specific
99.2 97.4 98.3 87.9 83.1 85.4
![Page 44: Adaptive Parser-Centric Text Normalization](https://reader035.fdocuments.net/reader035/viewer/2022070318/55783d44d8b42a1f5b8b4c9b/html5/thumbnails/44.jpg)
Discussion• Domain transfer w/ small amount of
effort is possible
• Performing normalization is indeed beneficial to dependency parsing– Simple word-to-word normalization is not
enough
44
![Page 45: Adaptive Parser-Centric Text Normalization](https://reader035.fdocuments.net/reader035/viewer/2022070318/55783d44d8b42a1f5b8b4c9b/html5/thumbnails/45.jpg)
Conclusion• Normalization framework with an eye
toward domain adaptation
• Parser-centric view of normalization
• Our system outperformed competitive baselines over three different domains
• Dataset to spur future research– https://www.cs.washington.edu/node/9091/
45
![Page 46: Adaptive Parser-Centric Text Normalization](https://reader035.fdocuments.net/reader035/viewer/2022070318/55783d44d8b42a1f5b8b4c9b/html5/thumbnails/46.jpg)
Team
46