SymbolicConstraintsand StatisticalMethods:Use … · 2020. 3. 13. · criterion for selecting the...
Transcript of SymbolicConstraintsand StatisticalMethods:Use … · 2020. 3. 13. · criterion for selecting the...
Constan'ne Lignos University of Pennsylvania
Department of Computer and Informa'on Science Ins'tute for Research in Cogni've Science
Johns Hopkins Human Language Technology Center of Excellence
2/18/2013
Symbolic Constraints and Statistical Methods: Use Together for Best Results
I. Introduction
2
My focus: Computa'onally efficient solu'ons to hard natural language problems.
3
Insane Heuris'c (Siklóssy, 1974): “To solve a problem, solve a harder
problem.”
4
bake BAKE
baker BAKE +(er)
bakers BAKE +(er) +(s)
baking BAKE +(ing)
bakes BAKE +(s)
Morphology learning: MORSEL
5
Unsupervised learning of morphological structure, state-‐of-‐the-‐art for Finnish and English. h^ps://github.com/Constan'neLignos/MORSEL (BUCLD 2009; Morpho Challenge 2009, best paper 2010; Research on Language and Computa'on, 2010)
Symbolic constraints and statistical methods
6
• With the right constraints on the learning problem, simple learning strategies work
• Where should these constraints come from?
• Common prac'ce: error-‐mo'vated • Posterior regulariza'on (Graça et al., 2011): Constrain POS tags
to have a verb in every sentence, sparse tag distribu'on
• Constrain word segmenter to put a vowel in every word (Venkataraman, 2001)
• I suggest: all language comes from the mind; search within its constraints.
II. Infant word segmentation
7
Word segmenta'on is an unsexy problem.
8
Whereareyougoing? Howdoesabunnyrabbitwalk? Doeshewalklikeyouordoeshegohophophop? Whatareyoudoing? Sweepbroom. Isthatabroom? Ithough'twasabrush. Adam’s mother (Brown, 1973)
9
10
Where are you going? How does a bunny rabbit walk? Does he walk like you or does he go hop hop hop? What are you doing? Sweep broom. Is that a broom? I thought it was a brush. Adam’s mother (Brown, 1973)
11
Modeling of WS
12
• Transi'onal probability minima (Saffran, 1996 et seq.; Swingley, 2005) by themselves are a dead-‐end (Yang, 2004)
• Likewise for phonotac'cs-‐only accounts (Daland and Pierrehumbert, 2011)
• Models using lexicon, and Bayesian inference (Borschinger and Johnson, 2011; Goldwater et al., 2009; Johnson and Goldwater, 2009; Pearl et al., 2011) • Goal of learner is develop high-‐quality lexicon for segmenta'on,
context can be crucial
• Integrates progress in unsupervised learning (Teh, 2006) • High performance. Developmental impact?
Marr's three levels of description/analysis
13
• Computa'onal: system's goal • What problem does the system solve? Input/output constraints
• Algorithmic: system's method • How does the system do what it does: details of processes and
representa'ons used
• Implementa'onal: system's means • How is the system physically realized?
Connecting to development
14
• Can we connect these models to development? • Primarily models at computa-onal level, not algorithmic (Marr,
1983)
• What about the how?
• My interest: how can we explain the behavior of children as they learn to segment words?
Our modeling goal:
15
Build the simplest model that:
• Aligns with infants’ capabili'es • Replicates infants’ behavior in a principled fashion • Performs reasonably at the task
How do infants segment speech?
• Possible strategy: iden'fica'on of words in isola'on (Peters, 1983; Pinker et al., 1984) • Unlikely to be sufficient (Aslin et al., 1996), but probably helpful
(Brent and Siskind, 2001)
• A^ending to mul'ple cues in the input, most popularly: • Bootstrapping from known words (Borpeld et al., 2005; Dahan
and Brent, 1999)
• More easily iden'fy novel words at beginning and ends of u^erances at 8 months (Seidl & Johnson, 2006)
• Dominant stress pa^ern of language (Jusczyk et al., 1999)
16
Infant word learning
17
• Infants show first clear signs of iden'fying words at 6 mos. (Bergelson and Swingley, 2012)
• At that age, no: • Stress preference (Jusczyk et al., 1993, 1999) • Phoneme transi'on preference/phonotac'cs (Ma^ys et al.,
1999)
• But: • Syllable as primary perceptual unit, sensi'vity over 'me:
• Birth: number of syllables (Bijeljac-‐Babic et al., 1993); 2 mos: preference for syllabifiable input (Bertoncini & Mehler, 1981), holis'c syllables (Jusczyk & Derrah, 1987)
• Ability to iden'fy cohesive chunks (Goodsi^ et al, 1993)
Overview of the proposed algorithm
• Segmenter has a lexicon of poten'al words it builds over 'me • Starts empty, words are added based on segmenta'on of each
u^erance
• Each word has a score
• Use words in the lexicon to divide an u^erance by subtrac-vely segmen-ng from the front of the u^erance
• Operates online • Processes one u^erance at a 'me
• Cannot remember previous u^erances or how it segmented them, only lexicon
• Operates les-‐to-‐right in each u^erance to insert word boundaries between syllables
18
Useful constraints
19
• Like infants, start work from the outside in
• Restrict the search for possible segmenta'ons to words we already know when possible
• Use simple reinforcement learning to reward useful words and penalize bad ones
• I’ll work through some examples • Orthography for easy reading, input is syllabified phonemes
In the beginning…
• Just add whole u^erances to the lexicon • Gets words in isola'on for free, but osen more than
one word
big drum Lexicon:
Treat everything as word, add to lexicon
bigdrum
20
Subtractive Segmentation
• Use words in the lexicon to break up the u^erance • Increase word’s score when it is used • Add new words to lexicon
mo mmy’s tea Lexicon: mommy’s ...
Treat remainder as word, add to lexicon
tea
21
Trust
• Add new words to lexicon based on whether we trust them (touch an u^erance boundary)
is that a che cker Lexicon: a is that red ...
Treat remainder as word, add to lexicon
checker
22
is that che cker red
Don’t trust this
Multiple hypotheses
• For mul'ple possible subtrac'ons two op'ons: • Greedy approach (Lignos and Yang, 2010) • Pursue two (or more) hypotheses (beam search)
• Two hypotheses allow for penaliza-on: reduce score of word that started losing hypothesis
is that a che cker Lexicon: a is isthat that ...
is that a che cker
23
is and that lead to higher mean score than isthat
?
Scoring hypotheses
• Prefer the hypothesis that uses the higher-‐scoring words • Winner is rewarded, word scores will go up: “rich get richer”
• Geometric mean of scores of words used as metric:
• Useful for compound splitng (Koehn and Knight, 2003; Lignos, 2010)
• Doesn’t have bias for fewer or more words, but seeks a higher average score
• Post-‐hoc tes'ng shows this outperforms other obvious metrics (MDL/MLE, other means) (ask me)
u � the syllables of the utterance, initally with no word boundariesi � 0while i < len(u) do
if USS requires a word boundary thenInsert a word boundary and advance i, updating the lexicon as needed
else if Subtractive Segmentation can be performed thenSubtract the highest scoring word and advance i, updating the lexicon as needed
elseAdvance i by one syllable
end ifend whilew � the syllables between the last boundary inserted (or the beginning of the utterance if no boundaries were inserted) and theend of the utteranceIncrement w’s score in the lexicon, adding it to the lexicon if needed
Figure 3: An algorithm combining USS and Subtractive Segmentation
explore multiple hypotheses at once, using a sim-ple beam search. New hypotheses are added tosupport multiple possible subtractive segmentations.For example, using the utterance above, at the be-ginning of segmentation either part or partof couldbe subtracted from the utterance, and both possi-ble segmentations can be evaluated. The learnerscores these hypotheses in a fashion similar to thegreedy segmentation, but using a function based onthe score of all words used in the utterance. Thegeometric mean has been used in compound split-ting (Koehn and Knight, 2003), a task in many wayssimilar to word segmentation, so we adopt it as thecriterion for selecting the best hypothesis. For ahypothesized segmentation H comprised of wordswi . . . wn, a hypothesis is chosen as follows:
argmax
H(
Y
wi2Hscore(wi))
1n
For any w not found in the lexicon we must assigna score; we assign it a score of one as that wouldbe its value assuming it had just been added to thelexicon, an approach similar to Laplace smoothing.
Returning to the previous example, while thescore of partof is greater than that of part, the scoreof of is much higher than either, so if both partofan apple and part of an apple are considered, thehigh score of of causes the latter to be chosen.When beam search is employed, only words used inthe winning hypothesis are rewarded, similar to thegreedy case where there are no other hypotheses.
In addition to preferring segmentations that usewords of higher score, it is useful to reduce the
Algorithm Word Boundaries
Precision Recall F-ScoreNo Stress Information
Syllable Baseline 81.68 100.0 89.91Subtractive Seg. 91.66 89.13 90.37Subtractive Seg. + Beam 2 92.74 88.69 90.67
Word-level Stress
USS Only 91.53 18.82 31.21USS + Subtractive Seg. 93.76 92.02 92.88USS + Subtractive Seg. +Beam 2
94.20 91.87 93.02
Table 1: Learner and baseline performance
score of words that led to the consideration of a los-ing hypothesis. In the previous example we maywant to penalize partof so that we are less likely tochoose a future segmentation that includes it. Set-ting the beam size to be two, forcing each hypothesisto develop greedily after an ambiguous subtractioncauses two hypotheses to form, we are guaranteeda unique word to penalize. In the previous examplepartof causes the split between the two hypothesesin the beam, and thus the learner penalizes it to dis-courage using it in the future.
5 Results
5.1 Evaluation
To evaluate the performance of our model, we mea-sured performance on child-directed speech, usingthe same corpus used in a number of previous stud-ies that used syllabified input (Yang, 2004; Gambelland Yang, 2004; Lignos and Yang, 2010). The eval-
24
Predictions
25
• Default assump'on of u^erance = word à Infants will start with oversized units and words in isola'on
• Rich-‐get-‐richer scoring à As the learner is exposed to more data, learner will tend to use high-‐frequency elements
• Penaliza'on à Use of colloca'ons will decrease with 'me
Our evaluation corpus
• Constructed from Brown (1973) subset of CHILDES English (Adam, Eve, Sarah), ~60k u^erances
• Pronuncia'ons and stress for each word from CMUDICT, algorithmically syllabified
• Sample input:
B.IH.G|D.R.AH.M HH.AO.R.S HH.UW|IH.Z|DH.AE.T
26
Evaluation
• A’ calculated over syllable boundaries • Balance of hit rate and false alarm rate for discrimina'ng word
boundaries
• Trapezoidal approxima'on of area under ROC curve
• F-‐score used for word token iden'fica'on • Balance of precision (how osen a word iden'fied in an u^erance
is correct) and recall (how many correct words were found)
• Ex: is that a lady segmented as isthat a lady
• Precision 2/3: a, lady; isthat • Recall 2/4: a, lady; is, that
27
Evaluation
• Evaluated segmenter in three forms: • Subtrac've segmenta'on
• Subtrac've segmenta'on with trust, only adding words to the lexicon if they touch an u^erance boundary
• Subtrac've segmenta'on with trust and mul-ple hypotheses, considering two hypothe'cal segmenta'ons and penalizing the loser
• Errors computed over tes'ng corpus from first u^erance • Worst-‐case evalua'on for online learner: zero training.
• Word boundaries taken as orthographic boundaries in the input • Aligns with morphological and phonological defini'ons of word
28
Performance
• Trust and mul'ple hypotheses significantly reduce FA rate • Perfect memory is not crucial: evalua'ng A’ and F-‐score
with imperfect memory yield similar results • Segments 60k u^erances in < 1 second
• Beam performs be:er than exhaus've search (ask me)
Method Hit False Alarm A’
Word F-‐Score
Baseline: syllable = word 1.0 1.0 0 0.753 Subtrac've Segmenta'on 0.992 0.776 0.795 0.797 +Trust 0.961 0.468 0.860 0.841 +Mul'ple Hypotheses 0.953 0.401 0.875 0.849
29
Infant error patterns
30
• Undersegmenta'on at young age (Brown, 1973; Clark, 1977; Peters, 1977, 1983) • Func'on word colloca'ons: that-‐a, it’s, isn’t-‐it • Phrases as single unit:
look at that, what’s that, oh boy, all gone, open the door
• Func'on-‐content colloca'ons: picture-‐of, whole-‐thing
• Oversegmenta'on at older ages (Peters, 1983) • Func'on word oversegmenta'on: behave/be have, tulips/two
lips, Miami/my Ami/your Ami
• Errors can s'll occur in adulthood (Cutler and Norris, 1988)
Most frequent error tokens
• Divided into early (first 10k u^erances) and late (last 10k u^erances) stages of learning
• Coded most frequent incorrect words in output as: • Func'on: Overuse of func'on word (awayà a way)
• Func'on colloca'on: Two func'on words (that’s a à that’sa)
• Content colloca'on: Content and content/func'on word (a ball à aball)
• Other
• Distribu'on changes across 'me (Chi-‐squared p < .0001)
Time FuncDon Func. Colloc.
Cont. Colloc. Other
Early 44.2% 37.0% 8.5% 10.4% Late 70.6% 1.0% 1.2% 27.3%
31
Learning curve
• Learner starts undersegmen'ng, as it learns achieves balance with slight oversegmenta'on
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 5000 10000 15000 20000 25000 30000 UHerances
Precision Recall F-‐score
32
Extensions: Joint morphology learning
33
• Learned on output of segmenta'on
• Ini'al lexicon of high-‐enough quality to learn common suffixes
• Acquisi'on order similar to Brown 1973:
Predicted order Suffix Example Brown Average Rank
1 -‐ɪŋ bake-‐ing 2.33 2 -‐z happen-‐s 7.9 3 -‐d happen-‐ed 9 4 -‐ɚ bake-‐er -‐ 5 -‐t check-‐ed 9 6 -‐ən broke-‐en -‐ 7 -‐i: smell-‐y -‐
Word segmentation: Conclusions
34
• By working at the algorithmic level, we can use informa'onal from experimental results to build systems
• Taking advantage of what we know about infant behavior (syllables, segmenta'on strategies) leads to good performance and an efficient algorithm
• A simple reward-‐based approach predicts the learning pa^erns of infants
III. Codeswitching
35
i'm at work babe. trabajando like a good girl. que haces. estoy entusiamada porque voy al mall con gina!!! I have a bit of a situa'on mañana marcame porfas, creo que no voy a tener coche!
36
Value of detecting codeswitching
• Create single-‐language chunks for normaliza'on/MT: • “y u no…” vs. “manzanas y bananas”
• Characterize communicants using language knowledge
que te paso? Why are u upset ?
Spanish Model
English Model
Speaker knows English and Spanish Hearer knows English and Spanish
37
Challenges
38
• Only exis'ng codeswitching corpora are interview-‐scale • Assume no text normaliza'on or part-‐of-‐speech tagging
• Short messages, short phrases
• Technical terms (“USB”) and brand names (“Apple”)
Assumptions
39
• We can collect corpora primarily consis'ng of each language of interest • Collec'ons of primarily Spanish (6.8m) and English (2.8m) tweets
• Try to make fewest assump'ons about match between training data and applica'ons • May not be same medium (e.g., Twi^er, SMS)
• May not have same dominant language in the pair
• No annotated codeswitched data for training
Codeswitchador methodology
40
• Label each token using source language • Threshold on number of hits to perform language
iden'fica'on and codeswitching • Require 2 tokens for a language to call it present • If mul'ple languages present, label it codeswitched
Models 1 and 1.5 • Language ra'o:
• Model 1: Label words of ra'o above 1
Spanish, below as English (MLE) • Intui-on for ra'o ≈ 1 :
• Syntac'c context constraints codeswitching (Belazi et al., 1994)
• Syntac'c heads for English/Spanish on les • Model 1.5:
• Label ra'o ≈ 1 words and OOV by les context
2.37 la
1.15 me
41
log(P(w | Spanish))log(P(w | English))
Data collection*
42
• Accuracy verified using test set created by bilingual Mechanical Turk annotators • Instructed to account for standard borrowings
• Labeled 10,000 tweets from MITRE Spanish Twi^er (N = 93,136 tweets) (Burger et al. 2011) for source language • Token inter-‐annotator agreement: 96.52%
• Tokens labeled with basic en'ty tags (name, 'tle) and excluded from token evalua'on
*Many thanks to Jacob Lozano Ramirez for implemen'ng the HIT interface and performing adjudica'on.
LID and Codeswitching accuracy
43
• Task: Mark each tweet (N=7,018) as Spanish, English, or Codeswitched
• Applica'on: split data into single or mixed language sets
0.529 0.564
0.779 0.815
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Baseline Func'on Model 1 Model 1.5
All Message LID/CS Accuracy
Codeswitching detection
44
• Task: Mark each tweet (N=7,018) as Codeswitched or not
• Applica'on: iden'fy codeswitched tweets/users
Baseline Func'on Model 1 Model 1.5 CS Prec. 0.529 0.963 0.715 0.797 CS Rec. 1 0.205 0.969 0.874 CS F1 0.692 0.338 0.823 0.833
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Codeswitching DetecDon
Token language identi^ication
45
• Task: Among codeswitched tweets (N=3,711), iden'fy source language of each token • Inter-‐annotator agreement: 96.52%
• Applica'on: iden'fy tokens/spans for normaliza'on/MT
0.9 0.947 0.967
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
Baseline Model 1 Model 1.5
Token LID Accuracy
Extensions
46
• Applica'ons in linguis'c research: what are the constraints on codeswitching in large-‐scale data? (ask me)
• Could apply more powerful learning methods: • Simple one state per language HMM: equivalent performance
• SVM-‐HMM with morphological features: equivalent performance
• Higher overfitng risk with these models
• Most improvement is likely to come from be^er OOV modeling
• Comparison with standard LID systems • Not yet fair as they are meant to do many-‐way LID, not two-‐way
Codeswitching: Conclusions
47
• Simple approach that takes advantage of linguis'c constraints works as well as human annotators
• Fast, determinis'c decoding and minimal storage (1 bit/word)
• Promising future for improving text normaliza'on, communicant characteriza'on, and machine transla'on
IV. Conclusions
48
Care for some more insane heuristic?
49
Yes, please. “Necessity is the mother of invention,” and considering additional constraints (e.g., fast, cognitively plausible) has lead to good results.
Natural language for robot control
50
Natural language used to generate linear temporal logic plans (AAAI Language Grounding Workshop 2012; HRI 2012) h^ps://github.com/PennNLP/SLURP (Supported by ARO MURI)
Natural language for robot control
51
iPad Interface
LTLBroom
LTLMoP
NLP Pipeline
Logical Controller
Robot Controller
Seman'c Parsing
Language Processing Planning
User Input
Orders/ Responses
Orders
Automaton
World Knowledge
World Updates Control Commands
Sensor Data
Map Informa'on
User Input
Cornell
UMass Lowell
Morphological processing
52
Developing algorithmic-‐level model of morphological processing
(Chicago Linguis'c Society 2012)
play ing playing
How do we represent and process morphologically complex forms? Does frequency affect our answer?
Use big data to compare whole word, decomposi'onal, and dual-‐route models
Whole word storage Complete decompos'on
Dual-‐route
Conclusions
53
• Combining the best of sta's'cal learning and symbolic constraints leads to good results and fast systems
• Cogni've/linguis'c constraints provide robust and a priori mo'vated restric'ons on language systems
Thanks to: Charles Yang Mitch Marcus NSF IGERT #50504487 ARO MURI (SUBTLE) W911NF-‐07-‐1-‐0216 Constantine Lignos [email protected] http://www.seas.upenn.edu/~lignos
54