SymbolicConstraintsand StatisticalMethods:Use … · 2020. 3. 13. · criterion for selecting the...

Constan'ne Lignos University of Pennsylvania

Department of Computer and Informa'on Science Ins'tute for Research in Cogni've Science

Johns Hopkins Human Language Technology Center of Excellence

2/18/2013

Symbolic Constraints and Statistical Methods: Use Together for Best Results

I. Introduction

2

My focus: Computa'onally efficient solu'ons to hard natural language problems.

3

Insane Heuris'c (Siklóssy, 1974): “To solve a problem, solve a harder

problem.”

4

bake BAKE

baker BAKE +(er)

bakers BAKE +(er) +(s)

baking BAKE +(ing)

bakes BAKE +(s)

Morphology learning: MORSEL

5

Unsupervised learning of morphological structure, state-‐of-‐the-‐art for Finnish and English. h^ps://github.com/Constan'neLignos/MORSEL (BUCLD 2009; Morpho Challenge 2009, best paper 2010; Research on Language and Computa'on, 2010)

Symbolic constraints and statistical methods

6

•  With the right constraints on the learning problem, simple learning strategies work

•  Where should these constraints come from?

•  Common prac'ce: error-‐mo'vated •  Posterior regulariza'on (Graça et al., 2011): Constrain POS tags

to have a verb in every sentence, sparse tag distribu'on

•  Constrain word segmenter to put a vowel in every word (Venkataraman, 2001)

•  I suggest: all language comes from the mind; search within its constraints.

II. Infant word segmentation

7

Word segmenta'on is an unsexy problem.

8

Whereareyougoing? Howdoesabunnyrabbitwalk? Doeshewalklikeyouordoeshegohophophop? Whatareyoudoing? Sweepbroom. Isthatabroom? Ithough'twasabrush. Adam’s mother (Brown, 1973)

9

Where are you going? How does a bunny rabbit walk? Does he walk like you or does he go hop hop hop? What are you doing? Sweep broom. Is that a broom? I thought it was a brush. Adam’s mother (Brown, 1973)

11

Modeling of WS

12

•  Transi'onal probability minima (Saffran, 1996 et seq.; Swingley, 2005) by themselves are a dead-‐end (Yang, 2004)

•  Likewise for phonotac'cs-‐only accounts (Daland and Pierrehumbert, 2011)

•  Models using lexicon, and Bayesian inference (Borschinger and Johnson, 2011; Goldwater et al., 2009; Johnson and Goldwater, 2009; Pearl et al., 2011) •  Goal of learner is develop high-‐quality lexicon for segmenta'on,

context can be crucial

•  Integrates progress in unsupervised learning (Teh, 2006) •  High performance. Developmental impact?

Marr's three levels of description/analysis

13

•  Computa'onal: system's goal •  What problem does the system solve? Input/output constraints

•  Algorithmic: system's method •  How does the system do what it does: details of processes and

representa'ons used

•  Implementa'onal: system's means •  How is the system physically realized?

Connecting to development

14

•  Can we connect these models to development? •  Primarily models at computa-onal level, not algorithmic (Marr,

1983)

•  What about the how?

•  My interest: how can we explain the behavior of children as they learn to segment words?

Our modeling goal:

15

Build the simplest model that:

•  Aligns with infants’ capabili'es •  Replicates infants’ behavior in a principled fashion •  Performs reasonably at the task

How do infants segment speech?

•  Possible strategy: iden'fica'on of words in isola'on (Peters, 1983; Pinker et al., 1984) •  Unlikely to be sufficient (Aslin et al., 1996), but probably helpful

(Brent and Siskind, 2001)

•  Aênding to mul'ple cues in the input, most popularly: •  Bootstrapping from known words (Borpeld et al., 2005; Dahan

and Brent, 1999)

•  More easily iden'fy novel words at beginning and ends of uêrances at 8 months (Seidl & Johnson, 2006)

•  Dominant stress paêrn of language (Jusczyk et al., 1999)

16

Infant word learning

17

•  Infants show first clear signs of iden'fying words at 6 mos. (Bergelson and Swingley, 2012)

•  At that age, no: •  Stress preference (Jusczyk et al., 1993, 1999) •  Phoneme transi'on preference/phonotac'cs (Ma^ys et al.,

1999)

•  But: •  Syllable as primary perceptual unit, sensi'vity over 'me:

•  Birth: number of syllables (Bijeljac-‐Babic et al., 1993); 2 mos: preference for syllabifiable input (Bertoncini & Mehler, 1981), holis'c syllables (Jusczyk & Derrah, 1987)

•  Ability to iden'fy cohesive chunks (Goodsi^ et al, 1993)

Overview of the proposed algorithm

•  Segmenter has a lexicon of poten'al words it builds over 'me •  Starts empty, words are added based on segmenta'on of each

uêrance

•  Each word has a score

•  Use words in the lexicon to divide an uêrance by subtrac-vely segmen-ng from the front of the uêrance

•  Operates online •  Processes one uêrance at a 'me

•  Cannot remember previous uêrances or how it segmented them, only lexicon

•  Operates les-‐to-‐right in each uêrance to insert word boundaries between syllables

18

Useful constraints

19

•  Like infants, start work from the outside in

•  Restrict the search for possible segmenta'ons to words we already know when possible

•  Use simple reinforcement learning to reward useful words and penalize bad ones

•  I’ll work through some examples •  Orthography for easy reading, input is syllabified phonemes

In the beginning…

•  Just add whole u^erances to the lexicon •  Gets words in isola'on for free, but osen more than

one word

big drum Lexicon:

Treat everything as word, add to lexicon

bigdrum

20

Subtractive Segmentation

•  Use words in the lexicon to break up the u^erance •  Increase word’s score when it is used •  Add new words to lexicon

mo mmy’s tea Lexicon: mommy’s ...

Treat remainder as word, add to lexicon

tea

21

Trust

•  Add new words to lexicon based on whether we trust them (touch an u^erance boundary)

is that a che cker Lexicon: a is that red ...

Treat remainder as word, add to lexicon

checker

22

is that che cker red

Don’t trust this

Multiple hypotheses

•  For mul'ple possible subtrac'ons two op'ons: •  Greedy approach (Lignos and Yang, 2010) •  Pursue two (or more) hypotheses (beam search)

•  Two hypotheses allow for penaliza-on: reduce score of word that started losing hypothesis

is that a che cker Lexicon: a is isthat that ...

is that a che cker

23

is and that lead to higher mean score than isthat

?

Scoring hypotheses

•  Prefer the hypothesis that uses the higher-‐scoring words •  Winner is rewarded, word scores will go up: “rich get richer”

•  Geometric mean of scores of words used as metric:

•  Useful for compound splitng (Koehn and Knight, 2003; Lignos, 2010)

•  Doesn’t have bias for fewer or more words, but seeks a higher average score

•  Post-‐hoc tes'ng shows this outperforms other obvious metrics (MDL/MLE, other means) (ask me)

u � the syllables of the utterance, initally with no word boundariesi � 0while i < len(u) do

if USS requires a word boundary thenInsert a word boundary and advance i, updating the lexicon as needed

else if Subtractive Segmentation can be performed thenSubtract the highest scoring word and advance i, updating the lexicon as needed

elseAdvance i by one syllable

end ifend whilew � the syllables between the last boundary inserted (or the beginning of the utterance if no boundaries were inserted) and theend of the utteranceIncrement w’s score in the lexicon, adding it to the lexicon if needed

Figure 3: An algorithm combining USS and Subtractive Segmentation

explore multiple hypotheses at once, using a sim-ple beam search. New hypotheses are added tosupport multiple possible subtractive segmentations.For example, using the utterance above, at the be-ginning of segmentation either part or partof couldbe subtracted from the utterance, and both possi-ble segmentations can be evaluated. The learnerscores these hypotheses in a fashion similar to thegreedy segmentation, but using a function based onthe score of all words used in the utterance. Thegeometric mean has been used in compound split-ting (Koehn and Knight, 2003), a task in many wayssimilar to word segmentation, so we adopt it as thecriterion for selecting the best hypothesis. For ahypothesized segmentation H comprised of wordswi . . . wn, a hypothesis is chosen as follows:

argmax

H(

Y

wi2Hscore(wi))

1n

For any w not found in the lexicon we must assigna score; we assign it a score of one as that wouldbe its value assuming it had just been added to thelexicon, an approach similar to Laplace smoothing.

Returning to the previous example, while thescore of partof is greater than that of part, the scoreof of is much higher than either, so if both partofan apple and part of an apple are considered, thehigh score of of causes the latter to be chosen.When beam search is employed, only words used inthe winning hypothesis are rewarded, similar to thegreedy case where there are no other hypotheses.

In addition to preferring segmentations that usewords of higher score, it is useful to reduce the

Algorithm Word Boundaries

Precision Recall F-ScoreNo Stress Information

Syllable Baseline 81.68 100.0 89.91Subtractive Seg. 91.66 89.13 90.37Subtractive Seg. + Beam 2 92.74 88.69 90.67

Word-level Stress

USS Only 91.53 18.82 31.21USS + Subtractive Seg. 93.76 92.02 92.88USS + Subtractive Seg. +Beam 2

94.20 91.87 93.02

Table 1: Learner and baseline performance

score of words that led to the consideration of a los-ing hypothesis. In the previous example we maywant to penalize partof so that we are less likely tochoose a future segmentation that includes it. Set-ting the beam size to be two, forcing each hypothesisto develop greedily after an ambiguous subtractioncauses two hypotheses to form, we are guaranteeda unique word to penalize. In the previous examplepartof causes the split between the two hypothesesin the beam, and thus the learner penalizes it to dis-courage using it in the future.

5 Results

5.1 Evaluation

To evaluate the performance of our model, we mea-sured performance on child-directed speech, usingthe same corpus used in a number of previous stud-ies that used syllabified input (Yang, 2004; Gambelland Yang, 2004; Lignos and Yang, 2010). The eval-

24

Predictions

25

•  Default assump'on of u^erance = word à Infants will start with oversized units and words in isola'on

•  Rich-‐get-‐richer scoring à As the learner is exposed to more data, learner will tend to use high-‐frequency elements

•  Penaliza'on à Use of colloca'ons will decrease with 'me

Our evaluation corpus

•  Constructed from Brown (1973) subset of CHILDES English (Adam, Eve, Sarah), ~60k u^erances

•  Pronuncia'ons and stress for each word from CMUDICT, algorithmically syllabified

•  Sample input:

B.IH.G|D.R.AH.M HH.AO.R.S HH.UW|IH.Z|DH.AE.T

26

Evaluation

•  A’ calculated over syllable boundaries •  Balance of hit rate and false alarm rate for discrimina'ng word

boundaries

•  Trapezoidal approxima'on of area under ROC curve

•  F-‐score used for word token iden'fica'on •  Balance of precision (how osen a word iden'fied in an u^erance

is correct) and recall (how many correct words were found)

•  Ex: is that a lady segmented as isthat a lady

•  Precision 2/3: a, lady; isthat •  Recall 2/4: a, lady; is, that

27

Evaluation

•  Evaluated segmenter in three forms: •  Subtrac've segmenta'on

•  Subtrac've segmenta'on with trust, only adding words to the lexicon if they touch an u^erance boundary

•  Subtrac've segmenta'on with trust and mul-ple hypotheses, considering two hypothe'cal segmenta'ons and penalizing the loser

•  Errors computed over tes'ng corpus from first u^erance •  Worst-‐case evalua'on for online learner: zero training.

•  Word boundaries taken as orthographic boundaries in the input •  Aligns with morphological and phonological defini'ons of word

28

Performance

•  Trust and mul'ple hypotheses significantly reduce FA rate •  Perfect memory is not crucial: evalua'ng A’ and F-‐score

with imperfect memory yield similar results •  Segments 60k u^erances in < 1 second

•  Beam performs be:er than exhaus've search (ask me)

Method Hit False Alarm A’

Word F-‐Score

Baseline: syllable = word 1.0 1.0 0 0.753 Subtrac've Segmenta'on 0.992 0.776 0.795 0.797 +Trust 0.961 0.468 0.860 0.841 +Mul'ple Hypotheses 0.953 0.401 0.875 0.849

29

Infant error patterns

30

•  Undersegmenta'on at young age (Brown, 1973; Clark, 1977; Peters, 1977, 1983) •  Func'on word colloca'ons: that-‐a, it’s, isn’t-‐it •  Phrases as single unit:

look at that, what’s that, oh boy, all gone, open the door

•  Func'on-‐content colloca'ons: picture-‐of, whole-‐thing

•  Oversegmenta'on at older ages (Peters, 1983) •  Func'on word oversegmenta'on: behave/be have, tulips/two

lips, Miami/my Ami/your Ami

•  Errors can s'll occur in adulthood (Cutler and Norris, 1988)

Most frequent error tokens

•  Divided into early (first 10k u^erances) and late (last 10k u^erances) stages of learning

•  Coded most frequent incorrect words in output as: •  Func'on: Overuse of func'on word (awayà a way)

•  Func'on colloca'on: Two func'on words (that’s a à that’sa)

•  Content colloca'on: Content and content/func'on word (a ball à aball)

•  Other

•  Distribu'on changes across 'me (Chi-‐squared p < .0001)

Time FuncDon Func. Colloc.

Cont. Colloc. Other

Early 44.2% 37.0% 8.5% 10.4% Late 70.6% 1.0% 1.2% 27.3%

31

Learning curve

•  Learner starts undersegmen'ng, as it learns achieves balance with slight oversegmenta'on

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 5000 10000 15000 20000 25000 30000 UHerances

Precision Recall F-‐score

32

Extensions: Joint morphology learning

33

•  Learned on output of segmenta'on

•  Ini'al lexicon of high-‐enough quality to learn common suffixes

•  Acquisi'on order similar to Brown 1973:

Predicted order Suffix Example Brown Average Rank

1 -‐ɪŋ bake-‐ing 2.33 2 -‐z happen-‐s 7.9 3 -‐d happen-‐ed 9 4 -‐ɚ bake-‐er -‐ 5 -‐t check-‐ed 9 6 -‐ən broke-‐en -‐ 7 -‐i: smell-‐y -‐

Word segmentation: Conclusions

34

•  By working at the algorithmic level, we can use informa'onal from experimental results to build systems

•  Taking advantage of what we know about infant behavior (syllables, segmenta'on strategies) leads to good performance and an efficient algorithm

•  A simple reward-‐based approach predicts the learning pa^erns of infants

III. Codeswitching

35

i'm at work babe. trabajando like a good girl. que haces. estoy entusiamada porque voy al mall con gina!!! I have a bit of a situa'on mañana marcame porfas, creo que no voy a tener coche!

36

Value of detecting codeswitching

•  Create single-‐language chunks for normaliza'on/MT: •  “y u no…” vs. “manzanas y bananas”

•  Characterize communicants using language knowledge

que te paso? Why are u upset ?

Spanish Model

English Model

Speaker knows English and Spanish Hearer knows English and Spanish

37

Challenges

38

•  Only exis'ng codeswitching corpora are interview-‐scale •  Assume no text normaliza'on or part-‐of-‐speech tagging

•  Short messages, short phrases

•  Technical terms (“USB”) and brand names (“Apple”)

Assumptions

39

•  We can collect corpora primarily consis'ng of each language of interest •  Collec'ons of primarily Spanish (6.8m) and English (2.8m) tweets

•  Try to make fewest assump'ons about match between training data and applica'ons •  May not be same medium (e.g., Twi^er, SMS)

•  May not have same dominant language in the pair

•  No annotated codeswitched data for training

Codeswitchador methodology

40

•  Label each token using source language •  Threshold on number of hits to perform language

iden'fica'on and codeswitching •  Require 2 tokens for a language to call it present •  If mul'ple languages present, label it codeswitched

Models 1 and 1.5 •  Language ra'o:

•  Model 1: Label words of ra'o above 1

Spanish, below as English (MLE) •  Intui-on for ra'o ≈ 1 :

•  Syntac'c context constraints codeswitching (Belazi et al., 1994)

•  Syntac'c heads for English/Spanish on les •  Model 1.5:

•  Label ra'o ≈ 1 words and OOV by les context

2.37 la

1.15 me

41

log(P(w | Spanish))log(P(w | English))

Data collection*

42

•  Accuracy verified using test set created by bilingual Mechanical Turk annotators •  Instructed to account for standard borrowings

•  Labeled 10,000 tweets from MITRE Spanish Twi^er (N = 93,136 tweets) (Burger et al. 2011) for source language •  Token inter-‐annotator agreement: 96.52%

•  Tokens labeled with basic en'ty tags (name, 'tle) and excluded from token evalua'on

*Many thanks to Jacob Lozano Ramirez for implemen'ng the HIT interface and performing adjudica'on.

LID and Codeswitching accuracy

43

•  Task: Mark each tweet (N=7,018) as Spanish, English, or Codeswitched

•  Applica'on: split data into single or mixed language sets

0.529 0.564

0.779 0.815

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Baseline Func'on Model 1 Model 1.5

All Message LID/CS Accuracy

Codeswitching detection

44

•  Task: Mark each tweet (N=7,018) as Codeswitched or not

•  Applica'on: iden'fy codeswitched tweets/users

Baseline Func'on Model 1 Model 1.5 CS Prec. 0.529 0.963 0.715 0.797 CS Rec. 1 0.205 0.969 0.874 CS F1 0.692 0.338 0.823 0.833

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Codeswitching DetecDon

Token language identi^ication

45

•  Task: Among codeswitched tweets (N=3,711), iden'fy source language of each token •  Inter-‐annotator agreement: 96.52%

•  Applica'on: iden'fy tokens/spans for normaliza'on/MT

0.9 0.947 0.967

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

Baseline Model 1 Model 1.5

Token LID Accuracy

Extensions

46

•  Applica'ons in linguis'c research: what are the constraints on codeswitching in large-‐scale data? (ask me)

•  Could apply more powerful learning methods: •  Simple one state per language HMM: equivalent performance

•  SVM-‐HMM with morphological features: equivalent performance

•  Higher overfitng risk with these models

•  Most improvement is likely to come from be^er OOV modeling

•  Comparison with standard LID systems •  Not yet fair as they are meant to do many-‐way LID, not two-‐way

Codeswitching: Conclusions

47

•  Simple approach that takes advantage of linguis'c constraints works as well as human annotators

•  Fast, determinis'c decoding and minimal storage (1 bit/word)

•  Promising future for improving text normaliza'on, communicant characteriza'on, and machine transla'on

IV. Conclusions

48

Care for some more insane heuristic?

49

Yes, please. “Necessity is the mother of invention,” and considering additional constraints (e.g., fast, cognitively plausible) has lead to good results.

Natural language for robot control

50

Natural language used to generate linear temporal logic plans (AAAI Language Grounding Workshop 2012; HRI 2012) h^ps://github.com/PennNLP/SLURP (Supported by ARO MURI)

Natural language for robot control

51

iPad Interface

LTLBroom

LTLMoP

NLP Pipeline

Logical Controller

Robot Controller

Seman'c Parsing

Language Processing Planning

User Input

Orders/ Responses

Orders

Automaton

World Knowledge

World Updates Control Commands

Sensor Data

Map Informa'on

User Input

Cornell

UMass Lowell

Morphological processing

52

Developing algorithmic-‐level model of morphological processing

(Chicago Linguis'c Society 2012)

play ing playing

How do we represent and process morphologically complex forms? Does frequency affect our answer?

Use big data to compare whole word, decomposi'onal, and dual-‐route models

Whole word storage Complete decompos'on

Dual-‐route

Conclusions

53

•  Combining the best of sta's'cal learning and symbolic constraints leads to good results and fast systems

•  Cogni've/linguis'c constraints provide robust and a priori mo'vated restric'ons on language systems

Thanks to: Charles Yang Mitch Marcus NSF IGERT #50504487 ARO MURI (SUBTLE) W911NF-‐07-‐1-‐0216 Constantine Lignos [email protected] http://www.seas.upenn.edu/~lignos

54

SymbolicConstraintsand StatisticalMethods:Use … · 2020. 3. 13. · criterion for selecting the...

Documents

Transcript of SymbolicConstraintsand StatisticalMethods:Use … · 2020. 3. 13. · criterion for selecting the...