SymbolicConstraintsand StatisticalMethods:Use … · 2020. 3. 13. · criterion for selecting the...

54
Constan’ne Lignos University of Pennsylvania Department of Computer and Informa’on Science Ins’tute for Research in Cogni’ve Science Johns Hopkins Human Language Technology Center of Excellence 2/18/2013 Symbolic Constraints and Statistical Methods: Use Together for Best Results

Transcript of SymbolicConstraintsand StatisticalMethods:Use … · 2020. 3. 13. · criterion for selecting the...

Page 1: SymbolicConstraintsand StatisticalMethods:Use … · 2020. 3. 13. · criterion for selecting the best hypothesis. For a hypothesized segmentation H comprised of words w i ... iPad)Interface)

Constan'ne  Lignos  University  of  Pennsylvania  

Department  of  Computer  and  Informa'on  Science  Ins'tute  for  Research  in  Cogni've  Science  

Johns  Hopkins  Human  Language  Technology  Center  of  Excellence  

2/18/2013  

Symbolic  Constraints  and  Statistical  Methods:  Use  Together  for  Best  Results  

Page 2: SymbolicConstraintsand StatisticalMethods:Use … · 2020. 3. 13. · criterion for selecting the best hypothesis. For a hypothesized segmentation H comprised of words w i ... iPad)Interface)

I.  Introduction  

2  

Page 3: SymbolicConstraintsand StatisticalMethods:Use … · 2020. 3. 13. · criterion for selecting the best hypothesis. For a hypothesized segmentation H comprised of words w i ... iPad)Interface)

My  focus:  Computa'onally  efficient    solu'ons  to  hard  natural  language  problems.  

3  

Page 4: SymbolicConstraintsand StatisticalMethods:Use … · 2020. 3. 13. · criterion for selecting the best hypothesis. For a hypothesized segmentation H comprised of words w i ... iPad)Interface)

Insane  Heuris'c  (Siklóssy,  1974):  “To  solve  a  problem,  solve  a  harder  

problem.”  

4  

Page 5: SymbolicConstraintsand StatisticalMethods:Use … · 2020. 3. 13. · criterion for selecting the best hypothesis. For a hypothesized segmentation H comprised of words w i ... iPad)Interface)

bake  BAKE  

baker  BAKE  +(er)  

bakers  BAKE  +(er)  +(s)  

baking  BAKE  +(ing)  

bakes  BAKE  +(s)  

Morphology  learning:  MORSEL  

5  

Unsupervised  learning  of  morphological  structure,  state-­‐of-­‐the-­‐art  for  Finnish  and  English.  h^ps://github.com/Constan'neLignos/MORSEL  (BUCLD  2009;  Morpho  Challenge  2009,  best  paper  2010;  Research  on  Language  and  Computa'on,  2010)  

Page 6: SymbolicConstraintsand StatisticalMethods:Use … · 2020. 3. 13. · criterion for selecting the best hypothesis. For a hypothesized segmentation H comprised of words w i ... iPad)Interface)

Symbolic  constraints  and  statistical  methods  

6  

•  With  the  right  constraints  on  the  learning  problem,  simple  learning  strategies  work  

•  Where  should  these  constraints  come  from?  

•  Common  prac'ce:  error-­‐mo'vated  •  Posterior  regulariza'on  (Graça  et  al.,  2011):  Constrain  POS  tags  

to  have  a  verb  in  every  sentence,  sparse  tag  distribu'on  

•  Constrain  word  segmenter  to  put  a  vowel  in  every  word  (Venkataraman,  2001)  

•  I  suggest:  all  language  comes  from  the  mind;  search  within  its  constraints.  

Page 7: SymbolicConstraintsand StatisticalMethods:Use … · 2020. 3. 13. · criterion for selecting the best hypothesis. For a hypothesized segmentation H comprised of words w i ... iPad)Interface)

II.  Infant  word  segmentation  

7  

Page 8: SymbolicConstraintsand StatisticalMethods:Use … · 2020. 3. 13. · criterion for selecting the best hypothesis. For a hypothesized segmentation H comprised of words w i ... iPad)Interface)

Word  segmenta'on  is  an  unsexy  problem.  

8  

Page 9: SymbolicConstraintsand StatisticalMethods:Use … · 2020. 3. 13. · criterion for selecting the best hypothesis. For a hypothesized segmentation H comprised of words w i ... iPad)Interface)

Whereareyougoing?  Howdoesabunnyrabbitwalk?  Doeshewalklikeyouordoeshegohophophop?  Whatareyoudoing?  Sweepbroom.  Isthatabroom?  Ithough'twasabrush.    Adam’s  mother  (Brown,  1973)  

9  

Page 10: SymbolicConstraintsand StatisticalMethods:Use … · 2020. 3. 13. · criterion for selecting the best hypothesis. For a hypothesized segmentation H comprised of words w i ... iPad)Interface)

10  

Page 11: SymbolicConstraintsand StatisticalMethods:Use … · 2020. 3. 13. · criterion for selecting the best hypothesis. For a hypothesized segmentation H comprised of words w i ... iPad)Interface)

Where  are  you  going?  How  does  a  bunny  rabbit  walk?  Does  he  walk  like  you  or  does  he  go  hop  hop  hop?  What  are  you  doing?  Sweep  broom.  Is  that  a  broom?  I  thought  it  was  a  brush.    Adam’s  mother  (Brown,  1973)  

11  

Page 12: SymbolicConstraintsand StatisticalMethods:Use … · 2020. 3. 13. · criterion for selecting the best hypothesis. For a hypothesized segmentation H comprised of words w i ... iPad)Interface)

Modeling  of  WS  

12  

•  Transi'onal  probability  minima  (Saffran,  1996  et  seq.;  Swingley,  2005)  by  themselves  are  a  dead-­‐end  (Yang,  2004)  

•  Likewise  for  phonotac'cs-­‐only  accounts  (Daland  and  Pierrehumbert,  2011)  

•  Models  using  lexicon,  and  Bayesian  inference  (Borschinger  and  Johnson,  2011;  Goldwater  et  al.,  2009;  Johnson  and  Goldwater,  2009;  Pearl  et  al.,  2011)    •  Goal  of  learner  is  develop  high-­‐quality  lexicon  for  segmenta'on,  

context  can  be  crucial  

•  Integrates  progress  in  unsupervised  learning  (Teh,  2006)  •  High  performance.  Developmental  impact?  

Page 13: SymbolicConstraintsand StatisticalMethods:Use … · 2020. 3. 13. · criterion for selecting the best hypothesis. For a hypothesized segmentation H comprised of words w i ... iPad)Interface)

Marr's  three  levels  of  description/analysis  

13  

•  Computa'onal:  system's  goal  •  What  problem  does  the  system  solve?  Input/output  constraints  

•  Algorithmic:  system's  method  •  How  does  the  system  do  what  it  does:  details  of  processes  and  

representa'ons  used  

•  Implementa'onal:  system's  means  •  How  is  the  system  physically  realized?  

Page 14: SymbolicConstraintsand StatisticalMethods:Use … · 2020. 3. 13. · criterion for selecting the best hypothesis. For a hypothesized segmentation H comprised of words w i ... iPad)Interface)

Connecting  to  development  

14  

•  Can  we  connect  these  models  to  development?  •  Primarily  models  at  computa-onal  level,  not  algorithmic  (Marr,  

1983)  

•  What  about  the  how?  

•  My  interest:  how  can  we  explain  the  behavior  of  children  as  they  learn  to  segment  words?  

 

Page 15: SymbolicConstraintsand StatisticalMethods:Use … · 2020. 3. 13. · criterion for selecting the best hypothesis. For a hypothesized segmentation H comprised of words w i ... iPad)Interface)

Our  modeling  goal:  

15  

Build  the  simplest  model  that:  

•  Aligns  with  infants’  capabili'es  •  Replicates  infants’  behavior  in  a  principled  fashion  •  Performs  reasonably  at  the  task  

 

Page 16: SymbolicConstraintsand StatisticalMethods:Use … · 2020. 3. 13. · criterion for selecting the best hypothesis. For a hypothesized segmentation H comprised of words w i ... iPad)Interface)

How  do  infants  segment  speech?  

•  Possible  strategy:  iden'fica'on  of  words  in  isola'on  (Peters,  1983;  Pinker  et  al.,  1984)  •  Unlikely  to  be  sufficient  (Aslin  et  al.,  1996),  but  probably  helpful  

(Brent  and  Siskind,  2001)  

•  A^ending  to  mul'ple  cues  in  the  input,  most  popularly:  •  Bootstrapping  from  known  words  (Borpeld  et  al.,  2005;  Dahan  

and  Brent,  1999)  

•  More  easily  iden'fy  novel  words  at  beginning  and  ends  of  u^erances  at  8  months  (Seidl  &  Johnson,  2006)  

•  Dominant  stress  pa^ern  of  language  (Jusczyk  et  al.,  1999)  

16  

Page 17: SymbolicConstraintsand StatisticalMethods:Use … · 2020. 3. 13. · criterion for selecting the best hypothesis. For a hypothesized segmentation H comprised of words w i ... iPad)Interface)

Infant  word  learning  

17  

•  Infants  show  first  clear  signs  of  iden'fying  words  at  6  mos.  (Bergelson  and  Swingley,  2012)  

•  At  that  age,  no:  •  Stress  preference  (Jusczyk  et  al.,  1993,  1999)  •  Phoneme  transi'on  preference/phonotac'cs  (Ma^ys  et  al.,  

1999)  

•  But:  •  Syllable  as  primary  perceptual  unit,  sensi'vity  over  'me:  

•  Birth:  number  of  syllables  (Bijeljac-­‐Babic  et  al.,  1993);  2  mos:  preference  for  syllabifiable  input  (Bertoncini  &  Mehler,  1981),  holis'c  syllables  (Jusczyk  &  Derrah,  1987)  

•  Ability  to  iden'fy  cohesive  chunks  (Goodsi^  et  al,  1993)  

Page 18: SymbolicConstraintsand StatisticalMethods:Use … · 2020. 3. 13. · criterion for selecting the best hypothesis. For a hypothesized segmentation H comprised of words w i ... iPad)Interface)

Overview  of  the  proposed  algorithm  

•  Segmenter  has  a  lexicon  of  poten'al  words  it  builds  over  'me  •  Starts  empty,  words  are  added  based  on  segmenta'on  of  each  

u^erance    

•  Each  word  has  a  score  

•  Use  words  in  the  lexicon  to  divide  an  u^erance  by  subtrac-vely  segmen-ng  from  the  front  of  the  u^erance  

•  Operates  online  •  Processes  one  u^erance  at  a  'me  

•  Cannot  remember  previous  u^erances  or  how  it  segmented  them,  only  lexicon  

•  Operates  les-­‐to-­‐right  in  each  u^erance  to  insert  word  boundaries  between  syllables  

18  

Page 19: SymbolicConstraintsand StatisticalMethods:Use … · 2020. 3. 13. · criterion for selecting the best hypothesis. For a hypothesized segmentation H comprised of words w i ... iPad)Interface)

Useful  constraints  

19  

•  Like  infants,  start  work  from  the  outside  in  

•  Restrict  the  search  for  possible  segmenta'ons  to  words  we  already  know  when  possible  

•  Use  simple  reinforcement  learning  to  reward  useful  words  and  penalize  bad  ones  

 

•  I’ll  work  through  some  examples  •  Orthography  for  easy  reading,  input  is  syllabified  phonemes  

Page 20: SymbolicConstraintsand StatisticalMethods:Use … · 2020. 3. 13. · criterion for selecting the best hypothesis. For a hypothesized segmentation H comprised of words w i ... iPad)Interface)

In  the  beginning…  

•  Just  add  whole  u^erances  to  the  lexicon  •  Gets  words  in  isola'on  for  free,  but  osen  more  than  

one  word  

 

 

 

 

big   drum  Lexicon:  

Treat  everything  as  word,  add  to  lexicon  

bigdrum  

20  

Page 21: SymbolicConstraintsand StatisticalMethods:Use … · 2020. 3. 13. · criterion for selecting the best hypothesis. For a hypothesized segmentation H comprised of words w i ... iPad)Interface)

Subtractive  Segmentation  

•  Use  words  in  the  lexicon  to  break  up  the  u^erance  •  Increase  word’s  score  when  it  is  used  •  Add  new  words  to  lexicon  

 

 

 

 

mo   mmy’s   tea  Lexicon:  mommy’s    ...  

Treat  remainder  as  word,  add  to  lexicon  

tea  

21  

Page 22: SymbolicConstraintsand StatisticalMethods:Use … · 2020. 3. 13. · criterion for selecting the best hypothesis. For a hypothesized segmentation H comprised of words w i ... iPad)Interface)

Trust  

•  Add  new  words  to  lexicon  based  on  whether  we  trust  them  (touch  an  u^erance  boundary)  

 

 

 

 

 

is   that   a   che   cker  Lexicon:  a  is  that  red    ...  

Treat  remainder  as  word,  add  to  lexicon  

checker  

22  

is   that   che   cker   red  

Don’t  trust  this  

Page 23: SymbolicConstraintsand StatisticalMethods:Use … · 2020. 3. 13. · criterion for selecting the best hypothesis. For a hypothesized segmentation H comprised of words w i ... iPad)Interface)

Multiple  hypotheses  

•  For  mul'ple  possible  subtrac'ons  two  op'ons:  •  Greedy  approach  (Lignos  and  Yang,  2010)  •  Pursue  two  (or  more)  hypotheses  (beam  search)  

•  Two  hypotheses  allow  for  penaliza-on:  reduce  score  of      word  that  started  losing  hypothesis  

is   that   a   che   cker  Lexicon:  a  is  isthat  that  ...  

is   that   a   che   cker  

23  

is  and  that  lead  to  higher  mean  score  than  isthat  

?  

Page 24: SymbolicConstraintsand StatisticalMethods:Use … · 2020. 3. 13. · criterion for selecting the best hypothesis. For a hypothesized segmentation H comprised of words w i ... iPad)Interface)

Scoring  hypotheses  

•  Prefer  the  hypothesis  that  uses  the  higher-­‐scoring  words  •  Winner  is  rewarded,  word  scores  will  go  up:  “rich  get  richer”  

•  Geometric  mean  of  scores  of  words  used  as  metric:  

•  Useful  for  compound  splitng  (Koehn  and  Knight,  2003;  Lignos,  2010)  

•  Doesn’t  have  bias  for  fewer  or  more  words,  but  seeks  a  higher  average  score  

•  Post-­‐hoc  tes'ng  shows  this  outperforms  other  obvious  metrics  (MDL/MLE,  other  means)  (ask  me)  

u � the syllables of the utterance, initally with no word boundariesi � 0while i < len(u) do

if USS requires a word boundary thenInsert a word boundary and advance i, updating the lexicon as needed

else if Subtractive Segmentation can be performed thenSubtract the highest scoring word and advance i, updating the lexicon as needed

elseAdvance i by one syllable

end ifend whilew � the syllables between the last boundary inserted (or the beginning of the utterance if no boundaries were inserted) and theend of the utteranceIncrement w’s score in the lexicon, adding it to the lexicon if needed

Figure 3: An algorithm combining USS and Subtractive Segmentation

explore multiple hypotheses at once, using a sim-ple beam search. New hypotheses are added tosupport multiple possible subtractive segmentations.For example, using the utterance above, at the be-ginning of segmentation either part or partof couldbe subtracted from the utterance, and both possi-ble segmentations can be evaluated. The learnerscores these hypotheses in a fashion similar to thegreedy segmentation, but using a function based onthe score of all words used in the utterance. Thegeometric mean has been used in compound split-ting (Koehn and Knight, 2003), a task in many wayssimilar to word segmentation, so we adopt it as thecriterion for selecting the best hypothesis. For ahypothesized segmentation H comprised of wordswi . . . wn, a hypothesis is chosen as follows:

argmax

H(

Y

wi2Hscore(wi))

1n

For any w not found in the lexicon we must assigna score; we assign it a score of one as that wouldbe its value assuming it had just been added to thelexicon, an approach similar to Laplace smoothing.

Returning to the previous example, while thescore of partof is greater than that of part, the scoreof of is much higher than either, so if both partofan apple and part of an apple are considered, thehigh score of of causes the latter to be chosen.When beam search is employed, only words used inthe winning hypothesis are rewarded, similar to thegreedy case where there are no other hypotheses.

In addition to preferring segmentations that usewords of higher score, it is useful to reduce the

Algorithm Word Boundaries

Precision Recall F-ScoreNo Stress Information

Syllable Baseline 81.68 100.0 89.91Subtractive Seg. 91.66 89.13 90.37Subtractive Seg. + Beam 2 92.74 88.69 90.67

Word-level Stress

USS Only 91.53 18.82 31.21USS + Subtractive Seg. 93.76 92.02 92.88USS + Subtractive Seg. +Beam 2

94.20 91.87 93.02

Table 1: Learner and baseline performance

score of words that led to the consideration of a los-ing hypothesis. In the previous example we maywant to penalize partof so that we are less likely tochoose a future segmentation that includes it. Set-ting the beam size to be two, forcing each hypothesisto develop greedily after an ambiguous subtractioncauses two hypotheses to form, we are guaranteeda unique word to penalize. In the previous examplepartof causes the split between the two hypothesesin the beam, and thus the learner penalizes it to dis-courage using it in the future.

5 Results

5.1 Evaluation

To evaluate the performance of our model, we mea-sured performance on child-directed speech, usingthe same corpus used in a number of previous stud-ies that used syllabified input (Yang, 2004; Gambelland Yang, 2004; Lignos and Yang, 2010). The eval-

24  

Page 25: SymbolicConstraintsand StatisticalMethods:Use … · 2020. 3. 13. · criterion for selecting the best hypothesis. For a hypothesized segmentation H comprised of words w i ... iPad)Interface)

Predictions  

25  

•  Default  assump'on  of  u^erance  =  word  à      Infants  will  start  with  oversized  units  and  words  in    isola'on  

•  Rich-­‐get-­‐richer  scoring  à    As  the  learner  is  exposed  to  more  data,  learner  will    tend  to  use  high-­‐frequency  elements  

•  Penaliza'on  à      Use  of  colloca'ons  will  decrease  with  'me  

 

Page 26: SymbolicConstraintsand StatisticalMethods:Use … · 2020. 3. 13. · criterion for selecting the best hypothesis. For a hypothesized segmentation H comprised of words w i ... iPad)Interface)

Our  evaluation  corpus  

•  Constructed  from  Brown  (1973)  subset  of  CHILDES  English  (Adam,  Eve,  Sarah),  ~60k  u^erances  

•  Pronuncia'ons  and  stress  for  each  word  from  CMUDICT,  algorithmically  syllabified  

•  Sample  input:  

B.IH.G|D.R.AH.M  HH.AO.R.S  HH.UW|IH.Z|DH.AE.T  

26  

Page 27: SymbolicConstraintsand StatisticalMethods:Use … · 2020. 3. 13. · criterion for selecting the best hypothesis. For a hypothesized segmentation H comprised of words w i ... iPad)Interface)

Evaluation  

•  A’  calculated  over  syllable  boundaries    •  Balance  of  hit  rate  and  false  alarm  rate  for  discrimina'ng  word  

boundaries  

•  Trapezoidal  approxima'on  of  area  under  ROC  curve  

•  F-­‐score  used  for  word  token  iden'fica'on  •  Balance  of  precision  (how  osen  a  word  iden'fied  in  an  u^erance  

is  correct)  and  recall  (how  many  correct  words  were  found)  

•  Ex:  is  that  a  lady  segmented  as  isthat  a  lady  

•  Precision  2/3:  a,  lady;  isthat  •  Recall  2/4:  a,  lady;  is,  that  

27  

Page 28: SymbolicConstraintsand StatisticalMethods:Use … · 2020. 3. 13. · criterion for selecting the best hypothesis. For a hypothesized segmentation H comprised of words w i ... iPad)Interface)

Evaluation  

•  Evaluated  segmenter  in  three  forms:  •  Subtrac've  segmenta'on  

•  Subtrac've  segmenta'on  with  trust,  only  adding  words  to  the  lexicon  if  they  touch  an  u^erance  boundary  

•  Subtrac've  segmenta'on  with  trust  and  mul-ple  hypotheses,  considering  two  hypothe'cal  segmenta'ons  and  penalizing  the  loser  

•  Errors  computed  over  tes'ng  corpus  from  first  u^erance  •  Worst-­‐case  evalua'on  for  online  learner:    zero  training.  

•  Word  boundaries  taken  as  orthographic  boundaries  in  the  input  •  Aligns  with  morphological  and  phonological  defini'ons  of  word  

28  

Page 29: SymbolicConstraintsand StatisticalMethods:Use … · 2020. 3. 13. · criterion for selecting the best hypothesis. For a hypothesized segmentation H comprised of words w i ... iPad)Interface)

Performance  

•  Trust  and  mul'ple  hypotheses  significantly  reduce  FA  rate  •  Perfect  memory  is  not  crucial:  evalua'ng  A’  and  F-­‐score  

with  imperfect  memory  yield  similar  results  •  Segments  60k  u^erances  in  <  1  second  

•  Beam  performs  be:er  than  exhaus've  search  (ask  me)  

Method   Hit  False  Alarm   A’  

Word  F-­‐Score  

Baseline:  syllable  =  word   1.0   1.0   0   0.753  Subtrac've  Segmenta'on   0.992   0.776   0.795   0.797  +Trust   0.961   0.468   0.860   0.841  +Mul'ple  Hypotheses   0.953   0.401   0.875   0.849  

29  

Page 30: SymbolicConstraintsand StatisticalMethods:Use … · 2020. 3. 13. · criterion for selecting the best hypothesis. For a hypothesized segmentation H comprised of words w i ... iPad)Interface)

Infant  error  patterns  

30  

•  Undersegmenta'on  at  young  age  (Brown,  1973;  Clark,  1977;  Peters,  1977,  1983)  •  Func'on  word  colloca'ons:  that-­‐a,  it’s,  isn’t-­‐it  •  Phrases  as  single  unit:  

 look  at  that,  what’s  that,  oh  boy,  all  gone,  open  the  door  

•  Func'on-­‐content  colloca'ons:  picture-­‐of,  whole-­‐thing  

•  Oversegmenta'on  at  older  ages  (Peters,  1983)  •  Func'on  word  oversegmenta'on:  behave/be  have,  tulips/two  

lips,  Miami/my  Ami/your  Ami  

•  Errors  can  s'll  occur  in  adulthood  (Cutler  and  Norris,  1988)  

Page 31: SymbolicConstraintsand StatisticalMethods:Use … · 2020. 3. 13. · criterion for selecting the best hypothesis. For a hypothesized segmentation H comprised of words w i ... iPad)Interface)

Most  frequent  error  tokens  

•  Divided  into  early  (first  10k  u^erances)  and  late  (last  10k  u^erances)  stages  of  learning  

•  Coded  most  frequent  incorrect  words  in  output  as:  •  Func'on:  Overuse  of  func'on  word  (awayà  a  way)  

•  Func'on  colloca'on:  Two  func'on  words  (that’s  a  à  that’sa)  

•  Content  colloca'on:  Content  and  content/func'on  word  (a  ball  à  aball)  

•  Other  

•  Distribu'on  changes  across  'me  (Chi-­‐squared  p  <  .0001)  

Time   FuncDon  Func.  Colloc.  

Cont.  Colloc.   Other  

Early   44.2%   37.0%   8.5%   10.4%  Late   70.6%   1.0%   1.2%   27.3%  

31  

Page 32: SymbolicConstraintsand StatisticalMethods:Use … · 2020. 3. 13. · criterion for selecting the best hypothesis. For a hypothesized segmentation H comprised of words w i ... iPad)Interface)

Learning  curve  

•  Learner  starts  undersegmen'ng,  as  it  learns  achieves  balance  with  slight  oversegmenta'on  

0.3  

0.4  

0.5  

0.6  

0.7  

0.8  

0.9  

1  

0   5000   10000   15000   20000   25000   30000  UHerances  

Precision   Recall   F-­‐score  

32  

Page 33: SymbolicConstraintsand StatisticalMethods:Use … · 2020. 3. 13. · criterion for selecting the best hypothesis. For a hypothesized segmentation H comprised of words w i ... iPad)Interface)

Extensions:  Joint  morphology  learning  

33  

•  Learned  on  output  of  segmenta'on  

•  Ini'al  lexicon  of  high-­‐enough  quality  to  learn  common  suffixes  

•  Acquisi'on  order  similar  to  Brown  1973:  

Predicted  order   Suffix   Example  Brown  Average  Rank  

1   -­‐ɪŋ   bake-­‐ing   2.33  2   -­‐z   happen-­‐s   7.9  3   -­‐d   happen-­‐ed   9  4   -­‐ɚ   bake-­‐er   -­‐  5   -­‐t   check-­‐ed   9  6   -­‐ən   broke-­‐en   -­‐  7   -­‐i:   smell-­‐y   -­‐  

Page 34: SymbolicConstraintsand StatisticalMethods:Use … · 2020. 3. 13. · criterion for selecting the best hypothesis. For a hypothesized segmentation H comprised of words w i ... iPad)Interface)

Word  segmentation:  Conclusions  

34  

•  By  working  at  the  algorithmic  level,  we  can  use  informa'onal  from  experimental  results  to  build  systems  

•  Taking  advantage  of  what  we  know  about  infant  behavior  (syllables,  segmenta'on  strategies)  leads  to  good  performance  and  an  efficient  algorithm  

•  A  simple  reward-­‐based  approach  predicts  the  learning  pa^erns  of  infants  

Page 35: SymbolicConstraintsand StatisticalMethods:Use … · 2020. 3. 13. · criterion for selecting the best hypothesis. For a hypothesized segmentation H comprised of words w i ... iPad)Interface)

III.  Codeswitching  

35  

Page 36: SymbolicConstraintsand StatisticalMethods:Use … · 2020. 3. 13. · criterion for selecting the best hypothesis. For a hypothesized segmentation H comprised of words w i ... iPad)Interface)

i'm  at  work  babe.  trabajando  like  a  good  girl.  que  haces.  estoy  entusiamada  porque  voy  al  mall  con  gina!!!      I  have  a  bit  of  a  situa'on  mañana  marcame  porfas,  creo  que  no  voy  a  tener  coche!  

36  

Page 37: SymbolicConstraintsand StatisticalMethods:Use … · 2020. 3. 13. · criterion for selecting the best hypothesis. For a hypothesized segmentation H comprised of words w i ... iPad)Interface)

Value  of  detecting  codeswitching  

•  Create  single-­‐language  chunks  for  normaliza'on/MT:  •  “y  u  no…”  vs.  “manzanas  y  bananas”  

•  Characterize  communicants  using  language  knowledge  

que  te  paso?    Why  are  u  upset  ?  

Spanish  Model  

English  Model  

Speaker  knows  English  and  Spanish  Hearer  knows  English  and  Spanish  

37  

Page 38: SymbolicConstraintsand StatisticalMethods:Use … · 2020. 3. 13. · criterion for selecting the best hypothesis. For a hypothesized segmentation H comprised of words w i ... iPad)Interface)

Challenges  

38  

•  Only  exis'ng  codeswitching  corpora  are  interview-­‐scale  •  Assume  no  text  normaliza'on  or  part-­‐of-­‐speech  tagging  

•  Short  messages,  short  phrases  

•  Technical  terms  (“USB”)  and  brand  names  (“Apple”)  

Page 39: SymbolicConstraintsand StatisticalMethods:Use … · 2020. 3. 13. · criterion for selecting the best hypothesis. For a hypothesized segmentation H comprised of words w i ... iPad)Interface)

Assumptions  

39  

•  We  can  collect  corpora  primarily  consis'ng  of  each  language  of  interest  •  Collec'ons  of  primarily  Spanish  (6.8m)  and  English  (2.8m)  tweets  

•  Try  to  make  fewest  assump'ons  about  match  between  training  data  and  applica'ons  •  May  not  be  same  medium  (e.g.,  Twi^er,  SMS)  

•  May  not  have  same  dominant  language  in  the  pair  

•  No  annotated  codeswitched  data  for  training  

Page 40: SymbolicConstraintsand StatisticalMethods:Use … · 2020. 3. 13. · criterion for selecting the best hypothesis. For a hypothesized segmentation H comprised of words w i ... iPad)Interface)

Codeswitchador  methodology  

40  

•  Label  each  token  using  source  language  •  Threshold  on  number  of  hits  to  perform  language  

iden'fica'on  and  codeswitching  •  Require  2  tokens  for  a  language  to  call  it  present  •  If  mul'ple  languages  present,  label  it  codeswitched  

Page 41: SymbolicConstraintsand StatisticalMethods:Use … · 2020. 3. 13. · criterion for selecting the best hypothesis. For a hypothesized segmentation H comprised of words w i ... iPad)Interface)

Models  1  and  1.5  •  Language  ra'o:  

   •  Model  1:  Label  words  of  ra'o  above  1  

Spanish,  below  as  English  (MLE)    •  Intui-on  for  ra'o  ≈  1  :    

•  Syntac'c  context  constraints  codeswitching  (Belazi  et  al.,  1994)  

•  Syntac'c  heads  for  English/Spanish  on  les    •  Model  1.5:    

•  Label  ra'o  ≈  1  words  and  OOV  by  les  context  

2.37              la    

1.15          me    

41  

log(P(w | Spanish))log(P(w | English))

Page 42: SymbolicConstraintsand StatisticalMethods:Use … · 2020. 3. 13. · criterion for selecting the best hypothesis. For a hypothesized segmentation H comprised of words w i ... iPad)Interface)

Data  collection*  

42  

•  Accuracy  verified  using  test  set  created  by  bilingual  Mechanical  Turk  annotators  •  Instructed  to  account  for  standard  borrowings  

•  Labeled  10,000  tweets  from  MITRE  Spanish  Twi^er  (N  =  93,136  tweets)  (Burger  et  al.  2011)  for  source  language  •  Token  inter-­‐annotator  agreement:  96.52%  

•  Tokens  labeled  with  basic  en'ty  tags  (name,  'tle)  and  excluded  from  token  evalua'on  

 

*Many  thanks  to  Jacob  Lozano  Ramirez  for  implemen'ng  the  HIT  interface  and  performing  adjudica'on.    

Page 43: SymbolicConstraintsand StatisticalMethods:Use … · 2020. 3. 13. · criterion for selecting the best hypothesis. For a hypothesized segmentation H comprised of words w i ... iPad)Interface)

LID  and  Codeswitching  accuracy  

43  

•  Task:  Mark  each  tweet  (N=7,018)  as  Spanish,  English,  or  Codeswitched  

•  Applica'on:  split  data  into  single  or  mixed  language  sets  

0.529   0.564  

0.779   0.815  

0  

0.1  

0.2  

0.3  

0.4  

0.5  

0.6  

0.7  

0.8  

0.9  

1  

Baseline   Func'on   Model  1   Model  1.5  

All  Message  LID/CS  Accuracy  

Page 44: SymbolicConstraintsand StatisticalMethods:Use … · 2020. 3. 13. · criterion for selecting the best hypothesis. For a hypothesized segmentation H comprised of words w i ... iPad)Interface)

Codeswitching  detection  

44  

•  Task:  Mark  each  tweet  (N=7,018)  as  Codeswitched  or  not  

•  Applica'on:  iden'fy  codeswitched  tweets/users  

Baseline   Func'on   Model  1   Model  1.5  CS  Prec.   0.529   0.963   0.715   0.797  CS  Rec.   1   0.205   0.969   0.874  CS  F1   0.692   0.338   0.823   0.833  

0  0.1  0.2  0.3  0.4  0.5  0.6  0.7  0.8  0.9  1  

Codeswitching  DetecDon  

Page 45: SymbolicConstraintsand StatisticalMethods:Use … · 2020. 3. 13. · criterion for selecting the best hypothesis. For a hypothesized segmentation H comprised of words w i ... iPad)Interface)

Token  language  identi^ication  

45  

•  Task:  Among  codeswitched  tweets  (N=3,711),  iden'fy  source  language  of  each  token  •  Inter-­‐annotator  agreement:  96.52%  

•  Applica'on:  iden'fy  tokens/spans  for  normaliza'on/MT  

 

0.9  0.947   0.967  

0.5  

0.55  

0.6  

0.65  

0.7  

0.75  

0.8  

0.85  

0.9  

0.95  

1  

Baseline   Model  1   Model  1.5  

Token  LID  Accuracy  

Page 46: SymbolicConstraintsand StatisticalMethods:Use … · 2020. 3. 13. · criterion for selecting the best hypothesis. For a hypothesized segmentation H comprised of words w i ... iPad)Interface)

Extensions  

46  

•  Applica'ons  in  linguis'c  research:  what  are  the  constraints  on  codeswitching  in  large-­‐scale  data?  (ask  me)  

•  Could  apply  more  powerful  learning  methods:  •  Simple  one  state  per  language  HMM:  equivalent  performance  

•  SVM-­‐HMM  with  morphological  features:  equivalent  performance  

•  Higher  overfitng  risk  with  these  models  

•  Most  improvement  is  likely  to  come  from  be^er  OOV  modeling  

•  Comparison  with  standard  LID  systems  •  Not  yet  fair  as  they  are  meant  to  do  many-­‐way  LID,  not  two-­‐way  

Page 47: SymbolicConstraintsand StatisticalMethods:Use … · 2020. 3. 13. · criterion for selecting the best hypothesis. For a hypothesized segmentation H comprised of words w i ... iPad)Interface)

Codeswitching:  Conclusions  

47  

•  Simple  approach  that  takes  advantage  of  linguis'c  constraints  works  as  well  as  human  annotators  

•  Fast,  determinis'c  decoding  and  minimal  storage  (1  bit/word)  

•  Promising  future  for  improving  text  normaliza'on,  communicant  characteriza'on,  and  machine  transla'on  

 

Page 48: SymbolicConstraintsand StatisticalMethods:Use … · 2020. 3. 13. · criterion for selecting the best hypothesis. For a hypothesized segmentation H comprised of words w i ... iPad)Interface)

IV.  Conclusions  

48  

Page 49: SymbolicConstraintsand StatisticalMethods:Use … · 2020. 3. 13. · criterion for selecting the best hypothesis. For a hypothesized segmentation H comprised of words w i ... iPad)Interface)

Care  for  some  more  insane  heuristic?      

49  

 Yes,  please.  “Necessity  is  the  mother  of  invention,”  and  considering  additional  constraints  (e.g.,  fast,  cognitively  plausible)  has  lead  to  good  results.    

Page 50: SymbolicConstraintsand StatisticalMethods:Use … · 2020. 3. 13. · criterion for selecting the best hypothesis. For a hypothesized segmentation H comprised of words w i ... iPad)Interface)

Natural  language  for  robot  control  

50  

Natural  language  used  to  generate  linear  temporal  logic  plans  (AAAI  Language  Grounding  Workshop  2012;  HRI  2012)  h^ps://github.com/PennNLP/SLURP  (Supported  by  ARO  MURI)  

Page 51: SymbolicConstraintsand StatisticalMethods:Use … · 2020. 3. 13. · criterion for selecting the best hypothesis. For a hypothesized segmentation H comprised of words w i ... iPad)Interface)

Natural  language  for  robot  control  

51  

iPad  Interface  

LTLBroom  

LTLMoP  

NLP  Pipeline  

Logical  Controller  

Robot  Controller  

Seman'c  Parsing  

Language  Processing   Planning  

User  Input  

Orders/  Responses  

Orders  

Automaton  

World  Knowledge  

World  Updates   Control  Commands  

Sensor  Data  

Map  Informa'on  

User  Input  

Cornell  

UMass  Lowell  

Page 52: SymbolicConstraintsand StatisticalMethods:Use … · 2020. 3. 13. · criterion for selecting the best hypothesis. For a hypothesized segmentation H comprised of words w i ... iPad)Interface)

Morphological  processing  

52  

Developing  algorithmic-­‐level  model  of  morphological  processing  

(Chicago  Linguis'c  Society  2012)  

play   ing  playing  

How  do  we  represent  and  process  morphologically  complex  forms?  Does  frequency  affect  our  answer?  

Use  big  data  to  compare  whole  word,  decomposi'onal,  and  dual-­‐route  models  

Whole  word  storage   Complete  decompos'on  

Dual-­‐route  

Page 53: SymbolicConstraintsand StatisticalMethods:Use … · 2020. 3. 13. · criterion for selecting the best hypothesis. For a hypothesized segmentation H comprised of words w i ... iPad)Interface)

Conclusions  

53  

•  Combining  the  best  of  sta's'cal  learning  and  symbolic  constraints  leads  to  good  results  and  fast  systems  

•  Cogni've/linguis'c  constraints  provide  robust  and  a  priori  mo'vated  restric'ons  on  language  systems  

Page 54: SymbolicConstraintsand StatisticalMethods:Use … · 2020. 3. 13. · criterion for selecting the best hypothesis. For a hypothesized segmentation H comprised of words w i ... iPad)Interface)

Thanks  to:  Charles  Yang  Mitch  Marcus  NSF  IGERT  #50504487  ARO  MURI  (SUBTLE)  W911NF-­‐07-­‐1-­‐0216    Constantine  Lignos  [email protected]  http://www.seas.upenn.edu/~lignos    

54