BUCLD2011

24
Statistical word segmentation of Zipfian frequency distributions Chigusa Kurumada Linguistics, Stanford Stephan C. Meylan Psychology, Stanford Michael C. Frank Psychology, Stanford

description

Kurumada, C., Meylan, S.C., & Frank, M.C. (2011b). “Statistical word segmentation of Zipfian frequency distributions". Paper presented at BUCLD 36, November 5th.

Transcript of BUCLD2011

Page 1: BUCLD2011

Statistical wordsegmentation ofZipfian frequencydistributions

Chigusa Kurumada Linguistics, Stanford

Stephan C. Meylan Psychology, Stanford

Michael C. Frank Psychology, Stanford

Page 2: BUCLD2011

2

Segmentation of running speech

I l o ve y o u

Saffran, Newport et al.(1996); Saffran,Aslin et al. (1996) ; Jusczyk(1997);Perruchet et al. (1998); Aslin (1998),Brent (1999); Swingley (2005);Thiessenet al. (2005); Monaghan & Christiansen,(2010) among others

Page 3: BUCLD2011

3

Example

Listen to a Japanese speakingmother’s speech and find “words”

Page 4: BUCLD2011

4

Where is your daddy?

Page 5: BUCLD2011

5

Words occur at different frequencies

Page 6: BUCLD2011

6

The naturalistic word frequency distribution

Zipfiandistribution

Zipf (1965)

Page 7: BUCLD2011

7

This talk

Effects of a Zipfian distribution of wordfrequencies in speech segmentation

• 2 large-scale web-based segmentation experiments

The skewed distribution supports word segmentation

• Implications for existing models

Page 8: BUCLD2011

8

A potential problem for statistical word segmentation?

Pre-tty-ba-by

TP = 0.2

(Saffran, Newport, & Aslin, 1996)

(Goldwater et al., 2009)

Uniform Zipfian

TP = 1.0

Page 9: BUCLD2011

9

Question 1: Is segmentation of a Zipfianlanguage more difficult?

6types

12types

24types

36types

uniform

zipfian

Page 10: BUCLD2011

10

Experiment 1:Task (on Mechanical Turk)

Exposure: 300 word tokens

Subjects: 246 individuals in the 8 conditions

(6, 12, 24, 36 types * uniform/zipfian)

Test: 2 alternative forced choice task

go-la-bu la-bu-bi

Page 11: BUCLD2011

11

Results1: Proportion correct in each condition

6 12 24 36 word types

6 12 24 36 word types

Uniform Zipfian

Prop

ortio

n co

rrec

t

Page 12: BUCLD2011

12

Result2 : Effects of the (log) input token frequency

Page 13: BUCLD2011

13

Experiment 1: Summary

The standard 2AFC paradigm

• Robust segmentation ability

• Strong effects of unigram (log) frequencies

No effects ofuniform

vs.Zipfian

Page 14: BUCLD2011

14

Which one’s Daddy?Is it Daddy?That’s Daddy.Is that Daddy too?

Segmentation from the chunk-finding perspective

Chunking (Orban et al. 2008)

Bortfeld et al. (2005)

mommy’s sock familiar new

Brent & Cartwright (1996), Brent(1999), Goldwater et al. (2009),Perruchet & Vinter (1998)

Dahan & Brent (1999), Conway et al. (2010), van de Weijer(2001), Cunillera et al. (2010), Lew-Williams et al. (2011)

Page 15: BUCLD2011

15

Question 2

6 9 12 24

uniform

zipfian

Is segmentation based on a Zipfiandistribution more accurate whenwords are presented in context?

Page 16: BUCLD2011

16

Experiment 2: Task

Orthographic manual segmentation(50 sentences)

• words are presented in context• active search for words

Unlike the 2AFCgo-la-bu

vs.

mo-go-la • time-course of learning

Page 17: BUCLD2011

17

Results1: 6 word types - Uniform

- Zipfian

- Uniform

trials

Recall

(% correct)

Page 18: BUCLD2011

18

- Uniform - Zipfian

6 word types

12 word types 24 word types

9 word typesRec

all

(% c

orre

ct)

Page 19: BUCLD2011

19

A mixed logit model predicting correct segmentation

LogFrequency(p<0.001)

LogFrequency(p=0.9)

LogFrequency(p<0.001)

target wordword before word after

Page 20: BUCLD2011

20

Contextual bootstrapping

The average logfrequency of all thewords that appearedon the left (p<0.001)

No main effect or interaction with the distributiontypes (i.e., uniform vs. Zipfian).

The average logfrequency of all thewords that appeared onthe right (p<0.07)

target word

Page 21: BUCLD2011

21

Zipfian

uniform

Experiment 2: Summary

• Clear advantage of a Zipfian distribution

• The advantage is mediated by (log) token frequency

Page 22: BUCLD2011

22

Conclusion

I l o ve y o u

The Zipfian structure of natural languagesupports word recognition in context

Page 23: BUCLD2011

23

Thanks to:Stanford Language Cognition Lab,Eve Clark, Tom Wasow, Dan Jurafsky, andNoah Goodman (Stanford),T. Florian Jaeger (University of Rochester),Josh Tenenbaum (MIT)

For a full text of this paper, visit theStanford Language Cognition Lab website:http://langcog.stanford.edu/publications.html

Thank you!

Page 24: BUCLD2011

24

Meghan Sumner websitehttp://www.stanford.edu/~sumner/