BUCLD2011

Post on 28-Mar-2016

214 views 0 download

Tags:

description

Kurumada, C., Meylan, S.C., & Frank, M.C. (2011b). “Statistical word segmentation of Zipfian frequency distributions". Paper presented at BUCLD 36, November 5th.

Transcript of BUCLD2011

Statistical wordsegmentation ofZipfian frequencydistributions

Chigusa Kurumada Linguistics, Stanford

Stephan C. Meylan Psychology, Stanford

Michael C. Frank Psychology, Stanford

2

Segmentation of running speech

I l o ve y o u

Saffran, Newport et al.(1996); Saffran,Aslin et al. (1996) ; Jusczyk(1997);Perruchet et al. (1998); Aslin (1998),Brent (1999); Swingley (2005);Thiessenet al. (2005); Monaghan & Christiansen,(2010) among others

3

Example

Listen to a Japanese speakingmother’s speech and find “words”

4

Where is your daddy?

5

Words occur at different frequencies

6

The naturalistic word frequency distribution

Zipfiandistribution

Zipf (1965)

7

This talk

Effects of a Zipfian distribution of wordfrequencies in speech segmentation

• 2 large-scale web-based segmentation experiments

The skewed distribution supports word segmentation

• Implications for existing models

8

A potential problem for statistical word segmentation?

Pre-tty-ba-by

TP = 0.2

(Saffran, Newport, & Aslin, 1996)

(Goldwater et al., 2009)

Uniform Zipfian

TP = 1.0

9

Question 1: Is segmentation of a Zipfianlanguage more difficult?

6types

12types

24types

36types

uniform

zipfian

10

Experiment 1:Task (on Mechanical Turk)

Exposure: 300 word tokens

Subjects: 246 individuals in the 8 conditions

(6, 12, 24, 36 types * uniform/zipfian)

Test: 2 alternative forced choice task

go-la-bu la-bu-bi

11

Results1: Proportion correct in each condition

6 12 24 36 word types

6 12 24 36 word types

Uniform Zipfian

Prop

ortio

n co

rrec

t

12

Result2 : Effects of the (log) input token frequency

13

Experiment 1: Summary

The standard 2AFC paradigm

• Robust segmentation ability

• Strong effects of unigram (log) frequencies

No effects ofuniform

vs.Zipfian

14

Which one’s Daddy?Is it Daddy?That’s Daddy.Is that Daddy too?

Segmentation from the chunk-finding perspective

Chunking (Orban et al. 2008)

Bortfeld et al. (2005)

mommy’s sock familiar new

Brent & Cartwright (1996), Brent(1999), Goldwater et al. (2009),Perruchet & Vinter (1998)

Dahan & Brent (1999), Conway et al. (2010), van de Weijer(2001), Cunillera et al. (2010), Lew-Williams et al. (2011)

15

Question 2

6 9 12 24

uniform

zipfian

Is segmentation based on a Zipfiandistribution more accurate whenwords are presented in context?

16

Experiment 2: Task

Orthographic manual segmentation(50 sentences)

• words are presented in context• active search for words

Unlike the 2AFCgo-la-bu

vs.

mo-go-la • time-course of learning

17

Results1: 6 word types - Uniform

- Zipfian

- Uniform

trials

Recall

(% correct)

18

- Uniform - Zipfian

6 word types

12 word types 24 word types

9 word typesRec

all

(% c

orre

ct)

19

A mixed logit model predicting correct segmentation

LogFrequency(p<0.001)

LogFrequency(p=0.9)

LogFrequency(p<0.001)

target wordword before word after

20

Contextual bootstrapping

The average logfrequency of all thewords that appearedon the left (p<0.001)

No main effect or interaction with the distributiontypes (i.e., uniform vs. Zipfian).

The average logfrequency of all thewords that appeared onthe right (p<0.07)

target word

21

Zipfian

uniform

Experiment 2: Summary

• Clear advantage of a Zipfian distribution

• The advantage is mediated by (log) token frequency

22

Conclusion

I l o ve y o u

The Zipfian structure of natural languagesupports word recognition in context

23

Thanks to:Stanford Language Cognition Lab,Eve Clark, Tom Wasow, Dan Jurafsky, andNoah Goodman (Stanford),T. Florian Jaeger (University of Rochester),Josh Tenenbaum (MIT)

For a full text of this paper, visit theStanford Language Cognition Lab website:http://langcog.stanford.edu/publications.html

Thank you!

24

Meghan Sumner websitehttp://www.stanford.edu/~sumner/