BUCLD2011

Statistical wordsegmentation ofZipfian frequencydistributions

Chigusa Kurumada Linguistics, Stanford

Stephan C. Meylan Psychology, Stanford

Michael C. Frank Psychology, Stanford

Segmentation of running speech

I l o ve y o u

Saffran, Newport et al.(1996); Saffran,Aslin et al. (1996) ; Jusczyk(1997);Perruchet et al. (1998); Aslin (1998),Brent (1999); Swingley (2005);Thiessenet al. (2005); Monaghan & Christiansen,(2010) among others

Example

Listen to a Japanese speakingmother’s speech and find “words”

Where is your daddy?

Words occur at different frequencies

The naturalistic word frequency distribution

Zipfiandistribution

Zipf (1965)

This talk

Effects of a Zipfian distribution of wordfrequencies in speech segmentation

• 2 large-scale web-based segmentation experiments

The skewed distribution supports word segmentation

• Implications for existing models

A potential problem for statistical word segmentation?

Pre-tty-ba-by

TP = 0.2

(Saffran, Newport, & Aslin, 1996)

(Goldwater et al., 2009)

Uniform Zipfian

TP = 1.0

Question 1: Is segmentation of a Zipfianlanguage more difficult?

6types

12types

24types

36types

uniform

zipfian

Experiment 1:Task (on Mechanical Turk)

Exposure: 300 word tokens

Subjects: 246 individuals in the 8 conditions

(6, 12, 24, 36 types * uniform/zipfian)

Test: 2 alternative forced choice task

go-la-bu la-bu-bi

Results1: Proportion correct in each condition

6 12 24 36 word types

Uniform Zipfian

Result2 : Effects of the (log) input token frequency

Experiment 1: Summary

The standard 2AFC paradigm

• Robust segmentation ability

• Strong effects of unigram (log) frequencies

No effects ofuniform

vs.Zipfian

Which one’s Daddy?Is it Daddy?That’s Daddy.Is that Daddy too?

Segmentation from the chunk-finding perspective

Chunking (Orban et al. 2008)

Bortfeld et al. (2005)

mommy’s sock familiar new

Brent & Cartwright (1996), Brent(1999), Goldwater et al. (2009),Perruchet & Vinter (1998)

Dahan & Brent (1999), Conway et al. (2010), van de Weijer(2001), Cunillera et al. (2010), Lew-Williams et al. (2011)

Question 2

6 9 12 24

uniform

zipfian

Is segmentation based on a Zipfiandistribution more accurate whenwords are presented in context?

Experiment 2: Task

Orthographic manual segmentation(50 sentences)

• words are presented in context• active search for words

Unlike the 2AFCgo-la-bu

mo-go-la • time-course of learning

Results1: 6 word types - Uniform

- Zipfian

- Uniform

trials

Recall

(% correct)

- Uniform - Zipfian

6 word types

12 word types 24 word types

9 word typesRec

A mixed logit model predicting correct segmentation

LogFrequency(p<0.001)

LogFrequency(p=0.9)

LogFrequency(p<0.001)

target wordword before word after

Contextual bootstrapping

The average logfrequency of all thewords that appearedon the left (p<0.001)

No main effect or interaction with the distributiontypes (i.e., uniform vs. Zipfian).

The average logfrequency of all thewords that appeared onthe right (p<0.07)

target word

Zipfian

uniform

Experiment 2: Summary

• Clear advantage of a Zipfian distribution

• The advantage is mediated by (log) token frequency

Conclusion

I l o ve y o u

The Zipfian structure of natural languagesupports word recognition in context

Thanks to:Stanford Language Cognition Lab,Eve Clark, Tom Wasow, Dan Jurafsky, andNoah Goodman (Stanford),T. Florian Jaeger (University of Rochester),Josh Tenenbaum (MIT)

For a full text of this paper, visit theStanford Language Cognition Lab website:http://langcog.stanford.edu/publications.html

Thank you!

Meghan Sumner websitehttp://www.stanford.edu/~sumner/

BUCLD2011

Documents

Transcript of BUCLD2011

Star Wars Trivia!

Steve Jobs' Commencement Speech at Stanford

Heidegger Kritik

Chapter 23

Bhagavad Gita

Puntuación PAEF,s

Tragic Heroes

Do you admire Leonardo da Vinci?

Venture Capital

European Colinization of Latin America

How Computer Monitors Work

Algorithms

xCDC14a

Aesops Fables

The Dutch Republic In International Trade

How Computer Keyboards Work

1 시스템분석입문

Simple Functions in Haskell

Compressing And Decompressing Folders

United States Constitution