Post on 28-Mar-2016
description
Statistical wordsegmentation ofZipfian frequencydistributions
Chigusa Kurumada Linguistics, Stanford
Stephan C. Meylan Psychology, Stanford
Michael C. Frank Psychology, Stanford
2
Segmentation of running speech
I l o ve y o u
Saffran, Newport et al.(1996); Saffran,Aslin et al. (1996) ; Jusczyk(1997);Perruchet et al. (1998); Aslin (1998),Brent (1999); Swingley (2005);Thiessenet al. (2005); Monaghan & Christiansen,(2010) among others
3
Example
Listen to a Japanese speakingmother’s speech and find “words”
4
Where is your daddy?
5
Words occur at different frequencies
6
The naturalistic word frequency distribution
Zipfiandistribution
Zipf (1965)
7
This talk
Effects of a Zipfian distribution of wordfrequencies in speech segmentation
• 2 large-scale web-based segmentation experiments
The skewed distribution supports word segmentation
• Implications for existing models
8
A potential problem for statistical word segmentation?
Pre-tty-ba-by
TP = 0.2
(Saffran, Newport, & Aslin, 1996)
(Goldwater et al., 2009)
Uniform Zipfian
TP = 1.0
9
Question 1: Is segmentation of a Zipfianlanguage more difficult?
6types
12types
24types
36types
uniform
zipfian
10
Experiment 1:Task (on Mechanical Turk)
Exposure: 300 word tokens
Subjects: 246 individuals in the 8 conditions
(6, 12, 24, 36 types * uniform/zipfian)
Test: 2 alternative forced choice task
go-la-bu la-bu-bi
11
Results1: Proportion correct in each condition
6 12 24 36 word types
6 12 24 36 word types
Uniform Zipfian
Prop
ortio
n co
rrec
t
12
Result2 : Effects of the (log) input token frequency
13
Experiment 1: Summary
The standard 2AFC paradigm
• Robust segmentation ability
• Strong effects of unigram (log) frequencies
No effects ofuniform
vs.Zipfian
14
Which one’s Daddy?Is it Daddy?That’s Daddy.Is that Daddy too?
Segmentation from the chunk-finding perspective
Chunking (Orban et al. 2008)
Bortfeld et al. (2005)
mommy’s sock familiar new
Brent & Cartwright (1996), Brent(1999), Goldwater et al. (2009),Perruchet & Vinter (1998)
Dahan & Brent (1999), Conway et al. (2010), van de Weijer(2001), Cunillera et al. (2010), Lew-Williams et al. (2011)
15
Question 2
6 9 12 24
uniform
zipfian
Is segmentation based on a Zipfiandistribution more accurate whenwords are presented in context?
16
Experiment 2: Task
Orthographic manual segmentation(50 sentences)
• words are presented in context• active search for words
Unlike the 2AFCgo-la-bu
vs.
mo-go-la • time-course of learning
17
Results1: 6 word types - Uniform
- Zipfian
- Uniform
trials
Recall
(% correct)
18
- Uniform - Zipfian
6 word types
12 word types 24 word types
9 word typesRec
all
(% c
orre
ct)
19
A mixed logit model predicting correct segmentation
LogFrequency(p<0.001)
LogFrequency(p=0.9)
LogFrequency(p<0.001)
target wordword before word after
20
Contextual bootstrapping
The average logfrequency of all thewords that appearedon the left (p<0.001)
No main effect or interaction with the distributiontypes (i.e., uniform vs. Zipfian).
The average logfrequency of all thewords that appeared onthe right (p<0.07)
target word
21
Zipfian
uniform
Experiment 2: Summary
• Clear advantage of a Zipfian distribution
• The advantage is mediated by (log) token frequency
22
Conclusion
I l o ve y o u
The Zipfian structure of natural languagesupports word recognition in context
23
Thanks to:Stanford Language Cognition Lab,Eve Clark, Tom Wasow, Dan Jurafsky, andNoah Goodman (Stanford),T. Florian Jaeger (University of Rochester),Josh Tenenbaum (MIT)
For a full text of this paper, visit theStanford Language Cognition Lab website:http://langcog.stanford.edu/publications.html
Thank you!
24
Meghan Sumner websitehttp://www.stanford.edu/~sumner/